Tutorial

::

Collecting and Analyzing Data with TGrep2

[Step 0: Setting up TGrep2 |

Step 1: Collecting Data with TGrep2 |

Step 2: Preparing a database |

Step 3: Analyzing your data]

Medium difficulty problem sets and notes

The first problem set talked about complement clauses (CCs) and how to find all of them (or at least all that satisfy the pattern we used) in Switchboard. Now that we have our data set, we can ask our research questions about it. We would like you to find a way using TGrep2 and your stats knowledge to answer two of the following hypotheses H1 to H6. No fancy statistical modeling is needed: a simple chi-square will do for now*. If you don't have time for the stats, just get the data counts. Each of the problems contains some further hints that will be useful for your own research. The problems don't assume any knowledge beside the one given in the tutorial (Section 1).

Please bring your solutions (the relevant data counts and the whole TGrep2 command line including the patterns you used) to class so that we can go over them. Each of the hypotheses is complex and a real test needs a lot more controls than we expect from you at this moment. But try to get a first impression. Would you follow up on the hypothesis or does it seem unlikely that is holds?

NB: Don't forget to use the -af TGrep2 options so that you get accurate counts (you want to find ALL instances)! We strongly encourage you to read through the problems even if you won't do all because each of them contain important little helpers for your research.

H1 :: ACCESSIBILITY?

THAT-omission (the rate of ZERO complementizers) is higher for CCs with pronoun subjects than for CCs with definite NP subjects. For the current purpose we treat all and only NPs that start with "the" as definite NPs.

NOTE: Like for all embedded clauses, the SBAR of CCs directly governs a clausal node /^S/ which directly governs a subject node /^NP-SBJ/. You can look up the relevant tags for determiner and for pronoun in the Penn Treebank tagset (p.9). If you don't know how to answer this question for all pronouns. Just compare the 1.SG pronoun, "I", vs. definite NP subjects.

In order to make TGrep2 to count, you can "pipe" its output through "wc -l". The pattern below counts all occurrences of "I" in the corpus (rather than the occurrence of "I" as the subject of a complement clause):

tgrep2 -aft 'I' | wc -l

First, note that I didn't specify a corpus since I have set my environment variable TGREP2_CORPUS to point to the Switchboard corpus (as specified on the TGrep2 setup page). The option "-af" ensures that TGrep2 will return all matches (rather than one per TOP node!). The "-t" ensures that the output is exactly one line of terminal/leaf nodes per match. This is important since "wc -l" is a UNIX command that counts the lines in the output. So if I had used "tgrep2 -afl", which gives a tree-like output, the resulting count wouldn't have meant much (since the tree-like output usually has several lines per match). For my version of the Switchboard corpus (there are a couple of slightly different ones around), I got 38,042 matches for "I". The count command ("wc -l") is extremely important since we usually want to compare counts when we do corpus linguistics.

H2 :: ADJACENCY?

THAT-omission is lower for CCs that are not adjacent to the embedding verb.

NOTE: Our current pattern only finds CCs that are adjacent to the verb. You have to extend the pattern to allow intervening material. Have a look at the TGrep2 manual to see how you could do that. What problems do you run into? What types of interveners do you find? Do you think there would be a difference between the different types of interveners and their effect on THAT-omission? (Speculate).

H3 :: NEGATION?

Everything else being ignored, CCs that contain a negation are more likely to have THAT as a complementizer (rather than ZERO).

NOTE: For the current purpose, we only look at overt negation, e.g. "not" but also "n't" (this is one of the 'words' that is tagged as negation), etc. Here's an example:

tgrep2 "* < n't" | more

Note that I used double quotes around the search pattern (TGrep2 accepts both ' and " around patterns), because I want to search for something that includes the symbol ' and I don't want TGrep2 to think that the pattern ends at the '. The * matches anynode (this is a very useful shortcut!). The command gives me an output like the following (remember that by default TGrep2 prints the output in the non-tree-like Penn Treebank parantheses style:

So, it seems that negation is at least sometimes marked with the tag RB. If you want to have a bit more context around the negation (e.g. because you want to make sure that you actually are looking at negation), go up in the tree you want to match.

tgrep2 -l "* < (* < n't)" | more

You will see that "n't" is indeed tagged RB and that the RB is governed by a VP. The RB element "n't" is preceded by words like "was", "do", "should", etc. Play around with searches that explictly match one of those to find out what seems to be the tag/the tagset for negation. How is "won't" split up and annotated in the corpus? Is "not" marked differently from "n't"? Look what else is included under negation tags. Use a combination of the tag and the word string to find all CCs with negation (and all CCs without negation).

H4 :: VERB LEMMA

CCs that are embedded by "think" have a higher THAT-omission rate than CCs that are embedded by "remember" or "believe".

How confident are you about your conclusion for each of the two contrasts ("think" vs. "remember"; "think" vs. "believe"). What could be reasons for the outcome (state a hypothesis)? How could you test your hypothesis?

H5 :: Metrical optimization?

THAT-omission is more frequent after embedding verbs that do not end in a stressed syllable than after embedding verbs that do end in a stressed syllable.

NOTE: Start by extracting embedding verbs (you can 'pipe' the TGrep2 command through 'sort' and 'uniq' to get only one output line per distinct verb. E.g. (assuming you have copied Florian's MACRO-file into the AFS directory you are in):

tgrep2 -aft MACROS.ptn '/^VB/ $ @CC_OPT' | sort | uniq -i | more

NB: On some UNIX/LINUX shells the $ symbol is interpreted as a variable name. This can cause problems with the TGrep2 call. If so, copy the pattern in a file and call TGrep2 with that pattern file as input:

tgrep2 -aft MACROS.ptn name_of_pattern_file.ptn | sort | uniq -i | more

This will give you output something like the following

The 'sort' sorts the output of the TGrep2 search alphabetically. The 'uniq' collapses all adjacent identical lines in that list (the '-i' option tells 'uniq' that it should compare lines with being sensitive to case). Finally, the 'more' pauses the output pagewise (press 'ENTER' to see one more line; 'SPACE' for one more page; or 'q' to quit the output).

Once you have seen enough verbs simply select 3 with unstressed final syllables and 3 with stressed final syllables and compare the rate of THAT complementizers in the CCs they embed.

H6 :: Modality matters?

There is more THAT-omission in speech than in writing.

NOTE: Compare THAT-omission in the Switchboard with the one or both of the written TGrep2able corpora on AFS: the Wall Street Journal and the Brown corpus (for the location of those corpora consult the TGrep2 setup page). How confident are you about your conclusion. What could be a possible confound?

* For large data sets that comes from many speakers chi-squares are actually rather unproblematic (as long as one only wants to test whether two variables are dependent on each other, i.e. whether one affects the other).

[Step 0: Setting up TGrep2 |

Step 1: Collecting Data with TGrep2 |

Step 2: Preparing a database |

Step 3: Analyzing your data]