Tutorial

::

Collecting and Analyzing Data with TGrep2

 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]
 

Step 2: Preparing a database

Please not that this tutorial is outdated. It still contains useful information about building databases out of TGrep2 output, but the tools referred to here have been updated and now have slightly different syntax and more functionality. This page will (hopefully) be updated soon.

So far, we have seen how TGrep2 can be used to extract sentences (strings of words), or subtrees from syntactically annotated corpora. Of course what we really want is not just to see whether a given structure/collocation/etc. exists, but rather how frequent it is. If you have had a look at the medium difficulty problem set, you have also seen that TGrep2 can easily do that (use the "-aft" option and 'pipe' the output through "wc -l"). Such counts can then be fed into e.g. a Chi-square test to see whether there is more THAT-omission before embedded pronoun subjects than before non-pronominal subjects (cf. H1 in the problem set).

Ultimately, however, our goals are more ambitious. Rather than running many Chi-square tests, we want to control for many variables at the same time. We want to see whether each of the factors we are interested in influences the variation we are interested in (the dependent variable) after all the other factors are controlled for. That is, we want to run regression analyses (usually logistic regression given that most of our dependent variables in syntactic corpus research will be categorical). In order to do that, we need to get the TGrep2 output(s) into the format that we need for statistical analysis. Typically, this will be something like the following (with whatever separator between the columns you use, e.g. TABs or commas).

    Item_ID Dependent_variable FactorA FactorB ... FactorN 
    value value value value ... value 
    value value value value ... value 
    ...      
    value value value value ... value 

So let's say we're still interested in complement clauses and we happen to use the TGrep2 node ID as a unique identifier for each case (how convenient!). The database would look something like this:

    Item_ID COMPL VERB EMBEDDED_SUBJECT_REF ... LENGTH 
    118:9 THAT think PRO ... 
    157:86 THAT mean INDEF ... 
    ...      
    18530:9 ZERO remember DEF ... 
    ...      
    175540:9 ZERO imagine PRO ... 14 

Having an automated way to get from TGrep2 outputs to such a file would be nice, as such (tab/comma/whatever-)delimited files can easily be opened by e.g. R (or excel), as shown in Section 3 of this page. This section summarizes the use of PERL scripts that turn standard TGrep2 outputs into TAB-delimited files of the structure shown above. Don't worry you don't need to understand PERL to use the scripts and PERL is already installed and ready to go on our HLP machines.

Getting and setting up the scripts

First, you need to get the PERL scripts from Florian Jaeger's TGrep2 resource page (this is also where future updates, potential manuals, etc. will be available). The TDT scripts have been written by Florian Jaeger incorporating scripts from Dave Orr, Neal Snider, and Elizabeth Coppock. Please understand that the scripts are alpha releases - please do not distribute the scripts without permission. Instead, refer to the before-mentioned resource page. Today we will only introduce a subset of the scripts available, so ask us if you're interested in other scripts. If you're on an HLP machine, you can copy the TDT scripts directly into your directory:

    cp -r /p/hlp/tools/TDT ~

After you copy the files, which will now be in /scripts/ in your home directory, make sure that all the PERL files are executable. For that, execute the following commands:

    cd ~/scripts
    chmod u+x *.pl

For simplicity's sake, you should also include the scripts in your PATH variable in your .login file. Add the following line to your .login file (for instructions on how to add information to your .login file, see the TGrep2 setup page).

    setenv PATH ~/scripts:$PATH

Then either logout and in again or type source .login. If you have done this, the following command should work (try it):

    tgrep2 -afit 'TOP' | strip.pl swbd | more
Note that no corpus is specified since we assume that the environment variable is set to the corpus of your choice. By now, you should also be familiar with the pipe "|", which redirects the output of the TGrep2 call to "strip.pl swbd" (and then to "more" for pagewise output). "Strip" is a PERL script (hence the ending ".pl") that strips the annoying annotation like traces, coindices, etc. out of the TGrep2 output. This can be useful if we want to really only see the words in the output. Remember how a TGrep2 call with the -t (for terminal nodes) option looks, e.g. tgrep2 -aft 'TOP'?
    Uh , first , um , 40004C I need 40004C to know , uh , 400077 how do you feel 400077 \[ about , \+ uh , about \] -NONE- sending , uh , an elderly , uh , family member to a nursing home ? E_S
    Well , of course , \[ it 's , \+ you know , it 's one of the last few things in the world 4002A7 -NONE- 400298 you 'd ever want 400298 to do 4002A7 , yo u know . Unless it 's just , you know , really , you know , \[ and \+ , uh , \[ for their , \+ uh , you know , for their \] own good . E_S
    SpeakerA3 .
    Yes . E_S
    Yeah . E_S
    ...

Sometimes that is what we want but often it would be nicer not to see the "-NONE-", the speaker node, etc. Strip does exactly that. Here's how the output of the above call looks when piped through "strip.pl" (tgrep2 -afit 'TOP' | strip.pl swbd | more):

    Uh, first, um, I need to know, uh, how do you feel about, uh, about sending, uh, an elderly, uh, family member to a nursing home?
    Well, of course, it's, you know, it's one of the last few things in the worl d you'd ever want to do, you know. Unless it's just, you know, really, you k now, and, uh, for their, uh, you know, for their own good.
    Yes.
    Yeah.
    ...

Notice that we called strip takes "swbd" as an argument. This tells the strip program that it should use the filter for the Paraphrase Switchboard Corpus. If you're working with another corpus, use "other" instead (other filters will be available in the future; currently a standard is applied which seems to work for all Penn Treebank corpora). Try the following

    tgrep2 -c /p/hlp/corpora/TGrep2able/wsj_mrg.t2c.gz -afit "TOP" | strip.pl other | more

This should give you the following ouput (compare this to if you had called strip with the "swbd" option; or without piping through strip at all):

    Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
    Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
    ...

Finally, let's also copy any wrappers you have created into the ~/scripts directory (so that all our TGrep2 stuff is in one place).

Creating a database

Now that we know that the scripts are working properly, we can start to build up a database where each row corresponds to one case of the unit of observation we're interested in. We examplify this using complement clauses.

First, we need to have a unique identifier for each case, for which we will use the TGrep2 node ID. So, let's construct a file that contains the IDs of all complement clause. We already know the pattern for complement clauses and we have stored it in our macro file MACROS.ptn, which should be stored in your home directory, as @CC_OPT (for Complement Clause with or without OPTional that). We also have seen how to get the TGrep2 ID node as output of a TGrep2 search. Thus:

    tgrep2 -afi -m "%xh\t%th\n" MACROS.ptn '@CC_OPT' | more
gives us all the IDs followed by a TAB followed by the terminal string of the complement clause (we don't need the latter, but just to convince us that we got the right thing, we may as well include it). Now, let's save our data rather than just look at it. Make a directory for your data and write the output of the search pattern into a file in that directory. Since we will call TGrep2 with the same options (and the macro file) over and over again, we will use the wrapper (see above) called myTGrep2.pl, which stands for tgrep2 -afi -m "%xh\t%th\n" MACROS.ptn. So, we can write:
    mkdir ~/data
    myTGrep2.pl '@CC_OPT' > ~/data/EMBED-ID.dat
Convince yourself that the file indeed contains what we need:
    more ~/data/EMBED-ID.dat
We are almost there. The file with the IDs will allow us to construct a database. But first we have to think about what factors we want the database to have and what will be the dependent variable. Let's start with a few simple factors (if we change our mind later we can always simply rerun all the scripts we used as we will save a history of the things we're doing anyway). The script we will use, expects a file with all the variable names as input. You can use emacs to create a file of the following format with one variable per line (let's call the file "CCfactors" and let's put it into ~/data):
    COMPL
    VERB
    EMBEDDED_SUBJECT_REF
    LENGTH

You can download an example file or copy it from:

    cp -r /p/hlp/tools/TDT/example/CCfactors ~/data
Now let's change to the script directory and use a script to create the database (to learn more about any of the script simply call it without any arguments; also, the "perl" is optional if you're on an HLP machine and you're have the permissions of the file set correctly):
    cd ~/scripts
    perl create_database.pl ~/data/CCfactors swbd ~/data/EMBED-ID.dat

The script will output the following message to the shell and create a file called "swbd.csv" in the directory ~/scripts/.

    ==========================================================================================
    ############################### Corpus Creator v0.5 ######################################
    Reading in case IDs ...
    Creating new database ...
    Created new factor COMPL (Factor_ID )
    ... added factor COMPL with Factor_ID 0
    Created new factor VERB (Factor_ID 1)
    ... added factor VERB with Factor_ID 1
    Created new factor EMBEDDED_SUBJECT_REF (Factor_ID 2)
    ... added factor EMBEDDED_SUBJECT_REF with Factor_ID 2
    Created new factor LENGTH (Factor_ID 3)
    ... added factor LENGTH with Factor_ID 3
    New database has Item_ID plus 4 (number of factors) columns
    ==========================================================================================

The new database contains the case IDs (Item_ID) and four columns (one for each factor) with lots of empty cells (per default empty cells are marked by "."). Note that "Item_ID" was not mention in the factor file. This is because it isn't a factor and the script always automatically used the TGrep2 IDs as unique case identifiers.

    Item_ID COMPL VERB EMBEDDED_SUBJECT_REF LENGTH 
    118:9 
    157:86 
    ...     
    18530:9 
    ...     
    175540:9 

Below, we will use scripts to add information to this new database. What script is appropriate differs between the type of variable we want to add information to. Here we introduce four types of operations on the database:

  1. Adding values to a categorical fixed variable (with a priori defined levels)
  2. Adding values to a length variable
  3. Adding values to a categorical variable without a priori defined levels
  4. Adding strings to the database

There are additional scripts, which we won't describe on this page (but check Florian Jaeger's TGrep2 resource page), that:

  • add information about the speakers (e.g. gender, education; only available for Switchboard)
  • add information to variables that count number of occurrences (e.g. number of disfluencies in a complement clause)
  • add information on the phonology and lexical stress
  • add phonetic information (duration, pitch, intensity, speech rate, surrounding pauses, etc.; only available for Switchboard)
  • add unigram and bigram information
  • add conditional probabilities for events in the database
  • merge information in the database with other databases

Like the create_database.pl script, all these PERL scripts work on TGrep2 output files that we will have to create.

Adding values to a categorical fixed variable (with a priori defined levels)

In our little example database, we want both the dependent variable COMPL and the independent variable EMBEDDED_SUBJECT_REF to be categorical variables with a priori defined levels (see above). The former is binary and is either ZERO or THAT and the latter has three levels (PRO, DEF, and INDEF for pronominal subject, definite NP subject, and indefinite NP subject in the CC, respectively). Let's start with the depdendent variable COMPL.

First, we create two TGrep2 output files. One with all @CC_OPTs with "that" and one with all @CC_OPTs without "that":

    myTGrep2.pl '@CC_OPT < (IN < "that")' > ~/data/THAT.dat
    myTGrep2.pl '@CC_OPT < @ZERO' > ~/data/ZERO.dat

Now we use a script to add the values to our database (you can call the script without any arguments to find out what kind of arguments it expects):

    cd ~/scripts
    perl cat_factor_extractor.pl swbd COMPL 1 THAT ~/data/THAT.dat ZERO ~/data/ZERO.dat

It is important that we used "swbd" as the argument for the corpus name, because the script looks for swbd.csv (which should still be in the ~/scripts directory). COMPL is the name of the variable that we want to add information to (try what happens if you use a non- existent variable name, i.e. a name that was not defined in CCfactors and therefore is not contained in the header of our new database). The "1" is a default we won't worry about here. After the three basic arguments, we provide the script with a list of (value_name, files_with_IDs_of_cases_that_should_have_that_value). The script will write THAT in the column COMPL for all case IDs that are contained in file ~/data/THAT.dat and ZERO for all cases contained in ~/data/ZERO.dat.

Let's do the same thing for a three-way distinction of the CC_OPTs' subjects. For simplicity's sake, I use simplified patterns that approximate what we want to a large extent:

    myTGrep2.pl '@CC_OPT < (/^S/ < (@SBJ <<: @PRO))' > ~/data/PRO.dat
    myTGrep2.pl '@CC_OPT < (/^S/ < (@SBJ <<, @DEF))' > ~/data/DEF.dat
    myTGrep2.pl '@CC_OPT < (/^S/ < (@SBJ <<, @INDEF))' > ~/data/INDEF.dat
    perl cat_factor_extractor.pl swbd EMBEDDED_SUBJECT_REF 1 PRO ~/data/PRO.dat DEF ~/data/DEF.dat INDEF ~/data/INDEF.dat

By now our database file "sbwd.csv" should contain the following information (use "more" or "less" to look at the file):

    Item_ID COMPL VERB EMBEDDED_SUBJECT_REF LENGTH 
    118:9 THAT PRO 
    157:86 THAT INDEF 
    ...     
    18530:9 ZERO DEF 
    ...     
    175540:9 ZERO PRO 

Note that some cells in the columns we added information to may still be empty. Is this reason for concern? For EMBEDDED_SUBJECT_REF we expect many cells to be empty since, for example, CCs with quantifier NP subjects like "every dude on the whole planet" will not be contained in any of the three TGrep2 output files we created. If we find empty cells for COMPL, however, we should have a closer look at the primary data and find out why that is the case.

Adding values to a length variable

Sometimes we want to count the number of words in a given domain (or even across constituents). For example, we may want to extract the distance of the beginning of a complement clause from the beginning of the utterance. Here we demonstrate an other example, the extraction of the length of all complement clauses in our data sample. The script we use will only count real words and not any kind of other symbol, which is something that TGrep2 does not do for us. Also, the script will integrate the information into our existing database.

The input that this script expects is slightly different from the one used so far. It expects a file with your search represented one word per line, with each word preceded by the case ID of its head node. The script counts all real words associated with a case ID. For the length of the CC_OPTs, we can create such a file by calling the following TGrep2 command or by calling myTGrep2.pl with a specific option (using node labeling as discussed in Section 1 above):

    tgrep2 -afi -m '%xm\t%t=print=\n' MACROS.ptn '*=print !< * >> (`@CC_OPT)' > ~/data/LENGTH.dat
    myTGrep2.pl -printByLine '*=print !< * >> (`@CC_OPT)' > ~/data/LENGTH.dat

Either of these commands creates the file "LENGTH.dat", which will look something like:

    118:9 that
    118:9 \[
    118:9 they
    118:9 ,
    118:9 \+
    118:9 they
    118:9 \]
    118:9 had
    118:9 a
    118:9 great
    118:9 deal
    118:9 of
    118:9 ,
    118:9 um
    157:86 that
    157:86 if
    157:86 403965
    157:86 something
    157:86 like
    157:86 that
    ...

Don't worry. The script we will use will only count the words. But notice that for each CC_OPT there are many lines with one word each. This is the format we need. Now let's count and integrate the information into our database.

    perl cont_factor_extractor.pl swbd LENGTH ~/data/LENGTH.dat

The same can be done for any other lenght factor you wish to add to the database. Note that we can also count across constituents, as long as we can come up with a TGrep2 pattern that describes all the terminal nodes that we want to be counted. Here is an example that can be used to count all words before a complement clause (within a TOP node):

    myTGrep2.pl -printByLine '*=print = @TERMWORD .. (`@CC_OPT)'

Adding values to a categorical variable without a priori defined levels

Sometimes the levels of a categorical variable aren't really known in advance. This is usually the case for random effects. Consider the embedding verbs for our complement clause database. It is reasonable to assume that verbs differ in the base rates of THAT-omission in their complement clauses (i.e. CC_OPTs of some verbs may almost never occur without THAT and CC_OPTs of other verbs may almost always have a ZERO complementizer). This differences could be due to a variety of reaons: their could be correlations (or even causal relations) between THAT-omission and properties of the embeddding verb such as age of acquisition, frequency, phonological or metrical structure of the verb, the time when the verb entered the English language, the style associated with the verb, etc. For some questions, it doesn't even matter what factors are the cause for a correlation between the verb and the rate of THAT-omission. All we're interested in is that there is such a difference.

Here we are not interested in the statistical modeling of such a variable but rather in how we could extract such information from a TGrep2-able corpus into our database. Of course, we could write separate search patterns for all embedding verbs we can think of and then use the routine for adding categorical factors with a priori defined levels. This would result in a pretty long scripts call and a lot of TGrep2 patterns (there are approximately 160 verb forms stemming from 90 embedding verb lemmas in the Switchboard version used here)! Also we would have to search through the whole corpus first to find out what all those verb forms are! Fortunately, there is a combination of TGrep2 and one more PERL script that does the work for us.

First, we need to get a file that tells us, for each case ID, what the corresponding embedding verb is. Any idea how to do that? Right, we can actually use the TGrep2 call we used in the previous section to extract length information from the corpus. We can simply call myTGrep2.pl with the -printByLine option and only search for the embedding verbs:

    myTGrep2.pl -printByLine '*=print = @TERMWORD > /^VB/ $ (`@CC_OPT)' > ~/data/VERB.dat

Looking at "VERB.dat", this seems to give us what we need, a line per case ID followed by the embedding verb (as you can see there isn't too much variation with regard to the embedding verb - people definitely think too much):

    118:9    think
    157:86   wish
    214:11   think
    215:38   think
    236:42   think
    269:9    think
    291:13   guess
    291:166  feel
    ...
    175540:9 imagine

Let's do one additional quick check (counting the lines in the data files):

    wc -l ~/data/VERB.dat wc -l ~/data/EMBED-ID.dat

This tells us (for the corpus used here) that we found 6,449 verbs but there seem to be 6,792 CC_OPTs. This may point to a problem in our pattern or it could be due to something else. You can use the UNIX/LINUX commands grep and diff to compare the TGrep2 output files (the final line deletes the temporary files we create):

    grep -o -E '^[0-9]+:[0-9]+' ~/data/VERB.dat > ~/data/tmpverb
    grep -o -E '^[0-9]+:[0-9]+' ~/data/EMBED-ID.dat > ~/data/tmpid
    diff -y ~/data/tmpverb ~/data/tmpid | more
    rm -f ~/data/tmp*

The diff operation shows us which case ID is missing in which file. In this case, the first case missing in "VERB.dat" has the ID 257:43. We can now go back to "EMBED.dat" and search for that case ID (to see what string is corresponds to) or we can use tgrep2 -x (cf. the TGrep2 manual, p.4) to extract that sentence directly from the corpus (and maybe the whole TOP node as well, which has the ID 257:1).

    118:9            118:9
    157:86           157:86
    214:11           214:11
    215:38           215:38
    236:42           236:42
                   > 257:43
    269:9            269:9
    291:13           291:13
    ...

With a little bit of research we can see that the pattern we used contained a mistake (it misses a pair of parentheses). Here is the correct call (you have to delete the file "VERB.dat" first; use rm -f ~/data/VERB.dat for that):

    myTGrep2.pl -printByLine '*=print = @TERMWORD > (/^VB/ $ (`@CC_OPT))' > ~/data/VERB.dat

If you compare the case IDs in "VERB.dat" and "EMBED-ID.dat) now, you shouldn't find any mistmatches. Thus we can proceed by calling a PERL script (ignore the "0" - just yet another default switch we aren't interested in):

    perl string_factor_extractor.pl swbd 0 VERB ~/data/VERB.dat

If you have a look at "swbd.csv" again, you will see that our database know contains the verb information, which we could henceforth model as a random effect on THAT-omission.

Adding strings to the database

A nice benefit of the script introduced in the previous section is that it can also be used to add strings of any length into our database. Note that nothing in the string_factor_extractor.pl requires the extracted string to be one word. As a matter of fact, for cases like the VERB variable in the previous section, we have to be careful to make sure that the column's cells are empty before we apply the PERL script. This is necessary because the script concatenates what it finds in the TGrep2 output file with whatever is already in the database (if you switch the "0" to "1" in the call above you will can see the warning the script gives whenever it overrides another entry). The nice thing about this is that we can use the same kind of TGrep2 output used to count the length of something to simply concatenate (rather than count) the words we find in the TGrep2 output file.

So, if we wanted to add the string of all CC_OPTs to our database, all we have to do is the following:

  1. Edit the CCfactors file to include some factor named CC_WORDS
  2. Rerun all the scripts above (hopefully you have them stored in a file anyway). Note that you don't have to rerun the TGrep2 searches if you haven't deleted the *.dat files.
  3. Add one more PERL script call to add the word information.

This final PERL script call can use the file we also used to count the number of words in CC_OPTs. Just that this time we concatenate the words rather than counting them:

    perl string_factor_extractor.pl swbd 0 CC_WORDS ~/data/LENGTH.dat

Summary

By now our database "swbd.csv" should contain values for all variables:

    Item_ID COMPL VERB EMBEDDED_SUBJECT_REF LENGTH 
    118:9 THAT think PRO 
    157:86 THAT mean INDEF 
    ...     
    18530:9 ZERO remember DEF 
    ...     
    175540:9 ZERO imagine PRO 14 

We can add more factors by to our database (e.g. by adding more factors to "CCfactors"). If we save all commmands (including the calls of the PERL scripts) into a file, we can simply copy the content of that file into the shell and it will repeat all those commands. Even better we can use a shellscript for those repeated calls (a simple shell script isn't much more than a list of commands you would otherwise type into the shell). That way, we can recreate and extend our database whenever necessary. For example, because we lost it or because we realize that one of our patterns contains a bug (which happens ALWAYS).

 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]