Tutorial |
:: |
Collecting and Analyzing Data with TGrep2 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Step 2: Preparing a databasePlease not that this tutorial is outdated. It still contains useful information about building databases out of TGrep2 output, but the tools referred to here have been updated and now have slightly different syntax and more functionality. This page will (hopefully) be updated soon. So far, we have seen how TGrep2 can be used to extract sentences (strings of words), or subtrees from syntactically annotated corpora. Of course what we really want is not just to see whether a given structure/collocation/etc. exists, but rather how frequent it is. If you have had a look at the medium difficulty problem set, you have also seen that TGrep2 can easily do that (use the "-aft" option and 'pipe' the output through "wc -l"). Such counts can then be fed into e.g. a Chi-square test to see whether there is more THAT-omission before embedded pronoun subjects than before non-pronominal subjects (cf. H1 in the problem set). Ultimately, however, our goals are more ambitious. Rather than running many Chi-square tests, we want to control for many variables at the same time. We want to see whether each of the factors we are interested in influences the variation we are interested in (the dependent variable) after all the other factors are controlled for. That is, we want to run regression analyses (usually logistic regression given that most of our dependent variables in syntactic corpus research will be categorical). In order to do that, we need to get the TGrep2 output(s) into the format that we need for statistical analysis. Typically, this will be something like the following (with whatever separator between the columns you use, e.g. TABs or commas).
So let's say we're still interested in complement clauses and we happen to use the TGrep2 node ID as a unique identifier for each case (how convenient!). The database would look something like this:
Having an automated way to get from TGrep2 outputs to such a file would be nice, as such (tab/comma/whatever-)delimited files can easily be opened by e.g. R (or excel), as shown in Section 3 of this page. This section summarizes the use of PERL scripts that turn standard TGrep2 outputs into TAB-delimited files of the structure shown above. Don't worry you don't need to understand PERL to use the scripts and PERL is already installed and ready to go on our HLP machines. Getting and setting up the scriptsFirst, you need to get the PERL scripts from Florian Jaeger's TGrep2 resource page (this is also where future updates, potential manuals, etc. will be available). The TDT scripts have been written by Florian Jaeger incorporating scripts from Dave Orr, Neal Snider, and Elizabeth Coppock. Please understand that the scripts are alpha releases - please do not distribute the scripts without permission. Instead, refer to the before-mentioned resource page. Today we will only introduce a subset of the scripts available, so ask us if you're interested in other scripts. If you're on an HLP machine, you can copy the TDT scripts directly into your directory:
After you copy the files, which will now be in /scripts/ in your home directory, make sure that all the PERL files are executable. For that, execute the following commands:
chmod u+x *.pl For simplicity's sake, you should also include the scripts in your PATH variable in your .login file. Add the following line to your .login file (for instructions on how to add information to your .login file, see the TGrep2 setup page).
Then either logout and in again or type source .login. If you have done this, the following command should work (try it):
Well , of course , \[ it 's , \+ you know , it 's one of the last few things in the world 4002A7 -NONE- 400298 you 'd ever want 400298 to do 4002A7 , yo u know . Unless it 's just , you know , really , you know , \[ and \+ , uh , \[ for their , \+ uh , you know , for their \] own good . E_S SpeakerA3 . Yes . E_S Yeah . E_S ... Sometimes that is what we want but often it would be nicer not to see the "-NONE-", the speaker node, etc. Strip does exactly that. Here's how the output of the above call looks when piped through "strip.pl" (tgrep2 -afit 'TOP' | strip.pl swbd | more):
Well, of course, it's, you know, it's one of the last few things in the worl d you'd ever want to do, you know. Unless it's just, you know, really, you k now, and, uh, for their, uh, you know, for their own good. Yes. Yeah. ... Notice that we called strip takes "swbd" as an argument. This tells the strip program that it should use the filter for the Paraphrase Switchboard Corpus. If you're working with another corpus, use "other" instead (other filters will be available in the future; currently a standard is applied which seems to work for all Penn Treebank corpora). Try the following
This should give you the following ouput (compare this to if you had called strip with the "swbd" option; or without piping through strip at all):
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. ... Finally, let's also copy any wrappers you have created into the ~/scripts directory (so that all our TGrep2 stuff is in one place). Creating a databaseNow that we know that the scripts are working properly, we can start to build up a database where each row corresponds to one case of the unit of observation we're interested in. We examplify this using complement clauses. First, we need to have a unique identifier for each case, for which we will use the TGrep2 node ID. So, let's construct a file that contains the IDs of all complement clause. We already know the pattern for complement clauses and we have stored it in our macro file MACROS.ptn, which should be stored in your home directory, as @CC_OPT (for Complement Clause with or without OPTional that). We also have seen how to get the TGrep2 ID node as output of a TGrep2 search. Thus:
myTGrep2.pl '@CC_OPT' > ~/data/EMBED-ID.dat
VERB EMBEDDED_SUBJECT_REF LENGTH You can download an example file or copy it from:
perl create_database.pl ~/data/CCfactors swbd ~/data/EMBED-ID.dat The script will output the following message to the shell and create a file called "swbd.csv" in the directory ~/scripts/.
############################### Corpus Creator v0.5 ###################################### Reading in case IDs ... Creating new database ... Created new factor COMPL (Factor_ID ) ... added factor COMPL with Factor_ID 0 Created new factor VERB (Factor_ID 1) ... added factor VERB with Factor_ID 1 Created new factor EMBEDDED_SUBJECT_REF (Factor_ID 2) ... added factor EMBEDDED_SUBJECT_REF with Factor_ID 2 Created new factor LENGTH (Factor_ID 3) ... added factor LENGTH with Factor_ID 3 New database has Item_ID plus 4 (number of factors) columns ========================================================================================== The new database contains the case IDs (Item_ID) and four columns (one for each factor) with lots of empty cells (per default empty cells are marked by "."). Note that "Item_ID" was not mention in the factor file. This is because it isn't a factor and the script always automatically used the TGrep2 IDs as unique case identifiers.
Below, we will use scripts to add information to this new database. What script is appropriate differs between the type of variable we want to add information to. Here we introduce four types of operations on the database:
There are additional scripts, which we won't describe on this page (but check Florian Jaeger's TGrep2 resource page), that:
Like the create_database.pl script, all these PERL scripts work on TGrep2 output files that we will have to create. Adding values to a categorical fixed variable (with a priori defined levels)In our little example database, we want both the dependent variable COMPL and the independent variable EMBEDDED_SUBJECT_REF to be categorical variables with a priori defined levels (see above). The former is binary and is either ZERO or THAT and the latter has three levels (PRO, DEF, and INDEF for pronominal subject, definite NP subject, and indefinite NP subject in the CC, respectively). Let's start with the depdendent variable COMPL. First, we create two TGrep2 output files. One with all @CC_OPTs with "that" and one with all @CC_OPTs without "that":
myTGrep2.pl '@CC_OPT < @ZERO' > ~/data/ZERO.dat Now we use a script to add the values to our database (you can call the script without any arguments to find out what kind of arguments it expects):
perl cat_factor_extractor.pl swbd COMPL 1 THAT ~/data/THAT.dat ZERO ~/data/ZERO.dat It is important that we used "swbd" as the argument for the corpus name, because the script looks for swbd.csv (which should still be in the ~/scripts directory). COMPL is the name of the variable that we want to add information to (try what happens if you use a non- existent variable name, i.e. a name that was not defined in CCfactors and therefore is not contained in the header of our new database). The "1" is a default we won't worry about here. After the three basic arguments, we provide the script with a list of (value_name, files_with_IDs_of_cases_that_should_have_that_value). The script will write THAT in the column COMPL for all case IDs that are contained in file ~/data/THAT.dat and ZERO for all cases contained in ~/data/ZERO.dat. Let's do the same thing for a three-way distinction of the CC_OPTs' subjects. For simplicity's sake, I use simplified patterns that approximate what we want to a large extent:
myTGrep2.pl '@CC_OPT < (/^S/ < (@SBJ <<, @DEF))' > ~/data/DEF.dat myTGrep2.pl '@CC_OPT < (/^S/ < (@SBJ <<, @INDEF))' > ~/data/INDEF.dat perl cat_factor_extractor.pl swbd EMBEDDED_SUBJECT_REF 1 PRO ~/data/PRO.dat DEF ~/data/DEF.dat INDEF ~/data/INDEF.dat By now our database file "sbwd.csv" should contain the following information (use "more" or "less" to look at the file):
Note that some cells in the columns we added information to may still be empty. Is this reason for concern? For EMBEDDED_SUBJECT_REF we expect many cells to be empty since, for example, CCs with quantifier NP subjects like "every dude on the whole planet" will not be contained in any of the three TGrep2 output files we created. If we find empty cells for COMPL, however, we should have a closer look at the primary data and find out why that is the case. Adding values to a length variableSometimes we want to count the number of words in a given domain (or even across constituents). For example, we may want to extract the distance of the beginning of a complement clause from the beginning of the utterance. Here we demonstrate an other example, the extraction of the length of all complement clauses in our data sample. The script we use will only count real words and not any kind of other symbol, which is something that TGrep2 does not do for us. Also, the script will integrate the information into our existing database. The input that this script expects is slightly different from the one used so far. It expects a file with your search represented one word per line, with each word preceded by the case ID of its head node. The script counts all real words associated with a case ID. For the length of the CC_OPTs, we can create such a file by calling the following TGrep2 command or by calling myTGrep2.pl with a specific option (using node labeling as discussed in Section 1 above):
myTGrep2.pl -printByLine '*=print !< * >> (`@CC_OPT)' > ~/data/LENGTH.dat Either of these commands creates the file "LENGTH.dat", which will look something like:
118:9 \[ 118:9 they 118:9 , 118:9 \+ 118:9 they 118:9 \] 118:9 had 118:9 a 118:9 great 118:9 deal 118:9 of 118:9 , 118:9 um 157:86 that 157:86 if 157:86 403965 157:86 something 157:86 like 157:86 that ... Don't worry. The script we will use will only count the words. But notice that for each CC_OPT there are many lines with one word each. This is the format we need. Now let's count and integrate the information into our database.
The same can be done for any other lenght factor you wish to add to the database. Note that we can also count across constituents, as long as we can come up with a TGrep2 pattern that describes all the terminal nodes that we want to be counted. Here is an example that can be used to count all words before a complement clause (within a TOP node):
Adding values to a categorical variable without a priori defined levelsSometimes the levels of a categorical variable aren't really known in advance. This is usually the case for random effects. Consider the embedding verbs for our complement clause database. It is reasonable to assume that verbs differ in the base rates of THAT-omission in their complement clauses (i.e. CC_OPTs of some verbs may almost never occur without THAT and CC_OPTs of other verbs may almost always have a ZERO complementizer). This differences could be due to a variety of reaons: their could be correlations (or even causal relations) between THAT-omission and properties of the embeddding verb such as age of acquisition, frequency, phonological or metrical structure of the verb, the time when the verb entered the English language, the style associated with the verb, etc. For some questions, it doesn't even matter what factors are the cause for a correlation between the verb and the rate of THAT-omission. All we're interested in is that there is such a difference. Here we are not interested in the statistical modeling of such a variable but rather in how we could extract such information from a TGrep2-able corpus into our database. Of course, we could write separate search patterns for all embedding verbs we can think of and then use the routine for adding categorical factors with a priori defined levels. This would result in a pretty long scripts call and a lot of TGrep2 patterns (there are approximately 160 verb forms stemming from 90 embedding verb lemmas in the Switchboard version used here)! Also we would have to search through the whole corpus first to find out what all those verb forms are! Fortunately, there is a combination of TGrep2 and one more PERL script that does the work for us. First, we need to get a file that tells us, for each case ID, what the corresponding embedding verb is. Any idea how to do that? Right, we can actually use the TGrep2 call we used in the previous section to extract length information from the corpus. We can simply call myTGrep2.pl with the -printByLine option and only search for the embedding verbs:
Looking at "VERB.dat", this seems to give us what we need, a line per case ID followed by the embedding verb (as you can see there isn't too much variation with regard to the embedding verb - people definitely think too much):
157:86 wish 214:11 think 215:38 think 236:42 think 269:9 think 291:13 guess 291:166 feel ... 175540:9 imagine Let's do one additional quick check (counting the lines in the data files):
This tells us (for the corpus used here) that we found 6,449 verbs but there seem to be 6,792 CC_OPTs. This may point to a problem in our pattern or it could be due to something else. You can use the UNIX/LINUX commands grep and diff to compare the TGrep2 output files (the final line deletes the temporary files we create):
grep -o -E '^[0-9]+:[0-9]+' ~/data/EMBED-ID.dat > ~/data/tmpid diff -y ~/data/tmpverb ~/data/tmpid | more rm -f ~/data/tmp* The diff operation shows us which case ID is missing in which file. In this case, the first case missing in "VERB.dat" has the ID 257:43. We can now go back to "EMBED.dat" and search for that case ID (to see what string is corresponds to) or we can use tgrep2 -x (cf. the TGrep2 manual, p.4) to extract that sentence directly from the corpus (and maybe the whole TOP node as well, which has the ID 257:1).
157:86 157:86 214:11 214:11 215:38 215:38 236:42 236:42 > 257:43 269:9 269:9 291:13 291:13 ... With a little bit of research we can see that the pattern we used contained a mistake (it misses a pair of parentheses). Here is the correct call (you have to delete the file "VERB.dat" first; use rm -f ~/data/VERB.dat for that):
If you compare the case IDs in "VERB.dat" and "EMBED-ID.dat) now, you shouldn't find any mistmatches. Thus we can proceed by calling a PERL script (ignore the "0" - just yet another default switch we aren't interested in):
If you have a look at "swbd.csv" again, you will see that our database know contains the verb information, which we could henceforth model as a random effect on THAT-omission. Adding strings to the databaseA nice benefit of the script introduced in the previous section is that it can also be used to add strings of any length into our database. Note that nothing in the string_factor_extractor.pl requires the extracted string to be one word. As a matter of fact, for cases like the VERB variable in the previous section, we have to be careful to make sure that the column's cells are empty before we apply the PERL script. This is necessary because the script concatenates what it finds in the TGrep2 output file with whatever is already in the database (if you switch the "0" to "1" in the call above you will can see the warning the script gives whenever it overrides another entry). The nice thing about this is that we can use the same kind of TGrep2 output used to count the length of something to simply concatenate (rather than count) the words we find in the TGrep2 output file. So, if we wanted to add the string of all CC_OPTs to our database, all we have to do is the following:
This final PERL script call can use the file we also used to count the number of words in CC_OPTs. Just that this time we concatenate the words rather than counting them:
SummaryBy now our database "swbd.csv" should contain values for all variables:
We can add more factors by to our database (e.g. by adding more factors to "CCfactors"). If we save all commmands (including the calls of the PERL scripts) into a file, we can simply copy the content of that file into the shell and it will repeat all those commands. Even better we can use a shellscript for those repeated calls (a simple shell script isn't much more than a list of commands you would otherwise type into the shell). That way, we can recreate and extend our database whenever necessary. For example, because we lost it or because we realize that one of our patterns contains a bug (which happens ALWAYS). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|