Tutorial |
:: |
Collecting and Analyzing Data with TGrep2 |
||||
---|---|---|---|---|---|---|
|
||||||
Step 1: Collecting Data with TGrep2Basics
MacrosOur search for complement clauses allows structures with intervening material between the verb and the complement, such as direct objects, e.g. "She told me that...". We may not want to deal with this as a factor, so we can add another condition to make sure no word intervenes between the verb and the complement clause. If we spelled out this condition in one line, it would look like this:
This pattern involves several new features of regular expressions:
It also involves some new features of TGrep2:
If you still don't know what this pattern means, don't worry. It should be much more understandable once we simplify the pattern using macros. The part of the search before the dollar sign ($) is the general search for complement clauses. If we define that entire pattern as a macro called @COMPCLAUSE, then we can greatly simplify the pattern to just:
With that defined as a macro, you can replace that part of the pattern with @COMPCLAUSE, as long as you tell TGrep2 where to find the definition of @COMPCLAUSE. Call TGrep2 as follows:
Next, let's concentrate on this part:
Now you can replace that part of the pattern with @WORD. Call TGrep2 as follows:
Now let's concentrate on this part:
In fact (wouldn't you know it), Florian Jaeger has developed an elaborate set of macros for the purpose of extracting complement and relative clauses. Copy this file into your home directory (right-click and save it and then make sure that you transfer it to your directory on the corpus server). If you are logged into our HLP machines, you can also copy the latest version from the TDT directory (/p/hlp/tools/TDT/) by typing:
Formatting outputBy default, tgrep2 prints the head of the pattern, which is the leftmost node. This is very important when you want to understand the function of the -af options. These options cause tgrep2 to find all instances of the head, so for the purposes of getting counts, the head of your pattern must be the construction you are looking for. Basic output is formatted with command line switches. -l and -t are the ones you will use most. The -l option pretty-prints the subtree matched by your pattern:
(SBAR (WHADVP (N 400B34) (WDT that)) (S (NP-SBJ_MARKABLE_human (N 400B21) (PRP we)) (VP (VBD had) (S (NP-SBJ_MARKABLE (-NONE- (N 400B21))) (VP (TO to) (VP (VB do) (NP_MARKABLE_nonconc (PRP it)) (ADVP-TMP (-NONE- (N 400B34))))))))) The -t option prints the terminals (terminals are leaf nodes, but keep in mind that terminals can contain lots of non-word information like coreference and speaker id codes)
4002A7 -NONE- 400298 you 'd ever want 400298 to do 4002A7 , you know . Unless it 's just , you know , really , you know , \[ and \+ , uh , \[ for their , \+ uh , you know , for their \] own good Given that the head of the construction has a special meaning, you should not use the head to determine what is printed. Fortunately, tgrep2 allows you great flexibility in how the output from your search is formatted. The basic syntax is to call tgrep2 with an output option (-m in the examples below) followed by a schematic representation of your desired output. The -m option causes tgrep2 to output for every match to your pattern. Other options include -s, which causes output for every whole tree in the corpus, regardless of whether a match was successful. Perhaps the most useful output is the numeric code corresponding to the subtree that matches your pattern. tgrep2 codes are in the following format: number(:number)+. The first number corresponds to the number of the whole tree containing the match, counting from beginning of the corpus file to the end. The other numbers are subtrees, counting from top to bottom and right to left in the tree. Subtree codes are very important for two reasons:
To print the code for the tree matched by pattern:
5:73 17:73 21:68 To print code and tree separated by a tab (note use of \t for tab and \n for newline):
17:73 (SBAR (WHNP_MARKABLE_human (N 400945) (WP who)) (, ,) (INTJ (UH uh)) (, ,) (S (NP-SBJ_MARKABLE (-NONE- (N 400945))) (VP (VBD ran) (NP_MARKABLE_org (NP_MARKABLE_org (DT the) (NN nursing) (NN home)) (PP-LOC (IN in) (NP_MARKABLE_place (PRP$_MARKABLE_human our) (JJ little ) (NN hometown))))))) To print code and terminals separated by a tab:
17:73 400945 who , uh , 400945 ran the nursing home in our little hometown 21:68 400B34 that 400B21 we had 400B21 to do it 400B34 To print code and pretty-printed tree separated by a tab:
21:68 (SBAR (WHADVP (N 400B34) (WDT that)) (S (NP-SBJ_MARKABLE_human (N 400B21) (PRP we)) (VP (VBD had) (S (NP-SBJ_MARKABLE (-NONE- (N 400B21))) (VP (TO to) (VP (VB do) (NP_MARKABLE_nonconc (PRP it)) (ADVP-TMP (-NONE- (N 400B34))))))))) To print code and node marked for printing (node with backquote):
174:22 was \] lucky 403F31 too 403F31 that I only have one brother 214:8 think that 404C45 when she passed away 404C45 it was probably one of the greatest To print code and node(s) labelled in pattern (this can also be used to print multiple nodes):
118:9 that \[ they , \+ they \] had a great deal of , um 118:25 they \] --- 157:86 [...] that my children would do something like , that for me 157:122 my children --- To print tree and one line of context before (TOP node before the one containing the match):
Speaker1133_10 . (SBAR (WHNP_MARKABLE_org (N 40121A) (WDT that)) (S (NP-SBJ_MARKABLE_human (PRP we)) (ADVP-TMP (RB finally)) (VP (VBD had) (NP_MARKABLE (-NONE- (N 40121A)))))) To print tree and 10 lines of context before:
Yes . E_S Yeah . E_S Speaker1133_4 . 400617 I 'd be very very careful \[ and , \+ uh , you know , 400617 checking them out . E_S Uh , our , E_S 400725 -NONE- had 400725 t- , place my mother in a nursing home . E_S She had a rather massive stroke \[ about , \+ uh , about \] uh , eight months ago I guess . E_S Speaker1169_5 . Uh-huh . E_S Speaker1133_6 . (SBAR (WHNP_MARKABLE_human (N 400945) (WP who)) (, ,) (INTJ (UH uh)) (, ,) (S (NP-SBJ_MARKABLE (-NONE- (N 400945))) (VP (VBD ran) (NP_MARKABLE_org (NP_MARKABLE_org (DT the) (NN nursing) (NN home)) (PP-LOC (IN in) (NP_MARKABLE_place (PRP$_MARKABLE_human our) (JJ little) (NN hometown))))))) You cannot output arabitary amounts of context, so writing a script to extract contexts might be more useful. You can also have tgrep output a list of trees corresponding to the codes in a file.
all the output formatting commands work with the -e option. WrappersIf there is a particular set of options that you tend to reuse, or if it is cumbersome to remember the right set of options to use, then you can use a "wrapper" around TGrep2, i.e., a script that executes TGrep2. For example, suppose that you always want to look backwards 10 lines of discourse, and you don't want to keep re-typing the formatting option:
You can write a short little perl script that does this for you, so that you can just write the following instead.
Create a file called "tgrep2cxt.pl" containing the following lines of code (the .pl extension means "perl", and "cxt" stands for "context").
$pattern = $ARGV[0]; system("tgrep2 -m \'%t10b\\n%t9b\\n%t8b\\n%t7b\\n%t6b\\n%t5b\\n%t4b\\n%t3b\\n%t2b\\n%t1b\\n%lh\\n\' \'$pattern\'"); $ARGV[0] refers to the first argument given to the script on the command line. The "system" function executes a command as if it were given on the command line. All of the backslashes (\) in the pattern are themselves preceded by backslashes because otherwise "\n" will be interpreted as a carriage return character, rather than a backslash character followed by an 'n' character. The backslash is an "escape" character, allowing the following backslash to "escape" its normal interpretation. Then make the script executable by typing:
This script could be modified to allow arbitrary numbers of lines of context, with a little more perl programming. The wrapper that we will use for the purposes of the next step in the tutorial is called myTGrep2.pl. It was originally developed by Liz Coppock and has been slightly updated and extended by Florian Jaeger. Again, make the script executable by typing:
By default, the script prints out only the node ID. This behavior can be overridden if one of the following options is given:
Try it out by typing this at the command line:
|
||||||
|