Tutorial

::

Collecting and Analyzing Data with TGrep2

 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]
 

Step 1: Collecting Data with TGrep2

Basics

  • Follow the instructions on Setting up TGrep2.
  • Exercise. Try to find all complement clauses that either have a "that" complementizer or no complementizer at all. An example would be "Neal didn't think (that) I would mess around with this". Examples like "He didn't believed what he saw." are not what we are looking for.
    1. By searching with the pattern 'TOP << told' (or use your favorite sentential complement verb), find out what tag is given to complement clauses.
    2. (a) Now do a search for just this tag. What other structures do you find besides complement clauses?
      (b) What other complementizers do you find (find at least three in addition to "that")?
    3. Modify your pattern to try to get only sentential complements. Don't worry if you can't get rid of all the extraneous hits (it's a bit beyond the tools we've given you), just get a high percentage of complement clauses in your output.
    You may find the following resources useful: Solution.

Macros

Our search for complement clauses allows structures with intervening material between the verb and the complement, such as direct objects, e.g. "She told me that...". We may not want to deal with this as a factor, so we can add another condition to make sure no word intervenes between the verb and the complement clause. If we spelled out this condition in one line, it would look like this:

    '(SBAR=cc !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/ !< "-NONE-")) $ (/^VB/ ! .. ((/^[a-zA-Z]+.*/ !< *) .. =cc))) > `VP'

This pattern involves several new features of regular expressions:

  • Character classes: [a-zA-Z] represents the class of characters that are alphabetic letters.
  • The dot (.), which matches any single character.
  • The + and * operators: * means "zero or more instances", + means "one or more".

It also involves some new features of TGrep2:

  • The precedence operator '..': 'A .. B' means that A precedes B. Note that this does not imply that A is a sister of B.
  • Labelling: we want to refer to the SBAR node later in the pattern, so we label it "cc".
  • Selecting nodes for printing: the backquote (`) marks the VP node for printing.

If you still don't know what this pattern means, don't worry. It should be much more understandable once we simplify the pattern using macros.

The part of the search before the dollar sign ($) is the general search for complement clauses. If we define that entire pattern as a macro called @COMPCLAUSE, then we can greatly simplify the pattern to just:

    '((SBAR=cc = @COMPCLAUSE) $ (/^VB/ ! .. ((/^[a-zA-Z]+.*/ !< *) .. =cc))) > `VP'
To do this, open a new file called "my-macros.ptn", and add the line (with a tab character after @COMPCLAUSE, and don't forget the semicolon at the end, or the space after the @-sign):
    @ COMPCLAUSE       (SBAR !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/ !< "-NONE-")));

With that defined as a macro, you can replace that part of the pattern with @COMPCLAUSE, as long as you tell TGrep2 where to find the definition of @COMPCLAUSE. Call TGrep2 as follows:

    tgrep2 my-macros.ptn '((SBAR=cc = @COMPCLAUSE) $ (/^VB/ ! .. ((/^[a-zA-Z]+.*/ !< *) .. =cc))) > `VP'

Next, let's concentrate on this part:

    /^[a-zA-Z]+.*/
This means, "anything that starts with an alphabetic character", i.e., basically, "any word". Let's define this as a macro called @WORD. Open "my-macros.ptn" and add the line:
    @ WORD       /^[a-zA-Z]+.*/;

Now you can replace that part of the pattern with @WORD. Call TGrep2 as follows:

    tgrep2 my-macros.ptn '((SBAR=cc = @COMPCLAUSE) $ (/^VB/ ! .. ((@WORD !< *) .. =cc))) > `VP'

Now let's concentrate on this part:

    (@WORD !< *)
The asterisk (*) matches any node (no forward slashes required for this one). Thus this means, "a word that does not dominate any node", in other words, "a terminal node that is a word". We can define a new macro, called @TERMWORD, that expresses this pattern using the @WORD macro. Open up my-macros.ptn again and add the following line after the @WORD macro is defined:
    @ TERMWORD       (@WORD !< *);
Now we've simplified the search to:
    '((SBAR=cc = @COMPCLAUSE) $ (/^VB/ ! .. (@TERMWORD .. =cc))) > `VP'
This is more readable. It means, "a complement clause that is a sister to a verb that does not precede any terminal words which themselves precede the complement clause", in other words, "a complement clause that is adjacent to the verb that introduces it".

In fact (wouldn't you know it), Florian Jaeger has developed an elaborate set of macros for the purpose of extracting complement and relative clauses. Copy this file into your home directory (right-click and save it and then make sure that you transfer it to your directory on the corpus server). If you are logged into our HLP machines, you can also copy the latest version from the TDT directory (/p/hlp/tools/TDT/) by typing:

    cp p/hlp/tools/TDT/MACROS.ptn ~
Now you can run the search for complement clauses that are adjacent to the verb that introduces them simply by typing:
    tgrep2 -afl MACROS.ptn '@CC_OPT'

Formatting output

By default, tgrep2 prints the head of the pattern, which is the leftmost node. This is very important when you want to understand the function of the -af options. These options cause tgrep2 to find all instances of the head, so for the purposes of getting counts, the head of your pattern must be the construction you are looking for. Basic output is formatted with command line switches. -l and -t are the ones you will use most.

The -l option pretty-prints the subtree matched by your pattern:

    tgrep2 -l 'SBAR'
(SBAR (WHADVP (N 400B34)
              (WDT that))
      (S (NP-SBJ_MARKABLE_human (N 400B21)
                                (PRP we))
         (VP (VBD had)
             (S (NP-SBJ_MARKABLE (-NONE- (N 400B21)))
                (VP (TO to)
                    (VP (VB do)
                        (NP_MARKABLE_nonconc (PRP it))
                        (ADVP-TMP (-NONE- (N 400B34)))))))))

The -t option prints the terminals (terminals are leaf nodes, but keep in mind that terminals can contain lots of non-word information like coreference and speaker id codes)

    tgrep2 -t 'SBAR'
4002A7 -NONE- 400298 you 'd ever want 400298 to do 4002A7 , you know . Unless it 's just , you know , really , you know , \[ and \+ , uh ,
 \[ for their , \+ uh , you know , for their \] own good

Given that the head of the construction has a special meaning, you should not use the head to determine what is printed. Fortunately, tgrep2 allows you great flexibility in how the output from your search is formatted. The basic syntax is to call tgrep2 with an output option (-m in the examples below) followed by a schematic representation of your desired output. The -m option causes tgrep2 to output for every match to your pattern. Other options include -s, which causes output for every whole tree in the corpus, regardless of whether a match was successful.

Perhaps the most useful output is the numeric code corresponding to the subtree that matches your pattern. tgrep2 codes are in the following format: number(:number)+. The first number corresponds to the number of the whole tree containing the match, counting from beginning of the corpus file to the end. The other numbers are subtrees, counting from top to bottom and right to left in the tree. Subtree codes are very important for two reasons:

  1. They allow you to pull trees and subtrees directly out of the corpus.
  2. The code is a unique identifier and can therefore be used as a case id for extracted data! This allows you to go back to the primary data at any point later when you work on your database.

To print the code for the tree matched by pattern:

    tgrep2 -m '%xh\n' 'SBAR'
5:73
17:73
21:68

To print code and tree separated by a tab (note use of \t for tab and \n for newline):

    tgrep2 -m '%xh\t%h\n' 'SBAR'
17:73   (SBAR (WHNP_MARKABLE_human (N 400945) (WP who)) (, ,) (INTJ (UH uh)) (, ,) (S (NP-SBJ_MARKABLE (-NONE- (N 400945))) (VP (VBD ran) (NP_MARKABLE_org (NP_MARKABLE_org (DT the) (NN nursing) (NN home)) (PP-LOC (IN in) (NP_MARKABLE_place (PRP$_MARKABLE_human our) (JJ little
) (NN hometown)))))))

To print code and terminals separated by a tab:

    tgrep2 -m '%xh\t%th\n' 'SBAR'
17:73   400945 who , uh , 400945 ran the nursing home in our little hometown
21:68   400B34 that 400B21 we had 400B21 to do it 400B34

To print code and pretty-printed tree separated by a tab:

    tgrep2 -m '%xh\t%lh\n' 'SBAR'
21:68   (SBAR (WHADVP (N 400B34)
              (WDT that))
      (S (NP-SBJ_MARKABLE_human (N 400B21)
                                (PRP we))
         (VP (VBD had)
             (S (NP-SBJ_MARKABLE (-NONE- (N 400B21)))
                (VP (TO to)
                    (VP (VB do)
                        (NP_MARKABLE_nonconc (PRP it))
                        (ADVP-TMP (-NONE- (N 400B34)))))))))

To print code and node marked for printing (node with backquote):

    tgrep2 -m '%xm\t%tm\n' '(SBAR !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/ !< "-NONE-")) $ /^VB/) > `VP'
174:22  was \] lucky 403F31 too 403F31 that I only have one brother
214:8   think that 404C45 when she passed away 404C45 it was probably one of the greatest

To print code and node(s) labelled in pattern (this can also be used to print multiple nodes):

    tgrep2 -m '%x=sbar=\t%t=sbar=\n%x=sbj=\t%t=sbj=\n---\n' '(SBAR=sbar !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/=sbj !< "-NONE-")) $ /^VB/) > `VP'
118:9   that \[ they , \+ they \] had a great deal of , um
118:25  they \]
---
157:86  [...] that my children would do something like , that for me
157:122 my children
---

To print tree and one line of context before (TOP node before the one containing the match):

    tgrep2 -m '%t1b\n%lh\n' 'SBAR'
Speaker1133_10 .
(SBAR (WHNP_MARKABLE_org (N 40121A)
                         (WDT that))
      (S (NP-SBJ_MARKABLE_human (PRP we))
         (ADVP-TMP (RB finally))
         (VP (VBD had)
             (NP_MARKABLE (-NONE- (N 40121A))))))

To print tree and 10 lines of context before:

    tgrep2 -m '%t10b\n%t9b\n%t8b\n%t7b\n%t6b\n%t5b\n%t4b\n%t3b\n%t2b\n%t1b\n%lh\n' 'SBAR'
Yes . E_S
Yeah . E_S
Speaker1133_4 .
400617 I 'd be very very careful \[ and , \+ uh , you know , 400617 checking them out . E_S
Uh , our , E_S
400725 -NONE- had 400725 t- , place my mother in a nursing home . E_S
She had a rather massive stroke \[ about , \+ uh , about \] uh , eight months ago I guess . E_S
Speaker1169_5 .
Uh-huh . E_S
Speaker1133_6 .
(SBAR (WHNP_MARKABLE_human (N 400945)
                           (WP who))
      (, ,)
      (INTJ (UH uh))
      (, ,)
      (S (NP-SBJ_MARKABLE (-NONE- (N 400945)))
         (VP (VBD ran)
             (NP_MARKABLE_org (NP_MARKABLE_org (DT the)
                                               (NN nursing)
                                               (NN home))
                              (PP-LOC (IN in)
                                      (NP_MARKABLE_place (PRP$_MARKABLE_human our)
                                                         (JJ little)
                                                         (NN hometown)))))))

You cannot output arabitary amounts of context, so writing a script to extract contexts might be more useful.

You can also have tgrep output a list of trees corresponding to the codes in a file.

    tgrep2 -m '%xh\n' 'SBAR' > test.txt
    tgrep -e test.txt -l

all the output formatting commands work with the -e option.

Wrappers

If there is a particular set of options that you tend to reuse, or if it is cumbersome to remember the right set of options to use, then you can use a "wrapper" around TGrep2, i.e., a script that executes TGrep2.

For example, suppose that you always want to look backwards 10 lines of discourse, and you don't want to keep re-typing the formatting option:

    tgrep2 -m '%t10b\n%t9b\n%t8b\n%t7b\n%t6b\n%t5b\n%t4b\n%t3b\n%t2b\n%t1b\n%lh\n' SBAR

You can write a short little perl script that does this for you, so that you can just write the following instead.

    ./tgrep2cxt.pl SBAR

Create a file called "tgrep2cxt.pl" containing the following lines of code (the .pl extension means "perl", and "cxt" stands for "context").

    #!/usr/bin/perl

    $pattern = $ARGV[0];

    system("tgrep2 -m \'%t10b\\n%t9b\\n%t8b\\n%t7b\\n%t6b\\n%t5b\\n%t4b\\n%t3b\\n%t2b\\n%t1b\\n%lh\\n\' \'$pattern\'");

$ARGV[0] refers to the first argument given to the script on the command line. The "system" function executes a command as if it were given on the command line. All of the backslashes (\) in the pattern are themselves preceded by backslashes because otherwise "\n" will be interpreted as a carriage return character, rather than a backslash character followed by an 'n' character. The backslash is an "escape" character, allowing the following backslash to "escape" its normal interpretation.

Then make the script executable by typing:

    chmod u+x tgrep2cxt.pl
and run the script as indicated above.

This script could be modified to allow arbitrary numbers of lines of context, with a little more perl programming.

The wrapper that we will use for the purposes of the next step in the tutorial is called myTGrep2.pl. It was originally developed by Liz Coppock and has been slightly updated and extended by Florian Jaeger.

Again, make the script executable by typing:

    chmod u+x myTGrep2.pl

By default, the script prints out only the node ID. This behavior can be overridden if one of the following options is given:

  • -printByLine: this prints out the node labelled "=print" in the pattern.
  • -printWholeString: this prints out the terminals under the matched node.
It also looks for the file named MACROS.ptn in the TDT directory (/p/hlp/tools/TDT/), your scripts directory (~/scripts/), and the directory from which you call myTGrep2.pl (in that order). It uses the last MACRO file found to interpret the search pattern you specify.

Try it out by typing this at the command line:

    ./myTGrep2.pl -printWholeString 'NP < CC' | less



Homework exercise for Part 2 of the tutorial



 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]