Tutorial

::

Collecting and Analyzing Data with TGrep2

 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]
 

Finding Complement Clauses in Parsed Corpora

Exercise

Try to find all complement clauses that either have a "that" complementizer or no complementizer at all. An example would be "Neal didn't think (that) I would mess around with this". Examples like "He didn't believe what he saw." are not what we are looking for.

Solution

When you search for 'TOP << told', you find that complement clauses are labelled with SBAR. So a do a search specifically for that tag. We add the condition that the SBAR be dominated by VP (> means 'is dominated by') so we can see the verb and its complement. The VP is marked by a backquote ` so it will be printed, instead of the head node of the pattern (as is the default).

    'SBAR > `VP'

This gives you outputs like the following:

(VP (VBP think)
    (SBAR (IN that)
          (S (EDITED (RM (-DFL- \[))
                     (NP-SBJ (PRP they))
                     (, ,)
                     (IP (-DFL- \+)))
             (NP-SBJ_MARKABLE_human (PRP they)
                                    (RS (-DFL- \])))
             (VP (VBD had)
                 (NP_MARKABLE_oanim (NP_MARKABLE_oanim (DT a)
                                                       (JJ great)
                                                       (NN deal))
                                    (PP-UNF (IN of)
                                            (, ,)
                                            (INTJ (UH um))))))))

This is an example of what we want, but it also gets:

(SBAR (WHNP_MARKABLE_human (N 400945)
                           (WP who))
      (, ,)
      (INTJ (UH uh))
      (, ,)
      (S (NP-SBJ_MARKABLE (-NONE- (N 400945)))
         (VP (VBD ran)
             (NP_MARKABLE_org (NP_MARKABLE_org (DT the)
                                               (NN nursing)
                                               (NN home))
                              (PP-LOC (IN in)
                                      (NP_MARKABLE_place (PRP$_MARKABLE_human our)
                                                         (JJ little)
                                                         (NN hometown)))))))

So this search not only gets complement clauses, but also gets free-relatives, which are headed by a wh-element, so we can modify the search to rule those out:

    'SBAR !< /^WH/'

/^WH/ is a regular expression that means: "beginning with WH". The caret (^) represents the beginning of the node. The exclamation point (!) represents negation.

This still gives us some complement clauses that don't start with "that":


(SBAR (IN Because)
      (S (EDITED (RM (-DFL- \[))
                 (S (NP-SBJ (PRP it))
                    (VP-UNF (VBD was)))
                 (, ,)
                 (IP (-DFL- \+)))
         (PRN (S (NP-SBJ_MARKABLE_human (PRP you))
                 (VP (VBP know)))
              (, ,))

Although complementizer "that" is tagged IN, other complementizers, such as "because", are as well. This search removes other complementizers:

    'SBAR !< /^WH/ !< (IN !< "that")'

What we have added means, "make sure that the SBAR does not dominate a complementizer [IN] that does not dominate that".

But now we still have to rule out reduced subject-extracted relative clauses:


(SBAR -NONE-
      (S (NP-SBJ_MARKABLE (-NONE- (N 400AB0)))
         (VP (VBD was)
             (ADJP-PRD (JJ interesting)))))

Therefore we need to require S to dominate a non-zero subject:

    'SBAR !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/ !< "-NONE-"))'

This search also gets extraposition structures, though:


(VP (BES 's)
    (ADJP-PRD (JJ proven))
    (SBAR (N 40500B)
          (IN that)
          (S (NP-SBJ_MARKABLE_nonconc (PRP it))
             (VP (VBZ is)
                 (RB n't)
                 (ADJP-PRD (JJ true))))))

So we need to require SBAR to be the sister of a verb:

    '(SBAR !< /^WH/ !< (IN !< "that") < (S < (/^NP-SBJ/ !< "-NONE-"))) > `VP'

The backquote (`) tells Tgrep2 to print out the node it precedes. The greater-than sign (>) means "is dominated by".

This search should give us all of the (properly-annotated) complement clauses with that or zero-complementizers.

 
[Step 0: Setting up TGrep2 Step 1: Collecting Data with TGrep2 Step 2: Preparing a database Step 3: Analyzing your data]