Output

::

TGrep2 Database Tools

 
[Articles | Manuscripts | Reviews | Talks | Tutorials | Class-related output]
 

Overview

I currently don't have the time to update this page and add detailed manuals. For an introduction to the available alpha release I refer to a tutorial I gave for the Laboratory Syntax class 10/30. The current release is from 12/08/05 (alpha release v.27). Have a look at the history of changes.

 

Available Features

I haven't released all scripts yet, but here is a list of the released and the unreleased features. Please contact me if you would like to use scripts that I have not made publically available. The most recent version is v.29, which contains many improvements over the most recent public release.

    Unreleased Features
  • add information to variables that count number of occurrences (e.g. number of disfluencies in a complement clause)
  • add information on the phonology and lexical stress
  • add phonetic information (duration, pitch, intensity, speech rate, surrounding pauses, etc.; only available for Switchboard)
  • add unigram and bigram information
  • add conditional probabilities for events in the database
  • merge information in the database with other databases
  • add information about priming (alpha and beta persistence)
    Coming up/Planned Features
  • add information about addressee to addConversationInfo.pl.
  • import speaker-mappings corrections (the current version of addConverstationInfo.pl uses the old Switchboard mappings, which are swapped for some conversations. This doesn't matter much as long as one isn't interested in research on speakers across conversations (since the entire conversation mapping is swapped, i.e. speaker A is speaker B and vice versa).

 

Download and Package Information

  • Scripts - these are PERL scripts that take TGrep2 outputs and format them. More will come later.
  • The other scripts that I have released so far are available in this directory (along with the necessary databases). The scripts use the CMU Pronouncing Dictionary and a database I constructed out of tables that come with the original Switchboard Corpus release. The database is used to extract information about speakers and conversation topic, etc. from the Switchboard corpus. If you want to use this option you need a Switchboard license.

 

Known bugs

Please let me know if you find any bugs.

  • There seems to be a bug in addWordformBigrams.pl (v.04) but I haven't found it yet.

 

History of Changes

  • v.27
    • Debugging of core functions in format.pl.
    • Completely revised addConversationInfo.pl (new: v.61), which is no much more efficient/faster and also includes information for each sentence on which position it has within the conversation (NB: this is not the same as 'turn' information since 'turns' can contain several contentful TOP nodes).
    • Debugging of addPriming.pl (new: v.37) (this was a serious bug which led to wrong counting of primes). The new script counts distance in TOP nodes (only those that have a content) and in words. Turn information is currently not used but should be implemented in the future.
    • Wrote several small helper programs that help to create databases (such as the one used to add speaker information, etc. for the Switchboard corpus). One of these helpers is used to overcome a shortcoming of TGrep2 with regard to the counting of words in a sentence when there are several cases (observations of interest) within one TOP node.
  • v.26
    • Added several new core-functions in format.pl that help developing further scripts.
    • Added script for importing factors from other databases import_factor.pl. It makes it easy to merge information from several databases into one. This is useful if there is additional annotation, etc. that you want to merge into your database.
    • CHANGE: Modification of factors that are not in the database causes scripts to die (unless option '+c' is used). This is necessary as precaution (otherwise one ends up too often with wrong data).
    • Further standardazation and debugging of scripts.
  • v.25
    • New script addConversationInfo.pl supercedes addSpeakerinfo.pl. The script adds information about the speaker (gender, education, age, dialect, payment, whether it was the caller 'A' or callee 'B'), conversation (topic, ID, rated complexity of transcription, rated variation in topic-continuity, rated naturalness), and the position of the utterance within the conversation.
    • Some minor bug fixes (warning messages that didn't work; spurious print commands)
  • v.24
    • New option to switch warnings on and off (+/-w).
    • New option to create variables that are not given in the factor file (+/-c). Existing variables are overriden if this option is selected.
    • Option defaults are "-"
    • Some of the command line syntax changed due to the introduction of the options (but they can be omitted). Call the scripts without arguments for details.
    • NB: Not all scripts actually use these features yet, but I have included them since they will be used in the future.
  • v.2
    • The syntax of create_database.pl has changed slightly after release v.2.
 
 
[Articles | Manuscripts | Reviews | Talks | Tutorials | Class-related output]