Overview
I currently don't have the time to update this page and add detailed manuals. For an introduction to the available alpha release
I refer to a tutorial I gave for the
Laboratory Syntax class 10/30. The current release is from 12/08/05 (alpha release v.27). Have a look
at the history of changes.
Available Features
I haven't released all scripts yet, but here is a list of the released
and the unreleased features. Please contact me if you would like to use scripts that I
have not made publically available. The most recent version is v.29,
which contains many improvements over the most recent public release.
Unreleased Features
- add information to variables that count number of occurrences
(e.g. number of disfluencies in a complement clause)
- add information on the phonology and lexical stress
- add phonetic information (duration, pitch, intensity, speech rate,
surrounding pauses, etc.; only available for Switchboard)
- add unigram and bigram information
- add conditional probabilities for events in the database
- merge information in the database with other databases
- add information about priming (alpha and beta persistence)
Coming up/Planned Features
- add information about addressee to addConversationInfo.pl.
- import speaker-mappings corrections (the current version of addConverstationInfo.pl uses the old Switchboard mappings, which are swapped
for some conversations. This doesn't matter much as long as one isn't interested in research on speakers across conversations (since the entire
conversation mapping is swapped, i.e. speaker A is speaker B and vice versa).
Download and Package Information
- Scripts - these are PERL scripts that take TGrep2 outputs and format them. More will come later.
- The other scripts that I have released so far are available in this directory (along with the
necessary databases). The scripts use the
CMU Pronouncing Dictionary and a database I constructed out of tables that come with the original Switchboard Corpus
release. The database is used to extract information about speakers and conversation topic, etc. from the
Switchboard corpus. If you want to use this option you need a Switchboard license.
Known bugs
Please let me know if you find any bugs.
- There seems to be a bug in addWordformBigrams.pl (v.04) but I haven't found it yet.
History of Changes
- v.27
- Debugging of core functions in format.pl.
- Completely revised addConversationInfo.pl (new: v.61), which is no much more efficient/faster and also includes information
for each sentence on which position it has within the conversation (NB: this is not the same as 'turn' information since 'turns'
can contain several contentful TOP nodes).
- Debugging of addPriming.pl (new: v.37) (this was a serious bug which led to wrong counting of primes). The new script
counts distance in TOP nodes (only those that have a content) and in words. Turn information is currently not used but should be implemented
in the future.
- Wrote several small helper programs that help to create databases (such as the one used to add speaker information, etc.
for the Switchboard corpus). One of these helpers is used to overcome a shortcoming of TGrep2 with regard to the counting of
words in a sentence when there are several cases (observations of interest) within one TOP node.
- v.26
- Added several new core-functions in format.pl that help developing further scripts.
- Added script for importing factors from other databases import_factor.pl. It makes it easy to merge information
from several databases into one. This is useful if there is additional annotation, etc. that you want to merge into your
database.
- CHANGE: Modification of factors that are not in the database causes scripts to die (unless option '+c' is used). This is
necessary as precaution (otherwise one ends up too often with wrong data).
- Further standardazation and debugging of scripts.
- v.25
- New script addConversationInfo.pl supercedes addSpeakerinfo.pl. The script adds information about
the speaker (gender, education, age, dialect, payment, whether it was the caller 'A' or callee 'B'), conversation (topic,
ID, rated complexity of transcription, rated variation in topic-continuity, rated naturalness), and the position of the utterance
within the conversation.
- Some minor bug fixes (warning messages that didn't work; spurious print commands)
- v.24
- New option to switch warnings on and off (+/-w).
- New option to create variables that are not given in the factor file (+/-c). Existing variables are overriden
if this option is selected.
- Option defaults are "-"
- Some of the command line syntax changed due to the introduction of the options (but they can be omitted). Call the
scripts without arguments for details.
- NB: Not all scripts actually use these features yet, but I have included them since they will be used in the
future.
- v.2
- The syntax of create_database.pl has changed slightly after release v.2.
|