CLS navigation links Center for Language Sciences People Research Training Abstracts Working Papers Colloquia Series Related Departments Contact University of Rochester

Van Durme Abstracts

Benjamin Van Durme and Ashwin Lall. (2009) Streaming Pointwise Mutual Information. NIPS.

Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, other- wise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data.


Benjamin Van Durme and Ashwin Lall. (2009) Probabilistic Counting with Randomized Storage. Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09).

Previous work by Talbot and Osborne [2007] explored the use of randomized storage mechanisms in language modeling. These structures trade a small amount of error for significant space savings, enabling the use of larger language models on relatively modest hardware. Going beyond space efficient count storage, here we present the Talbot Osborne Morris Bloom (TOMB) Counter, an extended model for performing space efficient counting over streams of finite length. Theoretical and experimental results are given, showing the promise of approximate count- ing over large vocabularies in the context of limited space.


Benjamin Van Durme, Phillip Michalak and Lenhart K. Schubert. (2009) Deriving Generalized Knowledge from Corpora using WordNet Abstraction. EACL'09.

Existing work in the extraction of com- monsense knowledge from text has been primarily restricted to factoids that serve as statements about what may possibly obtain in the world. We present an approach to deriving stronger, more general claims by abstracting over large sets of factoids. Our goal is to coalesce the observed nominals for a given predicate argument into a few predominant types, obtained as WordNet synsets. The results can be construed as generically quantified sentences restricting the semantic type of an argument position of a predicate.


Benjamin Van Durme, Ting Qian and Lenhart K. Schubert. Class-Driven Attribute Extraction. COLING'08. 2008.

We report on the large-scale acquisition of class attributes with and without the use of lists of representative instances, as well as the discovery of unary attributes, such as typically expressed in English through prenominal adjectival modification. Our method employs a system based on compositional language processing, as applied to the British National Corpus. Experimental results suggest that document-based, open class attribute extraction can produce results of comparable quality as those obtained using web query logs, indicating the utility of exploiting explicit occurrences of class labels in text.


Benjamin Van Durme and Marius Pasca. (2008) Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction. Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08).

A method is given for the extraction of large numbers of semantic classes along with their corresponding instances. Based on the recombination of elements clustered through distributional similarity, experimental results show the procedure allows for a parametric trade-off between high precision and expanded recall.


Dekang Lin, Shaojun Zhao, Benjamin Van Durme and Marius Pasca. (2008) Mining Parenthetical Translations from the Web by Word Alignment. The 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies (ACL-08).

Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. Our experiment on Chinese web pages produced more than 26 million pairs of translations, which is over two orders of magnitude more than previous results. We show that the addition of the extracted translation pairs as training data provides significant increase in the BLEU score for a statistical machine translation system.


Marius Pasca and Benjamin Van Durme. (2008) Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. The 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies (ACL-08).

A new approach to large-scale information extraction exploits both Web documents and query logs to acquire thousands of open-domain classes of instances, along with relevant sets of open-domain class attributes at precision levels previously obtained only on small-scale, manually-assembled classes.

top