Van Durme Abstracts
Benjamin Van Durme and Ashwin Lall. (2009) Streaming Pointwise Mutual
Information. NIPS.
Recent work has led to the ability to perform space efficient,
approximate counting over large vocabularies in a streaming context.
Motivated by the existence of data structures of this type, we explore
the computation of associativity scores, other- wise known as
pointwise mutual information (PMI), in a streaming context. We give
theoretical bounds showing the impracticality of perfect online PMI
computation, and detail an algorithm with high expected accuracy.
Experiments on news articles show our approach gives high accuracy on
real world data.
Benjamin Van Durme and Ashwin Lall. (2009) Probabilistic Counting with
Randomized Storage. Twenty-First International Joint Conference on
Artificial Intelligence (IJCAI-09).
Previous work by Talbot and Osborne [2007] explored the use of
randomized storage mechanisms in language modeling. These structures
trade a small amount of error for significant space savings, enabling
the use of larger language models on relatively modest hardware. Going
beyond space efficient count storage, here we present the Talbot
Osborne Morris Bloom (TOMB) Counter, an extended model for performing
space efficient counting over streams of finite length. Theoretical
and experimental results are given, showing the promise of approximate
count- ing over large vocabularies in the context of limited space.
Benjamin Van Durme, Phillip Michalak and Lenhart K. Schubert. (2009) Deriving
Generalized Knowledge from Corpora using WordNet Abstraction. EACL'09.
Existing work in the extraction of com- monsense knowledge from text
has been primarily restricted to factoids that serve as statements
about what may possibly obtain in the world. We present an approach to
deriving stronger, more general claims by abstracting over large sets
of factoids. Our goal is to coalesce the observed nominals for a given
predicate argument into a few predominant types, obtained as WordNet
synsets. The results can be construed as generically quantified
sentences restricting the semantic type of an argument position of a
predicate.
Benjamin Van Durme, Ting Qian and Lenhart K. Schubert. Class-Driven
Attribute Extraction. COLING'08. 2008.
We report on the large-scale acquisition of class attributes with and
without the use of lists of representative instances, as well as the
discovery of unary attributes, such as typically expressed in English
through prenominal adjectival modification. Our method employs a
system based on compositional language processing, as applied to the
British National Corpus. Experimental results suggest that
document-based, open class attribute extraction can produce results of
comparable quality as those obtained using web query logs, indicating
the utility of exploiting explicit occurrences of class labels in
text.
Benjamin Van Durme and Marius Pasca. (2008) Finding Cars, Goddesses and
Enzymes: Parametrizable Acquisition of Labeled Instances for
Open-Domain Information Extraction. Twenty-Third AAAI Conference on
Artificial Intelligence (AAAI-08).
A method is given for the extraction of large numbers of semantic
classes along with their corresponding instances. Based on the
recombination of elements clustered through distributional similarity,
experimental results show the procedure allows for a parametric
trade-off between high precision and expanded recall.
Dekang Lin, Shaojun Zhao, Benjamin Van Durme and Marius Pasca. (2008) Mining
Parenthetical Translations from the Web by Word Alignment. The 46th
Annual Meeting of the Association of Computational Linguistics: Human
Language Technologies (ACL-08).
Documents in languages such as Chinese, Japanese and Korean sometimes
annotate terms with their translations in English inside a pair of
parentheses. We present a method to extract such translations from a
large collection of web documents by building a partially parallel
corpus and use a word alignment algorithm to identify the terms being
translated. The method is able to generalize across the translations
for different terms and can reliably extract translations that
occurred only once in the entire web. Our experiment on Chinese web
pages produced more than 26 million pairs of translations, which is
over two orders of magnitude more than previous results. We show that
the addition of the extracted translation pairs as training data
provides significant increase in the BLEU score for a statistical
machine translation system.
Marius Pasca and Benjamin Van Durme. (2008) Weakly-Supervised Acquisition of
Open-Domain Classes and Class Attributes from Web Documents and Query
Logs. The 46th Annual Meeting of the Association of Computational
Linguistics: Human Language Technologies (ACL-08).
A new approach to large-scale information extraction exploits both Web
documents and query logs to acquire thousands of open-domain classes
of instances, along with relevant sets of open-domain class attributes
at precision levels previously obtained only on small-scale,
manually-assembled classes.
top
|