Talks

Large-scale paraphrasing for natural language understanding and generation

Wednesday, April 23, 2014, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

I will present my method for learning paraphrases - pairs of English expressions with equivalent meaning - from the bilingual parallel corpora, which are more commonly used to train statistical machine translation systems. My method pairs English phrases like <thrown into jail, imprisoned> when they share an aligned foreign phrase like festgenommen. Because bitexts are large and because a phrase can be aligned many different foreign phrases (including phrases in multiple foreign languages), the method extracts a diverse set of paraphrases. For thrown into jail, we not only learn imprisoned, but also arrested, detained, incarcerated, jailed, locked up, taken into custody, and thrown into prison, along with a set of incorrect/noisy paraphrases. I'll show a number of methods for filtering out the poor paraphrases, by defining a paraphrase probability calculated from translation model probabilities, and by re-ranking the candidate paraphrases using monolingual distributional similarity measures. In addition to lexical and phrasal paraphrases, I'll show how the bilingual pivoting method can be extended to learn meaning-preserving syntactic transformations like the English possessive rule or dative shift. I'll describe a way of using synchronous context free grammars (SCGFs) to represent these rules. This formalism allows us to re-use much of the machinery from statistical machine translation to perform sentential paraphrasing. We can adapt our "paraphrase grammars" to do monolingual text-to-text generation tasks like sentence compression or simplification. I'll also briefly sketch future directions for adding a semantics to the paraphrases, which my lab will be exploring in the DARPA DEFT program.

Bio

About me: I am an assistant professor in the Computer and Information Science Department at the University of Pennsylvania. Before joining Penn, I was a research faculty member for 6 years at the Center for Language and Speech Processing at Johns Hopkins University. I was the Chair of the Executive Board of the North American chapter of the Association for Computational Linguistics (NAACL) from 2011-2013. I have served on the editorial boards of the journals Transactions of the ACL (TACL) and Computational Linguistics. I have more than 80 publications, which have been cited more than 5000 times. I am a Sloan Research Fellow, and I have received faculty research awards from Google, Microsoft and Facebook in addition to funding from DARPA and the NSF.

This talk is organized by Hal Daume III