Talks

PhD Defense: Multifaceted Geotagging for Streaming News

Michael Lieberman - University of Maryland, College Park

3258 A.V. Williams Building (AVW)

Wednesday, August 22, 2012, 2:00-3:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

THE DISSERTATION DEFENSE FOR THE DEGREE OF Ph.D. IN COMPUTER SCIENCE FOR

Michael Lieberman

News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging.

To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques.

First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning--based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, wherein toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods.

Examining Committee:

Committee Chair: Dr. Hanan Samet

Dean's Representative: Dr. Paul Torrens

Committee Members: Dr. Larry Davis

Dr. William Rand

Dr. Amitabh Varshney

Dr. Amy Weinberg

This talk is organized by Jeff Foster