Talks

MS Defense: Large Scale Document Image Retrieval

Rajiv Jain - University of Maryland, College Park

3450 A.V. Williams Building (AVW)

Friday, April 27, 2012, 9:00-10:00 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

THE THESIS DEFENSE FOR THE M.S. DEGREE IN COMPUTER SCIENCE FOR

Rajiv Jain

Despite the explosion of textual content on the Internet, hard copy documents, scanned in image form, still play a significant role across many domains. How to best perform retrieval on these document images to satisfy a user’s information needs remains an open research question. The most common approach has been to perform text retrieval on the content generated by Optical Character Recognition (OCR) algorithms. Instead, this thesis presents a segmentation-free algorithm that queries sub-images within a page and scales to millions of images. Experimental results show this technique can reliably be used with both graphical objects such as logos and text. This algorithm is state of the art in comparison to other document image retrieval approaches and is shown to positively affect user relevance on a seven million document dataset of scanned images from the tobacco litigation corpus used in the TREC Legal track.

Examining Committee:

Committee Member(s): Dr. Larry Davis

Dr. David Jacobs

Dr. Douglas Oard

EVERYONE IS INVITED TO ATTEND THE PRESENTATION PORTION OF THIS DEFENSE

This talk is organized by Jeff Foster