THE THESIS DEFENSE FOR THE M.S. DEGREE IN COMPUTER SCIENCE FOR
Rajiv Jain
Despite the explosion of textual content on the Internet, hard copy documents, scanned in image form, still play a significant role across many domains. How to best perform retrieval on these document images to satisfy a user’s information needs remains an open research question. The most common approach has been to perform text retrieval on the content generated by Optical Character Recognition (OCR) algorithms. Instead, this thesis presents a segmentation-free algorithm that queries sub-images within a page and scales to millions of images. Experimental results show this technique can reliably be used with both graphical objects such as logos and text. This algorithm is state of the art in comparison to other document image retrieval approaches and is shown to positively affect user relevance on a seven million document dataset of scanned images from the tobacco litigation corpus used in the TREC Legal track.
Examining Committee:
Committee Member(s): Dr. Larry Davis
Dr. David Jacobs
Dr. Douglas Oard
EVERYONE IS INVITED TO ATTEND THE PRESENTATION PORTION OF THIS DEFENSE