In this proposal, we present an approach that leverages evaluation-driven information retrieval (IR) techniques. These techniques optimize an objective function that balances the value of finding relevant content with the imperative to protect sensitive information. This leads to designing a new evaluation metric that balances between relevance and sensitivity. Then, some baselines are introduced for addressing the current problem and a proposed approach that is based on building a listwise learning to rank (LtR) model. The resulting model is trained with a modified loss function to optimize for the new evaluation metric. In the experiments, one of the LETOR benchmark datasets, OHSUMED, is used with a subset of the Medical Subject Headings (MeSH) labels as a surrogate to represent the sensitive documents. Results show the efficacy of the proposed approach when evaluated using the new evaluation metric. This work leads to two challenges to be addressed by my future proposed work.
First, our experiments were done on OHSUMED that contains medical documents and we used metadata from that collection to treat some categories as if they represented sensitive content. This motivates us to develop a new test collection that has realistic sensitive content, e.g. personal information, or private conversations. The target test collection should have 4 components: 1) set of documents, 2) search topics which represent information needs, 3) relevance judgments, and 4) sensitivity annotations. We propose to work on corporate email datasets, e.g. Avocado. This test collection will help us understand the representation of sensitive content, and hence we can build a learning model to classify emails having sensitive information. The resulting learning model will be integrated with an LtR model to rank documents based on relevance and sensitivity.
Second, our fully automatic approaches may be risky because they may still do mistakes by putting sensitive content in the search result list. However, since people are far better than machines at drawing inferences from running text. We propose an active learning strategy where an archivist intervenes to manually review content which the sensitivity classifier is most uncertain about. Assuming the archivist is perfect in deciding the relevance and sensitivity of a document, a relevant and non-sensitive document is sent as an additional result to the searcher trying to enhance his future queries. The archivist's feedback will enable sensitivity classifier to adapt to the sensitivities within the collection. To measure the goodness of the proposed system, we propose a new evaluation metric that measures the gain the searcher gets, by getting at least one relevant document, while minimizing the archivist's review effort.
Dept rep: Dr. Ashok Agrawala
Members: Dr. Marine Carpuat
Mahmoud F. Sayed is a Ph.D. student in the Department of Computer Science at the University of Maryland, College Park. His research interests include Information Retrieval and Machine Learning. In particular, he is interested in multi-criteria learning to rank.