Text mining extracts valuable insights from a text corpus. Many interesting problems in text mining such as identifying characteristics of a group of documents, selecting high-quality comments to promote are open-ended tasks where no ground truth exists. Humans must still provide world knowledge, reasoning, and context for these tasks. However, this approach does not scale to large corpora, and automating them is proving to be a challenging problem. While sophisticated text mining algorithms are becoming increasingly proficient at extracting themes, identifying insightful documents, or labeling images, the lack of formative evaluation makes it difficult to evaluate and improve them.
My research investigates a general framework for transforming state-of-the-art text mining algorithms into interactive analytics process using visual representations. This framework consists of two components that integrate with different parts of the standard text mining pipeline: (1) sensing mechanism, and (2) steering mechanism. To amplify the cognitive ability of the human analyst to manage large data, text data is preprocessed with natural language processing methods for features. These features are then summarized using statistical methods to produce high-level abstractions. The users should understand these outputs with the sensing mechanism such that the abstractions are visually presented so that users can explore them interactively to answer open-ended questions. Based on the understanding, the users can act upon the models to form interactive loops.
I explore the design space of the sensing mechanism and steering mechanism for the text analysis context with several case studies. In the first part, I explore how the sensing mechanism can enable characterization the clusters. ParallelSpaces examines the understanding of the results of topic modeling for Yelp business reviews, where businesses and their reviews constitute each separate visual space and exploring these spaces enable the characterization of each space using the other. However, the scatterplot-based approach in the ParallelSpaces does not scale to the categorical variables due to the overplotting. We propose an improved layout algorithm for those cases while maintaining individual objects in the follow-up work ATOM. Another limitation in the clustering methods is the fixed number of the clusters as a hyperparameter. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time.
In the second part, we explore how humans can act upon the findings to give feedback to the machine process by steering mechanism in the domain of the comments analysis. Based on the output understanding, the user can directly manipulate the model parameters. CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. However, the features in the previous studies are syntactic features, which limits a concept-based analysis. To help users analyze the documents semantically, we develop a technique for user-driven text mining by building a dictionary for topics or concepts in a follow-up study, ConceptVector. ConceptVector uses word embedding to generate dictionary interactively and use those dictionaries to analyze the documents.
My dissertation will contribute a general framework for integrating the human in text mining loops that currently are non-interactive. The practical implications of this framework are wide and far-reaching. The case studies I present in this dissertation provide concrete and operational techniques for directly improving several state-of-the-art text mining algorithms. I will summarize those generalizable lessons and discuss the limitations of the visual analytics approach. On a more abstract level, I will crystallize the lessons learned from the application of my framework in multiple studies into design guidelines, which will guide the transformation of any linear algorithmic process into an interactive process. This, in turn, can facilitate the scaling of open-ended tasks empowering human analysts in the future.
Dean's rep: Dr. Hector Corrada Bravo
Members: Dr. Hal Daume
Dr. Bongshin Lee
Dr. Jaegul Choo