Topic models discover latent topics in large collections of documents. They infer what topics each document talks about and what words make up each topic. In the real world, however, links are also pervasive in corpora at both document and word level and contain rich information. Topic models are flexible enough to incorporate these links as external knowledge or part of the generative story.
This proposal explores new methods to uncover the latent structures in the weighted links and integrate them into topic modeling. We first look at binary document links, which indicate the document similarity in topics, e.g., citation links of scientific papers. Instead of directly predicting the links with topic distributions, LBH-RTM, a relational topic model (RTM) with lexical weights, block priors, and hinge loss, first identifies the latent blocks in the document network, then learns the block topic distributions to guide document topic inference. It finally predicts the links using topical, lexical, and block features, with a max-margin objective function. LBH-RTM outperforms RTM in both link prediction and topic coherence.
In addition to document links, words are also linked. To incorporate real-value word associations, we use three methods to organize the words in a tree structure which serves as a prior, i.e., tree prior, for tree LDA (tLDA). The methods are straightforward but effective, yielding more coherent topics than vanilla LDA, and slightly improving the extrinsic classification performance.
In the proposed work, we dig deeper into the tree structure. Instead of using a pre-constructed static tree prior, we propose probabilistic hierarchical clustering with coalescent to build tree priors. Thus the tree priors and topics are jointly learned and hopefully they will benefit from each other.
We also develop topic models to learn weighted topic links. Unlike document and word links, it is hard to obtain ground-truth topic links because topics are latent rather than observed. We propose to apply such topic models on multilingual corpora. Specifically, we assume each language has its own topic distributions which consist of only the words in that language. Topic links bridge the topics between languages. We plan to validate this for modeling low-resource languages along with high-resource ones related to natural disasters. It will be evaluated on an extrinsic multi-class multi-label situation frame classification task.
Dept. rep: Dr. Max Leiserson
Members: Dr. Jordan Boyd-Graber