With the advent of self-supervised learning, Transformers models have gained immense popularity across a large variety of tasks spanning natural language processing, computer vision, speech processing, document intelligence, and so on. They represent the state-of-the-art across many modalities, from language understanding and document intelligence to image classification and protein sequences. A common weakness of Transformers is their quadratic memory complexity within the self-attention mechanism, which restricts their potential application to domains requiring longer sequence lengths. Hence, these models and their associated methods struggle from an input length limitation for reasoning in long-context scenarios. Existing research has tried to propose extensions of the standard Transformer architecture (e.g., Longformer, Big Bird, Reformer, etc.) to encode longer input sequences. However, such methods are not task agnostic, trade the ability to model long-form input with reduced performance vis-à-vis regular Transformer models, do not show consistent performance gains across different tasks, and require extensive training data and computation power to make them usable for specialized domains such as legal, finance, news, and contracts with restricted supervised data resources.
Our research focuses on building predictive models for long context (also called document-level) multimedia understanding by expanding the capabilities of the Transformer language models for capturing local-level context as well as long-range global information and is broadly divided into four parts: We research designing and training supervised methods for document-level text information extraction. We look at tasks such as temporal event relation extraction, temporal dependency parsing, and natural language inference at a document scale. The research explores multimodal hierarchical structure extraction in visually-rich documents and using visual-linguistic-spatial learning for automated document manipulations. We investigate methods for building text-to-speech systems for semi-structured long-form text and improving speech recognition systems for handling long-term dependencies to better predict words having domain-specific contexts. Lastly, the research covers methods to extract information from multimodal long-form videos (also called conference calls) for downstream time series prediction and see how document-level transcripts, long-form audio-visual recordings, and tabular information can be combined for financial prediction tasks.
Puneet is pursuing Ph.D. in CS at the University of Maryland, College Park, advised by Prof. Dinesh Manocha. His research is focused on long-context multimodal understanding (documents, language, audio, video) spanning across machine learning, natural language processing, speech processing, video understanding and multimodal deep learning. He completed his Masters in Computer Science from UMD in 2021 and Bachelors in Engineering (B.E.) in Computer Engineering from NSIT (Delhi University). He has previously interned at Dataminr, Adobe Research and Meta AI (previously Facebook).