The machine learning community has embraced specialized models tailored to specific data domains. However, relying solely on a singular data type might constrain flexibility and generality, requiring additional labeled data and hindering user interaction. To address these challenges, my research objective is to build efficient, generalizable, interactive intelligent systems that learn from the perception of the physical world and their interactions with humans to execute diverse and complex tasks to assist people. These systems should support seamless interactions with humans and computers in digital software environments and tangible real-world contexts by aligning representations from vision and language. In this talk, I will elaborate on my approaches across three dimensions: perception, imagination, and action, focusing on large language models, generative models, and robotics. These findings effectively mitigate the limitations of existing model setups that cannot be overcome by simply scaling up, opening avenues for multimodal representations to unify a wide range of signals within a single, comprehensive model.
Boyi Li is a postdoctoral scholar at UC Berkeley, advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She is also a researcher at NVIDIA Research. She received her Ph.D. at Cornell University, advised by Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Her research interest lies in machine learning and multimodal systems. She aims to develop generalizable algorithms and interactive intelligent systems, with an emphasis on large language models, generative models, and robotics. To accomplish this, she works on aligning representations from multimodal data, specifically vision and language, to enhance, redefine, and extend the capabilities of intelligent systems in perceiving, understanding, and interacting with the world in a human-like manner.