Embodied AI deals with the physical manifestation of intelligence, with agents interacting with the environment to achieve specific objectives or tasks. Traditional approaches for these tasks have often relied on closed supervised-learning-based solutions, constraining agents to specific datasets, language instructions, and simulators. This limits their adaptability to novel environments and reduces their applicability to real-world scenarios. In this proposal, we aim to tackle these issues by presenting three contributions towards developing the navigation capabilities of a Generalist Embodied Agent (GEA) - an agent capable of performing tasks of real-world significance without any supervision.
First, we propose a zero-shot navigation approach that leverages Vision-Language Models (VLMs) to explore the environment, demonstrating consistent performance on the Vision-and-Language Navigation (VLN) task across diverse Matterport3D environments. This enables agents to generalize navigation capabilities without environment-specific training.
Second, we present a framework for utilizing Large Language Model (LLM) outputs to interpret free-form human language guidance and explore novel environments, achieving state-of-the-art performance on the zero-shot ObjectNav task on RoboTHOR. This allows agents to act on unstructured, human-like guidance beyond predefined labels.
Third, we democratize the evaluation of language-guided navigation models across simulation environments by providing researchers with a tool to synthesize human-like wayfinding instructions using LLMs. Our approach removes the dependence on curated simulator-based datasets, enabling the evaluation of the VLN across multiple simulator platforms.
We finally present ongoing work towards achieving GEA in the form of three novel embodied tasks that add real-world complexities to prior art: 1) P-ObjectNav, which extends ObjectNav to dynamic scenes by introducing non-stationary targets, 2) Assisted ObjectNav, which encourages collaboration with agents in the scene to improve navigation, and 3) S-EQA, which extends the embodied question answering task to more realistic situational queries. In developing novel tasks and solutions in these areas, our aim is to help embodied agents better adapt to real-world situations.
Vishnu is a fourth year PhD student advised by Dr. Dinesh Manocha. Prior to this, he completed his masters in robotics at UMD. Vishnu's research concerns Embodied AI, where he's broadly interested in enabling generalist agent behavior in human-centric environments.