Foundation Models, after training with Internet-scale data, show excellent understanding of natural language and images, possessing common sense, and performing logical reasoning and predictions. All of these capabilities are essential for developing intelligent and autonomous robots. How can we tap into the power of these Foundation Models in robotics? In this talk, I will cover three recent papers from Google DeepMind about building Robot Foundation Models: SayCan, RT-2, and ROSIE. They not only demonstrate how to ground large language models (LLM), and vision-language models (VLM) based on the robots' physical capability and the real-world environments, but also discuss how to leverage AI generative models for large-scale data augmentation. Pre-trained with Internet-scale text and image data, and fine-tuned with robotic data collected in the real world or imagined through diffusion models, the Robot Foundation Models show unprecedented capabilities for long-horizon task planning, and generalizable low-level skills.
Host
Dinesh Manocha
Jie Tan is a Senior Staff Research Scientist at Google DeepMind. He leads the Robot Mobility and Embodied General Reasoning Teams, whose mission is to build intelligent and autonomous robots that can assist humans for daily tasks in human-centered environments. His research focuses on applying foundation models and deep reinforcement learning methods to robots, with interests spanning locomotion, navigation, manipulation, simulation, and sim-to-real transfer.