Embodied agents for sequential decision-making lie at the heart of intelligent behavior, yet achieving generalization across tasks, embodiments, and modalities remains a key challenge. This work presents a unified framework centered on latent world-model-based representation learning, aiming to endow embodied agents with compact yet generalizable abstractions of both state and action that capture the underlying dynamics of the physical world.
I begin with TACO and Premier-TACO, which aim to learn predictive latent state representations through temporal contrastive objectives. These representations encode control-relevant dynamics, significantly improving data efficiency and enabling generalization to unseen tasks compared with other existing pre-training objectives. Building on this idea, FLARE extends world-model-based learning to large vision-language-action (VLA) foundation models through a future latent alignment objective, achieving state-of-the-art multitask policy learning and enabling training directly from human video data.
Beyond state representation learning, latent world-model-based future prediction can also be leveraged to learn effective temporal action abstractions. In my work, PRISE demonstrates how discretizing and tokenizing raw trajectories into higher-level temporal actions significantly reduces the planning horizon and improves the efficiency and generalization of multitask imitation learning.
Finally, beyond learning-based state and action representations, my research also explores how symbolic representations can bridge the perception–action gap in large VLA models. My work TraceVLA introduces an explicit visual prompting technique that encodes a robot’s execution history as a symbolic visual trace, providing richer spatio-temporal grounding and substantially improving real-world manipulation performance of existing VLA models.
Ruijie Zheng is a Ph.D. candidate in Computer Science at the University of Maryland, advised by Professors Furong Huang and Hal Daumé III. His research lies at the intersection of deep reinforcement learning and robot learning, focusing on developing efficient representations for generalist embodied foundation models. He has previously interned at Microsoft Research and NVIDIA.

