Foundation models have achieved remarkable progress in vision, language, and multimodal reasoning. However, their development has been largely driven by large-scale supervised learning and distillation from teacher models, both of which require substantial human effort and computational resources while still exhibiting unreliable and inefficient behaviors. This dissertation explores how foundation models can be systematically steered toward reliable and efficient behavior through the interplay of data design and modeling strategies, without relying on external supervision.
On the modeling side, we investigate self-critique mechanisms and value-guided generation frameworks that enable models to assess, regulate, and refine their own outputs, forming the basis for continuous self-improvement. On the data side, we explore model-in-the-loop data generation and selection pipelines that establish a closed feedback loop between model behavior and data construction, allowing training signals to scale in both volume and quality for scalable and reliable foundation model training. By tightly coupling self-evaluation, reward modeling, and data generation within unified training pipelines, this dissertation aims to establish principled approaches for scalable, controllable, and self-improving foundation model training.
Xiyao Wang is a 4th Year PhD student in the Department of Computer Science at the University of Maryland, College Park, advised by Prof. Furong Huang. His research focuses on developing efficient and scalable algorithms for building reliable large multimodal language models and world models. He has published several papers in top-tier conferences including ICML, NeurIPS, ICLR, CVPR, ACL, EMNLP, and NAACL.

