log in  |  register  |  feedback?  |  help  |  web accessibility
Baby VLM: Democratizing Research on the Pretraining of Vision Large Language Models
Boqing Gong
IRB 4105 or Zoom https://umd.zoom.us/j/92499087387?pwd=V576oxa5ktSyfEbmT9RiIldaPlutHd.1&jst=2
Tuesday, December 16, 2025, 11:00 am-12:00 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Pretraining vision foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks, such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research — “What I cannot create, I do not understand,” quoted Richard Feynman. Independent researchers and the public cannot gain a true understanding, trust, and safe use of VFMs passively from open weights or APIs. Meanwhile, the few privileged VFM creators could momentarily reach a plateau without the broad research community’s nurturing.
 
Hence, we propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets, aiming to promote exploration rather than exploitation of the pretraining and enable independent researchers to build general-purpose VFMs that approach “baby intelligence” to benefit efforts towards “grown-up” AI. This framework will closely mimic the minimal yet highly informative sensory experiences of human infants, encompassing 1) Pretraining data curated from longitudinal, egocentric audiovisual recordings of babies, 2) A suite of developmentally aligned evaluation benchmarks assessing VFM capabilities against cognitive milestones like object permanence, social skills, and language acquisition, and 3) A user-friendly pretraining codebase and baseline models.
Bio
Boqing Gong (https://boqinggong.github.io) is a computer science faculty member at Boston University and a part-time research scientist at Google DeepMind. His research on machine learning and computer vision focuses on visual recognition, video, and AI models’ generalization and efficiency.
 
Host: Please contact Ruohan Gao (rhgao@umd.edu) if you've any questions.
This talk is organized by Samuel Malede Zewdu