Multimodal foundation models have recently shown great promise for a wide array of tasks involving images, text and 3D geometry, pushing the boundaries of what was considered impossible just a few years ago. However, these models are not easily interpretable, hindering their potential usage and adaptation for various tasks.
In this talk, I will present an ongoing line of research that leverages multimodal foundation models for addressing the task of semantic editing over 3D objects, and demonstrate the paradigm shift between how we addressed this task several years ago and how we address it today in the presence of these powerful models. By considering both their internal mechanisms as well as their functional subparts in a standalone manner, our work also provides new angles for understanding what is learned by these multimodal foundation models. Finally, I will discuss several future directions.