Video editing is vital for transforming raw footage into structured visual narratives. However, unlike controlled studio productions, real-world videos feature a complex interplay between dynamic scene motion and unconstrained camera trajectories. This complexity hinders advanced tasks such as precise viewpoint synthesis and object-level manipulation. Furthermore, while modern generative models offer powerful text-to-video synthesis, high-level textual prompts lack the precision required for fine-grained control, preventing users from achieving specific, predictable results.
In this thesis, we address challenging video editing tasks by introducing methodologies for fine-grained controllability. First, we present a layer-decomposition framework that isolates objects and their correlated effects, enabling versatile scene manipulation. Second, we develop an efficient novel view synthesis representation that accelerates training and rendering for monocular videos. Finally, we establish 3D point tracks as a unified representation for generative motion editing, allowing simultaneous control over camera trajectories and scene-object movements. Collectively, these works provide a technical foundation for 3D-aware, controllable editing in unconstrained, real-world environments.
Yao-Chih Lee is a 4th-year PhD student advised by Prof. Jia-Bin Huang. His research focuses on 3D computer vision and generative video models. During his doctoral studies, he has completed research internships at Google DeepMind and Adobe Research.
Examining Committee Chair:
Dr. Jia-Bin Huang
Dean's Representative:
Dr. Maria Cameron
Members:
Dr. Ramani Duraiswami
Dr. Abhinav Shrivastava
Dr. Furong Huang

