Synthesizing novel views of a scene from existing observations is a long-standing problem with enormous practical applications, such as virtual museums and online tours, as the objects and scenes can only be captured from a limited number of viewpoints, but the viewers can freely move and rotate their camera. The emergence of virtual reality technology has also made it easier for people to observe and interact with virtual environments, highlighting the needs in robust and efficient view synthesis. In recent years, this field of research has seen large advances in neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS).
Although these methods achieve efficient high-quality view synthesis, they require careful and dense capturing in controlled environments. Therefore, much research effort has been made to relax the requirements, like allowing dynamic scenes and handling variance in lighting. This thesis presents our work along this path of enabling novel view synthesis from challenging inputs. In particular, we focus on three scenarios.
First, we tackle the problem of unwanted foreground objects, like moving people or vehicles present in front of a building. As these objects cast shadows and reflections, naively masking the objects leaves artifacts in background reconstruction. We propose a method to decompose foreground objects with their cast effects into separate 2D layers and a clean 3D background layer.
Second, we address view synthesis from very few inputs. With as few as three input views, we leverage recent developments in large image and video generation priors to interpolate in-between views for better supervising scene reconstruction methods. To improve both efficiency and quality, we use a feedforward geometry foundation model to obtain a dense point cloud that serves as condition images to the image priors. In addition, we introduce optimizable image warps and a robust view sampling strategy to deal with inconsistencies in generated images.
Lastly, we consider an extended problem of inverse rendering which decomposes the scene into geometry, material properties, and environment lighting. It not only enables synthesis of novel views via rendering, but also provides extended capacities like scene editing and relighting. We propose a simple capturing method by rotating the object several times when taking photos. With this setup, we show that the artifacts caused by ambiguity can be drastically reduced. We model the scene with 2D Gaussian primitives for computational efficiency and make use of a proxy geometry as well as a residual constraint to further improve handling of global illumination.
The works presented in this thesis improve the quality and robustness of novel view synthesis with challenging input data. Further research efforts can be made along the lines to enable casual capturing and lower the bars for creating and sharing digital content from the real world.
Geng Lin is a PhD student at the University of Maryland, College Park. He is advised by Prof. Matthias Zwicker. His research mainly focuses on novel view synthesis and inverse rendering.