Capturing the shape and material of objects and scenes is a cornerstone of computer vision research, with significant applications across augmented reality, e-commerce, healthcare, real estate, and robotics. This thesis explores two primary capture methods: Multiview Stereo (MVS), which leverages varying viewpoints, and Photometric Stereo (PS), which utilizes changes in lighting. To address some of the limitations inherent in these techniques, we introduce several novel methods.
In the first part, we present a user-friendly PS setup requiring only a camera, a flashlight, and optionally a tripod—simple enough for home assembly. To support high-resolution captures from this setup, we introduce RecNet, a novel recursive architecture trained on low-resolution synthetic data yet capable of predicting high-resolution geometry and reflectance. RecNet demonstrably outperforms state-of-the-art PS systems, even with only a few input images. Traditionally, PS assumes that lighting is distant, which is impractical for large objects or those in confined spaces. Building on RecNet, we propose a novel method that integrates per-pixel lighting estimates and recursive depth estimation to address the challenges of near-field lighting, thus broadening PS's applicability.
While PS excels at capturing fine details, it often struggles with global geometry, introducing low-frequency distortions that complicate the stitching of multiple views into a complete object. Conversely, MVS captures global geometry effectively but tends to miss finer details. In the second part, we address the so-called Multiview Photometric Stereo (MVPS) problem, which leverages variations in both lighting and viewpoint. Our feedforward architecture, inspired by both MVS and PS techniques, enables geometry reconstruction that matches or exceeds the state-of-the-art in quality, while being orders of magnitude faster.
In scenarios where adjusting lighting conditions is impractical, such as in large or outdoor scenes, changing viewpoints often proves more feasible, especially when cameras are mounted on mobile platforms like drones or vehicles. Large field of view (FoV) cameras are preferable for these expansive scenes, as they enable faster and easier capture. However, adapting MVS models developed for small-FoV to large-FoV requires significant modifications and traditionally depends on scarce large-FoV training data. In the third part, we introduce novel architectures and data augmentation techniques to train networks on the abundant small-FoV data but allow them to generalize to large-FoV scenarios. This approach demonstrates strong generalization capabilities across both indoor and outdoor datasets, effectively eliminating the need to acquire costly large-FoV-specific datasets for training large-FoV MVS models.
Through these contributions, we aim to streamline and enhance the capture of shape and material, making it faster and more practical for a broad range of users—from casual hobbyists to industrial systems.
Daniel Lichy is a Ph.D. candidate at the University of Maryland, College Park, advised by Professor David Jacobs. His research focuses on machine learning, computer vision, and computational photography with an emphasis on reconstructing 3D geometry and material properties using multi-view and multi-illumination data. He interned with Nvidia's Learning and Perception Research Group, developing multiview stereo algorithms that can generalize between camera models. Daniel earned his Bachelor's in Mathematics from the University of Maryland and completed a Post-Baccalaureate fellowship at the National Institute of Biomedical Imaging and Bioengineering.