The growing scale of large models has enabled impressive generalization across language, vision, and multimodal tasks, but also introduced significant challenges in computation, memory, and deployment. This dissertation aims to improve the efficiency of large models by leveraging structural insights and designing scalable, adaptive architectures.
We begin by analyzing the internal redundancy of Transformer models, demonstrating that a substantial portion of attention and MLP components can be removed or sparsified with minimal impact on performance. These findings motivate the development of conditional computation techniques—such as Mixture-of-Experts (MoE) and dynamic depth routing—that reduce unnecessary computation based on input characteristics. To support scalable inference, we further propose capacity-aware routing and token rescheduling strategies that mitigate straggler effects and improve hardware utilization.
Our methods are validated across multiple application domains, including natural language processing, representation learning, and vision-language understanding. Together, these contributions offer a principled framework for building large models that are both efficient and deployment-ready.
Shwai He is a second-year Ph.D. student in the Department of Computer Science at the University of Maryland, College Park, advised by Prof. Ang Li. His research focuses on efficient large models, with interests in structural redundancy analysis, dynamic architectures, and scalable inference.
He is a recipient of the Qualcomm Innovation Fellowship (QIF) for his work on Transformer efficiency. His recent contributions include Router-Tuning for dynamic depth adaptation, SparseAdapter for parameter-efficient fine-tuning, and capacity-aware inference techniques for Mixture-of-Experts (MoE) models. His work has been applied to large-scale systems in natural language processing, embedding models, and vision-language tasks, aiming to bridge efficient model design with practical deployment.