Modern machine learning models, particularly deep neural networks, have demonstrated remarkable capabilities across a variety of tasks. However, their success often hinges not only on model capacity but also on the structure of the data and the presence of spurious correlations. This dissertation investigates the foundations of learning non-spurious decision functions—functions that capture stable, generalizable patterns rather than superficial artifacts in the data.
We begin by analyzing the expressive power of neural networks, defined by the function classes that different architectures can represent. In contrast to traditional approaches that fix model structure a priori, we propose a data-informed perspective, where structural choices are guided by the information content and complexity of the data. This view enables more efficient and targeted model designs that align with the intrinsic learning task.
Building on this foundation, we explore how models trained on observational datasets are prone to learning spurious shortcuts, particularly when superficial features correlate with labels. We study the conditions under which such behavior emerges and propose methods to encourage the learning of non-spurious functions through structural and algorithmic interventions.
Finally, we focus on improving the data itself. We present a framework for constructing or refining datasets to better support non-spurious learning. This includes data augmentation strategies and self-improvement techniques assisted by large language models (LLMs), enabling models to learn from enhanced or more semantically aligned examples.
Together, these contributions offer a unified framework that connects model expressivity, data quality, and structural design. Our goal is to inform the development of learning systems that not only perform well empirically, but also base their decisions on meaningful, generalizable signals in the data.
Xiaoyu Liu is a Ph.D. candidate in the Computer Science Department at the University of Maryland, College Park, advised by Prof. Furong Huang. Her research focuses on causal inference and representation learning, developing efficient and interpretable methods to learn non-spurious decision functions from observational data.