This dissertation investigates how language models, including contemporary LLMs, can perpetuate social biases related to gender, race, and ethnicity as inferred from first names. Guided by the principle of counterfactual fairness, we use name substitution to uncover, understand, and mitigate these biases across three domains: stereotypes about personal attributes, occupational bias, and overgeneralized assumptions about romantic relationships.
By analyzing model behavior across diverse names, this dissertation reveals patterns of unfair treatment, such as personality judgments in social commonsense reasoning influenced by demographic associations, discrimination in hiring based on gender, race, and ethnicity, and heteronormative bias in relationship predictions. To address these issues, we propose open-ended diagnostic frameworks, interpretability analyses based on contextualized embeddings, and a novel consistency-guided finetuning method.
Together, these contributions aim to build fairer, more interpretable, and more inclusive language technologies.
Haozhe An is a Ph.D. candidate in the Department of Computer Science, advised by Professor Rachel Rudinger. Haozhe is a member of the CLIP lab. His research aims to understand and mitigate the issues of bias and fairness in NLP systems.