Analyzing language often requires assessing a cause-effect relationship. Does making a complaint polite increase the chance of someone resolving it quickly? Does adding self-identifying details to a Reddit flair change how other users respond? Causal inference provides a framework for modeling and estimating such causal effects from data, under a variety of assumptions. Despite the role causality plays in the sciences, using causal inference to analyze text is challenging because unlike other domains, the relevant variables are often unmeasured, and instead encoded in unstructured text data. In contrast, the field of machine learning (ML) has arguably succeeded at extracting task-relevant information from unstructured inputs like text, but largely in the context of analyzing associations. How can ML methods be used to draw causal conclusions from text data? In this talk, I'll discuss two use cases of ML for drawing valid causal inferences. First, I'll introduce causally sufficient text embeddings, a general method to adjust for the biases that arise in observational text data, allowing us to answer causal questions about interventions of interest. Next, I'll review an extension of these causally sufficient embeddings that allow us to analyze text when the causes of interest are unobserved. Finally, I'll conclude by highlighting the potential role of causality as a tool to better understand large language models.
Organizer's note: This talk will be virtual.
Dhanya Sridhar is an assistant professor in the department of computer science and operations research (DIRO) at Université de Montréal, a core academic member of Mila, and a Canada CIFAR AI Chair. Prior to this, she was a postdoctoral researcher at Columbia University. She received her doctorate from the University of California, Santa Cruz. In brief, her research focuses on combining causality and machine learning in service of AI systems that are robust to distribution shifts, adapt to new tasks efficiently, and discover new knowledge alongside us.