PhD Proposal: Supporting Data De-Identification in an Era of Open Science
Wentao Guo
Abstract
Abstract:
Human-subjects researchers are increasingly expected to de-identify and publish data about research participants in order to bolster reproducability, empower meta-analysis, and create transparency. However, sharing data puts research participants at risk of harm, and de-identification is a difficult task that fundamentally lacks objective solutions for balancing privacy and utility. In my thesis, I confront the inescapable tensions of de-identification from a user-centric perspective, focusing on the practices and needs of researchers who collect and publish data about people. Through this work, I ultimately aim to help researchers—as well as the policymakers, repository curators, research participants, and others who shape the production and publication of data—make more informed decisions that account for privacy, utility, and ethics.
First, to better understand how researchers are currently informed about threats and strategies, I conducted a thematic analysis of 38 recent online de-identification guides. I characterize techniques and attacks, and I identify some concerning patterns around the definition of key terms, coverage of threats, and the usability of guides. Next, to investigate how researchers navigate the tensions surrounding de-identification in practice, I conducted semi-structured interviews with 24 researchers who have de-identified data for publication. I find that researchers account for important risks, but they address them through manual and social processes rather than systematic assessments of risk across the dataset. I explore why researchers take this approach and highlight three main barriers to stronger de-identification of research data related to threat modeling, incentives, and tools. Finally, to explore how computational de-identification tools and methods could support researchers, I propose to conduct design probes with researchers who have graduate-level training in quantitative human-subjects research methods. I hope to reveal insights about how researchers process the acceptability of computational de-identification tools and methods, as well as insights about what kinds of workflows and capabilities more broadly should be supported by de-identification tools in order to meet researchers' needs.
First, to better understand how researchers are currently informed about threats and strategies, I conducted a thematic analysis of 38 recent online de-identification guides. I characterize techniques and attacks, and I identify some concerning patterns around the definition of key terms, coverage of threats, and the usability of guides. Next, to investigate how researchers navigate the tensions surrounding de-identification in practice, I conducted semi-structured interviews with 24 researchers who have de-identified data for publication. I find that researchers account for important risks, but they address them through manual and social processes rather than systematic assessments of risk across the dataset. I explore why researchers take this approach and highlight three main barriers to stronger de-identification of research data related to threat modeling, incentives, and tools. Finally, to explore how computational de-identification tools and methods could support researchers, I propose to conduct design probes with researchers who have graduate-level training in quantitative human-subjects research methods. I hope to reveal insights about how researchers process the acceptability of computational de-identification tools and methods, as well as insights about what kinds of workflows and capabilities more broadly should be supported by de-identification tools in order to meet researchers' needs.
Bio
Wentao Guo is a PhD student researching human-centered security and privacy, advised by Michelle Mazurek. His research focuses on unlocking the power of professionals to protect users' security and privacy. During his PhD so far, he has studied how tech product reviewers evaluate the security and privacy of the devices they write about; how researchers approach the challenging task of de-identifying sensitive data; and how experts tailor security and privacy advice for individuals facing heightened risks.
This talk is organized by Migo Gui