Talks

Privacy-Preserving Search of Similar Patients in Genomic Data

Shai Halevi - IBM

Thursday, March 9, 2017, 11:00-11:59 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The growing availability of genomic data holds great promise for advancing medicine and research, but unlocking its full potential requires adequate methods for protecting the privacy of individuals whose genome data we use. One example of this tension is running Similar Patient Query on remote genomic data: In this setting a doctor that holds the genome of his/her patient may try to find other individuals with "close" genomic data (in edit distance), and use the data of these individuals to help diagnose and find effective treatment for that patient's conditions. This is clearly a desirable mode of operation, however, the privacy exposure implications are considerable, so we would like to carry out the above "closeness" computation in a privacy preserving manner.

Secure-computation techniques offer a way out of this dilemma, but the high cost of computing edit distance privately poses a great challenge. Wang et al. proposed recently [ACM-CCS'15] an efficient solution, for situations where the genome sequences are so close that edit distance between two genomes can be well approximated just by looking at the indexes in which they differ from the reference genome. However, this solution does not extend well to cases with high divergence among individual genomes, and different techniques are needed there.

In this work we put forward a new approach for highly efficient edit-distance approximation, that works well even in settings with much higher divergence. We present contributions both in the design the approximation method itself and in the protocol for computing it privately. Our tests indicate that our approximation method works well even in regions of the genome where the distance between individuals is 5% or more with many insertions and deletions (compared to 99.5% similarly with mostly substitutions, as considered by Wang et al.). As for speed, our protocol implementation takes just a few seconds to run on databases with thousands of records, each of length thousands of alleles, and it scales almost linearly with both the database size and the length of the sequences in it. As an example, in the datasets of the recent iDASH competition, it takes less than two seconds to find the nearest five records to the query, in a size 500 dataset of length 3500 sequences. This is 2-3 orders of magnitude faster than using straightforward approaches.

This talk is organized by Jonathan Katz