Given two strings A and B from the DNA alphabet, the Levenshtein edit distance between A and B, $LED(A,B)$, is defined to be the minimum number of single character insertions, deletions and replacements to transform A to B (equivalently B to A). If in addition to the single character edits, one is permitted to perform segmental (block) edits in the form of (i) moving a block from any location to another, (ii) copying a block to any location, and (iii) uncopying (i.e. deleting one of the two occurrences of) a block, the resulting "block edit distance", $BED(A,B)$, captures much of our current understanding of the relation between individual genome sequences. If among two communicating parties, Alice (holding genome sequence A) and Bob (holding genome sequence B), wants to compute $BED(A,B)$ then, theoretically, the total number of bits Bob needs to send to Alice is $O^{\tilde}(BED(A,B)\cdot\log BED(A,B))$ [Cormode et al., SODA 2000]. Considering that between a typical donor genome B and a reference genome A, the number of single character differences are on the order of a few million and the number of structural (i.e., blockwise) differences are in the order of tens of thousands, it should be possible to communicate genomes by exchanging only a few million bytes! Yet, today, the most effective way of communicating genome sequence data involves physically exchanging hard disks.
In this talk we will try to explain the wide gap between theoretical expectations and the current reality in genome communication, as well as storage, and pose some theoretical and practical problems on the way to the Google-ization of genome search and analysis. We will also try to explore the extent of our theoretical predictions for genome sequences hold for the RNA-Seq data. Finally we will briefly go through some of the recent developments in transcriptome sequence analysis, especially in the context of disease studies.
S. Cenk Sahinalp is a Professor of Computing Science at Simon Fraser University, Canada. He received his Ph.D. in Computer Science from the University of Maryland at College Park in early 1997. Sahinalp is an NSF Career Awardee, a Canada Research Chair, a Michael Smith Foundation for Health Research Scholar and an NSERC Discovery Accelerator Awardee. His papers on genomics and bioinformatics have been highlighted by scientific journals and magazines (e.g. Genome Research, Nature Biotech, Genome Technology) and won several awards (e.g. ISMB'12 Best Student Paper, ISMB'11 HitSeq Best Paper). He was/is the conference general chair of RECOMB'11, PC chair of APBC'13, and has served as the sequence analysis area chair for ISMB and CSHL Genome Informatics Conferences. He has also co-founded the RECOMB-Seq conference series on Massively Parallel Sequencing. He is an area co-editor for BMC Bioinformatics and is on the editorial board of Bioinformatics and several other journals. He co-directs the SFU undergraduate program in Bioinformatics and the SFU Bioinformatics for Combating Infectious Diseases Research Program. His research interests include computational genomics, in particular algorithms for high throughput sequence data, network biology, RNA structure and interaction prediction and chemoinformatics algorithms.