Technological advances and economies of scale in biotechnology have led to advances that allow us to measure ever more DNA and RNA, and to make more complete measurements of genes and genomes. However, our overwhelming success in data generation has led to monumental computational challenges. To this end, I will discuss some recent work from the lab on the topic of sequence indexing. Specifically, I will discuss two related but separate challenges.
First, I will discuss recent work in large-scale "raw" data indexing, which seeks to build indexing data structures that allow for query over very large repositories of "raw" sequencing data (so-called sequencing reads). These tools allow researchers to comb through mountains of experimental data to search for sequences of interest, especially those that may be lost when subsequent standard pipelines are applied to process this data.
Second, I will discuss recent work in scaling reference-based indexes. As we assemble ever more partial and complete genomes across the tree of life, we often seek to analyze new data in light of the genomes we have successfully assembled. Most such analyses start with mapping or aligning the new data against a collection of one or many of these references to identify what (and in what abundance) is in each new sample. This necessitates indexing data structures that can answer so-called "locate" queries, to speed up the process of sequence alignment. I will discuss recent work in developing indices based upon the compacted colored De Bruijn Graph, and how efficient sampling schemes can be derived that allow substantial reductions in index size while retaining asymptotically optimal query performance.
Rob Patro is an Associate Professor of Computer Science at the University of Maryland and a member of the Center for Bioinformatics and Computational Biology (CBCB) and the University of Maryland Institute for Advanced Computer Studies (UMIACS). He leads the Computational Biology and Network Evolution (COMBINE) lab. Prior to joining Maryland, Rob was an Assistant Professor of Computer Science at Stony Brook University. He obtained his Ph.D. in Computer Science from the University of Maryland and was a postdoctoral research associate in the Kingsford Group at the Lane Center for Computational Biology (now the Department of Computational Biology) at Carnegie Mellon University. He is the recipient of an NSF CAREER award in 2018 and an Allen Newell Award for Research Excellence in 2020.
His main research interests are in the design of algorithms and data structures for processing, organizing, indexing and querying high-throughput genomics data and in the intersection between efficient algorithms and statistical inference. His current research focuses on the development of computational methods for accurate, efficient and uncertainty-aware transcriptome analysis using RNA-seq (both bulk and single-cell) as well as on the design of scalable (often succinct) data structures for indexing and querying raw sequencing data at the scale of public repositories. His lab develops a number of open-source tools for high-throughput genomic and transcriptomic analysis, most of which are available from GitHub at https://github.com/COMBINE-lab