Talks

How to factor a genome for fun and profit: De Bruijn’s legacy and the mathematical models, data structures, and algorithms at the core of modern genomics

Rob Patro

IRB 4150 or https://umd.zoom.us/j/93754397716?pwd=GuzthRJybpRS8HOidKRoXWcFV7sC4c.1

Wednesday, September 10, 2025, 3:00-4:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Modern sequencing experiments generate enormous amounts of data — an estimated 1,000 petabytes (roughly 1 exabyte) each year. These experiments support research ranging from tracking microbial food contamination to studying cancer evolution in patients. The sheer scale of data production has transformed modern biology into a data-intensive discipline and, in many cases, a computational one. Realizing the full potential of these experimental capabilities for advancing our understanding of biological systems and improving human health requires more than applying established computer science principles at scale. Rather, it requires the development of fundamentally new and efficient algorithms, data structures, and computational methods.

In this talk, I will discuss how the De Bruijn graph, a mathematical construct from graph theory introduced in 1946, has evolved from a relatively esoteric object into a foundational model and a powerful tool in modern genomics. I will focus on two main lines of work from our lab that advance the state of the art in applying the De Bruijn graph to sequencing data at the exabyte scale. First, I will describe our efforts to develop highly parallel and memory-efficient methods for constructing the compacted and colored De Bruijn graph. These graph variants are essential in practice, and we build them both from large collections of reference genomes and from raw sequencing measurements. I will emphasize how careful modeling and succinct data structures can be applied to this problem. Second, I will present our work on data structures for indexing compacted and colored De Bruijn graphs. These indexes enable efficient querying of large genomic datasets at massive scale. Finally, I will highlight downstream applications of this work. For example, our methods have been used to compress, to a first-order approximation, all publicly available sequencing data in the NCBI Sequence Read Archive (SRA). They have also been applied to develop tools for accurate and efficient estimation of gene expression at the single-cell level, which the Alex’s Lemonade Stand Foundation has used to build a pediatric single-cell cancer atlas.

Bio

Rob Patro is an Associate Professor of Computer Science at the University of Maryland and a member of the University of Maryland Institute for Advanced Computer Studies and the Center for Bioinformatics and Computational Biology (CBCB), where he leads the COMBINE lab. Prior to joining Maryland, Rob was an Assistant Professor of Computer Science at Stony Brook University. He obtained his Ph.D. in Computer Science from the University of Maryland and was a postdoctoral research associate in the Lane Center for Computational Biology in the Department of Computer Science at Carnegie Mellon University. He is the recipient of several awards including multiple best paper awards, the NSF CAREER award, and the Allen Newell Award for Research Excellence for his contributions in introducing and developing alignment-free methods for gene expression estimation.

Rob’s main research interests are in the design of algorithms and data structures for processing, organizing, indexing and querying high-throughput genomics data and in the intersection between efficient algorithms and statistical inference. His current research focuses on the development of computational methods for accurate, efficient and uncertainty-aware transcriptome analysis using RNA-seq (both bulk and single-cell) as well as on the design of scalable (often succinct) data structures for indexing and querying sequencing data at the scale of public repositories. His lab develops a number of open-source tools for high-throughput genomic and transcriptomic analysis, most of which are available from GitHub at https://github.com/COMBINE-lab.

This talk is organized by Samuel Malede Zewdu