log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Defense: Algorithms and Data Structures for Indexing, Querying and Analyzing Large Collections of Sequencing Data in the Presence or Absence of a Reference
Fatemeh Almodaresi
Remote
Friday, July 24, 2020, 1:30-3:30 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
High-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assembly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or transcriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query.

The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner. We have developed both sparse and dense indexing schemes which allow trading index space for query speed (though queries always remain asymptotically optimal). Pufferfish provides a full reference index that can return the set of references, positions and orientations of any k-mer (substring of fixed length “k” ) in the input genomes. We have built an alignment tool, Pu Aligner, around this index for aligning sequencing reads to reference sequences. We demonstrate that Pu Aligner is able to produce highly-sensitive alignments, similar to those of Bowtie , but much more quickly and exhibits speed similar to the ultrafast ligner while requiring considerably less memory to construct its index and align reads.

The Rainbowfish and Mantis data structures, on the other hand, are based on reference-free colored de Bruijn graphs (cdbg) constructed over raw sequencing data. Rainbowfish introduces a new efficient representation of the color information which is then adopted and refined by Mantis. Mantis supports graph traversal and other topological analyses, but is also particularly well-suited for large-scale sequence-level search over thousands of samples. We develop multiple and successively- refined versions of the Mantis index, culminating in an index that adopts a minimizer-partitioned representation of the underlying k-mer set and a referential encoding of the color information that exploits fast near-neighbor search and efficient encoding via a minimum spanning tree. We describe, further, how this index can be made incrementally updatable by developing an efficient merge algorithm and storing the overall index in a multi-level log-structured merge (LSM) tree. We demonstrate the utility of this index by building a searchable Mantis, via recursive merging, over , raw sequencing samples, which we then scale to over , samples via incremental update. This index can be queried, on a commodity server, to discover the samples likely containing thousands of reference sequences in only a few minutes.

Examining Committee: 
 
                           Chair:              Dr. Rob Patro      
                           Dean's rep:      Dr.   Steven M. Mount 
                          Members:         Dr.  David Mount
                                                Dr. Hector Corrada Bravo
                                                Dr. Mihai Pop
Bio

Fatemeh is a PhD student in Combine-Lab led by Dr. Rob Patro. She has worked on various projects in the field of computational biology under his supervision, mainly focused on designing algorithms and data structures that enable fast and memory-efficient queries over large collections of sequences.

This talk is organized by Tom Hurst