We propose three separate indices, “Pufferfish”, an index over a set of genomes or transcriptomes, “Rainbowfish” and “Mantis” which are indices for indexing a set of raw sequencing data sets. Specifically we designed a new, succinct representation for the color information of colored de Bruijn graphs (cdbg) in “Rainbowfish”. This representation can be used in de novo assembly and variant detection to keep information about the sample of origin of each k-mer when combining many samples. For example, our data structure allows building a cdbg on a metagenomic data set in just a few gigabytes while the space taken by other structures is hundreds of gigabytes, showing an order-of-magnitude improvement. Moreover, the query time for searching a k-mer and fetching the color information in our representation is almost the same as the existing data structures.
Using a similar representation as Rainbowfish to store the list of samples each k-mer appears in , combined with the use of the counting quotient filter to find the index associated to each k-mer, we’ve developed Mantis, a space-frugal index over thousands of samples with a fast query time up to 108 times faster than the sequence search tools at the time. Mantis is also a cdbg representation which supports fast graph traversal and other topological analyses in addition to large-scale sequence-level searches. Later, we update the color information representation by adopting a hierarchical encoding that exploits correlations among color vectors. Our results show significant improvements in the overall size and scalability of representation of the color information.
In Pufferfish, however, we’ve developed a new data structure for indexing large collections of reference sequences rather than raw sequence samples by building an index for the compacted, colored de Bruijn graph (ccdbg) over the reference sequences. We have developed both a sparse and dense indexing scheme which allows one to trade off index space for query speed (though queries always remain asymptotically optimal). In addition to indexing the k-mers, we store informations such as position and orientation of each k-mer in the reference set which makes it useful for purposes such as alignment and quantification.
For my thesis, I propose advancing the ideas explained earlier in two specific, but separate directions. For the reference-base index scheme, I focus on one of the applications of Pufferfish as the alignment tool used in estimating abundance of known genomes in metagenomic analysis. This work is along the lines of the tools such as Bracken and Karp and slightly different from metagenomic classification methods such as Kraken. In addition to that, I will continue working on the scalability limits of Mantis as a large-sequence-search index by investigating the potentials of merging two Mantises without the need to construct the color information in bit vector format and convert it to the hierarchical structure and by controlling the construction memory in a limited range with respect to the size of the input data.
Dept rep: Dr. David Mount
Members: Dr. Hector Corrada Bravo
Dr. Mihai Pop
Fatemeh Almodaresi is a a fifth year PhD student in Computer Science at University of Maryland. Her main research focus is on developing scalable data structures and algorithms for processing high-throughput genomic data; specifically, algorithms and data structures that are scalable to large-scale transcriptomic analysis and complex metagenomic experiments. She's been working on efficient ways of representing the colored de bruijn graphs, representing and indexing compacted de bruijn graphs (cdbg), and using these representations to solve classic problems of variant detection, read mapping, read alignment, and taxonomic read assignment. In the Combine-lab, researchers make use of different statistical approaches in addition to succinct data structures and hashing algorithms to optimize our approximate alignments or assembly outputs and have a more statistically accurate interpretation of the estimated results.