log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Proposal: Algorithms and Data Structures to Organize, Index, and Search High-throughput Genomic Data
Jamshed Khan
Friday, July 28, 2023, 11:00 am-1:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Rapid developments in the throughput and affordability of modern genome-sequencing technologies have made the generation of billions of short-read sequences from biological samples highly time- and cost-efficient. The exponential increase in available genomic data has resulted in bottlenecks where, in many cases, computational data analysis has become more expensive and time-consuming than the experiments that generate these data. These needs have continuously spurred computer scientists to design and implement ever more efficient and scalable methods for a variety of genomic analysis tasks. As such, development of algorithmic approaches capable of processing these data in a fundamentally scalable manner has been a focus at the forefront of computational genomics.

My ongoing PhD research work has been predominantly focused on developing scalable algorithms and data structures for (1) compressing, (2) organizing, (3) indexing, and (4) searching high-throughput genomic data, in ways so as to increase the scale at which existing analyses can be performed. Toward these objectives, and against the backdrop of ever-increasing high-throughput genomic data, a mathematical object called the de Bruijn graph—and the associated data structure and variants—has become a compact data representation of increasing importance and utility across computational genomics. Developing efficent algorithms and data structures to enable the use of the de Bruijn graph for tasks (1)-(4) mentioned above is the foundational characteristic of my PhD research.

In this proposal, I will first discuss compression; specifically a compact lossless representation of the de Bruijn graph, used to represent sets of genome sequences. Genomic data are quite costly in terms of space, and hence building a compact representation while retaining relevant features is usually the first step in many bioinformatics pipelines. We developed a state-of-the-art algorithm, Cuttlefish (along with an open-source implementation), that constructs the compacted graph from genome references with much better performance than other state-of-the-art tools—proposing a novel modeling approach for each graph vertex as an individual finite-state automaton and treating the graph as a collection of automata.

This algorithm is applicable for a specific class of input data—genome sequences. In subsequent work, we extended this method to be more general and be able take any form of input data—specifically, we enabled the method to also facilitate construction from sequencing read sets as input, which is fundamentally a more difficult problem due to structural properties of raw sequencing data. We proposed a new state-of-the-art solution, Cuttlefish 2, for this through generalization of the earlier algorithm, and made available a very efficient implementation of such.

Finally, I will propose ongoing and future work following this line of research. I will discuss further potential algorithmic and methodological improvements to enable scaling compacted de Bruijn graphs even further, both in efficiency and input size—while also incorporating source-annotation information (i.e. colors) into the graph. Then I will discuss some potential approaches into indexing and querying this graph for its annotation information—also known as the large-scale sequence search problem in computational genomics.
 
Examining Committee

Chair:

Dr. Robert Patro

Department Representative:

Dr. Laxman Dhulipala

Members:

Dr. Erin Molloy

Bio

Bio:

Jamshed Khan is a PhD student in Computer Science, advised by Dr. Rob Patro. Jamshed's research pertains to designing scalable data structures and parallel algorithms to efficiently compress, organize, index, and search high-throughput genomic data.
This talk is organized by Tom Hurst