log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Defense: Analyzing and indexing huge reference sequence collections
Jason Fan
3137 or https://umd.zoom.us/j/93457774326
Thursday, August 31, 2023, 1:30-3:30 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Recent advancements in genome-scale assays and high throughput sequencing have made systematic measurement of model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples or compare measurements against multiple reference organisms have been designed to improve analyses and gain new insights. However, such models and algorithms can be difficult to apply in practice. Furthermore, analysis of high-throughput sequencing data across multiple samples and multiple reference genomic sequences can be prohibitively costly in terms of space and time. In three parts, this dissertation investigates novel computational techniques that improve analyses at various scales.

In Part I, we present two general matrix-factorization algorithms designed to analyze and compare biological measurements of related species that can be summarized as networks. In Part II, we present methods that improve analyses of high-throughput sequencing data. The first method, ScalpelSig, reduces the computation burden of applying mutational signature analysis in resource limited settings; and the second method, a derivation of perplexity for gene and transcript expression estimation models, enables effective model selection in experimental RNA-seq data where ground-truth is absent.

In Part III, we tackle the difficulties of indexing and analyzing huge collections reference sequences. We introduce the spectrum preserving tiling (SPT), a new computational and mathematical abstraction. Mathematically, the SPT explicitly relates past work on compactly representing k-mer sets — namely the compacted de Bruijn graph and recent derivations of spectrum preserving string sets — to the indexing of k-mer positions and metadata in reference sequences. Computationally, the SPT makes possible an entire class of efficient and modular k-mer indexes. To this end, we introduce a pair of indexing schemes respectively designed to efficiently support rapid locate and k-mer “color” queries in small space. In the final chapter of this dissertation, we show how these modular indexes can be effectively and generically implemented.
 
Examining Committee

Chair:

Dr. Robert Patro

Dean's Representative:

Dr. Najib El-sayed

Members:

Dr. Erin Molloy

 

Dr. Mihai Pop

 

Dr. Michael Cummings

Bio

Jason Fan is a PhD student at the Center for Bioinformatics and Computational Biology. At this Oral Exam, Jason will focus on his recent work on indexing huge reference sequence collections that form the final chapters of his dissertation.

This talk is organized by Tom Hurst