The advent of modern high-throughput sequencing technologies has empowered researchers to rapidly and cost-effectively sequence large amounts of DNA/RNA fragments from biological samples. A crucial step following sequencing is the estimation of feature abundances, which requires mapping/aligning reads to a reference that far surpasses the length of the individual reads. We thus require efficient algorithms for read mapping, especially considering that many datasets comprise millions of reads. However, even with efficient mapping, overlapping sequences among features introduce uncertainty in determining the true locus of origin for a read. This dissertation seeks to address these challenges in the context of bulk RNA-Seq and single-cell ATAC-Seq data, with applications in transcriptomics and genomics.
First, we introduce TreeTerminus, a tree-based framework designed to incorporate uncertainty in abundance estimates for RNA-seq data. TreeTerminus constructs hierarchical trees from the samples in an RNA-Seq experiment, where the leaf nodes represent the individual transcripts, and the inner nodes represent aggregated transcript groups. The uncertainty decreases as one ascends the tree. The tree provides the flexibility to analyze data at nodes that are at different levels of resolution in the tree, adjustable based on the analysis of interest.
Next, we present mehenDi, a method for uncertainty-aware differential analysis that leverages the tree generated from TreeTerminus. mehenDi identifies differential candidate nodes that can include both transcripts and inner nodes, maximizing signal detection while controlling for uncertainty in inference in a data-driven manner. Applying mehenDi to both simulated and experimental datasets, we discovered inner nodes with a strong differential signal that would have been overlooked when analyzing the individual transcripts alone.
Finally, we introduce alevin-fry-atac, a method for processing and mapping single-cell ATAC-Seq data. We propose a novel pseudoalignment algorithm and a caching scheme that enables fast and memory-efficient read mapping to the genome utilizing the piscem index. Alevin-fry-atac is currently three times faster while using three times less memory compared to Chromap - the only fully open-source alternative—which itself was significantly faster than Cell-Ranger ATAC, a method developed by 10X Genomics. With the introduction of alevin-fry-atac, we establish the first fully open-source ecosystem capable of processing both single-cell RNA-Seq and ATAC-Seq data, facilitating seamless multi-omic analysis.
Noor, is a PhD candidate in the Department of Computer Science, advised by Dr Rob Patro. His research focuses on bioinformatics and computational biology, with a focus on developing algorithms and statistical methods for analyzing different types of genomic data.