Talks

PhD Defense: Computational approaches for improving the accuracy and efficiency of RNA-seq analysis

Hirak Sarkar

Remote

Friday, July 10, 2020, 2:00-4:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, we concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, we have described these problems and demonstrated efficient state-of-the-art solutions to a number of them.

The defense presentation will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. We used the equivalence classes for a number of applications ranging from developing data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, we introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner, allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.

Examining Committee:

Chair: Dr. Rob Patro
Dean's rep: Dr. Steve Mount
Members: Dr. Mihai Pop

Dr. Hector Corrada Bravo
Dr. John Dickerson

Bio

Hirak Sarkar is a Ph.D student at the department of Computer Science, working under the supervision of Prof. Robert Patro. His research interests include different areas of computational biology with a special focus in analyzing RNA-seq datasets.

This talk is organized by Tom Hurst