log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Proposal: Clustering Algorithms for Characterizing Microbial Communities
Tu Luan
Tuesday, April 25, 2023, 1:00-3:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Genomic sequence clustering, particularly 16S rRNA gene sequence clustering, is an important step in characterizing the diversity of microbial communities through an amplicon-based approach. As 16S rRNA gene datasets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We show an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the dataset, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the dataset first. Using real datasets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while effectively capturing the large clusters in the dataset. The experiments also show that SCRAPT is able to produce Operational Taxonomic Units (OTUs) that are less fragmented than popular tools like UCLUST, CD-HIT, and DNACLUST.

The emergence of long-read sequencing technologies, capable of producing reads of 10,000 base pairs or longer, provides opportunities in various areas of genomic studies. In the latter sections of this proposal, we outline our future plans for characterizing microbial communities using clustering algorithms that incorporate long-read sequencing technologies. We plan to extend the SCRAPT algorithm to cluster full-length 16S rRNA gene sequences generated by long-read sequencing platforms.

Metagenomic scaffolding is a process to reconstruct the original genomic sequences of organisms from metagenomic sequencing data, and it can be viewed as a process that involves clustering metagenomic assembled contigs originating from the same organism and creating a graph layout based on mate-pair or paired-end read information. Our second objective is to extend MetaCarvel, a specialized tool for metagenomic scaffolding, to perform hybrid metagenomic scaffolding, which would combine the strengths of both short and long-read sequencing data to improve contiguity and repeat resolution of metagenomic scaffolding.
 
Examining Committee

Chair:

Dr. Mihai Pop

Department Representative:

Dr. Aravind Srinivasan

Members:

Dr. Brantley Hall

Bio

Tu Luan is a fifth-year PhD student at the University of Maryland, advised by Dr. Mihai Pop. Her research interests lie in computational biology, with a focus on metagenomics. She obtained her bachelor's degree from Bryn Mawr College in 2018.

This talk is organized by Tom Hurst