Talks

PhD Defense: Data-driven algorithms for characterizing microbial communities

Nidhi Shah

Remote

Wednesday, April 14, 2021, 1:00-3:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Complex microbial communities play a crucial role in environmental and human health. Traditionally, microbes have been studied by isolating and culturing them, missing organisms that cannot grow in standard laboratory conditions, and losing information about microbe-microbe interactions. With affordable high-throughput sequencing, a new field of metagenomics has emerged that studies the genomic content of the microbial community as a whole. Metagenomics enables researchers to characterize complex microbial communities, however, many computational challenges remain with downstream analyses of large sequencing datasets. Here, we explore some fundamental problems in metagenomics and present simple algorithms and open-source software tools that implement these solutions.

In the first section, we focus on using a reference database of known organisms (and genomic segments within) to taxonomically classify sequences and estimate abundances of taxa in a metagenomic sample. We developed a “BLAST outlier detection” algorithm that identifies significant outliers within database search results. We extended this method and developed ATLAS, which uses significant database hits to group sequences in the database into partitions. These partitions capture the extent of ambiguity within the classification of a sample. Besides taxonomically classifying sequences, we also explored the problem of taxonomic abundance profiling i.e., estimating the abundance of different species in the community. We describe TIPP2, a marker gene-based abundance profiling method, which combines phylogenetic placement with statistical techniques to control classification accuracy. TIPP2 includes an updated set of reference packages and several algorithmic improvements over the original TIPP method.

Next, we explore how to reconstruct genomes from metagenomic samples. Despite advances in metagenome assembly algorithms, assembling reads into complete genomes is still a computationally challenging problem because of repeated sequences within and among genomes, uneven abundances of organisms, sequencing errors, and strain-level variation. To improve upon the fragmented assemblies, researchers use a strategy called binning, which involves clustering together genomic fragments that likely originate from an individual organism. We describe Binnacle, a tool that explicitly accounts for scaffold information in binning. We describe novel algorithms for estimating the scaffold-level depth of coverage information and show that variation-aware scaffolders help further improve the completeness and quality of the resulting metagenomic bins.

Finally, we explore how to organize enormous sets of sequence data generated through the surge of metagenomic studies. There have been recent efforts to organize and document genes found in microbial communities in “microbial gene catalogs”. Although gene catalogs are commonly used, they have not been critically evaluated for their effectiveness as a basis for metagenomic analyses. We investigated one such catalog and focus on both the approach used to construct this catalog and its effectiveness when used as a reference for microbiome studies. Our results highlight important limitations of the approach used to construct the catalog and call into question the broad usefulness of gene catalogs. We also recommend best practices for the construction and use of gene catalogs in microbiome studies and highlight opportunities for future research.

With the increasing data being generated in different metagenomic studies, we believe our ideas, algorithms, and software tools are well-timed with the need of the community.

Examining Committee:

Chair: Dr. Mihai Pop
Dean's rep: Dr. Michael P. Cummings
Members: Dr. Marine Carpuat

Dr. Robert Patro
Dr. Brantley Hall

Bio

Nidhi Shah is a PhD student in Computer Science at the University of Maryland, College Park, advised by Dr. Mihai Pop. Her research interests are in developing algorithms to analyze large-scale metagenomic datasets.

This talk is organized by Tom Hurst