log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Defense: Computational Metagenomics: Network, Classification and Assembly
Bo Liu - University of Maryland, College Park
Thursday, July 12, 2012, 2:00-3:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

THE DISSERTATION DEFENSE FOR THE DEGREE OF Ph.D. IN COMPUTER SCIENCE FOR

                                                Bo Liu

Due to the rapid advance of DNA sequencing technologies  in recent 10 years, large amounts of short DNA reads can be obtained quickly and cheaply. Metagenomics is a new scientific field that involves the analysis of genomic DNA obtained directly from the environment, enabling studies of novel microbial systems. My research focuses on developing efficient algorithms and tools to tackle challenging computational problems in this field.

From the functional perspective, a metagenomic sample can be represented as a weighted metabolic network: nodes are molecules, edges are enzymes, and the weights are strengths of the reactions. The goal is to find differentially abundant metabolic subnetworks between two groups. Previous tools usually partition the whole metabolic network into a set of connected components, and use traditional statistical comparisons. We have developed a statistical network analysis tool - MetaPath, which uses a greedy search algorithm to find maximum-weight subnetwork, and nonparametric permutation test to measure the statistical significance. Unlike previous approaches, MetaPath explicitly searches for significant subnetwork in the global network, enabling us to detect signatures at a finer level. In addition, we developed statistical methods that take into account the network topology.

Another computational problem is how to classify anonymous DNA sequences. There are several challenges here: (i) How to model the hierarchical structure of the labels? (ii) How to compute a confidence value? (3) How to analyze billions of data items quickly? We have developed a novel hierarchical classifier (MetaPhyler) for the classification of anonymous biological sequences. During training, MetaPhyler models the pairwise similarities with Gaussian distribution within different hierarchical groups, and the classification threshold is calculated by discriminant analysis. For a given query, the classification only requires its nearest neighbor; the confidence score is computed as a p-value from a formalized hypothesis testing.  Through benchmark comparison, we have shown that MetaPhyler is significantly faster and more accurate then previous tools.

DNA sequencing machines can only produce very short strings (e.g. 100bp) relative to the genome (e.g., a typical bacterial genome is 5Mbp). One of the most challenging computational tasks is to assemble millions of short reads into longer contigs. In this project, we have developed a comparative metagenomic assembler (MetaCompass), which utilizes the genomes that have already been sequenced previously, and produces long contigs through read mapping and assembly.  Given the availability of thousands of existing bacterial genomes, for a particular sample, MetaCompass first chooses a best subset as reference based on the taxonomic composition. Then, the reads are aligned against these genomes using MUMmer-map or Bowtie2.  Afterwards, we use a greedy algorithm of the minimum set-covering problem to build long contigs, and the consensus sequences are computed by the majority rule. We also propose an iterative approach to improve the performance.  Finally, MetaCompass has been evaluated and tested on over 20 terabytes of data sets.

 

Examining Committee:

Committee Chair:                                Dr. Mihai Pop

Dean's Representative:                      Dr. Charles Delwiche

Committee Members:                        Dr. Hector Corrada-Bravo

                                                               Dr. Carl Kingsford

                                                                Dr. Sridhar Hannenhalli

This talk is organized by Jeff Foster