Talks

Statistical Consistency of Quartet-based Species Tree Methods Under a Unfied Model of Duplication, Loss, and Incomplete Lineage Sorting

Rachel Parsons - Molloy Lab, University of Maryland

4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Thursday, April 16, 2026, 2:00-3:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Motivation: ASTRAL-pro is the leading method for reconstructing species trees under complex evolutionary scenarios involving gene duplication, loss, and coalescence. A major open question is whether ASTRAL-pro is statistically consistent under a unified model of these processes, called DLCoal. This question is challenging to address because ASTRAL-pro seeks a species tree that maximizes the number of four-taxon trees (called quartets) also displayed by the input (multi-copy) gene trees, excluding those induced by duplications and agglomerating those that are homeomorphic up to duplications. Critically, there is no notion of correctness when tagging gene tree vertices as duplication or speciation events in the context of deep coalescence. Results: Here, we propose that a gene tree vertex is correctly tagged as a duplication if it is the most recent common ancestor of at least one pair of gene copies related via a duplication event. Under our definition, deep coalescence propagates duplication tags across gene tree vertices, sometimes resulting in the exclusion of quartets on orthologous gene copies. Nevertheless, we show that A-pro is statistically consistent under the DLCoal model for an exclusion-only version of its objective function, assuming the input gene trees are correctly rooted and tagged. To empirically evaluate this modification, we exclude "duplication quartets" in the related method TREE-QMC and find that it achieves similar accuracy to A-pro on simulated data under varying rates of deep coalescence, duplication and loss, and gene tree estimation error, as well as on a plant data set.

This talk is organized by Marcus Fedarko