The fact that biological databases contain errors is well-known. Less appreciated is the role that computation plays in in the introduction and propagation of errors. Advances in generative AI have made it easier than ever to generate (large volumes of) data. To ensure the integrity of biological research, it is imperative that we develop evidence-based practices that limit the potential that generative AI and other computational tools introduce and amplify data errors.
In my talk, I will provide a brief overview of the current state of the art and describe initial research conducted in my lab with the goal of understanding how data errors can propagate through databases, in the context of taxonomic annotation of 16S rRNA gene sequence data.