Modern single-cell genomics protocols encode critical metadata, such as cell barcodes, unique molecular identifiers, and linker sequences, directly into sequencing reads. Accurately identifying and extracting these elements is a prerequisite for all downstream biological analyses, yet the task is complicated by the proliferation of distinct library chemistries. Each chemistry features its own read geometry (i.e., the configuration and encoding of this technical data) and is subject to sequencing and PCR errors that can corrupt the expected structure. This thesis presents SEQPROC, a general-purpose sequence preprocessing tool built around a declarative domain-specific language called the Extended Fragment Geometry Description Language (EFGDL). EFGDL allows users to specify read structures without prescribing how the parsing should be performed. The language is compiled into an efficient execution graph by the ANTISEQUENCE library, which processes reads in a batched, multi-threaded fashion with a focus on memory reuse. We describe the design and implementation of EFGDL and ANTISEQUENCE, alongside a series of optimizations that produced a 4.8x speedup and O(1) memory usage. Furthermore, we detail the addition of edit distance matching and orientation-aware processing for long-read data. On benchmarks spanning four single-cell protocols, SEQPROC is consistently the fastest and most memory-efficient tool, using 3 to 305 times less memory than alternatives while achieving the highest read recovery on long-read datasets.
Elan Fisher is a Master’s student in Computer Science at the University of Maryland, College Park (UMD), where he is advised by Dr. Rob Patro. He previously earned his Bachelor of Science in Computer Science from UMD in 2024. Elan’s research focuses on the development of algorithms, domain-specific languages (DSLs), and computational tools tailored for high-throughput sequencing data. He is interested more generally in creating tools that enable scientists to derive insights with greater speed and efficiency. In his spare time Elan enjoys acting in plays, sailing, running, and spending time outdoors.

