Modern transformer-based machine learning has produced powerful systems, but most problems of practical interest are not natural language. This dissertation studies what it takes to bring the autoregressive next-token prediction recipe out of its native habitat and into practical domains and training regimes that look different from conventional benchmark NLP. The first half concerns adapting the recipe to new practical domains, primarily through how data is represented to the model. We introduce STEP, which shows that a standard decoder-only transformer with a causal language modeling loss outperforms specialized architectures on tabular event prediction, given column-aware tokenization and simple training-time augmentations. In the second half, we adapt the training regime itself. GATES is a self-distillation framework that derives supervision online from consensus among privileged-context rollouts, improving language models on math reasoning benchmarks without external supervision. Across these contributions, the recurring observation is that the autoregressive recipe is more general than its origins suggest, but only when it is carefully adapted using the right representations and explicit training signals.
Alex is a PhD candidate in Computer Science at the University of Maryland, College Park, advised by Professors Tom Goldstein and John Dickerson. His research explores the adaptability of next-token prediction beyond traditional text generation, including structured data modeling, efficient training via self-distillation, and inline code editing agents.
During the summer of 2024, he was an Applied Research intern at Capital One, where he explored using transformers for event prediction. In 2025, he was a quantitative researcher on the structured learning team at Two Sigma, where he will return full-time following graduation.
Alex holds a Bachelor of Science in Computer Science and Operations Research from Columbia University.

