log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Proposal: Architectural Approaches to Reasoning in Language Models
Sean McLeish
IRB-5165 https://umd.zoom.us/j/95765075269?pwd=8CSiFgr2JEKK4KBPHAyKMi1agxAeKt.1
Monday, March 30, 2026, 10:00-11:30 am
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In the first part of this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth, finding increased depth improves both final loss and benchmark accuracy.

Next we explore how increasing model depth via depth recurrence may increase the arithmetic reasoning capabilities of transformers. We begin by studying addition, and find that the poor performance of transformers on such arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that other architectural modifications such as input injection and recurrent layers can improve performance even further.

Finally, we extend our studies of the relationship between transformer depth and reasoning capability to general language modeling settings and develop a procedure for converting existing pretrained non-recurrent language models into depth-recurrent models. In our experiments on mathematical tasks, we observe that converting pretrained models to depth-recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

Bio

Sean McLeish is a third year Computer Science PhD student at the University of Maryland advised by Tom Goldstein. He graduated from the University of Warwick in 2023 with a first class BSc (Hons) in Discrete Mathematics. His research currently focuses on algorithmic reasoning and language modelling.

 

This talk is organized by Migo Gui