Talks

Reducing the Memory Cost of Long Context Transformers

Ryan Synk

IRB-5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Thursday, May 8, 2025, 2:00-4:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

This proposal discusses methods for reducing the memory cost incurred by long-context inputs to transformer models in both the inference and training regimes. Despite growing demand for performing inference and training of transformer models with hundreds of thousands of input tokens, employing models at these context lengths incurs significant costs in memory. To combat this problem during inference, we construct a tunable mechanism to select the most relevant tokens from the context. We convert the task of selecting the most relevant tokens into a problem of approximate nearest neighbor search, allowing us to maintain low latency generations. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity induced by the reduced number of keys and values.

We then turn to discussing reducing memory costs at training time, and suggest strategies to alleviate the high memory cost incurred by the intermediate activations of the transformer. We propose various methods for approximating these activations via compression, and highlight opportunities for future work in this area.

Bio

Ryan Synk is a fourth-year PhD student at the University of Maryland, where he is advised by Ramani Duraiswami and Tom Goldstein. His research is in deep learning, and he currently focuses on algorithms for accelerating transformer models.

This talk is organized by Migo Gui