Talks

PhD Proposal: Improving Efficiency of Transformer Foundation Models

Armin Gerami

Remote https://umd.zoom.us/j/96431729723?pwd=ZVV1TU5RaC9RQ3Y1L2p0ZlFZcjJEZz09

Tuesday, July 1, 2025, 11:00 am-1:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Transformers are the foundational deep learning architecture behind many recent successes in diverse fields such as natural language processing, speech, computer vision, and biology. We examine key computational inefficiencies within the Transformer architecture and explore potential remedies. Specifically, we address three primary challenges. First, the standard attention mechanism scales quadratically with input sequence length due to its use of a softmax-based exponential kernel. We discuss how approximating this kernel with a linear estimation can reduce this complexity to linear time. Second, the all-to-all attention calculation, while necessary during training, becomes redundant during inference because most attention values are negligible. We review successful hierarchical strategies that combine coarse-grained token compression with fine-grained token selection to preserve both global context and local precision efficiently. Finally, the feed-forward networks (FFNs) in deep neural architectures, including Transformers, often develop sparse weights—a phenomenon described by the "Lottery Ticket Hypothesis." We explore how leveraging efficient sparse matrix multiplication accelerators can exploit this sparsity to speed up both inference and finetuning.

Bio

Armin Gerami is a Computer Science PhD student specializing in High Performance Computing (HPC) and Differentiable Programming. His research focuses on enhancing the foundational efficiency of deep neural networks. He is currently developing novel architectures for Transformer models that utilize linearly scaling attention mechanisms (Linear Attention) and attention matrix sparsification to achieve significant speedup gains. Additionally, Armin's work involves accelerating sparse matrix multiplication for general densities, a critical technique for optimizing deep neural networks that often feature highly sparse weights. Previously, Armin worked on efficient spatial audio rendering through estimating impulse responses with infinite impulse response (IIR) filters.

This talk is organized by Migo Gui