Talks

Data-Efficient and Fault-Tolerant Exascale Computing

Yafan Huang

IRB 4105 or https://umd.zoom.us/j/93666933047?pwd=gWgqOgGbBP6laZclyURdDG2mNdArBt.1

Thursday, March 26, 2026, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Modern high-performance computing (HPC) systems operate at massive scales, comprising thousands of nodes equipped with high-end CPUs and GPUs to support complex workloads such as large language model training, quantum simulation, and high-resolution scientific simulations. As these systems continue to scale, two major challenges identified by the U.S. Department of Energy (DOE) become increasingly critical: managing the growing volume of data and ensuring robust error resilience.

My research addresses both challenges by developing flexible, efficient, and broadly applicable software solutions. On the data-efficiency side, I design ultra-fast GPU-based compression frameworks, such as cuSZp, that achieve high compression ratios while preserving data fidelity for diverse applications. On the reliability side, I develop low-overhead fault-tolerance techniques that enable effective detection of complex faults with minimal performance impact. Together, these contributions provide scalable software solutions that improve data efficiency and reliability in next-generation HPC and AI systems.

Bio

Yafan Huang is a Ph.D. candidate in the Department of Computer Science at the University of Iowa, advised by Prof. Guanpeng Li. He has been a visiting graduate student at Argonne National Laboratory since 2021, where he works with Dr. Sheng Di and Dr. Franck Cappello. His research focuses on high-performance computing (HPC) and scientific applications, with particular interests in data compression, fault tolerance, parallel computing, and compiler optimizations. Yafan is the recipient of the 2025 ACM–IEEE CS George Michael Memorial HPC Fellowship and has received multiple best paper finalist and award recognitions at system conferences, including SC'22, SC'24, ICS'25, and LDAV'25.

This talk is organized by Samuel Malede Zewdu