Talks

PhD Defense: Longitudinal Data Analytics of HPC Systems and Applications

Onur Cankur

IRB-5165 https://umd.zoom.us/j/3309400543?pwd=c01CL1VqUk8vS2NObHNEYk5zNlZnZz09&omn=93479749930&jst=2

Wednesday, March 25, 2026, 10:00-11:30 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Modern high-performance computing systems continuously collect large volumes of telemetry through always-on monitoring services. These data describe system behavior, application resource usage, and changing operating conditions over time, and they enable performance analysis, workload characterization, and proactive resource management. As supercomputers grow in scale and complexity, analyzing such longitudinal telemetry becomes increasingly important because it helps reveal how different scientific workloads use modern systems and how system operation can be improved to better support diverse applications. Without effective analysis methods, important patterns in system utilization remain hidden. However, analyzing such data is difficult because HPC systems are large, heterogeneous, and dynamic. In addition, the relationship between low-level monitoring data and application behavior is often indirect and complex. Traditional analysis methods often rely on application-specific instrumentation, pre-run benchmarking, or do not scale well in production environments with diverse workloads. This dissertation studies how longitudinal telemetry can be used to better understand and manage modern HPC systems. It uses production monitoring data from a leadership-class supercomputer in two complementary ways. First, it combines system-wide monitoring data with scheduler metadata to examine GPU workload behavior. This analysis characterizes how production applications use GPU resources, how utilization varies across GPUs and over time, and how workload behavior relates to compute activity, memory activity, and resource imbalance. Second, it combines scheduler metadata with network telemetry to identify similar runs, model performance variation, and predict runs that are likely to perform unusually slowly relative to similar runs based on telemetry collected near the beginning of execution. This is done without application-specific profiling or prior knowledge of the code. This dissertation shows that always-on monitoring data can be used not only to understand workload behavior at scale, but also to anticipate performance problems and support more informed operational decisions.

Bio

Onur Cankur is a Ph.D. student in Computer Science at the University of Maryland, College Park, where he is advised by Prof. Abhinav Bhatele in the Parallel Software and Systems Group (PSSG). His research focuses on high-performance computing (HPC), with an emphasis on performance analysis and operational data analytics for large-scale systems. Onur develops techniques for understanding and improving the HPC systems using system telemetry and machine learning.

This talk is organized by Migo Gui