Modern high-performance computing systems stream massive volumes of telemetry data via always-on monitoring services such as Lightweight Distributed Metric Service. This longitudinal data enables post-mortem performance analysis, continuous system health monitoring, early anomaly detection, and predictive resource management for proactive scheduling. However, effectively leveraging such telemetry remains challenging due to the scale, heterogeneity, and evolving nature of HPC workloads and system behavior. Without advanced analysis techniques, key patterns in the telemetry may go undetected. This can result in missed optimization opportunities, inaccurate diagnostics, inefficient resource management, and ultimately reduced system utilization. This proposal presents research efforts that utilize rich telemetry data from a leading supercomputer to enhance performance, reliability, and resource management in HPC systems. First, I explain our work on identifying spatial and temporal trends in GPU usage by analyzing previously under-explored hardware counters. Second, I propose an anomaly detection approach using job-aware clustering and graph-based system modeling to identify performance anomalies. Third, I propose a phase-aware online detection and forecasting framework that identifies and predicts application execution phases from system-level telemetry to enable proactive resource management.
Onur Cankur is a Ph.D. student in Computer Science at the University of Maryland, College Park, where he is advised by Prof. Abhinav Bhatele in the Parallel Software and Systems Group (PSSG). His research focuses on high-performance computing (HPC), with an emphasis on performance analysis and operational data analytics for large-scale systems. Onur develops techniques for understanding and improving the HPC systems using system telemetry and machine learning.