PhD Proposal: PhD Preliminary: Learning Parallel Code and System Behaviors Across Modalities
Daniel Nichols
Abstract
Performance modeling is an integral part of the research process for computational scientists. It enables them to understand how different factors contribute to the final runtime of an application. This understanding is crucial to developing efficient scientific applications and simulations. While important, performance modeling is difficult as there are a large number of factors that may contribute to final performance. Factors such as the algorithm,
problem size, implementation, architecture, and systems software stack all impact performance in an often complex relationship. Analytical models can be employed to study these causal variables and performance, however, they are difficult to scale up to a large number of input variables. Additionally, the relationship between the causal variables and performance may be unknown or complex, making it challenging to derive an analytical model. Fortunately, machine learning (ML) can help address these challenges as ML algorithms excel at modeling unknown and complex relationships. Furthermore, ML-based performance models can handle a large number of input variables, making them ideal for modeling complex scientific codes. By training ML models on historical performance data, computational scientists can develop accurate models that can predict the performance of new applications and simulations under different scenarios. However, current ML-based modeling approaches are limited to modeling one or two sources of performance data, such as hardware
counters or application features. This limitation prevents models from making use of all available causal variables that may impact performance. In this proposal we introduce novel approaches to modeling performance that can make use of all available data sources. We additionally propose performance latent spaces that can be used to model various output metrics, such as runtime or energy consumption. Finally, we propose a method to integrate these latent spaces into large language models to enable natural language discussion for the performance of code.
problem size, implementation, architecture, and systems software stack all impact performance in an often complex relationship. Analytical models can be employed to study these causal variables and performance, however, they are difficult to scale up to a large number of input variables. Additionally, the relationship between the causal variables and performance may be unknown or complex, making it challenging to derive an analytical model. Fortunately, machine learning (ML) can help address these challenges as ML algorithms excel at modeling unknown and complex relationships. Furthermore, ML-based performance models can handle a large number of input variables, making them ideal for modeling complex scientific codes. By training ML models on historical performance data, computational scientists can develop accurate models that can predict the performance of new applications and simulations under different scenarios. However, current ML-based modeling approaches are limited to modeling one or two sources of performance data, such as hardware
counters or application features. This limitation prevents models from making use of all available causal variables that may impact performance. In this proposal we introduce novel approaches to modeling performance that can make use of all available data sources. We additionally propose performance latent spaces that can be used to model various output metrics, such as runtime or energy consumption. Finally, we propose a method to integrate these latent spaces into large language models to enable natural language discussion for the performance of code.
Bio
Daniel Nichols is a PhD student at the University of Maryland, College Park in Computer Science working with the Parallel Software and Systems Group and advised by Prof. Abhinav Bhatele. His research interests lie at the intersection of high-performance computing and machine learning where he focuses on applying ML to computer systems problems that arise in supercomputing. His work enables more efficient use of supercomputers through intelligent job scheduling and resource placement, large language model guided performance optimizations, and ML-driven performance modeling. Previously he completed his BS in Computer Science at the University of Tennessee, Knoxville.
This talk is organized by Migo Gui