Talks

Analyzing Large-scale Data in High Performance Computing using Machine Learning

Abhinav Bhatele

Virtual-https://umd.zoom.us/j/93637673064?pwd=TzJYcE15UXg0MTJSQXJ5UFFLMlBNZz09

Friday, October 2, 2020, 11:00 am-12:00 pm

Abstract

In the last decade, the amount of data generated on parallel systems from

computational science simulations and monitoring tools has grown exponentially

because of the increase in systems' sizes, availability of additional hardware

counters and sensors, and larger parallel storage capabilities. Research on

analyzing such data has been rapidly moving from manual and one-off tool

efforts to using statistical analysis / machine learning. Several HPC

facilities have started continuous monitoring of their systems and user jobs to

collect performance-related data for understanding performance and operational

efficiency. Such data can be used to optimize the performance of individual

jobs and the overall system by creating data-driven models that can predict the

performance of pending jobs.

In this talk, I will present our work on modeling the performance of

representative control jobs using longitudinal system-wide monitoring data to

explore the causes of performance variability. Using machine learning, we are

able to predict the performance of unseen jobs before they are executed based

on the current system state. We analyze these prediction models in detail to

identify the features that are dominant predictors of performance. We

demonstrate that such models can be application-agnostic and can be used for

predicting performance of applications that are not included in training. I

will also briefly mention these other research directions we are working on in

my research group: parallel deep learning, analyzing large graphs, and modeling

epidemic diffusion (more details here: https://pssg.cs.umd.edu).

Bio

Abhinav Bhatele is an assistant professor in the department of computer science

at the University of Maryland, College Park. Previously, he was a principal

computer scientist in the Center for Applied Scientific Computing at Lawrence

Livermore National Laboratory. His research interests are broadly in systems

and networks, with a focus on parallel computing and big data analytics. He has

published research in programming models and runtimes, network design and

simulation, applications of machine learning to parallel systems, and on

analyzing, modeling and optimizing the performance of parallel software and

systems.

Abhinav received a B.Tech. degree in Computer Science and Engineering from

I.I.T. Kanpur, India in May 2005, and M.S. and Ph.D. degrees in Computer

Science from the University of Illinois at Urbana-Champaign in 2007 and 2010

respectively. Abhinav was an ACM-IEEE CS George Michael Memorial HPC Fellow in

2009. He has received best paper awards at Euro-Par 2009, IPDPS 2013 and IPDPS

2016. Abhinav was selected as a recipient of the IEEE TCSC Young Achievers in

Scalable Computing award in 2014, and the LLNL Early and Mid-Career Recognition

award in 2018.

This talk is organized by Richa Mathur