log in  |  register  |  feedback?  |  help  |  web accessibility
Analyzing Large-scale Data in High Performance Computing using Machine Learning
Abhinav Bhatele
Friday, October 2, 2020, 11:00 am-12:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
In the last decade, the amount of data generated on parallel systems from
computational science simulations and monitoring tools has grown exponentially
because of the increase in systems' sizes, availability of additional hardware
counters and sensors, and larger parallel storage capabilities. Research on
analyzing such data has been rapidly moving from manual and one-off tool
efforts to using statistical analysis / machine learning. Several HPC
facilities have started continuous monitoring of their systems and user jobs to
collect performance-related data for understanding performance and operational
efficiency. Such data can be used to optimize the performance of individual
jobs and the overall system by creating data-driven models that can predict the
performance of pending jobs.
In this talk, I will present our work on modeling the performance of
representative control jobs using longitudinal system-wide monitoring data to
explore the causes of performance variability.  Using machine learning, we are
able to predict the performance of unseen jobs before they are executed based
on the current system state. We analyze these prediction models in detail to
identify the features that are dominant predictors of performance. We
demonstrate that such models can be application-agnostic and can be used for
predicting performance of applications that are not included in training. I
will also briefly mention these other research directions we are working on in
my research group: parallel deep learning, analyzing large graphs, and modeling
epidemic diffusion (more details here: https://pssg.cs.umd.edu).
Abhinav Bhatele is an assistant professor in the department of computer science
at the University of Maryland, College Park. Previously, he was a principal
computer scientist in the Center for Applied Scientific Computing at Lawrence
Livermore National Laboratory. His research interests are broadly in systems
and networks, with a focus on parallel computing and big data analytics. He has
published research in programming models and runtimes, network design and
simulation, applications of machine learning to parallel systems, and on
analyzing, modeling and optimizing the performance of parallel software and
Abhinav received a B.Tech. degree in Computer Science and Engineering from
I.I.T. Kanpur, India in May 2005, and M.S. and Ph.D. degrees in Computer
Science from the University of Illinois at Urbana-Champaign in 2007 and 2010
respectively. Abhinav was an ACM-IEEE CS George Michael Memorial HPC Fellow in
2009. He has received best paper awards at Euro-Par 2009, IPDPS 2013 and IPDPS
2016. Abhinav was selected as a recipient of the IEEE TCSC Young Achievers in
Scalable Computing award in 2014, and the LLNL Early and Mid-Career Recognition
award in 2018.
This talk is organized by Richa Mathur