High Performance Computing at Extreme Scale: Resilience, Energy Efficiency, and Scalability
Traditional scientific discovery and engineering innovation require extensive real experiments. However, conducting real experiments faces many limitations e.g., safety consideration, the cost in time and money, feasibility in real-world, etc. So, nowadays many researchers rely on computers to do simulation-based experiments e.g., airplane design, weather forecasting, quantum chemistry, deep learning in neural networks, etc. This brings great demand on computational power. That's why we need High Performance Computing (HPC). HPC is the practice of aggregating computing power in a way that delivers much higher performance than traditional desktop or workstation in order to solve larger problems in science, engineering, or business. However, many challenges arise when using HPC systems at extreme scale e.g., reliability, energy efficiency, performance, and portability. This talk will focus on one of the most widely used computing systems in HPC -- heterogeneous systems with GPUs. Linear algebra operations are the building blocks of many applications in HPC. I will introduce several algorithm-based approaches to improve fault tolerance, energy saving, and performance of linear algebra operations on heterogeneous systems with GPUs. In addition, I will talk about my current and future work that include virtualization in HPC, performance optimization on NVIDIA DGX-1, etc.
Mr. Jieyang Chen is a Ph.D. candidate in Computer Science at University of California, Riverside. He received a Master's degree in Computer Science from the University of California, Riverside in 2014 and a Bachelor's degree in Computer Science and Engineering from Beijing University of Technology in 2012. His research primarily focuses on high performance computing on GPUs with CUDA, algorithm-based fault tolerance, energy-efficient computing, and virtualized computing. He did a summer internship at Los Alamos National Laboratory in 2017 and he is currently collaborating with Los Alamos National Laboratory working on HPC virtualization. He is also collaborating with Pacific Northwest National Laboratory working on GPU performance optimization.