Cray User Group 2023 Birds of a Feather
HPC System Test: Challenges and Lessons Learned Deploying Bleeding-Edge Network Technologies
Description
In the last couple of years, CUG sites across the world have begun deploying and testing the latest generation of HPE/Cray EX systems including Perlmutter at NERSC (USA), LUMI at CSC (Finland), Frontier at ORNL (USA), Setonix at PAWSEY (Australia), among others. In this birds of a feather session, we aim to gather center and vendor staff from across the globe to describe deployment processes used, discuss challenges encountered, and share lessons learned during the deployment of HPE’s Slingshot technology at different scales. The session will first feature speakers from HPE Engineering and ORNL HPC Scalable Systems teams which will be followed by an interactive discussion to encourage participants to share their own experiences.
The session will focus on these primary goals: (1) identify common challenges across sites, (2) gather information on tests and tools that could be leveraged by the community, and (3) define and share best practices that can be used to validate functionality, performance, and stability of Slingshot based systems.
When & Where
- Date: Monday, May 8, 2021 from 4:30pm to 6pm (Eastern European Summer Time - GMT+3).
- Location: Bysa 1 room, Clarion Hotel Helsinki, Helsinki, Finland
- Event Information: CUG 2023 Technical Program
Agenda
- Overview and goals
- Opening survey
- Lightning invited speakers:
- Forest Godfrey (HPE)
- Matt Ezell (ORNL)
- Discussion
- Summary
Invited Speakers
Forest Godfrey (HPE)
Forest Godfrey is a Distinguished Technologist at HPE and is one of the software architects for Slingshot software. After attaining a Bachelor of Science in Computer Science from Carnegie Mellon University, Forest’s 23 year career at HPE, Cray and SGI, has involved working on a number of supercomputer platforms in the areas of OS Kernel, system management, and overall system architecture. He holds 13 patents in the field of high performance computing.
Matt Ezell (ORNL)
Matt Ezell is a high performance computing systems engineer at the Oak Ridge Leadership Facility (OLCF) at Oak Ridge National Laboratory (ORNL). His team is responsible for the deployment and day-to-day operation of the OLCF’s largest HPC resources. Matt is also the technical lead for the Frontier supercomputer, ORNL’s exascale system currently first in the TOP500 list (November 2022).
Moderators
Bilel Hadri (KAUST)
Bilel Hadri is a Computational Scientist at the Supercomputing Core Lab at KAUST since July 2013. He contributes in benchmarking and performance optimization, helps in systems procurements, upgrades, and provides regular training to users. He received his Masters in Applied Mathematics and his PhD in Computer Science from the University of Houston in 2008. He joined the National Institute for Computational Science at Oak Ridge National Laboratory as a computational scientist in December 2009 following a Postdoctoral Position in June 2008 at the University of Tennessee Innovative Computing Laboratory led by Dr. Jack Dongarra. His expertise areas include performance analysis, tuning and optimization, System Utilization Analysis, Monitoring and Library Tracking Usage, Porting and Optimizing Scientific Applications on Accelerator Architectures (NVIDIA GPUs, Intel Xeon Phi), Linear Algebra, Numerical Analysis and Multicore Algorithms.
Verónica G. Melesse Vergara (ORNL)
Verónica G. Melesse Vergara (Vergara Larrea) is originally from Quito, Ecuador. Verónica earned a B.A. in Mathematics/Physics at Reed College and a M.S. in Computational Science at Florida State University. Verónica has over a decade of experience in the high performance computing field and is currently Group Leader of the User Assistance — Pre-production Systems Group at the Oak Ridge Leadership Computing Facility. In addition to providing assistance to OLCF users, Verónica is part of the systems testing team, led acceptance for Summit, and is leading acceptance for Frontier, ORNL’s exascale supercomputer. Her research interests include high performance computing, large-scale system testing, and performance evaluation and optimization of scientific applications. Verónica is a member of both IEEE and ACM and serves in the ACM SIGHPC Executive Committee and the SC Steering Committee.