HPCTESTS 2024
Second International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software
https://olcf.github.io/hpc-system-test-wg/hpctests/hpctests2024
Details
- When: 08:30 - 12:00 ET on Friday, November 22, 2024, held in conjunction with SC24.
- Where: Room B309, Georgia World Congress Center (Atlanta, Goeorgia, USA)
Description
The Second International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS) will bring together experts from high performance computing (HPC) centers around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, and best practices. The workshop will encourage submissions that highlight current benchmarks, tests, and procedures utilized in today’s HPC systems. This event will provide an avenue to showcase newly developed tools and methodologies, as well as those that are being actively designed to allow authors to gather feedback from the community that could help guide their project. As machine learning (ML) and deep learning (DL) become more prevalent workloads, HPC centers must provide a wider range of services and more robust and resilient resources in order to support both traditional HPC and ML/DL workloads. The workshop will also invite submissions that are looking ahead at the post-exascale future of HPC system testing to help the community think of alternate mechanisms that could be used to adapt to the evolving and emerging workloads. The event will invite and welcome international participation from HPC centers, academic institutions, as well as representatives from vendors in the supercomputing space.
The event will include a keynote focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and will conclude with a panel discussion.
Call for Papers
The Second International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS) will bring together experts from high performance computing (HPC) centers around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, and best practices. The workshop will encourage submissions that highlight current benchmarks, tests, and procedures utilized in today’s HPC systems. This event will provide an avenue to showcase newly developed tools and methodologies, as well as those that are being actively designed to allow authors to gather feedback from the community that could help guide their project. As machine learning (ML) and deep learning (DL) become more prevalent workloads, HPC centers must provide a wider range of services and more robust and resilient resources in order to support both traditional HPC and ML/DL workloads. The workshop will also invite submissions that are looking ahead at the post-exascale future of HPC system testing to help the community think of alternate mechanisms that could be used to adapt to the evolving and emerging workloads. The event will invite and welcome international participation from HPC centers, academic institutions, as well as representatives from vendors in the supercomputing space.
In addition to discussing procedures and tools utilized, submissions can describe challenges, lessons learned, and best practices used for regression testing, acceptance testing, and hardware evaluations. Furthermore, the workshop aims to encourage submissions that explore testbed evaluations as a means to gather preliminary results on system readiness to assist system design and deployment efforts.
This half-day workshop will kick-off with a keynote focused on novel HPC system testing techniques. The keynote will be followed by a series of paper talks selected from the open call for submissions that will be held. The workshop will conclude with a panel discussing key concepts impacting HPC system testing and related topics.
The workshop description, programming committee, Call for Papers (CFP), papers, and presentations will be archived in the HPC System Test Working Group website (https://olcf.github.io/hpc-system-test-wg/) under a new dedicated section for the HPCTESTS workshop.
Due to the wide range of systems currently in production and the rapidly evolving landscape of supercomputer architectures, the community should leverage individual center efforts to develop a collaborative set of best practices that will evolve from the publication of current tools, tests, and benchmarks. The diversity of architectures compounded with the distinct user workloads supported at each center, has resulted in each HPC center developing their own mechanisms, tests, benchmarks, and tools to conduct acceptance and regression testing. In this venue, we hope to attract stakeholders from vendors, HPC center staff, students, and faculty interested in exploring HPC system testing topics in depth.
Topics of interest include, but are not limited to:
- Testing methodologies and procedures
- Tools for regression testing, frameworks
- Automation of testing and continuous regression testing and performance monitoring
- Selection and development of proxy-applications, benchmarks, synthetic vs. real applications
- Efforts to improve reproducibility, sustainability, and availability of tests that can be leveraged by the community
- Hardware and component focused testing (compute, memory, network, storage) at all scales from a single server (CPU/GPU) to clusters and cloud environments
- System software, programming languages, library testing
- Monitoring and analysis of tests results
- Best practices and lessons learned
- Early failure detection and/or classification using ML approaches
- Benchmark and testing for emerging technologies (e.g., quantum computing, ML/AI)
Paper Submissions
The workshop will publish its proceedings with the SC24 conference. Authors must follow the formatting guidelines from SC24 Papers which are available here. Submissions can be 5-10 two-column pages (U.S. letter – 8.5 inches x 11 inches), excluding the bibliography, using the IEEE proceedings template. The IEEE conference proceeding templates for LaTeX and MS Word provided by IEEE eXpress Conference Publishing are available for download. See the templates here.
Submissions will be accepted through the SC24 Submissions site: https://submissions.supercomputing.org/. After you create an account, you will be able to submit to the HPCTESTS 2024 form.
Workshop Deadlines
- Paper Submission Deadline: August 16, 2024 AoE
- Author Notification: September 6, 2024 AoE
- Camera-ready: September 27, 2024 AoE
Organizing Committees
HPCTESTS 2024 General Chairs
- Verónica G. Melesse Vergara (Oak Ridge National Laboratory, USA)
- Bilel Hadri (King Abdullah University of Science and Technology, Saudi Arabia)
- Vasileios Karakasis (NVIDIA, Switzerland)
HPCTESTS 2024 Steering Committee
- Keita Teranishi (Oak Ridge National Laboratory, USA)
- Maciej Cytowski (Pawsey Supercomputing Centre, Australia)
- Michèle Weiland (Edinburgh Parallel Computing Centre / University of Edinburgh, Scotland)
- Olga Pearce (Livermore National Laboratory)
- Oscar Hernandez (Oak Ridge National Laboratory)
HPCTESTS 2024 Program Committee
- Jay Blair (P&G, USA)
- Tina Declerck (Lawrence Berkeley National Laboratory, USA)
- Dan Dietz (Oak Ridge National Laboratory, USA)
- Jens Domke (RIKEN, Japan)
- Pascal Jahan Elahi (Pawsey Supercomputing Research Centre, Australia)
- Paul Ferrell (Los Alamos National Laboratory, USA)
- Bilel Hadri (King Abdullah University of Science and Technology, Saudi Arabia)
- Nick Hagerty (Oak Ridge National Laboratory, USA)
- John Holmen (Oak Ridge National Laboratory, USA)
- Adrian Jackson (University of Edinburgh, Scotland)
- Vasileios Karakasis (NVIDIA, Switzerland)
- Eirini Koutsaniti (Swiss National Supercomputing Centre, Switzerland)
- James Lin (Shanghai Jiao Tong University, China)
- Amiya K. Maji (Purdue University, USA)
- Alessandro Marani (CINECA, Italy)
- Zachary Tschirhart (HPE, USA)
- Andy Warner (HPE, USA)
Agenda
-
8:30am - 8:40am MST Welcome and Introduction by Verónica G. Melesse Vergara
- 08:30-08:40 ET Welcome
- 08:40-09:20 ET Architecting and deploying compute clusters for large language models (Mike Houston, NVIDIA)
- 09:20-09:40 ET Testing GPU Numerics: Finding Numerical Differences between NVIDIA and AMD GPUs (A. Zahid, et al.)
- 09:40-10:00 ET Performance Analysis of Scientific Applications on a Grace System (A. Ruhela, et al.)
- 10:00-10:30 ET BREAK
- 10:30-10:50 ET Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric (G. Schieffer, et al.)
- 10:50-11:10 ET Testing the Unknown: A Framework for OpenMP Testing via Random Program Generation (I. Laguna, et al.)
- 11:10-11:30 ET Benchmarking and Continuous Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster (N. Simakov, et al.)
- 11:30-11:50 ET Perspectives and Discussions Panel
- 11:50-12:00 ET Closing Remarks by Bilel Hadri, Verónica Melesse Vergara