Skip to the content.

HPC System Test Workshop - 2026 HPCTESTATHON (Follow-up Edition)

Details

Description

This workshop brings together HPC experts, system engineers, and vendors to continue discussions on how HPC systems are tested, building on the first HPCTestathon held at CUG 2024.

As HPC systems become more complex, with mixed architectures, AI workloads, and energy constraints, testing is no longer just about running more tests. HPC centers need to better understand which tests are truly useful, when to run them, and how test results help day-to-day operations over the lifetime of a system.

HPCTestathon 2026 is a follow-up event that builds on the work started in 2024, while giving participants space to introduce new challenges, test cases, and emerging topics. The workshop focuses on sharing real-world experience and structured group discussions, with the goal of capturing practical insights that can be reused by the wider HPC community.

Goals

The overall objective of this event is to continue the open cataloging and sharing of HPC system tests, building on the outcomes of previous editions and on participant-driven recommendations. It also aims to foster open discussions on the challenges associated with each testing sub-theme, and to encourage structured reflection guided by testing areas.

Focus Areas

In addition to the focus areas addressed in 2024 (applications, file systems, node health, performance, schedulers, continuous testing), HPCTestathon 2026 will introduce new themes reflecting emerging challenges, including:

Program

HPCTestathon 2026 is a one-day event structured as follows:

Saturday morning, April 25, 2026: Experience Sharing

The goal of the morning sessions is to share lessons learned, challenges, and practical insights.

9:15 - 9:30: Welcome coffee

9:30 - 11:00: Experience Sharing

Format: 10-minute presentation + 5-minute Q&A

Time Speaker Title Short description
9:30 - 9:45 Cédric Jourdain, Mathieu Cloirec (CINES) Introduction  
9:45 – 10:00 Bilel Hadri (KAUST) Keeping HPC Systems Nice and Calm: Regression Testing from KAUST Supercomputing Lab We will share our experience integrating new CPU, GPU, and storage architectures, along with managing the regular updates of OS and PrgEnv updates across our HPE-Cray systems. The session highlights the testing frameworks, automation practices, and real‑world wins that keep large‑scale HPC/AI platforms stable, predictable, and ready to deliver a great user experience
10:00 – 10:15 Guilherme Peretti-Pezzi / Jonathan Coles (CSCS) vCluster validation on Alps with ReFrame Alps is a general-purpose compute and data Research Infrastructure, composed by a dynamic and heterogeneous set of vClusters. In this talk we will present an overview of the main challenges for testing these systems, in order to provide a smooth user experience throughout the vClusters life-cycle
10:15 – 10:30 Maciej Cytowski (Pawsey) Reframe testing on Setonix at Pawsey We will summarise current operational Reframe testing environment used on Setonix supercomputer at Pawsey Supercomputing Research Centre. We will cover types of tests, challenges, use cases and recent developments on the performance monitoring side
10:30 – 10:45 Isa Wazirzada (HPE) Hit Me with Your Best Shot: Portable Node Stress Testing with Torch Hammer We will briefly introduce Torch Hammer, an open-source PyTorch‑based framework we developed for node‑level stress testing across CPUs, GPUs, and APUs. The focus will be on how it helps us check performance and stability expectations (e.g. GEMM performance, memory bandwidth, power/thermal
10:45 – 11:00 Nick Hagerty, Verónica G. Melesse Vergara (ORNL) Discovering the next Frontier of system testing @ OLCF This talk will share updates and new features in the Oak Ridge Leadership Computing (OLCF) Test Harness updates, present recent work integrating database in the OLCF Test Harness, and discuss current collaborative efforts with other sites in the HPC system testing area.

11:00 - 11:30: Break

11:30 - 12:30 : Experience Sharing

Time Speaker Title Short description
11:30 – 11:45 TBD    
11:45 – 12:00 TBD    
12:00 – 12:15 Mathieu Gontier (AMD) Best practices for prolog/epilog in AMD HPC GPU batch systems We will discuss common practices throughout Tier 1 AMD/HPC GPU systems as for addressing system state, memory fragmentation, GPU health/performance and RAS metrics. We will briefly cover RVS, a low level performance benchmarking tool used in such flows.
12:15 – 12:30 TBD    

12:30 – 14:00: Lunch break

Buffet lunch on the Panoramic Terrace, with free access to the museum (guide available).

Saturday afternoon, April 25, 2026: Working Groups

14:00 - 14:30: Demonstration of the movement of the Eiffel Dome and the giant telescope

14:30 - 17:00: Working in groups: different aspects of HPC testing environment

Participants will be divided into working groups focusing on key aspects of HPC system testing. Each group will explicitly build on the existing test lists (provided in the working-groups folder files), refining them based on current needs.

Discussions will be guided by a small, common set of questions, defined in advance with the working group leaders, in order to structure the exchanges while preserving open discussion.

17:00 - 17:30: Conclusions and Results

Connect with the HPC System Test Community

This workshop is organized as part of the HPC System Test Working Group.
Participants are encouraged to join the community Slack workspace to continue discussions beyond the event.

Workshop Registration

HPCTestathon 2026 Organizing Committee

TBD