Adding a New Test

The OLCF Test Harness (OTH) requires a specific source code repository structure for application tests. This document describes how to add an application and associated tests for a specific system to the harness.

Application Source Code Repository Structure

The top-level group in a GitLab group (or GitHub repo) contains a sub-group for each target machine.

Repository URL

Each application that can be tested on a given machine has a repository within the machine’s group.

Repository structure

The application repository must be structured as shown below:

<application name>/<test name>/Scripts/
                                      /rgt_test_input.ini
                                      /<check script>
                                      /<report script>
                                      /<job script template>
<application name>/Source/
                         /<build script>
                         /<other application source and build files>
                         /<inputs, scripts shared by multiple tests>

First, each application test must have its own subdirectory. The test directory has a mandatory Scripts subdirectory, which should contain the test input file (see Application Test Input below) and other required scripts (see Required Application Test Scripts below). This directory contains templates and input files for the test – a test must not modify this directory, but it may use files stored in this directory.

Second, the application’s source code and required build script must reside within the Source directory of the repository.

Example Repository Structure

For instance, let’s assume we have a Git repository for an application called hello_mpi.

To add a single node test and a two node test, we would create a separate subdirectory for each test, including their required Scripts subdirectory:

hello_mpi/c_n001/Scripts/
         /c_n002/Scripts/
         /Source

Note that the test names are not required to follow any specific naming convention, but you should avoid spaces and special characters in the names.

Application Test Input

Each test scripts directory should contain a test input file named rgt_test_input.ini. The test input file contains information that is used by the OTH to build, submit, and check the results of application tests. All the fields in the [Replacements] section can be used in the job script template and will be replaced when creating the batch script (see Job Script Template section below). The fields in the [EnvVars] section allow you to set environment variables that all stages of your test will be able to use.

Note

Environment variables cannot be used in the definition of other environment variables – ie, foo = $bar (See: Issue 132).

The following is a sample input for the single node test of the hello_mpi application mentioned above:

[Replacements]
#-- This is a comment
#-- The following variables are required
job_name = hello_mpi_c
walltime = 10
#-- %(<variablename>)s is the notation to use the value of a previously-defined variable
batch_filename = run_%(job_name)s.sh
build_cmd = ./build_hello_mpi_c.sh
check_cmd = ./check_hello_mpi_c.sh
report_cmd = ./report_hello_mpi_c.sh
#-- The following variables are optional
executable_path = hello
resubmit = 0
#-- Use in conjunction with resubmit argument to limit total submissions/runs of a test (inclusive of initial run)
#-- Set to 0 for indefinite resubmissions
max_submissions = 3


#-- The following are user-defined and used for Key-Value replacements
#-- ie, nodes replaces __nodes__ in the job script template
nodes = 1
total_processes = 16
processes_per_node = 16

[EnvVars]
FOO = bar

Note

Setting a variable in the Replacements section to <obtain_from_environment> pulls in the value set by an environment variable. For example, if you set nodes = <obtain_from_environemnt> and set RGT_NODES=4 in your environment, then __nodes__ will be replaced with 4.

Required Application Test Scripts

The OTH requires each application test to provide a build script, a job script template, a check script, and a reporting script. These scripts should be placed in the locations described in Repository structure. If the OTH cannot find the scripts specified in the test input, it will fail to launch.

Build Script

The build script can be a shell script, a Python script, or other executable command. It is specified in the test input file as build_cmd, and the OTH will execute the provided value as a subprocess. The build script should return 0 on success, non-zero otherwise.

For hello_mpi, an example build script named build_hello_mpi_c.sh may contain the following:

#!/bin/bash -l

module load gcc
module load openmpi
module list

mkdir -p bin
mpicc hello_mpi.c -o bin/hello

The build command be executed from the directory $BUILD_DIR, which is a copy of the contents of Source/. This means the build script should be written as if it were executed from Source/, regardless of where it actually is.

Likewise, the path to the build script given by build_cmd in rgt_test_input.ini should be relative to the Source/ directory.

Job Script Template

The OTH will generate the batch job script from the job script template by replacing keywords of the form __keyword__ with the values specified in the test input [Replacements] section.

The job script template must be named appropriately to match the specific scheduler of the target machine. For SLURM systems, use slurm.template.x as the name. For LSF systems, use lsf.template.x. An example SLURM template script for the hello_mpi application follows:

#!/bin/bash -l
#SBATCH -J __job_name__
#SBATCH -N __nodes__
#SBATCH -t __walltime__
#SBATCH -o __job_name__.o%j

# Define environment variables needed
export EXECUTABLE="__executable_path__"
export SCRIPTS_DIR="__scripts_dir__"
export WORK_DIR="__working_dir__"
export RESULTS_DIR="__results_dir__"
export HARNESS_ID="__harness_id__"
export BUILD_DIR="__build_dir__"

echo "Printing test directory environment variables:"
env | fgrep RGT_APP_SOURCE_
env | fgrep RGT_TEST_
echo

# Placing the environment setup script in a shared location reduces code duplication
# and ensures you have the same environment in building & running
source $BUILD_DIR/Common_Scripts/setup_env.sh

# Ensure we are in the starting directory
cd $SCRIPTS_DIR

# Make the working scratch space directory.
if [ ! -e $WORK_DIR ]
then
    mkdir -p $WORK_DIR
fi

# Change directory to the working directory.
cd $WORK_DIR

env &> job.environ
scontrol show hostnames > job.nodes

# Run the executable.
log_binary_execution_time.py --scriptsdir $SCRIPTS_DIR --uniqueid $HARNESS_ID --mode start

CMD="srun -n __total_processes__ -N __nodes__ $BUILD_DIR/bin/$EXECUTABLE"
echo "$CMD"
$CMD

log_binary_execution_time.py --scriptsdir $SCRIPTS_DIR --uniqueid $HARNESS_ID --mode final

# Ensure we return to the starting directory.
cd $SCRIPTS_DIR

# Copy the output and results back to the $RESULTS_DIR
# Depending on the size of files in $WORK_DIR, you may want to change this
cp -rf $WORK_DIR/* $RESULTS_DIR
cp $BUILD_DIR/output_build*.txt $RESULTS_DIR

# Check the final results.
check_executable_driver.py -p $RESULTS_DIR -i $HARNESS_ID

# Resubmit if needed:
# If you always want tests to resubmit if ``.kill_test`` is not present,
# then remove the conditional around calling ``test_harness_driver.py``.
case __resubmit__ in
    0)
       echo "No resubmit";;
    1)
       test_harness_driver.py -r __max_submissions__ ;;
esac

Using the job template above, the job will be submitted from the test Run_Archive/ directory and starts there. This is $RESULTS_DIR in the job template. The executable should then be run from $WORK_DIR directory, which is a scratch workspace derived from $RGT_PATH_TO_SSPACE.

One can access or copy any files relative to the Scripts/ directory using the $SCRIPT_DIR environment variable. For example, if one stores a CorrectResults directory at the same level as Scripts and Run_Archive for a test case, it can be be copied by adding the line

cp -a ${SCRIPT_DIR}/../CorrectResults ${WORK_DIR}/

inside the job script.

The environment variable $EXECUTABLE is also populated based on executable_path entry in rgt_test_input.ini file. The executable may still be inside $BUILD_DIR from the previous step, so one would need to either copy it to $WORK_DIR or provide the absolute path in the job script such as $BUILD_DIR/$EXECUTABLE.

Check Script

The check script can be a shell script, Python script, or other executable command.

Check scripts are used to verify that application tests ran as expected, and thus use standardized return codes to inform the OTH on the test result. Checking performance is optional but recommended for most tests. The check script return value should be one of the following:

  • 0: test succeeded

  • 1: test failed

  • 2: test completed but gave an incorrect answer

  • 5: test completed correctly but failed a performance target

These exit codes have no built-in meaning in the OTH other than 0 is a successful test and non-zero is a failed test. This set of test exit codes has been developed as a standard for test exit codes. The check script is launched from $RESULTS_DIR and stdout/stderr is captured in $RESULTS_DIR/output_check.txt.

For hello_mpi, an example check script named check_hello_mpi_c.sh may contain the following:

#!/bin/bash
echo "This is the check script for hello_mpi."
echo
echo -n "Working Directory: "; pwd
echo
echo "Test Result Files:"
ls ./*
echo
exit 0

Report Script

Like the check script, the report script can be a shell script, Python script, or other executable command. Report scripts are generally used to compute performance metrics from the run. The exit code of report scripts is not checked by the OTH. The report script is launched from $RESULTS_DIR and stdout/stderr is captured in $RESULTS_DIR/output_report.txt.

Note

In many cases, the check script serves the function of both the check and report script. In that event, report scripts often just exit 0.

Example Test from the Ground Up

This section details the thought process when developing a new test from the ground up. In this section, we develop an application repository named mpi-tests, which contains two “Hello, World!” MPI tests at different node counts. This section ignores Git integration and focuses on developing tests on an empty file system.

At the completion of this section, we will have created a directory structure that looks like the following:

mpi-tests/
         /Source/
                /build.sh
                /Common_Scripts/
                               /setup_env.sh
                               /slurm.template.x
                               /check_hello_world.sh
         /hello_world_n0001/Scripts/
                                   /rgt_test_input.ini
                                   /slurm.template.x -> ../../Source/Common_Scripts/slurm.template.x
                                   /check.sh -> ../../Source/Common_Scripts/check_hello_world.sh
                                   /report.sh -> ../../Source/Common_Scripts/check_hello_world.sh
         /hello_world_n0002/Scripts/
                                   /rgt_test_input.ini
                                   /slurm.template.x -> ../../Source/Common_Scripts/slurm.template.x
                                   /check.sh -> ../../Source/Common_Scripts/check_hello_world.sh
                                   /report.sh -> ../../Source/Common_Scripts/check_hello_world.sh

First, we create the top-level directory structure:

# Create the application's directory
mkdir mpi-tests
cd mpi-tests/
# Create the Source directory
mkdir ./Source/
# Create directories for two tests -- hello_world_n0001 and hello_world_n0002
mkdir -p ./hello_world_n0001/Scripts ./hello_world_n0002/Scripts

Both of these tests will use the same source code (this is very common for many tests), so we can go ahead and create that:

# from mpi-tests root:
cd Source
# create a directory to hold the source files
mkdir test_src
echo '#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv) {
  int rank, nranks;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);
  printf("Hello, World from rank %d of %d!\n",rank,nranks);
  MPI_Finalize();
}' > test_src/hello_world.c

The environment and build scripts will also be the same for both tests, so we can create a build script and a script to set up the environment:

# from mpi-tests root:
cd Source
# create a directory to hold shared scripts -- "Common_Scripts" is a good name for it, but not required
mkdir Common_Scripts
# Create a basic environment file:
echo '#!/bin/bash
# As an example, we do a ``module reset`` here
module reset
# The OTH is loaded by a module, so we need to re-add this to our environment
module use $OLCF_HARNESS_DIR/modulefiles
module load olcf_harness
# Now, we load a basic gcc and openmpi
module load gcc
module load openmpi
' > Common_Scripts/setup_env.sh
# Now, create a build script in the top-level of the Source directory:
echo '#!/bin/bash
# Setup the environment:
source ./Common_Scripts/setup_env.sh
# Compile the code into a binary:
cd test_src/
mpicc -O1 -g -Wall -o hello_world hello_world.c
' > ./build.sh

Let’s give some thought to how we want to construct these tests. We’ll start by working on the rgt_test_input.ini for the single-node Hello, World! test. Below is a file that can be used for the rgt_test_input.ini, with discussion infused as comments.

[Replacements]
job_name = hello_world_n0001
walltime = 5
nodes = 1
# Since nodes is defined, defining the number of MPI ranks per node (processes per node) might be useful, too
ppn = 2
# %(<variable>)s uses the value held by that variable
batch_filename = run_%(job_name)s.sh
# executable is in ${BUILD_DIR}/test_src/hello_world
executable_path = test_src/hello_world
# build.sh is in Source/build.sh directory
build_cmd = ./build.sh
# check.sh is in ${SCRIPTS_DIR}/check.sh
# I think that providing the total number of expected ranks to the check & report script might be useful in validating
# This can always be removed later
check_cmd = ./check.sh $((%(nodes)s*%(ppn)s))
# report.sh is in ${SCRIPTS_DIR}/check.sh
report_cmd = ./report.sh $((%(nodes)s*%(ppn)s))
# Don't allow resubmissions currently
resubmit = 0

[EnvVars]
# We don't currently have anything here

Notice that the only lines specific to this test are the job_name and nodes. This should help us re-use as much code as possible. Duplicate code will make tests difficult to maintain in the long run.

Next up is the Slurm template. Moving from 1 to 2 nodes shouldn’t change much about the job template, so let’s try to develop a generic Slurm job template for Hello, World! programs:

#!/bin/bash

#SBATCH -J __job_name__
#SBATCH -N __nodes__
#SBATCH -t __walltime__

# Define environment variables needed
export EXECUTABLE="__executable_path__"
export SCRIPTS_DIR="__scripts_dir__"
export WORK_DIR="__working_dir__"
export RESULTS_DIR="__results_dir__"
export HARNESS_ID="__harness_id__"
export BUILD_DIR="__build_dir__"

echo "Printing test directory environment variables:"
env | fgrep RGT_APP_SOURCE_
env | fgrep RGT_TEST_
echo

# Placing the environment setup script in a shared location reduces code duplication
# and ensures you have the same environment in building & running
source $BUILD_DIR/Common_Scripts/setup_env.sh

# Ensure we are in the starting directory
cd $SCRIPTS_DIR

# Make the working scratch space directory.
if [ ! -e $WORK_DIR ]
then
    mkdir -p $WORK_DIR
fi

# Change directory to the working directory.
cd $WORK_DIR

# These are very useful for debugging
env &> job.environ
scontrol show hostnames > job.nodes

# Run the executable.
log_binary_execution_time.py --scriptsdir $SCRIPTS_DIR --uniqueid $HARNESS_ID --mode start

# We use ${SLURM_NNODES} over __nodes__ for several reasons:
#   1. for testing purposes, it's good to ensure that SLURM_NNODES is correct, since users will use that
#   2. if you inadvertently set $RGT_SUBMIT_ARGS, using SLURM_NNODES will adapt to the size of the job
set -x
srun -N ${SLURM_NNODES} -n $((${SLURM_NNODES}*__ppn__)) --ntasks-per-node=__ppn__ $BUILD_DIR/$EXECUTABLE &> stdout.txt
set +x

log_binary_execution_time.py --scriptsdir $SCRIPTS_DIR --uniqueid $HARNESS_ID --mode final

# Ensure we return to the starting directory.
cd $SCRIPTS_DIR

# Copy the output and results back to the $RESULTS_DIR
# Depending on the size of files in $WORK_DIR, you may want to change this
cp -rf $WORK_DIR/* $RESULTS_DIR
cp $BUILD_DIR/output_build*.txt $RESULTS_DIR

# Check the final results -- this will call your command specified by `check_cmd`
check_executable_driver.py -p $RESULTS_DIR -i $HARNESS_ID

# Resubmit if needed:
# If you always want tests to resubmit if ``.kill_test`` is not present,
# then remove the conditional around calling ``test_harness_driver.py``.
case __resubmit__ in
    0)
       echo "No resubmit";;
    1)
       test_harness_driver.py -r __max_submissions__ ;;
esac

This job script will leave the output from the application in a file named stdout.txt. Let’s write a check script that can validate the output from this file. Recall that we provided the check script with the total number of tasks to expect as a command-line argument.

#!/bin/bash

expected_ranks=$1
nranks=$(grep "Hello, World from rank" ${RESULTS_DIR}/stdout.txt | wc -l)
if [ ! "${nranks}" == "${expected_ranks}" ]; then
    echo "Found ${nranks}, expected ${expected_ranks}"
    exit 1
fi
echo "Success! Found ${nranks}."
exit 0

This check script is generic and should be able to be re-used in multiple tests, so let’s put it in Source/Common_Scripts/check_hello_world.sh.

The OTH also wants a report script, but there’s not much to report here. You can either create a script that immediately exits, or just link to your check script. Here, we will just link to the check script.

The Slurm template and check and report scripts are required in the Scripts directory, so we use symbolic links to achieve this:

# from mpi-tests
cd hello_world_n0001/Scripts
ln -s ../../Source/Common_Scripts/slurm.template.x .
ln -s ../../Source/Common_Scripts/check_hello_world.sh ./check.sh
ln -s ../../Source/Common_Scripts/check_hello_world.sh ./report.sh

To expand to a 2-node Hello, World! test, we can just copy the Scripts directory from the single-node test, then modify the rgt_test_input.ini to specify 2 nodes instead of 1. Everything else is generalized, so no modification is needed.