Extensions

Optional extensions have been developed for use with the OLCF Test Harness (OTH). Extensions are enabled through environment variables and metadata files placed in the Run_Archive directory of a test launch.

InfluxDB Event Logging

The OTH leaves behind event information in files on the file system, with data like timestamp, filesystem paths for work, scratch and archive directories, test name, etc. This information can additionally be logged to an InfluxDB time-series database. To enable this extension, add the following variables to your environment (ie, export in bash):

  • RGT_INFLUX_URI : the URL (with appropriate endpoint) for your InfluxDB instance (ie, https://my-influxdb.domain.com/api/v2/write?org=my-org&bucket=my-bucket&precision=ns)

  • RGT_INFLUX_TOKEN : the token for your InfluxDB instance

Note that the RGT_INFLUX_URI contains information such as the organization name, bucket name, and desired precision. Currently the OTH only supports nanosecond precision.

Events are logged using the following data as tags in the InfluxDB measurement:

  • time

  • runtag : system log tag for this harness launch, equal to RGT_SYSTEM_LOG_TAG

  • app : name of the application

  • test : name of the test

  • test_id : test identifier of this test instance

  • machine : machine name

Tags are a set of unique identifiers used by InfluxDB to index records. If you write 2 records to InfluxDB using the same set of tags, only the most recent will be kept. Each set of tags is associated with a set of fields. Event fields logged to InfluxDB are:

  • build_directory : path to the build directory

  • run_archive : path to the Run_Archive directory

  • workdir : path to the work directory for this test

  • rgt_path_to_sspace : path to the scratch space derived from the RGT_PATH_TO_SSPACE environment variable

  • event_filename : name of the event file that this information is mirrored in

  • event_name : name of the event (ie, ‘build_start’)

  • event_subtype : specifies whether the event is start/end

  • event_time : time of the event

  • event_type : type of the event (ie, ‘build’, ‘binary_execute’)

  • event_value : status code of the event. ‘0’ indicates successful

  • hostname : hostname of the system that the event was run on

  • job_account_id : account ID used to submit the job to the scheduler

  • job_id : JobID referenced by the scheduler

  • path_to_rgt_package : path to the harness source code used

  • rgt_system_log_tag : log tag defined for this run (mirrors runtag)

  • test_instance : a string set to “$app,$test,$test_id”

  • user : username of the user that launched the harness

  • comment : enables the user to log comments to specific events

  • reason : used by some harness utility scripts to log explanations to InfluxDB, such as node failure messages

  • output_txt : output of specific events mined from files (last 64 kB only).

  • check_alias : an optional extension - alpha-numeric supplement of event_value

These fields are largely self-explanatory, but additional details for output_txt are provided below. output_txt is constructed for build_end, submit_end, binary_execute_end, and check_end events. The harness searches for files of a specific naming convention when each of those events is encountered. For build_end, the OTH reads the last 64 kB from output_build.txt, which is a file automatically created by the harness to store the output of the build process. For submit_end, the OTH reads the submit.err file, which is also automatically created by the harness during job submission. For binary_execute_end, the OTH looks for a file with the extension .o${job_id}, and reads the last 64 kB from that file. This file is not automatically created by the harness. For check_end, the OTH looks for a file named output_check.txt, which is automatically created by the harness to store output from the check script.

Note

If compute nodes do not have access to the internet or if InfluxDB was not enabled for a set of runs, runtests.py provides --mode influx_log, which finds all event files and logs them to InfluxDB after the run is completed. This mode also applies to metric and node health logging.

Finding runs to log is not selective – any test instance that has not already been sent to InfluxDB or explicitly disabled InfluxDB will be processed. If you do not want a test instance to be logged, set RGT_INFLUX_DISABLED=1 at run-time, or create a file named $RUNARCHIVE_DIR/.influx_disabled.

Logging application metrics to InfluxDB

The OTH provides capability to log metrics from each test to InfluxDB. This extension is a great way to visualize performance of a certain test over time. This requires that InfluxDB event logging is enabled.

To enable this extension, simply create a file named metrics.txt in the Run_Archive directory of a test launch (ie, /Path/to/Tests/$app/$test/Run_Archive/$test_id/metrics.txt). This file must exist by the end of the report_cmd execution. A common place to create metrics.txt is in the check script, check_cmd. Each line of this file must conform to one of the following formats:

# Comment lines begin with hashtags
metric_name_1=value_1
metric_name_2 = value_2
# It is not recommended to use spaces in metric names, but it is allowable
metric name 3 = value_3
metric_name_3\t=\tvalue_3

The OTH will log metric names to the InfluxDB database using the same tags as InfluxDB event logging uses. When at least 1 metric is defined, the OTH also automatically calculates the time between build_start and build_end events, and binary_execute_start and binary_execute_end. These events are logged as build_time and execution_time, respectively. If you’re interested only in build_time and execution_time, have your check script create a dummy metrics.txt file with a line like dummy=1. Note that the correct computation of execution_time requires proper placement of the log_binary_execution_time.py calls in the job script, since execution_time is the difference in time between the two log_binary_execution_time.py calls.

Monitoring the health of individual nodes

In many-node systems, it can be very difficult to monitor the health of specific nodes. To address this, the OTH supports node-centric monitoring. Similar to metrics logging, this extension requires that InfluxDB event logging is enabled, and this extension is triggered by the presence of a nodecheck.txt file in the Run_Archive directory of a test launch. This extension also requires geospatial information about the node, by default. This is discussed later in this section. Each line of nodecheck.txt must have the following format:

# Comment lines begin with hashtags
# Format: <nodename> <status> <message>
Node1 PASS Some optional message that can have any number of spaces in it to associate with Node1
Node2 FAIL optional message to associate with Node2
Node3 HW-FAIL optional messaging to associate with Node3

The second column has a defined set of possible values, which are reduced to 4 common strings for usability in the database and dashboards. Each status in nodecheck.txt must be a status present in the square braces for one of the 4 common statuses. These values are:

  1. FAILED : [‘FAILED’, ‘FAIL’, ‘BAD’]

  2. SUCCESS : [‘SUCCESS’, ‘OK’, ‘GOOD’, ‘PASS’, ‘PASSED’]

  3. HW-FAIL : [‘INCORRECT’, ‘HW-FAIL’]

  4. PERF-FAIL : [‘PERF’, ‘PERF-FAIL’]

So to classify a successful test on a node, the line in nodecheck.txt may use SUCCESS, OK, GOOD, PASS, or PASSED keywords, and these are not case-sensitive, so success also works.

These 4 values are intended to present a known set of statuses to the InfluxDB database and dashboards, for ease of visualization. FAILED, SUCCESS, and PERF-FAIL are self-explanatory. HW-FAIL is intended to be a status associated with a hardware failure (ie, bus errors, power fault, network failure).

This extension logs results to the node_health measurement (table) of InfluxDB using machine, node, and test as tags. By default, this extension also requires geospatial information about each node (ie, cabinet number, board number, row number). This information is used as an InfluxDB tag to correlate failures by location. To bypass this feature (useful for small systems, especially), set the RGT_IGNORE_NODE_LOCATION environment variable to 1. To utilize this feature, provide the absolute path to a JSON file containing the desired information by using the RGT_NODE_LOCATION_FILE environment variable. An example portion from this file may look like:

{
    "node001": {
        "cabinet": "c0",
        "switch": "s0",
        "slot": "s0"
    },
    "node002": {
        "cabinet": "c0",
        "switch": "s0",
        "slot": "s1"
    },
    ...
}

Then, when querying InfluxDB in this example, you may use cabinet, switch, and slot to filter records.

Check Alias

Check aliasing allows codes to provide an alpha-numeric explanation to the check script exit code. For example, the OTH uses a check_end event value of 1 to dictate a failure. Failures come in many shapes and sizes, so an example of how you would use a check alias is by having distinct values such as MPI_ERR, BUS_ERR, TIMEOUT, INPUT_ERR. This simply supplies an alphabetic dimension to categorizing failures.

To enable this extension, create a file named check_alias.txt in the Run_Archive directory of a test launch (ie, /Path/to/Tests/$app/$test/Run_Archive/$test_id/check_alias.txt). check_alias is set to the content of the first line of this file. This check_alias field is set in each event file, so InfluxDB is not required for this extension. check_alias is sent to InfluxDB alongside the standard event metadata, if InfluxDB is enabled.