Extensions¶
Optional extensions have been developed for use with the OLCF Test Harness (OTH). Extensions are enabled through environment variables and metadata files placed in the Run_Archive directory of a test launch.
InfluxDB Event Logging¶
The OTH leaves behind event information in files on the file system, with data like timestamp, filesystem paths for work, scratch and archive directories, test name, etc. This information can additionally be logged to an InfluxDB time-series database. To enable this extension, add the following variables to your environment (ie, export in bash):
|
Note that this is a minimal configuration. Multiple InfluxDB instances can be logged to by supplying a semicolon-separated list of URIs and tokens. Please see Extension-specific Variables for the full list of InfluxDB-related environment variables.
Events are logged using the following data as tags in the InfluxDB measurement:
|
Tags are a set of unique identifiers used by InfluxDB to index records. If you write 2 records to InfluxDB using the same set of tags, only the most recent will be kept. Each set of tags is associated with a set of fields. Event fields logged to InfluxDB are:
|
These fields are largely self-explanatory, but additional details for output_txt are provided below. output_txt is constructed for build_end, submit_end, binary_execute_end, and check_end events. The harness searches for files of a specific naming convention when each of those events is encountered. For build_end, the OTH reads the last 64 kB from output_build.txt, which is a file automatically created by the harness to store the output of the build process. For submit_end, the OTH reads the submit.err file, which is also automatically created by the harness during job submission. For binary_execute_end, the OTH looks for a file with the extension .o${job_id}, and reads the last 64 kB from that file. This file is not automatically created by the harness. For check_end, the OTH looks for a file named output_check.txt, which is automatically created by the harness to store output from the check script.
Logging application metrics to InfluxDB¶
The OTH provides capability to log metrics from each test to InfluxDB. This extension is a great way to visualize performance of a certain test over time. This requires that InfluxDB event logging is enabled.
To enable this extension, simply create a file named metrics.txt in the Run_Archive directory of a test launch (ie, /Path/to/Tests/$app/$test/Run_Archive/$test_id/metrics.txt). This file must exist by the end of the report_cmd execution. A common place to create metrics.txt is in the check script, check_cmd. Each line of this file must conform to one of the following formats:
# Comment lines begin with hashtags
metric_name_1=value_1
metric_name_2 = value_2
# It is not recommended to use spaces in metric names, but it is allowable
metric name 3 = value_3
metric_name_3\t=\tvalue_3
The OTH will log metric names to the InfluxDB database using the same tags as InfluxDB event logging uses.
When at least 1 metric is defined, the OTH also automatically calculates the time between build_start and build_end events, and binary_execute_start and binary_execute_end.
These events are logged as build_time and execution_time, respectively.
If you’re interested only in build_time and execution_time, have your check script create a dummy metrics.txt file with a line like dummy=1
.
Note that the correct computation of execution_time requires proper placement of the log_binary_execution_time.py calls in the job script,
since execution_time is the difference in time between the two log_binary_execution_time.py calls.
Monitoring the health of individual nodes¶
In many-node systems, it can be very difficult to monitor the health of each node. To address this, the OTH supports node-centric monitoring. Similar to metrics logging, this extension requires that InfluxDB event logging is enabled, and this extension is triggered by the presence of a nodecheck.txt file in the Run_Archive directory of a test launch. This extension also requires geospatial information about the node, by default. This is discussed later in this section. Each line of nodecheck.txt must have the following format:
# Comment lines begin with hashtags
# Format: <nodename> <status> <message>
Node1 PASS Some optional message that can have any number of spaces in it to associate with Node1
Node2 FAIL optional message to associate with Node2
Node3 HW-FAIL optional messaging to associate with Node3
The second column has a defined set of possible values, which are reduced to 4 common strings for usability in the database and dashboards. Each status in nodecheck.txt must be a status present in the square braces for one of the 4 common statuses. These values are:
FAILED : [‘FAILED’, ‘FAIL’, ‘BAD’]
SUCCESS : [‘SUCCESS’, ‘OK’, ‘GOOD’, ‘PASS’, ‘PASSED’]
HW-FAIL : [‘INCORRECT’, ‘HW-FAIL’]
PERF-FAIL : [‘PERF’, ‘PERF-FAIL’]
So to classify a successful test on a node, the line in nodecheck.txt may use SUCCESS, OK, GOOD, PASS, or PASSED keywords, and these are not case-sensitive, so success also works.
These 4 values are intended to present a known set of statuses to the InfluxDB database and dashboards, for ease of visualization.
FAILED
, SUCCESS
, and PERF-FAIL
are self-explanatory.
HW-FAIL
is intended to be a status associated with a hardware failure (ie, bus errors, power fault, network failure).
This extension logs results to the node_health measurement (table) of InfluxDB using machine, node, and test as tags.
By default, this extension also requires geospatial information about each node (ie, cabinet number, board number, row number).
This information is used as an InfluxDB tag to correlate failures by location.
To bypass this feature (common for single-cabinet systems), set the RGT_NODE_LOCATION_FILE environment variable to none
(not case-sensitive).
To utilize this feature, provide the absolute path to a JSON file containing the desired information by using the RGT_NODE_LOCATION_FILE environment variable.
An example portion from this file may look like:
{
"node001": {
"cabinet": "c0",
"switch": "s0",
"slot": "s0"
},
"node002": {
"cabinet": "c0",
"switch": "s0",
"slot": "s1"
},
...
}
Then, when querying InfluxDB in this example, you may use cabinet, switch, and slot to filter records.
Check Alias¶
Check aliasing allows codes to provide an alpha-numeric explanation to the check script exit code.
For example, the OTH uses a check_end event value of 1
to dictate a failure.
Failures come in many shapes and sizes, so an example of how you would use a check alias is by having distinct values such as MPI_ERR
, BUS_ERR
, TIMEOUT
, INPUT_ERR
.
This simply supplies an alphabetic dimension to categorizing failures.
To enable this extension, create a file named check_alias.txt in the Run_Archive directory of a test launch (ie, /Path/to/Tests/$app/$test/Run_Archive/$test_id/check_alias.txt). check_alias is set to the content of the first line of this file. This check_alias field is set in each event file, so InfluxDB is not required for this extension. check_alias is sent to InfluxDB alongside the standard event metadata, if InfluxDB is enabled.