OTH Utilities¶

The OLCF Test Harness (OTH) packages several scripts that may be useful in the course of testing. For example, many of these scripts handle keeping the remote database up-to-date with current runs, in the event of job timeouts, connectivity issues, or data loss. These scripts are documented below.

`rgt_archive_utility.py`¶

The rgt_archive_utility.py script allows you to “archive” a test. What “archive” actually means is it allows you to select when to keep or discard a test’s build and work directories, and will copy the test into a single location on the file system, without needing sym-links between the Run_Archive and scratch areas. This is useful for taking all tests older than 6 months and manipulating the directories so that the test is in a single directory, ready to be tarred and archived (if desired). The --help message for the rgt_archive_utility.py script is provided below.

usage: rgt_archive_tests.py [-h] --path-to-tests PATH_TO_TESTS
                        --path-to-archive PATH_TO_ARCHIVE
                        [--starttime STARTTIME] [--endtime ENDTIME]
                        [--keep-workdir {ON_FAIL,ALWAYS,NEVER}]
                        [--keep-builddir {ON_FAIL,ALWAYS,NEVER}]
                        [--delete-scratch-dir] [--delete-run-dir]
                        [--users USERS [USERS ...]]
                        [--machines MACHINES [MACHINES ...]]
                        [--apps APPS [APPS ...]]
                        [--tests TESTS [TESTS ...]]
                        [--runtags RUNTAGS [RUNTAGS ...]] [--no-tqdm]
                        [--print-summary] [--force] [--compress]
                        [--limit LIMIT] [--stop-after STOP_AFTER]
                        [--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                        [--logfile LOGFILE]

Locates and archives tests, condensing test output into a simplified directory
structure.

optional arguments:
  -h, --help            show this help message and exit
  --path-to-tests PATH_TO_TESTS
                        Path to the application repository directories (ie, Path_to_tests).
  --path-to-archive PATH_TO_ARCHIVE
                        Path to the archive location.
  --starttime STARTTIME
                        Absolute start time. Format: YYYY-MM-DDTHH:MM.
  --endtime ENDTIME     Absolute end time. Format: YYYY-MM-DDTHH:MM.
  --keep-workdir {ON_FAIL,ALWAYS,NEVER}
                        Customize when to copy the work directory to archive
                        (default: ON_FAIL).
  --keep-builddir {ON_FAIL,ALWAYS,NEVER}
                        Customize when to copy the build directory to archive
                        (default: ON_FAIL).
  --delete-scratch-dir  DANGEROUS. If set, deletes the build and work
                        directories after archiving.
  --delete-run-dir      DANGEROUS. If set, deletes the Run_Archive and Status
                        directories after archiving.
  --users USERS [USERS ...]
                        Specifies one or more UNIX users to archive jobs for
                        (default: all).
  --machines MACHINES [MACHINES ...]
                        Specifies one or more machines to archive jobs for
                        (default: all).
  --apps APPS [APPS ...]
                        Specifies one or more apps to archive jobs for
                        (default: all).
  --tests TESTS [TESTS ...]
                        Specifies one or more tests to archive jobs for
                        (default: all).
  --runtags RUNTAGS [RUNTAGS ...]
                        Specifies one or more runtags to archive jobs for
                        (default: all). This filter supports regex.
  --no-tqdm             If set, disables using TQDM progress bars.
  --print-summary       If set, prints a summary of how many test instances
                        are archived for each app-test.
  --force               DANGEROUS. If set, will remove the archive of an
                        existing test if found, then re-archive.
  --compress            If set, tar's and gzip's the resulting archive directory.
  --limit LIMIT         Maximum number of tests to archive.
  --stop-after STOP_AFTER
                        Specify a number of hours after which to cleanly pause
                        archiving and exit.
  --loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Specify verbosity
  --logfile LOGFILE     Name/location of the log file (default: archive.log).
                        Set to /dev/null to disable log file.

`update_databases.py`¶

The update_databases.py script retrieves all incomplete tests from the remote database (ie, an InfluxDB instance), and tries to determine if that test has completed, but did not log its completion message. This script has support for the Slurm job scheduler, and will look for the job ID of the given test, to see if it completed. If the job failed, this script will log a completion message to the database with information about how long the job ran, if it timed out, if it hit a node failure, etc.. The --help message for the update_databases.py script is provided below. This script requires the same environment variables as the core harness requires to enable the database backend, as described in _influxdb_event_logging.

usage: update_databases.py [-h] [--time TIME] [--starttime STARTTIME]
                           [--endtime ENDTIME] [--user USER] --machine MACHINE
                           [--app APP] [--test TEST] [--runtag RUNTAG]
                           [--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                           [--dry-run] [--build-timeout BUILD_TIMEOUT]
                           [--kafka-grace-period KAFKA_GRACE_PERIOD]

Updates harness runs in database backends using event and Slurm data

optional arguments:
  -h, --help            show this help message and exit
  --time TIME, -t TIME  How far back to look for jobs relative to now (ex: 1h,
                        2d).
  --starttime STARTTIME
                        Absolute start time. Format: YYYY-MM-DDTHH:MM:SSZ.
                        Overrides --time
  --endtime ENDTIME     Absolute end time. Format: YYYY-MM-DDTHH:MM:SSZ.
                        Should only be used with --starttime.
  --user USER, -u USER  Specifies the UNIX user to update jobs for.
  --machine MACHINE, -m MACHINE
                        Specifies the machine to look for jobs for. Setting a
                        wrong machine may lead to SLURM job IDs not being
                        found.
  --app APP             Specifies the app to update jobs for.
  --test TEST           Specifies the test to update jobs for.
  --runtag RUNTAG       Specifies the runtag to update jobs for.
  --loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Specify verbosity
  --dry-run             When set, prints messages to send to databases, but
                        does not send them.
  --build-timeout BUILD_TIMEOUT
                        Number of hours after a build_start event before
                        logging a failed build_end event.
  --kafka-grace-period KAFKA_GRACE_PERIOD
                        Number of seconds that Druid can be out-of-sync with
                        local files due to Kafka buffering before re-logging
                        existing events. Only applies to Kafka

For the Kafka backend only, the default grace period is 30 minutes before re-logging any messages not present in the Druid database. For InfluxDB, such a grace period does not exist, since data is synchronously written.

This script was written with Cron usage in mind, so the following list of --loglevel options may be useful, if you use this in a Cron job:

CRITICAL: prints a 1-line summary only if >0 jobs are logged
ERROR: prints 1 line per logged job, plus the 1-line summary if >0 jobs updated
WARNING: prints non-fatal messages/output, plus the single-line summary regardless of the number of jobs updated

`add_comment_to_databases.py`¶

The add_comment_to_databases.py script adds a comment to a specific test instance in the remote database (ie, an InfluxDB instance). The --help message for the add_comment_to_databases.py script is provided below. This script requires the same environment variables as the core harness requires to enable the database backend, as described in _influxdb_event_logging.

usage: add_comment_to_databases.py [-h] [--time TIME] --testid TESTID
                               [--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                               [--dry-run] --message MESSAGE
                               [--event {logging_start,build_start,build_end,submit_start,submit_end,job_queued,binary_execute_start,binary_execute_end,check_start,check_end}]

Add a comment to a specific test ID in the events database.

optional arguments:
-h, --help            show this help message and exit
--time TIME           How far back to look for jobs relative to now (ex: 1h, 2d).
--testid TESTID       Specifies the harness test id to update jobs for.
--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}
                      Specify verbosity
--dry-run             When set, prints messages to send to databases, but does not send them.
--message MESSAGE     Comment to add to the record.
--event {logging_start,build_start,build_end,submit_start,submit_end,job_queued,binary_execute_start,binary_execute_end,check_start,check_end}
                      Specifies the harness event to add the comment to.
                      Defaults to most recent event.

`report_to_databases.py`¶

The report_to_databases.py script enables you to further utilize a remote database to store custom, non-harness metrics. The --help message for the report_to_databases.py script is provided below. This script requires the same environment variables as the core harness requires to enable the database backend, as described in _influxdb_event_logging.

usage: report_to_databases.py [-h] [--time TIME] --keys KEYS --values VALUES
                          [--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                          [--table_name TABLE_NAME] [--dry-run]

Post a custom metric to Databases

optional arguments:
-h, --help            show this help message and exit
--time TIME, -t TIME  Timestamp to post record as. Format: YYYY-MM-DDTHH:MM:SS[.MS][Z]
--keys KEYS, -k KEYS  A set of comma-separated keys to identify your metric by. Ex: machine=frontier
--values VALUES, -v VALUES
                      A set of comma-separated values to post. Ex:
                      value_a=1,value_b=2. These may or may not be quoted
--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}
                      Specify verbosity
--table_name TABLE_NAME
                      Specifies the name of the table (measurement) to post to.
--dry-run             When set, print the message to the databases, but do not send.

OTH Utilities¶

rgt_archive_utility.py¶

update_databases.py¶

add_comment_to_databases.py¶

report_to_databases.py¶

`rgt_archive_utility.py`¶

`update_databases.py`¶

`add_comment_to_databases.py`¶

`report_to_databases.py`¶