Vapor

System Overview

Vapor is an Azure-based cluster provides experimental high-performance computing resources for the NCRC.

The cluster is configured as follows:

Login Node: 1 node
Compute Nodes: 8 nodes
Interconnect: InfiniBand
Processor: AMD EPYC 7V73X 64-Core Processor
Cores per Node: 120 cores
GPUs: None

File Systems

The cluster offers two types of filesystems for storage and computation:

NFS Home Directories: - Accessible at /ccs/home/<username>
Lustre Filesystem: - A high-performance Lustre system with 1 TB total available storage, mounted at /lustre/OLCFLustre

Storage Areas Overview

The Lustre filesystem is organized into the following storage areas for project-specific work:

Area	Path	Type	Permissions	On Compute Nodes
Member Work	/lustre/OLCFLustre/[projid]/scratch/[userid]	Lustre	700	Yes
Project Work	/lustre/OLCFLustre/[projid]/proj-shared	Lustre	770	Yes
World Work	/lustre/OLCFLustre/[projid]/world-shared	Lustre	775	Yes

Logging In to Vapor

To log in to Vapor, use your OLCF username and passcode:

First ssh to hub and from there ssh to the Vapor login node.

$ ssh <username>@hub.ccs.ornl.gov
$ ssh <username>@login1.vapor.olcf.ornl.gov

Programming Environment

Frontier users are provided with many pre-installed software packages and scientific libraries. To facilitate this, environment management tools are used to handle necessary changes to the shell.

Environment modules are provided through Lmod, a Lua-based module system for dynamically altering shell environments. By managing changes to the shell’s environment variables (such as PATH, LD_LIBRARY_PATH, and PKG_CONFIG_PATH), Lmod allows you to alter the software available in your shell environment without the risk of creating package and version combinations that cannot coexist in a single environment.

General Usage

The interface to Lmod is provided by the module command:

Command	Description
module -t list	Shows a terse list of the currently loaded modules
module avail	Shows a table of the currently available modules
module help <modulename>	Shows help information about <modulename>
module show <modulename>	Shows the environment changes made by the <modulename> modulefile
module spider <string>	Searches all possible modules according to <string>
module load <modulename> [...]	Loads the given <modulename>(s) into the current environment
module use <path>	Adds <path> to the modulefile search cache and MODULESPATH
module unuse <path>	Removes <path> from the modulefile search cache and MODULESPATH
module purge	Unloads all modules
module reset	Resets loaded modules to system defaults
module update	Reloads all currently loaded modules

Searching for Modules

Modules with dependencies are only available when the underlying dependencies, such as compiler families, are loaded. Thus, module avail will only display modules that are compatible with the current state of the environment. To search the entire hierarchy across all possible dependencies, the spider sub-command can be used as summarized in the following table.

Command	Description
module spider	Shows the entire possible graph of modules
module spider <modulename>	Searches for modules named <modulename> in the graph of possible modules
module spider <modulename>/<version>	Searches for a specific version of <modulename> in the graph of possible modules
module spider <string>	Searches for modulefiles containing <string>

Compilers

AMD, GCC, Intel, and LLVM compilers are provided through modules. There is also the system version of GCC available in /usr/bin. The below table lists details.

Vendor	Compiler	Language	Compiler
AMD	aocc	C	clang
		C++	clang++
		Fortran	flang
Intel	oneapi	C	icx
		C++	icpx
		Fortran	ifx
LLVM	llvm	C	clang
		C++	clang++
		Fortran	flang
GCC	gcc	C	gcc
		C++	g++
		Fortran	gfortran

MPI

Both MPICH and OpenMPI modules are available. But MPICH is recommended and loaded by default. Use mpicc, mpicxx, mpifort compiler wrappers for compiling for C, C++, Fortran with MPI. The compiler wrapper will use the compiler from the currently loaded compiler module.

Running Jobs

Computational work on Vapor is performed by jobs. Jobs typically consist of several componenets:

A batch submission script
A binary executable
A set of input files for the executable
A set of output files created by the executable

In general, the process for running a job is to:

Prepare executables and input files.
Write a batch script.
Submit the batch script to the batch scheduler.
Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on Frontier. Frontier uses SchedMD's Slurm Workload Manager as the batch scheduling system.

Login vs Compute Nodes

Recall from the System Overview that Frontier contains two node types: Login and Compute. When you connect to the system, you are placed on a login node. Login nodes are used for tasks such as code editing, compiling, etc. They are shared among all users of the system, so it is not appropriate to run tasks that are long/computationally intensive on login nodes. Users should also limit the number of simultaneous tasks on login nodes (e.g., concurrent tar commands, parallel make

Compute nodes are the appropriate place for long-running, computationally-intensive tasks. When you start a batch job, your batch script (or interactive shell for batch-interactive jobs) runs on one of your allocated compute nodes.

Running Jobs

This section describes how to run programs on the Vapor compute nodes, including a brief overview of Slurm and also how to map processes and threads to CPU cores and GPUs.

Login vs Compute Nodes

Vapor contains two node types: Login and Compute. When you connect to the system, you are placed on a login node. Login nodes are used for tasks such as code editing, compiling, etc. They are shared among all users of the system, so it is not appropriate to run tasks that are long/computationally intensive on login nodes. Users should also limit the number of simultaneous tasks on login nodes (e.g., concurrent tar commands, parallel make

Slurm Workload Manager

Slurm is the workload manager used to interact with the compute nodes on Vapor. In the following subsections, the most commonly used Slurm commands for submitting, running, and monitoring jobs will be covered, but users are encouraged to visit the official documentation and man pages for more information.

Batch Scheduler and Job Launcher

Slurm provides 3 ways of submitting and launching jobs on Vapor's compute nodes: batch scripts, interactive, and single-command. The Slurm commands associated with these methods are shown in the table below and examples of their use can be found in the related subsections.

Command	Description
sbatch	Used to submit a batch script to allocate a Slurm job allocation. The script contains options preceded with #SBATCH. (see Batch Scripts section below)
salloc	Used to allocate an interactive Slurm job allocation, where one or more job steps (i.e., srun commands) can then be launched on the allocated resources (i.e., nodes). (see Interactive Jobs section below)
srun	Used to run a parallel job (job step) on the resources allocated with sbatch or salloc. If necessary, srun will first create a resource allocation in which to run the parallel job(s). (see Single Command section below)

Batch Scripts

A batch script can be used to submit a job to run on the compute nodes at a later time. In this case, stdout and stderr will be written to a file(s) that can be opened after the job completes. Here is an example of a simple batch script:

#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J <job_name>
#SBATCH -o %x-%j.out
#SBATCH -t 00:05:00
#SBATCH -p <partition>
#SBATCH -N 2

srun -n4 --ntasks-per-node=2 ./a.out

The Slurm submission options are preceded by #SBATCH, making them appear as comments to a shell (since comments begin with #). Slurm will look for submission options from the first line through the first non-comment line. Options encountered after the first non-comment line will not be read by Slurm. In the example script, the lines are:

Line	Description
1	[Optional] shell interpreter line
2	OLCF project to charge
3	Job name
4	stdout file name ( %x represents job name, %j represents job id)
5	Walltime requested (HH:MM:SS)
6	Batch queue
7	Number of compute nodes requested
8	Blank line
9	srun command to launch parallel job (requesting 4 processes - 2 per node)

Interactive Jobs

To request an interactive job where multiple job steps (i.e., multiple srun commands) can be launched on the allocated compute node(s), the salloc command can be used:

$ salloc -A <project_id> -J <job_name> -t 00:05:00 -p <partition> -N 2
salloc: Granted job allocation 313
salloc: Waiting for resource configuration
salloc: Nodes vapor[01-02] are ready for job

$ srun -n 4 --ntasks-per-node=2 ./a.out
<output printed to terminal>

$ srun -n 2 --ntasks-per-node=1 ./a.out
<output printed to terminal>

Here, salloc is used to request an allocation of compute nodes for 5 minutes. Once the resources become available, the user is granted access to the compute nodes (vapor01 and vapor02 in this case) and can launch job steps on them using srun.

Single Command (non-interactive)

$ srun -A <project_id> -t 00:05:00 -p <partition> -N 2 -n 4 --ntasks-per-node=2 ./a.out
<output printed to terminal>

The job name and output options have been removed since stdout/stderr are typically desired in the terminal window in this usage mode.

Common Slurm Submission Options

The table below summarizes commonly-used Slurm job submission options:

Flag	Description
A <project_id>	Project ID to charge
-J <job_name>	Name of job
-p <partition>	Partition / batch queue
-t <time>	Wall clock time <HH:MM:SS>
-N <number_of_nodes>	Number of compute nodes
-o <file_name>	Standard output file name
-e <file_name>	Standard error file name

For more information about these and/or other options, please see the sbatch man page.

Other Common Slurm Commands

The table below summarizes commonly-used Slurm commands:

Command	Description
sinfo	Used to view partition and node information. E.g., to view user-defined details about the batch partition: sinfo -p partition -o "%15N %10D %10P %10a %10c %10z"
squeue	Used to view job and job step information for jobs in the scheduling queue. E.g., to see all jobs from a specific user: squeue -l -u <user_id>
sacct	Used to view accounting data for jobs and job steps in the job accounting log (currently in the queue or recently completed). E.g., to see a list of specified information about all jobs submitted/run by a users since 1 PM on October 10, 2025 sacct -u <username> -S 2025-10-04T13:00:00 -o "jobid%5,jobname%25,user%15,nodelist%20" -X
scancel	Used to signal or cancel jobs or job steps. E.g., to cancel a job: scancel <jobid>
scontrol	Used to view or modify job configuration. E.g., to place a job on hold: scontrol hold <jobid>

Vapor container guide

You can also build and run containers on Vapor with Apptainer. Vapor provides Apptainer v1.4.1. You can build containers from Apptainer defintion files or pull images from a registry like Dockerhub. simplempich.def

Building and running OSU Benchmarks container example (with GCC and MPICH)

Create a file named simplempich.def

Bootstrap: docker
From: opensuse/leap:15.6
%environment
    # Point to MPICH binaries, libraries man pages
    export MPICH_DIR=/opt/mpich
    export PATH="$MPICH_DIR/bin:$PATH"
    export LD_LIBRARY_PATH="$MPICH_DIR/lib:$LD_LIBRARY_PATH"
    export MANPATH=$MPICH_DIR/share/man:$MANPATH
    # Point to rocm locations
    export ROCM_PATH=/opt/rocm
    export LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/lib64:$LD_LIBRARY_PATH"
    export PATH="/opt/rocm/bin:$PATH"

%post
echo "Installing required packages..."
export DEBIAN_FRONTEND=noninteractive
zypper install -y wget tar make sudo git fakeroot gzip gcc gcc-c++ gcc-fortran
export MPICH_VERSION=3.4.2
export MPICH_URL="http://www.mpich.org/static/downloads/$MPICH_VERSION/mpich-$MPICH_VERSION.tar.gz"
export MPICH_DIR=/opt/mpich
echo "Installing MPICH..."
mkdir -p /mpich
mkdir -p /opt
# Download
cd /mpich && wget -O mpich-$MPICH_VERSION.tar.gz $MPICH_URL && tar --no-same-owner -xzf mpich-$MPICH_VERSION.tar.gz
# Compile and install
cd /mpich/mpich-$MPICH_VERSION && ./configure --disable-fortran --with-device=ch4:ofi --prefix=$MPICH_DIR && make install
rm -rf /mpich
# Set env variables so we can compile our application

export PATH=$MPICH_DIR/bin:$PATH
export LD_LIBRARY_PATH=$MPICH_DIR/lib:$LD_LIBRARY_PATH
echo "Compiling the MPI application..."
cd /
curl -o osubenchmarks-7.2.tar.gz https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz && tar -xzf osubenchmarks-7.2.tar.gz --no-same-owner
cd osu-micro-benchmarks-7.2 && ./configure CC=mpicc CXX=mpicc  && make  && rm ../osubenchmarks-7.2.tar.gz

Build the container with

apptainer build simplempich.sif simplempich.def

Building and running OSU Benchmarks container example (with Intel and Intel MPI)

If you want to build application in the container with the Intel Classic compiler and Intel MPI, you will need to install the appropriate version of the Intel OneAPI release, and set up several environment variables.

First create the file intelenvs with the required environment variables. This file will be copied into the container image and will be sourced every time the container is started to set up the environment variables.

export INTEL_PATH=/opt/intel/oneapi/compiler/2023.2.0
export INTEL_VERSION=2023.2.0
export INTEL_COMPILER_TYPE=CLASSIC
export LD_LIBRARY_PATH=/opt/intel/oneapi/mpi/2021.10.0/lib/release:/opt/intel/oneapi/compiler/2023.2.0/linux/lib:/opt/intel/oneapi/compiler/2023.2.0/linux/lib/x64:/opt/intel/oneapi/compiler/2023.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2023.2.0/linux/compiler/lib/intel64_lin:$LD_LIBRARY_PATH
export CMAKE_PREFIX_PATH=/opt/intel/oneapi/compiler/2023.2.0/linux/IntelDPCPP:$CMAKE_PREFIX_PATH
export NLSPATH=/opt/intel/oneapi/compiler/2023.2.0/linux/compiler/lib/intel64_lin/locale/%l_%t/%N:$NLSPATH
export OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:/opt/intel/oneapi/compiler/2023.2.0/linux/lib/x64/libintelocl.so
export ACL_BOARD_VENDOR_PATH=/opt/intel/OpenCLFPGA/oneAPI/Boards
export FPGA_VARS_DIR=/opt/intel/oneapi/compiler/2023.2.0/linux/lib/oclfpga
export CMPLR_ROOT=/opt/intel/oneapi/compiler/2023.2.0
export INTELFPGAOCLSDKROOT=/opt/intel/oneapi/compiler/2023.2.0/linux/lib/oclfpga
export LIBRARY_PATH=/opt/intel/oneapi/mpi/2021.10.0/lib/release:/opt/intel/oneapi/mpi/2021.10.0/lib/:/opt/intel/oneapi/mpi/2021.10.0/lib/:/opt/intel/oneapi/compiler/2023.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2023.2.0/linux/lib:$LIBRARY_PATH
export DIAGUTIL_PATH=/opt/intel/oneapi/compiler/2023.2.0/sys_check/sys_check.sh:$DIAGUTIL_PATH
export MANPATH=/opt/intel/oneapi/compiler/2023.2.0/documentation/en/man/common:$MANPATH
export PATH=/opt/intel/oneapi/compiler/2023.2.0/linux/bin/intel64:/opt/intel/oneapi/compiler/2023.2.0/linux/lib/oclfpga/bin:/opt/intel/oneapi/compiler/2023.2.0/linux/bin/intel64:/opt/intel/oneapi/compiler/2023.2.0/linux/bin:$PATH
export PKG_CONFIG_PATH=/opt/intel/oneapi/compiler/2023.2.0/lib/pkgconfig:$PKG_CONFIG_PATH
export LD_LIBRARY_PATH=/opt/intel/oneapi/mpi/2021.10.0/lib/:/opt/intel/oneapi/mkl/2023.2.0/lib/intel64:$LD_LIBRARY_PATH
export CPATH=/opt/intel/oneapi/compiler/2023.2.0/linux/include:/opt/intel/oneapi/mkl/2023.2.0/include:$CPATH
export NLSPATH=/opt/intel/oneapi/mkl/2023.2.0/lib/intel64/locale/%l_%t/%N:$NLSPATH
export LIBRARY_PATH=/opt/intel/oneapi/mkl/2023.2.0/lib/intel64:$LIBRARY_PATH
export MKLROOT=/opt/intel/oneapi/mkl/2023.2.0
export PATH=/opt/intel/oneapi/mpi/2021.10.0/bin:/opt/intel/oneapi/mkl/2023.2.0/bin/intel64:$PATH
export PKG_CONFIG_PATH=/opt/intel/oneapi/mkl/2023.2.0/lib/pkgconfig:$PKG_CONFIG_PATH
export INCLUDE_PATH=/opt/intel/oneapi/mpi/2021.10.0/include:$INCLUDE_PATH
export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0

Create the file simpleintelmpi.def

Bootstrap: docker
From: opensuse/leap:15.6

%files
./intelenvs /intelenvs

%environment
    # Point to MPICH binaries, libraries man pages
    export MPICH_DIR=/opt/mpich
    export PATH="$MPICH_DIR/bin:$PATH"
    export LD_LIBRARY_PATH="$MPICH_DIR/lib:$LD_LIBRARY_PATH"
    export MANPATH=$MPICH_DIR/share/man:$MANPATH
    # Point to rocm locations
    export ROCM_PATH=/opt/rocm
    export LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/lib64:$LD_LIBRARY_PATH"
    export PATH="/opt/rocm/bin:$PATH"
    source /intelenvs

%post
set -xe
echo "Installing required packages..."
export DEBIAN_FRONTEND=noninteractive
zypper install -y wget tar make sudo git fakeroot gzip gcc gcc-c++ gcc-fortran which vim


## adding intel and internal cray pkg repos
tee > /etc/zypp/repos.d/oneAPI.repo << EOF
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF


zypper --releasever=15.6 --non-interactive --gpg-auto-import-keys  refresh
## installing intel 2023.2 since that is the version that has intel-classic 2021.10 (and 2023.2 is the last release that provides intel-classic)
zypper --non-interactive --gpg-auto-import-keys install -y intel-dpcpp-cpp-compiler-2023.2.0  intel-oneapi-compiler-fortran-2023.2.0 intel-oneapi-mpi-devel-2021.10.0

source /intelenvs
which mpicc
echo "Compiling the MPI application..."
cd /
curl -o osubenchmarks-7.2.tar.gz https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz && tar -xzf osubenchmarks-7.2.tar.gz --no-same-owner
cd osu-micro-benchmarks-7.2 && ./configure CC=mpiicc CXX=mpiicpc  && make  && rm ../osubenchmarks-7.2.tar.gz

Build the container image with

apptainer build simpleintelmpi.sif simpleintelmpi.def

To run the container, write a job script that will bind in the host's MPI libraries into the container. For example, create the below file submitbind.sl and submit the job with sbatch submitbind.sl.

#!/bin/bash

#SBATCH -A stf007uanofn
#SBATCH -J test
#SBATCH -N 2
#SBATCH -o logs/subil_%j.out
#SBATCH -t 01:00:00
###SBATCH --ntasks-per-node=16

module reset
module load oneapi


export APPTAINERENV_LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64/libibverbs::\$LD_LIBRARY_PATH"
export APPTAINER_CONTAINLIBS="/usr/lib64/libjansson.so.4,/usr/lib64/libjson-c.so.5,/usr/lib64/libnl-3.so.200,/usr/lib64/libibverbs.so.1,/usr/lib64/libnuma.so.1,/usr/lib64/libnl-cli-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-nf-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-idiag-3.so.200,/usr/lib64/libnl-xfrm-3.so.200,/usr/lib64/libnl-genl-3.so.200"
export APPTAINER_BIND=/sw/vapor,/var/spool/slurmd,${PWD},/etc/libibverbs.d,/usr/lib64/libibverbs,/usr/lib64/libnl,${HOME}

set -x

srun --ntasks-per-node=16 apptainer exec --writable-tmpfs simplempich.sif /osu-micro-benchmarks-7.2/c//mpi/collective/blocking/osu_alltoall -m 4096
srun --ntasks-per-node=16 apptainer exec --writable-tmpfs simpleintelmpi.sif /osu-micro-benchmarks-7.2/c//mpi/collective/blocking/osu_alltoall -m 4096