Crusher (OLCF)

The Crusher cluster is located at OLCF. Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).

Introduction

If you are new to this system, please see the following resources:

  • Crusher user guide

  • Batch system: Slurm

  • Production directories:

    • $PROJWORK/$proj/: shared with all members of a project, purged every 90 days (recommended)

    • $MEMBERWORK/$proj/: single user, purged every 90 days (usually smaller quota)

    • $WORLDWORK/$proj/: shared with all users, purged every 90 days

    • Note that the $HOME directory is mounted as read-only on compute nodes. That means you cannot run in your $HOME.

Installation

Use the following commands to download the WarpX source code and switch to the correct branch:

git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx

We use the following modules and environments on the system ($HOME/crusher_warpx.profile).

Listing 16 You can copy this file from Tools/machines/crusher-olcf/crusher_warpx.profile.example.
# please set your project account
#    note: WarpX ECP members use aph114_crusher
#export proj=<yourProject>

# required dependencies
module load cpe/22.08
module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.3.0
module load cray-mpich/8.1.23
module load cce/15.0.0  # must be loaded after rocm

# optional: faster builds
module load ccache
module load ninja

# optional: just an additional text editor
module load nano

# optional: for PSATD in RZ geometry support (not yet available)
#module load blaspp
#module load lapackpp
export CMAKE_PREFIX_PATH=$HOME/sw/crusher/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/crusher/lapackpp-master:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/crusher/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/crusher/lapackpp-master/lib64:$LD_LIBRARY_PATH

# optional: for QED lookup table generation support
#module load boost/1.78.0-cxx17

# optional: for openPMD support
module load adios2/2.8.3
module load cray-hdf5-parallel/1.12.2.1

# optional: for Python bindings or libEnsemble
module load cray-python/3.9.12.1

# suggested by HPE
module unload xalt/1.3.0

# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand

# an alias to request an interactive batch node for one hour
#   for paralle execution, start on the batch node: srun <command>
alias getNode="salloc -A $proj -J warpx -t 01:00:00 -p batch -N 1 -c 8 --ntasks-per-node=8"
# an alias to run a command on a batch node for up to 30min
#   usage: runNode <command>
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 -c 8 --ntasks-per-node=8"

# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1

# optimize ROCm/HIP compilation for MI250X
export AMREX_AMD_ARCH=gfx90a

# compiler environment hints
export CC=$(which cc)
export CXX=$(which CC)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include"
#export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"

We recommend to store the above lines in a file, such as $HOME/crusher_warpx.profile, and load it into your shell after a login:

source $HOME/crusher_warpx.profile

And since Crusher does not yet provide a module for them, install BLAS++ and LAPACK++:

# BLAS++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/blaspp.git src/blaspp
rm -rf src/blaspp-crusher-build
cmake -S src/blaspp -B src/blaspp-crusher-build -Duse_openmp=OFF -Dgpu_backend=hip -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=$HOME/sw/crusher/blaspp-master
cmake --build src/blaspp-crusher-build --target install --parallel 10

# LAPACK++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/lapackpp.git src/lapackpp
rm -rf src/lapackpp-crusher-build
cmake -S src/lapackpp -B src/lapackpp-crusher-build -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=$HOME/sw/crusher/lapackpp-master
cmake --build src/lapackpp-crusher-build --target install --parallel 10

Then, cd into the directory $HOME/src/warpx and use the following commands to compile:

cd $HOME/src/warpx
rm -rf build

cmake -S . -B build -DWarpX_DIMS=3 -DWarpX_COMPUTE=HIP -DWarpX_PSATD=ON
cmake --build build -j 10

The general cmake compile-time options apply as usual.

That’s it! A 3D WarpX executable is now in build/bin/ and can be run with a 3D example inputs file. Most people execute the binary directly or copy it out to a location in $PROJWORK/$proj/.

Running

MI250X GPUs (2x64 GB)

ECP WarpX project members, use the aph114 project ID.

After requesting an interactive node with the getNode alias above, run a simulation like this, here using 8 MPI ranks and a single node:

runNode ./warpx inputs

Or in non-interactive runs:

Listing 17 You can copy this file from Tools/machines/crusher-olcf/submit.sh.
#!/usr/bin/env bash

#SBATCH -A <project id>
#    note: WarpX ECP members use aph114
#SBATCH -J warpx
#SBATCH -o %x-%j.out
#SBATCH -t 00:10:00
#SBATCH -p batch
#SBATCH --ntasks-per-node=8
# broken since 01/2023
#S BATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH -N 1

# From the documentation:
# Each Crusher compute node consists of [1x] 64-core AMD EPYC 7A53
# "Optimized 3rd Gen EPYC" CPU (with 2 hardware threads per physical core) with
# access to 512 GB of DDR4 memory.
# Each node also contains [4x] AMD MI250X, each with 2 Graphics Compute Dies
# (GCDs) for a total of 8 GCDs per node. The programmer can think of the 8 GCDs
# as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).

# note (5-16-22, OLCFHELP-6888)
# this environment setting is currently needed on Crusher to work-around a
# known issue with Libfabric
#export FI_MR_CACHE_MAX_COUNT=0  # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks  # alternative cache monitor

# note (9-2-22, OLCFDEV-1079)
# this environment setting is needed to avoid that rocFFT writes a cache in
# the home directory, which does not scale.
export ROCFFT_RTC_CACHE_PATH=/dev/null

export OMP_NUM_THREADS=8
srun ./warpx inputs > output.txt

Post-Processing

For post-processing, most users use Python via OLCFs’s Jupyter service (Docs).

Please follow the same guidance as for OLCF Summit post-processing.

Known System Issues

Warning

May 16th, 2022 (OLCFHELP-6888): There is a caching bug in Libfrabric that causes WarpX simulations to occasionally hang on Crusher on more than 1 node.

As a work-around, please export the following environment variable in your job scripts until the issue is fixed:

#export FI_MR_CACHE_MAX_COUNT=0  # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks  # alternative cache monitor

Warning

Sep 2nd, 2022 (OLCFDEV-1079): rocFFT in ROCm 5.1+ tries to write to a cache in the home area by default. This does not scale, disable it via:

export ROCFFT_RTC_CACHE_PATH=/dev/null

Warning

January, 2023 (OLCFDEV-1284, AMD Ticket: ORNLA-130): We discovered a regression in AMD ROCm, leading to 2x slower current deposition (and other slowdowns) in ROCm 5.3 and 5.4. Reported to AMD and investigating.

Stay with the ROCm 5.2 module to avoid.