Lassen (LLNL)

The Lassen V100 GPU cluster is located at LLNL.

Introduction

If you are new to this system, please see the following resources:

LLNL user account
Lassen user guide
Batch system: LSF
Production directories:
- /p/gpfs1/$(whoami): personal directory on the parallel filesystem
- Note that the $HOME directory and the /usr/workspace/$(whoami) space are NFS mounted and not suitable for production quality data generation.

Installation

Use the following commands to download the WarpX source code and switch to the correct branch:

git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx

We use the following modules and environments on the system ($HOME/lassen_warpx.profile).

Listing 23 You can copy this file from Tools/machines/lassen-llnl/lassen_warpx.profile.example.

# please set your project account
#export proj=<yourProject>

# required dependencies
module load cmake/3.21.1
module load gcc/8.3.1
module load cuda/11.2.0

# optional: for PSATD support
module load fftw/3.3.8

# optional: for QED lookup table generation support
module load boost/1.70.0

# optional: for openPMD support
module load hdf5-parallel/1.12.2
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/c-blosc-1.21.1:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/adios2-2.7.1:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/c-blosc-1.21.1/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/adios2-2.7.1/lib64:$LD_LIBRARY_PATH

# optional: for PSATD in RZ geometry support
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/lapackpp-master:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/lapackpp-master/lib64:$LD_LIBRARY_PATH

# optional: for Python bindings
module load python/3.8.2

# optional: an alias to request an interactive node for two hours
alias getNode="bsub -G $proj -W 2:00 -nnodes 1 -Is /bin/bash"

# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand

# optimize CUDA compilation for V100
export AMREX_CUDA_ARCH=7.0

# compiler environment hints
export CC=$(which gcc)
export CXX=$(which g++)
export FC=$(which gfortran)
export CUDACXX=$(which nvcc)
export CUDAHOSTCXX=$(which g++)

We recommend to store the above lines in a file, such as $HOME/lassen_warpx.profile, and load it into your shell after a login:

source $HOME/lassen_warpx.profile

And since Lassen does not yet provide a module for them, install ADIOS2, BLAS++ and LAPACK++:

# c-blosc (I/O compression)
git clone -b v1.21.1 https://github.com/Blosc/c-blosc.git src/c-blosc
rm -rf src/c-blosc-lassen-build
cmake -S src/c-blosc -B src/c-blosc-lassen-build -DBUILD_TESTS=OFF -DBUILD_BENCHMARKS=OFF -DDEACTIVATE_AVX2=OFF -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/c-blosc-1.21.1
cmake --build src/c-blosc-lassen-build --target install --parallel 16

# ADIOS2
git clone -b v2.7.1 https://github.com/ornladios/ADIOS2.git src/adios2
rm -rf src/adios2-lassen-build
cmake -S src/adios2 -B src/adios2-lassen-build -DADIOS2_USE_Blosc=ON -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_Python=OFF -DADIOS2_USE_SST=OFF -DADIOS2_USE_ZeroMQ=OFF -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/adios2-2.7.1
cmake --build src/adios2-lassen-build --target install -j 16

# BLAS++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/blaspp.git src/blaspp
rm -rf src/blaspp-lassen-build
cmake -S src/blaspp -B src/blaspp-lassen-build -Duse_openmp=ON -Dgpu_backend=CUDA -Duse_cmake_find_blas=ON -DBLA_VENDOR=IBMESSL -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/blaspp-master
cmake --build src/blaspp-lassen-build --target install --parallel 16

# LAPACK++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/lapackpp.git src/lapackpp
rm -rf src/lapackpp-lassen-build
CXXFLAGS="-DLAPACK_FORTRAN_ADD_" cmake -S src/lapackpp -B src/lapackpp-lassen-build -Duse_cmake_find_lapack=ON -DBLA_VENDOR=IBMESSL -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/lapackpp-master -DLAPACK_LIBRARIES=/usr/lib64/liblapack.so
cmake --build src/lapackpp-lassen-build --target install --parallel 16

Then, cd into the directory $HOME/src/warpx and use the following commands to compile:

cd $HOME/src/warpx
rm -rf build

cmake -S . -B build -DWarpX_COMPUTE=CUDA -DWarpX_PSATD=ON
cmake --build build -j 10

The other general compile-time options apply as usual.

That’s it! A 3D WarpX executable is now in build/bin/ and can be run with a 3D example inputs file. Most people execute the binary directly or copy it out to a location in /p/gpfs1/$(whoami).

Running

V100 GPUs (16GB)

The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Lassen at LLNL. Replace descriptions between chevrons <> by relevant values, for instance <input file> could be plasma_mirror_inputs. Note that the only option so far is to run with one MPI rank per GPU.

Listing 24 You can copy this file from Tools/machines/lassen-llnl/lassen.bsub.

#!/bin/bash

# Copyright 2020-2022 Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
#   https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
#   https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-system#quick16

#BSUB -G <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J

# Work-around OpenMPI bug with chunked HDF5
#   https://github.com/open-mpi/ompi/issues/7795
export OMPI_MCA_io=ompio

# Work-around for broken IBM "libcollectives" MPI_Allgatherv
#   https://github.com/ECP-WarpX/WarpX/pull/2874
export OMPI_MCA_coll_ibm_skip_allgatherv=true

# ROMIO has a hint for GPFS named IBM_largeblock_io which optimizes I/O with operations on large blocks
export IBM_largeblock_io=true

# MPI-I/O: ROMIO hints for parallel HDF5 performance
export ROMIO_HINTS=./romio-hints
#   number of hosts: unique node names minus batch node
NUM_HOSTS=$(( $(echo $LSB_HOSTS | tr ' ' '\n' | uniq | wc -l) - 1 ))
cat > romio-hints << EOL
   romio_cb_write enable
   romio_ds_write enable
   cb_buffer_size 16777216
   cb_nodes ${NUM_HOSTS}
EOL

# OpenMPI file locks are slow and not needed
# https://github.com/open-mpi/ompi/issues/10053
export OMPI_MCA_sharedfp=^lockedfile,individual

# HDF5: disable slow locks (promise not to open half-written files)
export HDF5_USE_FILE_LOCKING=FALSE

# OpenMP: 1 thread per MPI rank
export OMP_NUM_THREADS=1

# store out task host mapping: helps identify broken nodes at scale
jsrun -r 4 -a1 -g 1 -c 7 -e prepended hostname > task_host_mapping.txt

# run WarpX
jsrun -r 4 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs -e prepended -M "-gpu" <path/to/executable> <input file> > output.txt

To run a simulation, copy the lines above to a file lassen.bsub and run

bsub lassen.bsub

to submit the job.

For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on V100 GPUs for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:

amr.max_grid_size=256 and amr.blocking_factor=128.
One MPI rank per GPU (e.g., 4 MPI ranks for the 4 GPUs on each Lassen node)
Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.

Known System Issues

Warning

Feb 17th, 2022 (INC0278922): The implementation of AllGatherv in IBM’s MPI optimization library “libcollectives” is broken and leads to HDF5 crashes for multi-node runs.

Our batch script templates above apply this work-around before the call to jsrun, which avoids the broken routines from IBM and trades them for an OpenMPI implementation of collectives:

export OMPI_MCA_coll_ibm_skip_allgatherv=true