Lassen (LLNL)
The Lassen V100 GPU cluster is located at LLNL.
Introduction
If you are new to this system, please see the following resources:
Batch system: LSF
-
/p/gpfs1/$(whoami)
: personal directory on the parallel filesystemNote that the
$HOME
directory and the/usr/workspace/$(whoami)
space are NFS mounted and not suitable for production quality data generation.
Installation
Use the following commands to download the WarpX source code and switch to the correct branch:
git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx
We use the following modules and environments on the system ($HOME/lassen_warpx.profile
).
# please set your project account
#export proj=<yourProject>
# required dependencies
module load cmake/3.21.1
module load gcc/8.3.1
module load cuda/11.2.0
# optional: for PSATD support
module load fftw/3.3.8
# optional: for QED lookup table generation support
module load boost/1.70.0
# optional: for openPMD support
module load hdf5-parallel/1.12.2
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/c-blosc-1.21.1:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/adios2-2.7.1:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/c-blosc-1.21.1/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/adios2-2.7.1/lib64:$LD_LIBRARY_PATH
# optional: for PSATD in RZ geometry support
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/lassen/lapackpp-master:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/lassen/lapackpp-master/lib64:$LD_LIBRARY_PATH
# optional: for Python bindings
module load python/3.8.2
# optional: an alias to request an interactive node for two hours
alias getNode="bsub -G $proj -W 2:00 -nnodes 1 -Is /bin/bash"
# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand
# optimize CUDA compilation for V100
export AMREX_CUDA_ARCH=7.0
# compiler environment hints
export CC=$(which gcc)
export CXX=$(which g++)
export FC=$(which gfortran)
export CUDACXX=$(which nvcc)
export CUDAHOSTCXX=$(which g++)
We recommend to store the above lines in a file, such as $HOME/lassen_warpx.profile
, and load it into your shell after a login:
source $HOME/lassen_warpx.profile
And since Lassen does not yet provide a module for them, install ADIOS2, BLAS++ and LAPACK++:
# c-blosc (I/O compression)
git clone -b v1.21.1 https://github.com/Blosc/c-blosc.git src/c-blosc
rm -rf src/c-blosc-lassen-build
cmake -S src/c-blosc -B src/c-blosc-lassen-build -DBUILD_TESTS=OFF -DBUILD_BENCHMARKS=OFF -DDEACTIVATE_AVX2=OFF -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/c-blosc-1.21.1
cmake --build src/c-blosc-lassen-build --target install --parallel 16
# ADIOS2
git clone -b v2.7.1 https://github.com/ornladios/ADIOS2.git src/adios2
rm -rf src/adios2-lassen-build
cmake -S src/adios2 -B src/adios2-lassen-build -DADIOS2_USE_Blosc=ON -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_Python=OFF -DADIOS2_USE_SST=OFF -DADIOS2_USE_ZeroMQ=OFF -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/adios2-2.7.1
cmake --build src/adios2-lassen-build --target install -j 16
# BLAS++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/blaspp.git src/blaspp
rm -rf src/blaspp-lassen-build
cmake -S src/blaspp -B src/blaspp-lassen-build -Duse_openmp=ON -Dgpu_backend=CUDA -Duse_cmake_find_blas=ON -DBLA_VENDOR=IBMESSL -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/blaspp-master
cmake --build src/blaspp-lassen-build --target install --parallel 16
# LAPACK++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/lapackpp.git src/lapackpp
rm -rf src/lapackpp-lassen-build
CXXFLAGS="-DLAPACK_FORTRAN_ADD_" cmake -S src/lapackpp -B src/lapackpp-lassen-build -Duse_cmake_find_lapack=ON -DBLA_VENDOR=IBMESSL -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=$HOME/sw/lassen/lapackpp-master -DLAPACK_LIBRARIES=/usr/lib64/liblapack.so
cmake --build src/lapackpp-lassen-build --target install --parallel 16
Then, cd
into the directory $HOME/src/warpx
and use the following commands to compile:
cd $HOME/src/warpx
rm -rf build
cmake -S . -B build -DWarpX_COMPUTE=CUDA -DWarpX_PSATD=ON
cmake --build build -j 10
The other general compile-time options apply as usual.
That’s it!
A 3D WarpX executable is now in build/bin/
and can be run with a 3D example inputs file.
Most people execute the binary directly or copy it out to a location in /p/gpfs1/$(whoami)
.
Running
V100 GPUs (16GB)
The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Lassen at LLNL.
Replace descriptions between chevrons <>
by relevant values, for instance <input file>
could be plasma_mirror_inputs
.
Note that the only option so far is to run with one MPI rank per GPU.
#!/bin/bash
# Copyright 2020-2022 Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
# https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
# https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-system#quick16
#BSUB -G <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J
# Work-around OpenMPI bug with chunked HDF5
# https://github.com/open-mpi/ompi/issues/7795
export OMPI_MCA_io=ompio
# Work-around for broken IBM "libcollectives" MPI_Allgatherv
# https://github.com/ECP-WarpX/WarpX/pull/2874
export OMPI_MCA_coll_ibm_skip_allgatherv=true
# ROMIO has a hint for GPFS named IBM_largeblock_io which optimizes I/O with operations on large blocks
export IBM_largeblock_io=true
# MPI-I/O: ROMIO hints for parallel HDF5 performance
export ROMIO_HINTS=./romio-hints
# number of hosts: unique node names minus batch node
NUM_HOSTS=$(( $(echo $LSB_HOSTS | tr ' ' '\n' | uniq | wc -l) - 1 ))
cat > romio-hints << EOL
romio_cb_write enable
romio_ds_write enable
cb_buffer_size 16777216
cb_nodes ${NUM_HOSTS}
EOL
# OpenMPI file locks are slow and not needed
# https://github.com/open-mpi/ompi/issues/10053
export OMPI_MCA_sharedfp=^lockedfile,individual
# HDF5: disable slow locks (promise not to open half-written files)
export HDF5_USE_FILE_LOCKING=FALSE
# OpenMP: 1 thread per MPI rank
export OMP_NUM_THREADS=1
# store out task host mapping: helps identify broken nodes at scale
jsrun -r 4 -a1 -g 1 -c 7 -e prepended hostname > task_host_mapping.txt
# run WarpX
jsrun -r 4 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs -e prepended -M "-gpu" <path/to/executable> <input file> > output.txt
To run a simulation, copy the lines above to a file lassen.bsub
and run
bsub lassen.bsub
to submit the job.
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on V100 GPUs for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:
amr.max_grid_size=256
andamr.blocking_factor=128
.One MPI rank per GPU (e.g., 4 MPI ranks for the 4 GPUs on each Lassen node)
Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.
Known System Issues
Warning
Feb 17th, 2022 (INC0278922): The implementation of AllGatherv in IBM’s MPI optimization library “libcollectives” is broken and leads to HDF5 crashes for multi-node runs.
Our batch script templates above apply this work-around before the call to jsrun
, which avoids the broken routines from IBM and trades them for an OpenMPI implementation of collectives:
export OMPI_MCA_coll_ibm_skip_allgatherv=true