Install and run on HPC
This page describes how to install and run Eilmer on some of the HPC cluster systems our research group typically has access to. If you have access to another cluster, you still might find the information here useful. You might find your local cluster setup is similar to one of those listed here. Likely, you should be able to adapt the instructions with minimal hassle.
This section assumes you know how to set up your environment for running eilmer on the cluster machine. What we give you here are specific steps required to build and run the MPI version.
A note on D compiler versions
As D is a relatively new language, and particularly so in HPC environments, we find it best to bring-your-own D compiler. We recommend having a very recent version of the D compiler installed in your own account and commit to updating the compiler on a regular basis. For optimised builds use the LLVM compiler. Install notes for the LLVM D compiler (LDC) are available here.
A 3-minute introduction to queue systems
HPC systems are shared access systems for multiple users. The compute jobs are controlled by a queueing system. Different queueing systems are installed on various clusters, but they tend to have some commonalities. Those commonalities are: a way to submit your jobs to the queue; a way to check on your jobs' status; and a way to remove a job from the queue or terminate it while it’s running.
In the clusters we use regularly, there are two queueing systems: PBS(/Pro) and SLURM. Here is a brief summary on how to interact with those systems.
Preparing a submission script
Your compute job will run in batch mode on the cluster. What this means is that you prepare the instructions (ie. the list of commands you would type) ahead of time and place these in a text file. This is what we call the submission script. We also include some directions to the queue system itself in the submission script. So, there are two parts in a submission script:
- Directives for the queue system. Each of these lines start with
#
symbol and a phrase associated with the particular system. On PBS, use#PBS
. On SLURM, use#SBATCH
. - Commands to launch your compute job. These are the commands as you would actually type them if running your job interactively at a terminal.
We need to add a caveat on that last statement. You do type commands as you would at the terminal to start your job, but remember your job is running in batch mode and you won’t see its output or error messages. For this reason, it’s common to redirect that output with the stdout redirect
>
and stderr redirect2>
. You will find this in the examples below.
Here is an example script to give those points above some concreteness.
#!/bin/bash
#SBATCH --job-name=my-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=24
e4shared --job=myJob --run > LOGFILE 2> ERRFILE
The first line sets our shell.
The next three lines are directives for the SLURM queue manager.
We won’t explain them here because they are specific to each HPC cluster.
Here, we are just emphasising the general layout of a submission script.
There is one command in this script: a command to launch Eilmer.
Note the stdout is directed to a file called LOGFILE
and
stderr to ERRFILE
.
Save your text file submission script.
The extension on the script doesn’t matter.
I usually save PBS scripts with a .qsub
extension
and SLURM scripts with a .sbatch
extension.
The reason is related to submission: those extensions remind we
which queue system I prepared my job for.
Submitting a job
After saving your submission script into a file,
you are ready to submit this to the queue system.
The submission command in PBS is qsub
.
In SLURM, use sbatch
to submit a job.
This typed at the command prompt on the login node of a cluster.
For example, assuming a PBS system and submission script
called run-my-job.qsub
, type:
qsub run-my-job.qsub
The system will return to you a job number.
Checking job status
Your job may not launch instantly.
In fact, it might be several hours before your job starts.
To check the job’s status (or multiple jobs, if you have launched multiples),
use the queue status command.
On PBS, use qstat
.
On SLURM, use squeue
.
Removing a job
If your job runs successfully to completion, you don’t need to do anything. The job simply terminates and leaves output in your working directory.
Occasionally, you might detect an error in your script or
you might have changed your mind about parameters in your job.
You will need a way to remove your job from either the queue,
or running on the cluster.
Here is where your job number comes in handy.
You can remove a job using the appropriate command followed
by your job number.
On PBS, delete a job using qdel
.
On SLURM, cancel a job using scancel
.
This example shows what to type to remove a job with a
job id 3465
on a SLURM system:
scancel 3465
As tempting as it might be to remove others' jobs so your job gets on faster, these commands won’t let you do that. You can only remove your jobs from the queue.
Summary of commands and where to find more information
There are several more commands available to interact with a queue system. However, the main three you will need are: a command to submit a job; a command to check status; and a command to remove a job if you need. These are summarised here.
Action | PBS |
SLURM |
---|---|---|
submit a job | qsub my-job.qsub |
sbatch my-job.sbatch |
check job status | qstat |
squeue |
remove a job | qdel 3654.pbs |
scancel 3654 |
There is more information that is specific to each of the clusters
in the sections below.
You should consult the user guides for specific clusters for hints
on the queue submission.
You can also use man
to find out the full list of options
available on these queue commands.
What I’ve introduced here is just their very basic usage.
Goliath: EAIT Faculty Cluster
Hardware: Dell servers, Intel CPUs, 24-cores per node
Operating system: CentOS Linux 7.4
Queue system: SLURM
Compiling
As this is a modern RedHat-flavoured system, you will need
to load the openmpi module to both build and run e4mpi
.
To compile:
module load mpi/openmpi-x86_64
cd gdtk/src/eilmer
make FLAVOUR=fast WITH_MPI=1 install
To complete the install, remember to set your environment variables as described on the install page.
Running
Where possible, try to use a full node at a time on Goliath.
That means trying to set up your simulations to use 24 cores
per node.
Let’s look at an example SLURM submit script where you
have 24 MPI tasks to run on 1 node.
I’ve named this file run-on-goliath.slurm
.
#!/bin/bash
#SBATCH --job-name=my-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=24
module load mpi/openmpi-x86_64
mpirun e4mpi --job=myJob --run > LOGFILE
We submit that job using sbatch
:
sbatch run-on-goliath.slurm
Next, we’ll look at running an oversubscribed job. An oversubscribed job is when we setup more MPI tasks than we have cores. This might be the case if we have many blocks (eg. 120) and we wish to run the 24 cores of a single node.
#!/bin/bash
#SBATCH --job-name=my-oversubscribed-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=120
#SBATCH --overcommit
module load mpi/openmpi-x86_64
mpirun e4mpi --job=myJob --run > LOGFILE
Note the main changes are to correctly specify the number of
MPI tasks (here we have 120) and to add the directive --overcommit
.
Bunya: UQ RCC Cluster
Hardware: AMD EPYC 7313 Processors, 96 cores per node, Infiniband HDR cluster interconnect
Operating system: Rocky Linux v8.6 (Green Obsidian)
Queue system: slurm
User Guide: Bunya User Guide
Compiling
Bunya is a new system comissioned by UQ’s Research Computing Center and opened
in early 2023, featuring cutting edge hardware and an extensive ecosystem of
software modules. Installation of Eilmer requires the following modules,
loaded in a file named .bash_profile
, which lives in your home directory.
module load foss
module load python
You will need to install the LLVM D compiler. Version 1.30 has been tested with Bunya. RCC do not allow users to run any calculations, including compiling the code, on the login nodes. To build the code, please use the following command to request an interactive session.
$ salloc --nodes=1 --ntasks=1 --ntasks-per-core=1 --mem=50G --job-name=qcompile --time=00:30:00 --partition=general --account=a_hypersonics srun --export=PATH,TERM,HOME,LANG --pty /bin/bash -l
And then build an optimised version of eilmer 4 with the following commands.
$ cd gdtk/src/eilmer
$ make WITH_MPI=1 WITH_COMPLEX_NUMBERS=1 WITH_NK=1 FLAVOUR=fast install
Eilmer v5, aka lmr, can be built instead using:
$ cd gdtk/src/lmr
$ make FLAVOUR=fast install
It is possible to finish the process by setting up your environment variables in the normal way, but you may also elect to use the module file that gets installed with the code.
$ cd
$ mkdir -p privatemodules
$ mkdir -p privatemodules/gdtk
$ ln -s $HOME/gdtkinst/share/gdtk-module privatemodules/gdtk/production
Then add the following lines to your .bash_profile
:
module use -a $HOME/privatemodules
module load gdtk/production
Make sure to log out and then log back in to refresh your terminal environment
after making changes to .bash_history
.
Running
Nodes on Bunya are relatively large, consisting of 96 physical cores and a staggering 2TB of RAM. An example submission script is shown below, which requests 32 cores and eight hours of compute time. Larger allocations can be requested by simply increasing the ntasks argument.
#!/bin/bash -login
#SBATCH --job-name=cultivation_nk
#SBATCH --account=a_hypersonics
#SBATCH --partition=general
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --time=08:00:00
#SBATCH --mem-per-cpu=10G
d=$(date +%b%d-%H%M)
srun e4-nk-dist --job=bc --snapshot-start=last
echo "... End mpirun"
Submission of a slurm script is accomplished using:
$ sbatch submit.sh
Job status can be viewed using
$ squeue
Gadi: NCI Cluster
Hardware: Fujitsu servers, Intel ‘Cascade Lake’ processors: 48 cores-per-node (two 24-core CPUs), 192 Gigabytes of RAM per node, HDR InfiniBand interconnect
Operating system: Linux, CentOS 8
Queue system: PBSPro
Compiling
Gadi does not have a module for the ldc2 compiler, so you will need to install it. See the instructions on “Installing the LLVM D compiler”. In terms of modules, you only need to load the openmpi module. I have had success with the 4.0.2 version.
module load openmpi/4.0.2
As described for other systems, use the install-transient-solvers.sh
script to
get an optimised build of the distributed-memory (MPI) transient solver.
cd gdtk/install-scripts
./install-transient-solvers.sh
To complete the install, remember to set your environment variables as described on the install page.
An example setup in .bash_profile
Here, I include what I have added to the end of my .bash_profile
file on Gadi.
It sets my environment up to compile and run Eilmer.
Note also that I configure access to a locally installed version of the ldc2 compiler.
export DGD=${HOME}/gdtkinst
append_path PATH ${DGD}/bin
append_path PATH ${HOME}/opt/ldc2/bin
append_path PYTHONPATH ${DGD}/lib
export DGD_LUA_PATH=$DGD/lib/?.lua
export DGD_LUA_CPATH=$DGD/lib/?.so
module load openmpi/4.0.2
Running
As is common on large cluster computers, you will need to request the entire node CPU resources if your job spans multiple nodes. On Gadi, that means CPU request numbers are in multiples of 48. Here is a submit script where I’ve set up 192 MPI tasks for my job.
#!/bin/bash
#PBS -N my-MPI-job
#PBS -P dc7
#PBS -l walltime=00:30:00
#PBS -l ncpus=192
#PBS -l mem=200GB
#PBS -l wd
#PBS -l storage=scratch/dc7
mpirun e4mpi --job=myJob --run > LOGFILE
Jobs on Gadi are submitted using qsub
.
Take note about the storage
directive that appears in the submission script.
It is strongly encouraged on Gadi to set up your jobs in the scratch
area associated with your
project.
On Gadi, you need to request explicit access to the scratch
area in your submission
script so that filesystem is available to your job.
The storage
directive used to request that access.
In the next example, I have set my job up to run in an oversubscribed mode. Here I have 80 MPI tasks but I’d like to use only 48 CPUs.
#!/bin/bash
#PBS -N my-oversubscribed-MPI-job
#PBS -P dc7
#PBS -l walltime=00:30:00
#PBS -l ncpus=48
#PBS -l mem=50GB
#PBS -l wd
#PBS -l storage=scratch/dc7
mpirun --oversubscribe -n 80 e4mpi --job=myJob --run > LOGFILE-oversubscribed
Setonix: Pawsey Supercomputing Centre’s Flagship Research Machine
Hardware: AMD EPYC “Milan” CPU nodes (65k cores total, 2x64 cores per node, 256 GB memory per node)
Operating system: Linux, SUSE
Queue system: SLURM
User Guide: support.pawsey.org.au/documentation/display/US/Setonix+User+Guide
Execution Test Date: May 2024
Compiling
As of May 2024, the default loaded modules on Setonix are appropriate for compiling eilmer. That default list is:
$ module list
Currently Loaded Modules:
1) craype-x86-milan 6) pawseyenv/2023.08 11) cray-libsci/23.02.1.1
2) libfabric/1.15.2.0 7) gcc/12.2.0 12) PrgEnv-gnu/8.3.3
3) craype-network-ofi 8) craype/2.7.20 13) pawsey
4) perftools-base/23.03.0 9) cray-dsmml/0.2.2 14) pawseytools
5) xpmem/2.5.2-2.4_3.47__gd0f7936.shasta 10) cray-mpich/8.1.25 15) slurm/22.05.2
As per other machines, you will need to install the LDC compiler, following the instructions in “Installing the LLVM D compiler”.
The latest LDC compiler in May 2024 works on Setonix (eg. 1.38.0).
You should try installing the latest stable LDC compiler.
Setonix uses the .profile
file in your home directory for adding things to $PATH
, and long term software can be installed in the /software
directory located at:
/software/projects/PROJECTNAME/USERNAME
In this string PROJECTNAME
should be replaced with your project, and USERNAME
with your pawsey ID, which should just be the username displayed at the commandline.
The environment on Setonix conveniently sets a variable for you ${MYSOFTWARE}
that has already taken care of constructing this string.
Go ahead and clone the gdtk repository into /software
.
$ cd ${MYSOFTWARE}
$ git clone https://github.com/gdtk-uq/gdtk.git
Building and installing eilmer v5 in FAST
mode for Setonix can be accomplished in the following manner:
$ make FLAVOUR=fast WITH_MPICH=1 INSTALL_DIR=${MYSOFTWARE}/gdtkinst install
NOTE: What differs from other systems described on this page is the instruction WITH_MPICH=1
. This is becuase the Setonix system uses the Cray MPICH library as its MPI layer.
An example setup in .profile
A summary of the required entries is as follows, using my file as an example:
export LDC2_PATH=${MYSOFTWARE}/ldc2-1.38.0-linux-x86_64
export PATH=${PATH}:${LDC2_PATH}/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${LDC2_PATH}/lib
export DGD_REPO=${MYSOFTWARE}/gdtk
export DGD=${MYSOFTWARE}/gdtkinst
export PATH=${PATH}:${DGD}/bin
export DGD_LUA_PATH=${DGD}/lib/?.lua
export DGD_LUA_CPATH=${DGD}/lib/?.so
export PYTHONPATH=${PYTHONPATH}:${DGD}/lib
export RUBYPATH=${RUBYPATH}:${DGD}/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${DGD}/lib
Running
Setonix uses SLURM to queue jobs. Unlike most clusters, users do not have acccess to mpirun directly, and srun is used to launch jobs through the queueing system. An example configuration file is as follows.
#!/bin/bash -login
#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --exclusive
echo "Begin job, loaded modules are:"
module list
srun -m block:block:block lmrZ-mpi-run | tee -a log.txt
echo "... End job"
The #SBATCH
options in this script specify various queuing options. account
is a required string to determine which allocation your job will be charged to. partition
should usually be left on work unless your job has unusual requirements. ntasks
is the total number of cores your job is split across. ntasks-per-node
is the number assigned to each physical node in the machine, which should not exceed 128. cpus-per-task
should generally be 1 for CFD jobs. time
is the maximum time allowed for the job: exceeding this limit will result in your simulation being automatically killed by the queuing system. The final option ,exclusive
, specifies that you do not wish to share the node with jobs submitted by other people. Using full nodes in multiples of 128 exclusive mode is recommended for best performance, but be warned that you will be charged for all of the cores on all the nodes requested, whether you are using them or not.
The -m block:block:block
option to srun
is to ensure that threads are packed on contiguous cores.
You can read more about that at the link provided at the bottom of this section.
The script can be submitted using sbatch as follows.
$ sbatch submit.sh
More information about Setonix submission scripts and various examples can be found here.