Install and run on HPC

This page describes how to install and run Eilmer on some of the HPC cluster systems our research group typically has access to. If you have access to another cluster, you still might find the information here useful. You might find your local cluster setup is similar to one of those listed here. Likely, you should be able to adapt the instructions with minimal hassle.

This section assumes you know how to set up your environment for running eilmer on the cluster machine. What we give you here are specific steps required to build and run the MPI version.

A note on D compiler versions

As D is a relatively new language, and particularly so in HPC environments, we find it best to bring-your-own D compiler. We recommend having a very recent version of the D compiler installed in your own account and commit to updating the compiler on a regular basis. For optimised builds use the LLVM compiler. Install notes for the LLVM D compiler (LDC) are available here.

A 3-minute introduction to queue systems

HPC systems are shared access systems for multiple users. The compute jobs are controlled by a queueing system. Different queueing systems are installed on various clusters, but they tend to have some commonalities. Those commonalities are: a way to submit your jobs to the queue; a way to check on your jobs' status; and a way to remove a job from the queue or terminate it while it’s running.

In the clusters we use regularly, there are two queueing systems: PBS(/Pro) and SLURM. Here is a brief summary on how to interact with those systems.

Preparing a submission script

Your compute job will run in batch mode on the cluster. What this means is that you prepare the instructions (ie. the list of commands you would type) ahead of time and place these in a text file. This is what we call the submission script. We also include some directions to the queue system itself in the submission script. So, there are two parts in a submission script:

  1. Directives for the queue system. Each of these lines start with # symbol and a phrase associated with the particular system. On PBS, use #PBS. On SLURM, use #SBATCH.
  2. Commands to launch your compute job. These are the commands as you would actually type them if running your job interactively at a terminal.

We need to add a caveat on that last statement. You do type commands as you would at the terminal to start your job, but remember your job is running in batch mode and you won’t see its output or error messages. For this reason, it’s common to redirect that output with the stdout redirect > and stderr redirect 2>. You will find this in the examples below.

Here is an example script to give those points above some concreteness.

#!/bin/bash
#SBATCH --job-name=my-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=24

e4shared --job=myJob --run  > LOGFILE 2> ERRFILE

The first line sets our shell. The next three lines are directives for the SLURM queue manager. We won’t explain them here because they are specific to each HPC cluster. Here, we are just emphasising the general layout of a submission script. There is one command in this script: a command to launch Eilmer. Note the stdout is directed to a file called LOGFILE and stderr to ERRFILE.

Save your text file submission script. The extension on the script doesn’t matter. I usually save PBS scripts with a .qsub extension and SLURM scripts with a .sbatch extension. The reason is related to submission: those extensions remind we which queue system I prepared my job for.

Submitting a job

After saving your submission script into a file, you are ready to submit this to the queue system. The submission command in PBS is qsub. In SLURM, use sbatch to submit a job. This typed at the command prompt on the login node of a cluster. For example, assuming a PBS system and submission script called run-my-job.qsub, type:

qsub run-my-job.qsub

The system will return to you a job number.

Checking job status

Your job may not launch instantly. In fact, it might be several hours before your job starts. To check the job’s status (or multiple jobs, if you have launched multiples), use the queue status command. On PBS, use qstat. On SLURM, use squeue.

Removing a job

If your job runs successfully to completion, you don’t need to do anything. The job simply terminates and leaves output in your working directory.

Occasionally, you might detect an error in your script or you might have changed your mind about parameters in your job. You will need a way to remove your job from either the queue, or running on the cluster. Here is where your job number comes in handy. You can remove a job using the appropriate command followed by your job number. On PBS, delete a job using qdel. On SLURM, cancel a job using scancel.

This example shows what to type to remove a job with a job id 3465 on a SLURM system:

scancel 3465

As tempting as it might be to remove others' jobs so your job gets on faster, these commands won’t let you do that. You can only remove your jobs from the queue.

Summary of commands and where to find more information

There are several more commands available to interact with a queue system. However, the main three you will need are: a command to submit a job; a command to check status; and a command to remove a job if you need. These are summarised here.

Action PBS SLURM
submit a job qsub my-job.qsub sbatch my-job.sbatch
check job status qstat squeue
remove a job qdel 3654.pbs scancel 3654

There is more information that is specific to each of the clusters in the sections below. You should consult the user guides for specific clusters for hints on the queue submission. You can also use man to find out the full list of options available on these queue commands. What I’ve introduced here is just their very basic usage.

Goliath: EAIT Faculty Cluster

Hardware: Dell servers, Intel CPUs, 24-cores per node

Operating system: CentOS Linux 7.4

Queue system: SLURM

Compiling

As this is a modern RedHat-flavoured system, you will need to load the openmpi module to both build and run e4mpi.

To compile:

module load mpi/openmpi-x86_64
cd gdtk/src/eilmer
make FLAVOUR=fast WITH_MPI=1 install

To complete the install, remember to set your environment variables as described on the install page.

Running

Where possible, try to use a full node at a time on Goliath. That means trying to set up your simulations to use 24 cores per node. Let’s look at an example SLURM submit script where you have 24 MPI tasks to run on 1 node. I’ve named this file run-on-goliath.slurm.

#!/bin/bash
#SBATCH --job-name=my-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=24

module load mpi/openmpi-x86_64
mpirun e4mpi --job=myJob --run  > LOGFILE

We submit that job using sbatch:

sbatch run-on-goliath.slurm

Next, we’ll look at running an oversubscribed job. An oversubscribed job is when we setup more MPI tasks than we have cores. This might be the case if we have many blocks (eg. 120) and we wish to run the 24 cores of a single node.

#!/bin/bash
#SBATCH --job-name=my-oversubscribed-MPI-job
#SBATCH --nodes=1
#SBATCH --ntasks=120
#SBATCH --overcommit

module load mpi/openmpi-x86_64
mpirun e4mpi --job=myJob --run  > LOGFILE

Note the main changes are to correctly specify the number of MPI tasks (here we have 120) and to add the directive --overcommit.

Bunya: UQ RCC Cluster

Hardware: AMD EPYC 7313 Processors, 96 cores per node, Infiniband HDR cluster interconnect

Operating system: Rocky Linux v8.6 (Green Obsidian)

Queue system: slurm

User Guide: Bunya User Guide

Compiling

Bunya is a new system comissioned by UQ’s Research Computing Center and opened in early 2023, featuring cutting edge hardware and an extensive ecosystem of software modules. Installation of Eilmer requires the following modules, loaded in a file named .bash_profile, which lives in your home directory.

module load foss
module load python

You will need to install the LLVM D compiler. Version 1.30 has been tested with Bunya. RCC do not allow users to run any calculations, including compiling the code, on the login nodes. To build the code, please use the following command to request an interactive session.

$ salloc --nodes=1 --ntasks=1 --ntasks-per-core=1 --mem=50G --job-name=qcompile --time=00:30:00 --partition=general --account=a_hypersonics srun --export=PATH,TERM,HOME,LANG --pty /bin/bash -l

And then build an optimised version of eilmer 4 with the following commands.

$ cd gdtk/src/eilmer
$ make WITH_MPI=1 WITH_COMPLEX_NUMBERS=1 WITH_NK=1 FLAVOUR=fast install

Eilmer v5, aka lmr, can be built instead using:

$ cd gdtk/src/lmr
$ make FLAVOUR=fast install

It is possible to finish the process by setting up your environment variables in the normal way, but you may also elect to use the module file that gets installed with the code.

$ cd
$ mkdir -p privatemodules
$ mkdir -p privatemodules/gdtk
$ ln -s $HOME/gdtkinst/share/gdtk-module privatemodules/gdtk/production

Then add the following lines to your .bash_profile:

module use -a $HOME/privatemodules
module load gdtk/production

Make sure to log out and then log back in to refresh your terminal environment after making changes to .bash_history.

Running

Nodes on Bunya are relatively large, consisting of 96 physical cores and a staggering 2TB of RAM. An example submission script is shown below, which requests 32 cores and eight hours of compute time. Larger allocations can be requested by simply increasing the ntasks argument.

#!/bin/bash  -login
#SBATCH --job-name=cultivation_nk
#SBATCH --account=a_hypersonics
#SBATCH --partition=general
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --time=08:00:00
#SBATCH --mem-per-cpu=10G

d=$(date +%b%d-%H%M)
srun e4-nk-dist --job=bc --snapshot-start=last

echo "... End mpirun"

Submission of a slurm script is accomplished using:

$ sbatch submit.sh

Job status can be viewed using

$ squeue

Gadi: NCI Cluster

Hardware: Fujitsu servers, Intel ‘Cascade Lake’ processors: 48 cores-per-node (two 24-core CPUs), 192 Gigabytes of RAM per node, HDR InfiniBand interconnect

Operating system: Linux, CentOS 8

Queue system: PBSPro

Compiling

Gadi does not have a module for the ldc2 compiler, so you will need to install it. See the instructions on “Installing the LLVM D compiler”. In terms of modules, you only need to load the openmpi module. I have had success with the 4.0.2 version.

module load openmpi/4.0.2

As described for other systems, use the install-transient-solvers.sh script to get an optimised build of the distributed-memory (MPI) transient solver.

cd gdtk/install-scripts
./install-transient-solvers.sh

To complete the install, remember to set your environment variables as described on the install page.

An example setup in .bash_profile

Here, I include what I have added to the end of my .bash_profile file on Gadi. It sets my environment up to compile and run Eilmer. Note also that I configure access to a locally installed version of the ldc2 compiler.

export DGD=${HOME}/gdtkinst
append_path PATH ${DGD}/bin
append_path PATH ${HOME}/opt/ldc2/bin
append_path PYTHONPATH ${DGD}/lib

export DGD_LUA_PATH=$DGD/lib/?.lua
export DGD_LUA_CPATH=$DGD/lib/?.so

module load openmpi/4.0.2

Running

As is common on large cluster computers, you will need to request the entire node CPU resources if your job spans multiple nodes. On Gadi, that means CPU request numbers are in multiples of 48. Here is a submit script where I’ve set up 192 MPI tasks for my job.

#!/bin/bash
#PBS -N my-MPI-job
#PBS -P dc7
#PBS -l walltime=00:30:00
#PBS -l ncpus=192
#PBS -l mem=200GB
#PBS -l wd
#PBS -l storage=scratch/dc7

mpirun e4mpi --job=myJob --run > LOGFILE

Jobs on Gadi are submitted using qsub.

Take note about the storage directive that appears in the submission script. It is strongly encouraged on Gadi to set up your jobs in the scratch area associated with your project. On Gadi, you need to request explicit access to the scratch area in your submission script so that filesystem is available to your job. The storage directive used to request that access.

In the next example, I have set my job up to run in an oversubscribed mode. Here I have 80 MPI tasks but I’d like to use only 48 CPUs.

#!/bin/bash
#PBS -N my-oversubscribed-MPI-job
#PBS -P dc7
#PBS -l walltime=00:30:00
#PBS -l ncpus=48
#PBS -l mem=50GB
#PBS -l wd
#PBS -l storage=scratch/dc7

mpirun --oversubscribe -n 80 e4mpi --job=myJob --run > LOGFILE-oversubscribed

Setonix: Pawsey Supercomputing Centre’s Flagship Research Machine

Hardware: AMD EPYC “Milan” CPU nodes (65k cores total, 2x64 cores per node, 256 GB memory per node)

Operating system: Linux, SUSE

Queue system: SLURM

User Guide: support.pawsey.org.au/documentation/display/US/Setonix+User+Guide

Execution Test Date: May 2024

Compiling

As of May 2024, the default loaded modules on Setonix are appropriate for compiling eilmer. That default list is:

$ module list

Currently Loaded Modules:
  1) craype-x86-milan                        6) pawseyenv/2023.08  11) cray-libsci/23.02.1.1
  2) libfabric/1.15.2.0                      7) gcc/12.2.0         12) PrgEnv-gnu/8.3.3
  3) craype-network-ofi                      8) craype/2.7.20      13) pawsey
  4) perftools-base/23.03.0                  9) cray-dsmml/0.2.2   14) pawseytools
  5) xpmem/2.5.2-2.4_3.47__gd0f7936.shasta  10) cray-mpich/8.1.25  15) slurm/22.05.2

As per other machines, you will need to install the LDC compiler, following the instructions in “Installing the LLVM D compiler”. The latest LDC compiler in May 2024 works on Setonix (eg. 1.38.0). You should try installing the latest stable LDC compiler. Setonix uses the .profile file in your home directory for adding things to $PATH, and long term software can be installed in the /software directory located at:

/software/projects/PROJECTNAME/USERNAME

In this string PROJECTNAME should be replaced with your project, and USERNAME with your pawsey ID, which should just be the username displayed at the commandline. The environment on Setonix conveniently sets a variable for you ${MYSOFTWARE} that has already taken care of constructing this string.

Go ahead and clone the gdtk repository into /software.

$ cd ${MYSOFTWARE}
$ git clone https://github.com/gdtk-uq/gdtk.git

Building and installing eilmer v5 in FAST mode for Setonix can be accomplished in the following manner:

$ make FLAVOUR=fast WITH_MPICH=1 INSTALL_DIR=${MYSOFTWARE}/gdtkinst install

NOTE: What differs from other systems described on this page is the instruction WITH_MPICH=1. This is becuase the Setonix system uses the Cray MPICH library as its MPI layer.

An example setup in .profile

A summary of the required entries is as follows, using my file as an example:

export LDC2_PATH=${MYSOFTWARE}/ldc2-1.38.0-linux-x86_64
export PATH=${PATH}:${LDC2_PATH}/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${LDC2_PATH}/lib
export DGD_REPO=${MYSOFTWARE}/gdtk
export DGD=${MYSOFTWARE}/gdtkinst
export PATH=${PATH}:${DGD}/bin
export DGD_LUA_PATH=${DGD}/lib/?.lua
export DGD_LUA_CPATH=${DGD}/lib/?.so
export PYTHONPATH=${PYTHONPATH}:${DGD}/lib
export RUBYPATH=${RUBYPATH}:${DGD}/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${DGD}/lib

Running

Setonix uses SLURM to queue jobs. Unlike most clusters, users do not have acccess to mpirun directly, and srun is used to launch jobs through the queueing system. An example configuration file is as follows.

#!/bin/bash  -login

#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --exclusive

echo "Begin job, loaded modules are:"
module list

srun -m block:block:block lmrZ-mpi-run | tee -a log.txt
echo "... End job"

The #SBATCH options in this script specify various queuing options. account is a required string to determine which allocation your job will be charged to. partition should usually be left on work unless your job has unusual requirements. ntasks is the total number of cores your job is split across. ntasks-per-node is the number assigned to each physical node in the machine, which should not exceed 128. cpus-per-task should generally be 1 for CFD jobs. time is the maximum time allowed for the job: exceeding this limit will result in your simulation being automatically killed by the queuing system. The final option ,exclusive, specifies that you do not wish to share the node with jobs submitted by other people. Using full nodes in multiples of 128 exclusive mode is recommended for best performance, but be warned that you will be charged for all of the cores on all the nodes requested, whether you are using them or not. The -m block:block:block option to srun is to ensure that threads are packed on contiguous cores. You can read more about that at the link provided at the bottom of this section. The script can be submitted using sbatch as follows.

$ sbatch submit.sh

More information about Setonix submission scripts and various examples can be found here.