Compiling and Running MPI Software
==================================

Introduction
------------

This guide is intended to give an overview of what is needed to compile and run MPI software on the ARC cluster systems.

The guide shows how to:

- Compile a MPI application,
- Prepare a job submission script and
- Submit the job.


About MPI
---------

MPI stands for Message Passing Interface, an interface standard that defines a number of library routines aimed at the programming of message-passing
(distributed-processing) applications.  The interface specifications were designed by a group of researchers from both academia and industry and cover
bindings for C, C++ and Fortran.

Being standardised, MPI programming leads to highly portable code.  Nevertheless, the MPI standard has many implementations in libraries (both commercial
and open source software), and the quality and performance of MPI libraries can differ significantly.

Any MPI library implementation has a number of tools that help programmers build and run MPI applications.  The main tools are:

compiler utilities and
an application run agent.

Compiler utilities (``mpicc``, ``mpiicc``, ``mpicxx``, ``mpif77``, ``mpif90``, ``mpiifort``) are used to compile and link MPI programs.
These are not compilers as such but wrappers around back-end compilers (e.g. the GNU or Intel compilers) and are designed to make compiling
and linking against the MPI library easy.

The run agent launches and manages the execution of a MPI executable on distributed computer systems.  This agent is called mpirun or mpiexec,
with mpirun being the most frequently used one.  

MPI on the ARC systems
----------------------

The ARC clusters have two main MPI implementations installed, however this guide is intended to be independent of any particular flavour of MPI. 
The MPI libraries available per cluster system are presented below.

The MPI implementations OpenMPI and Intel-MPI are installed on the clusters ARC and HTC, are optimised and configured to use the InfiniBand interconnect.
Each MPI implementation has several versions installed, and may be used with different compilers.  All installations are managed through the environment
module system.

 
Preparing and Running An Example
--------------------------------

Preparation

Log in to one of the ARC clusters and ensure you are running on an interactive node (this is important!), create a directory in which to do some work and go to it.  The sequence of commands is::

  srun -p interactive --pty /bin/bash
  cd $DATA
  mkdir examples
  cd examples
 

Then, copy the ARC MPI example files to your newly created directory::

  cp /apps/common/examples/mpi/* .
 
Run the command ``ls`` to list the copied files.  Simple C ``cluster_myprog.c`` and Fortran ``cluster_myprog.f`` MPI example codes are provided.
Also, there is a submission script ``slurm.sh``  You can edit and adapt the submission script for the cluster on which you are running the example.

Compiling the application
-------------------------

The compilation and linking of an MPI program is managed by the compiler wrappers ``mpicc`` and ``mpif77`` for GCC and ``mpiicc`` and ``mpiifort`` for Intel -
and performed by the back-end compiler. The MPI wrapper scripts ensure the correct options for MPI operation are supplied to the compiler.

Toolchains
----------

The ARC and HTC systems have a number of compiler, MPI and maths library combinations grouped into toolchains which are versioned every six months 
(a and b versions). These are based upon the EasyBuild standard toolchain definitions to ensure reproducability. For Intel compilers these are named 
intel and for GCC they are named foss (free open-source software). 

For example the ``intel/2020a`` toolchain contains the following components::

  module load intel/2020a
  module list

  Currently Loaded Modules:
    1) GCCcore/9.3.0               3) binutils/2.34-GCCcore-9.3.0   5) impi/2019.7.217-iccifort-2020.1.217   7) imkl/2020.1.217-iimpi-2020a
    2) zlib/1.2.11-GCCcore-9.3.0   4) iccifort/2020.1.217           6) iimpi/2020a                           8) intel/2020a
 

The ``foss/2020a`` toolchain contains::

  module load foss/2020a
  module list

  Currently Loaded Modules:
    1) GCCcore/9.3.0                 4) GCC/9.3.0                      7) libxml2/2.9.10-GCCcore-9.3.0     10) OpenMPI/4.0.3-GCC-9.3.0   13) FFTW/3.3.8-gompi-2020a
    2) zlib/1.2.11-GCCcore-9.3.0     5) numactl/2.0.13-GCCcore-9.3.0   8) libpciaccess/0.16-GCCcore-9.3.0  11) OpenBLAS/0.3.9-GCC-9.3.0  14) ScaLAPACK/2.1.0-gompi-2020a
    3) binutils/2.34-GCCcore-9.3.0   6) XZ/5.2.5-GCCcore-9.3.0         9) hwloc/2.2.0-GCCcore-9.3.0        12) gompi/2020a               15) foss/2020a
 

Important Note for Intel toolchain users: When using the intel toolchain, the MPI build wrappers ``mpicc``, ``mpicxx`` and ``mpifc`` point to the GCC compilers. To
use the Intel compilers you should use the wrappers: ``mpiicc``, ``mpiicpc`` and ``mpiifort`` respectively. If you are using a third-party build which cannot be
easily modified, you can override the behaviour of the ``mpicc``, ``mpicxx`` and ``mpifc`` wrappers to use Intel compilers by setting the following environment
variables::

  export MPICH_CC=icc

  export MPICH_FC=ifort
  export MPICH_F90=ifort
  export MPICH_F77=ifort

  export MPICH_CPP="icc -E"

  export MPICH_CXX=icpc
  export MPICH_CCC=icpc
 
Other toolchains/versions can be made available, a list of EasyBuild supported versions can be found `here <https://docs.easybuild.io/en/master/version-specific/toolchains.html>`_. Please note that the ARC systems only support ``foss/2018b``
and newer, and ``intel/2020a`` and newer - due to operating system compatibility.

Compilation
-----------

After loading your chosen toolchain module, compile one of the source files:

For the ``foss`` toolchain use::

  mpicc cluster_myprog.c -o cluster_myprog

Or (for the Fortran code)::

  mpif77 cluster_myprog.f -o cluster_myprog

 
For the ``intel`` toolchain use::

  mpiicc cluster_myprog.c -o cluster_myprog

Or (for the Fortran code)::

  mpiifort cluster_myprog.f -o cluster_myprog
 
Run the ``ls`` command to verify the executable cluster_myprog was created.

Preparing the submission script
-------------------------------

Edit the submission script provided ``slurm.sh`` to input the details of the job.  The key lines to pay attention to in the script are:

- the request for resources (number of nodes and walltime) 
- the chosen toolchain and
- the mpirun command.

The submission script should look like this for a foss toolchain build::

 #!/bin/bash

 #SBATCH --job-name=myprog
 #SBATCH --time=00:10:00
 #SBATCH --nodes=2
 #SBATCH --ntasks-per-node=8
 #SBATCH --constraint="[scratch:weka|scratch:gpfs]"
 #SBATCH --mail-type=BEGIN,END
 #SBATCH --mail-user=my.name@email.com

 module load foss/2020a

 mpirun ./cluster_myprog
 
or for an ``intel`` toolchain build::

 #!/bin/bash 

 #SBATCH --job-name=myprog 
 #SBATCH --time=00:10:00 
 #SBATCH --nodes=2 
 #SBATCH --ntasks-per-node=8
 #SBATCH --constraint="[scratch:weka|scratch:gpfs]"
 #SBATCH --mail-type=BEGIN,END 
 #SBATCH --mail-user=my.name@email.com

 module load intel/2020a 

 mpirun ./cluster_myprog
 
In this example, SLURM is instructed to allocate 2 nodes ``--nodes=2`` for 10 minutes ``--time=00:10:00``  Also, the run is scheduled for 8 MPI processes per node; this maps each MPI process to a physical core, leading to a (generally) optimal run configuration.

N.B. In ARC there are 48 cores per node but in this example we are only using 8 cores per node.

The command line ``mpirun ./cluster_myprog`` runs the executable ``cluster_myprog`` built with the approprate toolchain MPI library.  

.. note::
  Specifying the scratch file system type is especially important if you are running a multi-node (MPI) code. If you do not specify a scratch constraint, then you might be allocated nodes with different scratch file systems which could cause problems for your job, even if you do not use scratch.

  The options are::

  --constraint="[scratch:weka|scratch:gpfs]"   - Use either WEKA or GPFS scratch
  --constraint="[scratch:weka]"                - Use WEKA scratch ONLY
  --constraint="[scratch:gpfs]"                - Use GPFS scratch **not recommended**


Running the application
-----------------------

After having prepared the submission script, submit the job with::

 sbatch slurm.sh

This will print a job number and return control to the Linux prompt at once.  Monitor its execution using the SLURM ``squeue`` command.

Checking the results
--------------------

After the job is run, you should have two email notifications (one for the start of the job, one for its end) and a couple of extra files in your directory.  The SLURM scheduler will create a single output file, slurm-XXXX.out. [where XXXX is the JobId number]

The output file slurm-XXXX.out should contain the output from the execution, which can be seen by doing for example::

 cat slurm-XXXX.out

The output should look like this (the exact execution of processes is out of order due to the parallelisation)::

 Process  2  received  from process  1
 Process  9  received  from process  4
 Process  1  received  from process  0
 Process  15 received  from process  14
 Process  11 received  from process  10
 Process  13 received  from process  12
 Process  4  received  from process  3
 Process  6  received  from process  5
 Process  12 received  from process  11
 Process  10 received  from process  9
 Process  7  received  from process  6
 Process  8  received  from process  7
 Process  0  received  from process  16
 Process  2  received  from process  1
 Process  3  received  from process  2
 Process  5  received  from process  4
 Process  14 received  from process  13

MPI Core Allocation (and OpenMP)
--------------------------------
 
In the above examples we have used the SLURM ``--ntasks-per-node`` option to allocate a single CPU core to each MPI process.  There may be occasions where we want to run fewer MPI processes per node, and use insead OpenMP for the remaining allocated cores. We can do this using the ``--cpus-per-task`` option.

Below is an example submission script (for OpenMPI) which requests two nodes with 1 MPI process each, where each MPI process can use 8 cores (for OpenMP) - so a total allocation of 16 cores::

 #!/bin/bash

 #SBATCH --nodes=2
 #SBATCH --ntasks-per-node=1
 #SBATCH --cpus-per-task=8
 #SBATCH --time=00:10:00
 #SBATCH --partition=devel

 module load mpitest/1.0

 mpirun --map-by numa:pe=${SLURM_CPUS_PER_TASK} mpisize
 

The command from the ``mpitest module``, named mpisize outputs the following information::

 Hello from host "arc-c303". This is MPI task 1, the total MPI Size is 2, and there are 8 CPU core(s) allocated to *this* MPI task, these being { 0 1 2 3 4 5 6 7 }
 Hello from host "arc-c302". This is MPI task 0, the total MPI Size is 2, and there are 8 CPU core(s) allocated to *this* MPI task, these being { 0 1 2 3 4 5 6 7 }
 

From the results above we can see that as expected, two MPI processes ran, one on node ``arc-c302`` and the other on ``arc-303`` and each of these processes were allocaed 8 CPUs.


Note: The mpirun option ``--map-by numa:pe=${SLURM_CPUS_PER_TASK}`` is not required if running with Intel MPI.