MPI and UPC Users Guide
MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementation.
- Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.)
- Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (older versions of OpenMPI).
- Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH. These tasks initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished. The use of pam_slurm_adopt is also strongly recommended.
Note: Slurm is not directly launching the user application in case 3, which may prevent the desired behavior of binding tasks to CPUs and/or accounting. Some versions of some MPI implementations work, so testing your particular installation may be required to determie the actual behavior.
Two Slurm parameters control which MPI implementation will be supported. Proper configuration is essential for Slurm to establish the proper environment for the MPI job, such as setting the appropriate environment variables. The MpiDefault configuration parameter in slurm.conf establishes the system default MPI to be supported. The srun option --mpi= (or the equivalent environment variable SLURM_MPI_TYPE can be used to specify when a different MPI implementation is to be supported for an individual job).
Note: Use of an MPI implementation without the appropriate Slurm plugin may result in application failure. If multiple MPI implementations are used on a system then some users may be required to explicitly specify a suitable Slurm MPI plugin.
Links to instructions for using several varieties of MPI/PMI with Slurm are provided below.
PMIx
Building PMIx
Before building PMIx, it is advisable to read these How-To Guides. They provide some details on building dependencies and installation steps as well as some relevant notes with regards to Slurm Support.
This section is intended to complement the PMIx FAQ with some notes on how to prepare Slurm and PMIx to work together. PMIx can be obtained from the official PMIx GitHub repository, either by cloning the repository or by downloading a packaged release.
Slurm support for PMIx was first included in Slurm 16.05 based on the PMIx v1.2 release. It has since been updated to support the PMIx v2.x and v3.x series, as per the following table:
- Slurm 16.05+ supports only the PMIx v1.x series, starting with v1.2.0. This Slurm version specifically does not support PMIx v2.x and above.
- Slurm 17.11+ supports both PMIx v1.2+ and v2.x but not v3.x.
- Slurm 18.08+ supports PMIx v1.2+, v2.x and v3.x.
Altough it is recommended to build PMIx from the packaged releases, here is an example on how to build PMIx v2.1 by cloning the git repository:
user@testbox:~/git$ mkdir -p pmix/build/2.1 pmix/install/2.1 user@testbox:~/git$ cd pmix user@testbox:~/git/pmix$ git clone https://github.com/openpmix/openpmix.git source user@testbox:~/git/pmix$ cd source/ user@testbox:~/git/pmix/source$ git branch -a user@testbox:~/git/pmix/source$ git checkout v2.1 user@testbox:~/git/pmix/source$ git pull user@testbox:~/git/pmix/source$ ./autogen.sh user@testbox:~/git/pmix/source$ cd ../build/2.1/ user@testbox:~/git/pmix/build/2.1$ ../../source/configure \ > --prefix=/home/user/git/pmix/install/2.1 user@testbox:~/git/pmix/build/2.1$ make -j install >/dev/null user@testbox:~/git/pmix/build/2.1$ cd ../../install/2.1/ user@testbox:~/git/pmix/install/2.1$ grep PMIX_VERSION include/pmix_version.h #define PMIX_VERSION_MAJOR 2L #define PMIX_VERSION_MINOR 1L user@testbox:~/git/pmix/install/2.1$
For the purpose of these instructions let's imagine PMIx v2.1 has been installed on the following path:
user@testbox:/home/user/git/pmix/install/2.1
Additional PMIx notes can be found in the SchedMD Publications and Presentations page.
Building Slurm with PMIx support
At configure time, Slurm looks by default for a PMIx installation under:
/usr /usr/local
If PMIx isn't installed in any of the previous locations, the Slurm configure script can be requested to point to the non default location. Here's an example:
user@testbox:~/slurm/17.11/testbox/slurm$ ../../slurm/configure \ > --prefix=/home/user/slurm/17.11/testbox \ > --with-pmix=/home/user/git/pmix/install/2.1
Or the analogous with RPM based building:
[user@testbox Downloads]$ rpmbuild \ > --define '_prefix /home/user/slurm/17.11/testbox' \ > --define '_slurm_sysconfdir /home/user/slurm/17.11/testbox/etc' \ > --define '_with_pmix --with-pmix=/home/user/git/pmix/install/2.1' \ > -ta slurm-17.11.1.tar.bz2
NOTE: It is also possible to build against multiple PMIx versions with a ':' separator. For instance to build against 1.2 and 2.1:
... > --with-pmix=/path/to/pmix/install/1.2:/path/to/pmix/install/2.1 \ ...
NOTE: When submitting a job, the desired version can then be selected using any of the available from --mpi=list. The default for pmix will be the highest version of the library:
user@testbox:~/t$ srun --mpi=list srun: MPI types are... srun: pmix_v1 srun: pmi2 srun: none srun: pmix srun: pmix_v2
Continuing with the configuration, if Slurm is unable to locate the PMIx installation and/or finds it but considers it not usable, the configure output should log something like this:
checking for pmix installation... configure: WARNING: unable to locate pmix installation
Otherwise, if Slurm finds it and considers it usable, something like this:
checking for pmix installation... /home/user/git/pmix/install/2.1
Inspecting the generated config.log in the Slurm build directory might provide even more detail for troubleshooting purposes. After configuration, we can proceed to install Slurm (using make or rpm accordingly):
user@testbox:~/slurm/17.11/testbox/slurm$ make -j install > /dev/null user@testbox:~/slurm/17.11/testbox/slurm$ cd ../lib/slurm/ user@testbox:~/slurm/17.11/testbox/lib/slurm$ ls -l | grep pmix lrwxrwxrwx 1 user user 16 Dec 6 19:35 mpi_pmix.so -> ./mpi_pmix_v2.so -rw-r--r-- 1 user user 7008260 Dec 6 19:35 mpi_pmix_v2.a -rwxr-xr-x 1 user user 1040 Dec 6 19:35 mpi_pmix_v2.la -rwxr-xr-x 1 user user 1020112 Dec 6 19:35 mpi_pmix_v2.so user@testbox:~/slurm/17.11/testbox/lib/slurm$
If support for PMI2 version is also needed, it can also be installed from the contribs directory:
user@testbox:~/slurm/17.11/testbox/lib/slurm$ cd ../../slurm/contribs/pmi2 user@testbox:~/slurm/17.11/testbox/slurm/contribs/pmi2$ make -j install user@testbox:~/slurm/17.11/testbox/slurm/contribs/pmi2$ cd ../../../lib user@testbox:~/slurm/17.11/testbox/lib$ ls -l | grep pmi -rw-r--r-- 1 user user 498144 Dec 6 20:05 libpmi2.a -rwxr-xr-x 1 user user 958 Dec 6 20:05 libpmi2.la lrwxrwxrwx 1 user user 16 Dec 6 20:05 libpmi2.so -> libpmi2.so.0.0.0 lrwxrwxrwx 1 user user 16 Dec 6 20:05 libpmi2.so.0 -> libpmi2.so.0.0.0 -rwxr-xr-x 1 user user 214512 Dec 6 20:05 libpmi2.so.0.0.0 -rw-r--r-- 1 user user 414680 Dec 6 19:35 libpmi.a -rwxr-xr-x 1 user user 1011 Dec 6 19:35 libpmi.la lrwxrwxrwx 1 user user 15 Dec 6 19:35 libpmi.so -> libpmi.so.0.0.0 lrwxrwxrwx 1 user user 15 Dec 6 19:35 libpmi.so.0 -> libpmi.so.0.0.0 -rwxr-xr-x 1 user user 230552 Dec 6 19:35 libpmi.so.0.0.0 user@testbox:~/slurm/17.11/testbox/lib$
NOTE: Since both Slurm and PMIx provide libpmi[2].so libraries, we recommend to install both pieces of software in different locations. Otherwise, both libraries provided by Slurm and PMIx might end up being installed under standard locations like /usr/lib64 and the package manager erroring out and reporting the conflict. It is planned to alleviate that by putting these libraries in a separate libpmi-slurm package.
NOTE: If you are setting up a test environment using multiple-slurmd, the TmpFS option in your slurm.conf needs to be specified and the number of directory paths created needs to equal the number of nodes. These directories are used by the Slurm PMIx plugin to create temporal files and/or UNIX sockets. Here's an example setup for two nodes named compute[1-2]:
TmpFS=/home/user/slurm/17.11/testbox/spool/slurmd-tmpfs-%n user@testbox:~/slurm/17.11/testbox$ mkdir spool/slurmd-tmpfs-compute1 user@testbox:~/slurm/17.11/testbox$ mkdir spool/slurmd-tmpfs-compute2
Testing Slurm and PMIx
It is possible to directly test Slurm and PMIx without the need of a MPI implementation software being installed. Here's an example indicating that both components work properly:
user@testbox:~/t$ srun --mpi=list srun: MPI types are... srun: pmi2 srun: none srun: pmix srun: pmix_v2 user@testbox:~/t$ srun --mpi=pmix -n2 -N2 \ > ~/git/pmix/build/2.1/test/pmix_client -n 2 --job-fence -c OK OK user@testbox:~/t$ srun --mpi=pmix_v2 -n2 -N2 \ > ~/git/pmix/build/2.1/test/pmix_client -n 2 --job-fence -c OK OK user@testbox:~/t$
OpenMPI
The current versions of Slurm and Open MPI support task launch using the srun command. It relies upon Slurm managing reservations of communication ports for use by the Open MPI version 1.5.
If OpenMPI is configured with --with-pmi either pmi or pmi2, the OMPI jobs can be launched directly using the srun command. This is the preferred mode of operation. If the pmi2 support is enabled, the option '--mpi=pmi2' or '--mpi=pmi2_v2' must be specified on the srun command line. Alternately configure 'MpiDefault=pmi' or 'MpiDefault=pmi_v2' in slurm.conf.
Starting with Open MPI version 3.1, PMIx version 2 is natively supported. To launch Open MPI application using PMIx version 2 the '--mpi=pmix_v2' option must be specified on the srun command line or 'MpiDefault=pmi_v2' configured in slurm.conf. Open MPI version 4.0, adds support for PMIx version 3 and is invoked in the same way, with '--mpi=pmix_v3'.
In Open MPI version 2.0, PMIx is natively supported too. To launch Open MPI application using PMIx the '--mpi=pmix' or '--mpi=pmix_v1' option has to be specified on the srun command line. It is also possible to build OpenMPI using an external PMIx installation. That option can take one of three values: "internal", "external", or a valid directory name. "internal" (or no DIR value) forces Open MPI to use its internal copy of PMIx. "external" forces Open MPI to use an external installation of PMIx. Supplying a valid directory name also forces Open MPI to use an external installation of PMIx, and adds DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries. Note that Open MPI does not support --without-pmix. Note also that if building OpenMPI using an external PMIx installation, both OpenMPI and PMIx need to be built against the same libevent/hwloc installations, otherwise a warning is shown. OpenMPI configure script provides the options --with-libevent=PATH and/or --with-hwloc=PATH to make OpenMPI match what PMIx was built against.
A set of environment variables are available to control the behavior of Slurm PMIx plugin:
- SLURM_PMIX_SRV_TMPDIR base directory for PMIx/server service files.
- SLURM_PMIX_TMPDIR base directory for applications session directories.
- SLURM_PMIX_DIRECT_CONN (default - yes) enables (1/yes/true) or disables (0/no/false) controls wheter direct connections between slurmstepd's are astablished or Slurm RPCs are used for data exchange. Direct connection shows better performanse for fully-packed nodes when PMIx is running in the direct-modex mode.
For older versions of OMPI not compiled with the pmi support the system administrator must specify the range of ports to be reserved in the slurm.conf file using the MpiParams parameter. For example: MpiParams=ports=12000-12999
Alternatively tasks can be launched using the srun command plus the option --resv-ports or using the environment variable SLURM_RESV_PORT, which is equivalent to always including --resv-ports on srun's execute line. The ports reserved on every allocated node will be identified in an environment variable available to the tasks as shown here: SLURM_STEP_RESV_PORTS=12000-12015
$ salloc -n4 sh # allocates 4 processors and spawns shell for job > srun a.out > exit # exits shell spawned by initial salloc command
or
> srun -n 4 a.out
or using the pmi2 support
> srun --mpi=pmi2 -n 4 a.out
or using the pmix support
> srun --mpi=pmix -n 4 a.out
If the ports reserved for a job step are found by the Open MPI library
to be in use, a message of this form will be printed and the job step
will be re-launched:
srun: error: sun000: task 0 unble to claim reserved port, retrying
After three failed attempts, the job step will be aborted.
Repeated failures should be reported to your system administrator in
order to rectify the problem by cancelling the processes holding those
ports.
NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() from within a Slurm allocation. If you need to use the MPI_Comm_spawn() function you will need to use another MPI implementation combined with PMI-2 since PMIx doesn't support it either.
NOTE: Some kernels and system configurations have resulted in a locked memory too small for proper OpemMPI functionality, resulting in application failure with a segmentation fault. This may be fixed by configuring the slurmd daemon to execute with a larger limit. For example, add "LimitMEMLOCK=infinity" to your slurmd.service file.
Intel MPI
Intel® MPI Library for Linux OS supports the following methods of launching the MPI jobs under the control of the Slurm job manager:
This description provides detailed information on all of these methods.
The mpirun Command over the MPD Process Manager
Slurm is supported by the mpirun command of the Intel® MPI Library 3.1 Build 029 for Linux OS and later releases.
When launched within a session allocated using the Slurm commands sbatch or salloc, the mpirun command automatically detects and queries certain Slurm environment variables to obtain the list of the allocated cluster nodes.
Use the following commands to start an MPI job within an existing Slurm session over the MPD PM:
export I_MPI_PROCESS_MANAGER=mpd mpirun -n <num_procs> a.out
The mpirun Command over the Hydra Process Manager
Slurm is supported by the mpirun command of the Intel® MPI Library 4.0 Update 3 through the Hydra PM by default. The behavior of this command is analogous to the MPD case described above.
Use the one of the following commands to start an MPI job within an existing Slurm session over the Hydra PM:
mpirun -n <num_procs> a.out
or
mpirun -bootstrap slurm -n <num_procs> a.out
We recommend that you use the second command. It uses the srun command rather than the default ssh based method to launch the remote Hydra PM service processes.
The mpiexec.hydra Command (Hydra Process Manager)
Slurm is supported by the Intel® MPI Library 4.0 Update 3 directly through the Hydra PM.
Use the following command to start an MPI job within an existing Slurm session:
mpiexec.hydra -bootstrap slurm -n <num_procs> a.out
The srun Command (Slurm, recommended)
This advanced method is supported by the Intel® MPI Library 4.0 Update 3. This method is the best integrated with Slurm and supports process tracking, accounting, task affinity, suspend/resume and other features. Use the following commands to allocate a Slurm session and start an MPI job in it, or to start an MPI job within a Slurm session already created using the sbatch or salloc commands:
- Set the I_MPI_PMI_LIBRARY environment variable to point to the Slurm Process Management Interface (PMI) library:
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
NOTE: Due to licensing reasons IMPI doesn't link directly to any external PMI implementation. Instead one must point to the desired library just exporting the environment variable mentioned above, which will then be dlopened by IMPI. Also, there is no official support provided by Intel against PMIx libraries. Since IMPI is based on MPICH, using PMIx with Intel may work due PMIx mantaining compatibility with pmi and pmi2, which are the ones used in MPICH, but it is not guaranteed to run in all cases.
srun -n <num_procs> a.out
Above information used by permission from Intel. For more information see Intel MPI Library.
MPICH (a.k.a. MPICH2)
MPICH2 jobs can be launched using the srun command using pmi 1 or 2, or mpiexec. All modes of operation are described below.
MPICH2 with srun and PMI version 2
MPICH2 must be built specifically for use with Slurm and PMI2 using a configure line similar to that shown below.
./configure --with-slurm=<PATH> --with-pmi=pmi2
The PATH must point to the Slurm installation directory, in other words the parent directory of bin and lib. In addition, if Slurm is not configured with MpiDefault=pmi2, then the srun command must be invoked with the option --mpi=pmi2 as shown in the example below.
srun -n4 --mpi=pmi2 ./a.out
The PMI2 support in Slurm works only if the MPI implementation supports it, in other words if the MPI has
the PMI2 interface implemented. The --mpi=pmi2 will load the library lib/slurm/mpi_pmi2.so
which provides the server side functionality but the client side must implement PMI2_Init()
and the other interface calls.
This does require that the MPICH intallation have been installed with the --with-pmi=pmi2 configure option.
To check if the MPI version you are using supports PMI2 check for PMI2_* symbols in the MPI library.
Slurm provides a version of the PMI2 client library in the contribs directory. This library gets installed in the Slurm lib directory. If your MPI implementation supports PMI2 and you wish to use the Slurm provided library you have to link the Slurm provided library explicitly:
$ mpicc -L<path_to_pmi2_lib> -lpmi2 ... $ srun -n20 a.out
MPICH2 with srun and PMI version 1
Link your program with Slurm's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). For example:
$ mpicc -L<path_to_slurm_lib> -lpmi ... $ srun -n20 a.outNOTES:
- Some MPICH2 functions are not currently supported by the PMI library integrated with Slurm
- Set the environment variable PMI_DEBUG to a numeric value of 1 or higher for the PMI library to print debugging information. Use srun's -l option for better clarity.
- Set the environment variable SLURM_PMI_KVS_NO_DUP_KEYS for improved performance with MPICH2 by eliminating a test for duplicate keys.
- The environment variables can be used to tune performance depending upon network performance: PMI_FANOUT, PMI_FANOUT_OFF_HOST, and PMI_TIME. See the srun man pages in the INPUT ENVIRONMENT VARIABLES section for a more information.
- Information about building MPICH2 for use with Slurm is described on the MPICH2 FAQ web page and below.
MPICH2 with mpiexec
Do not add any flags to mpich and build the default
(e.g. "./configure -prefix ... ".
Do NOT pass the --with-slurm, --with-pmi, --enable-pmiport options).
Do not add -lpmi to your application (it will force slurm's pmi 1
interface which doesn't support PMI_Spawn_multiple).
Launch the application using salloc to create the job allocation and mpiexec
to launch the tasks. A simple example is shown below.
salloc -N 2 mpiexec my_application
All MPI_comm_spawn work fine now going through hydra's PMI 1.1 interface.
MVAPICH (a.k.a. MVAPICH2)
MVAPICH2 supports launching multithreaded programs by Slurm as well as mpirun_rsh. Please note that if you intend to use srun, you need to build MVAPICH2 with Slurm support with a command line of this sort:
$ ./configure --with-pmi=pmi2 --with-pm=slurm
Use of Slurm's pmi2 plugin provides substantially higher performance and scalability than Slurm's pmi plugin. If pmi2 is not configured to be Slurm's default MPI plugin at your site, this can be specified using the srun command's "--mpi-pmi2" option as shown below or with the environment variable setting of "SLURM_MPI_TYPE=pmi2".
$ srun -n16 --mpi=pmi2 a.out
MVAPICH2 can be built using the following options:
--with-pmi=pmi2 \
--with-pm=slurm \
--with-slurm=<install directory> \
--enable-slurm=yes
For more information, please see the MVAPICH2 User Guide
UPC (Unified Parallel C)
Berkeley UPC (and likely other UPC implementations) rely upon Slurm to allocate resources and launch the application's tasks. The UPC library then read. Slurm environment variables in order to determine how the job's task count and location. One would build the UPC program in the normal manner then initiate it using a command line of this sort:
$ srun -N4 -n16 a.out
Last modified 30 October 2020