TURBOMOLE + Slurm 22.05: MPI not working?

Installation and usage of TURBOMOLE > Parallel Runs

(1/2) > >>

hermannschwaerzler:
Hi everybody,

we are having a problem wrt running MPI parallelisation on our cluster (Slurm 22.05 - TURBOMOLE 7.5 and 7.7 Demo).
As data we are using the example from the tutorial (benzene) and as a job-script this:

--- Code: ---#!/bin/bash
#SBATCH --job-name=turbomole-test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --hint=nomultithread
#SBATCH --mem-per-cpu=10G
#SBATCH --time=00:30:00

export TURBODIR=/path/to/TURBOMOLE
export PATH=$TURBODIR/scripts:$TURBODIR/mpirun_scripts:$PATH

## set locale to C
unset LANG
unset LC_CTYPE

# set stack size limit to unlimited:
ulimit -s unlimited

# Set environment variables for an MPI job
export PARA_ARCH=MPI
export PATH="${TURBODIR}/bin/`sysname`:${PATH}"
export PARNODES=$SLURM_NTASKS

ridft

--- End code ---

When we submit this job it seems to run, but the only output that is produced is this:

--- Code: ---STARTING ridft VIA YOUR QUEUING SYSTEM!
RUNNING PROGRAM /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi.
/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -machinefile NodeFile.1604633 -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/lib/release:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/lib:/path/to/TURBOMOLE/libso/em64t-unknown-linux-gnu_mpi /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi

--- End code ---

When we check with top on the node where it is running we see the situation in the attachment:

* ridft has detected that it has to use srun (which is good!).
* But it uses "-N 1 -n 1" i.e. it requests one node and only one task (although we asked for 4 in our job-script).
* There is a hydra_pmi_proxy running that uses 100% of one CPU (but produces no output).
Why is this not working?
Are we missing anything?

uwe:
Hello,

please set:

export MPI_USESRUN=yes

in your submit script before calling ridft and check if that helps.

Why this might help:

What Turbomole does in case of parallel MPI jobs is to call a script instead of the binary, so 'ridft' is here just a script that checks for the queuing system and sets the environment (see $TURBODIR/bin/em64t-unknown-linux-gnu_mpi/ridft). It is actually the same script for all Turbomole modules that run in parallel using MPI.

In there you will find the section that detects SLURM, the lines start with:

#
# SLURM environment
#
if [ $DONE -eq 0 -a -n "${SLURMD_NODENAME}" ]; then
# SLURM detected

If srun is configured by the queuing system, Intel MPI uses it automatically, but the Turbomole script creates by default a file with the name of all nodes. This is named NodeFile.<PID_of_the_script>, in your case NodeFile.1604633 and passed to the mpirun command using the command line option -machinefile.

In your output that's how the lines:
RUNNING PROGRAM /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi.
/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -machinefile NodeFile.1604633 -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE/[...]
are generated.

In case of srun the -machinefile option might be not helpful and prevents Intel MPI to start the right number of process on the right nodes.

Setting $MPI_USESRUN to some value (it just should not be empty) avoids adding the -machinefile option.

hermannschwaerzler:
Thanks for the hint.
I tried that. There are some changes visible but the basic problem still exists:

--- Code: --- `- slurmstepd: [63509.batch]
`- /bin/bash /var/spool/slurm/slurmd/job63509/slurm_script
`- /bin/bash /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft
`- /bin/sh /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/gpfs/g+
`- mpiexec.hydra -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel+
`- /usr/slurm/bin/srun -N 1 -n 1 --nodelist n054 --input none /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_bstrap_proxy --upstream-host n054.domain --upstream-port 34219 --pgid 0 --launcher slurm --launcher-number 1 --base-path /path/to/TURBOM+
`- /usr/slurm/bin/srun -N 1 -n 1 --nodelist n054 --input none /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_bstrap_proxy --upstream-host n054.domain --upstream-port 34219 --pgid 0 --launcher slurm --launcher-number 1 --base-path /path/to/TU+
`- slurmstepd: [63509.0]
`- /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

--- End code ---

We have a cluster (not the central one) in our Chemistry department where this works. If it works the hydra_pmi_proxy starts the requested number of ridft_mpi processes. In our case as you see no such process is started by hydra_pmi_proxy.

So sorry no, MPI_USESRUN=1 did not help. :-(

hermannschwaerzler:
OK, I found the problem and kind of a solution:

It looks like the Intel MPI installation that is shipped with TURBOMOLE (in mpirun_scripts/IMPI) is somehow (at least in our setup) borked.
As we have our own version of Intel MPI available on our systems I "fixed" this by "patching" mpirun_scripts/IMPI/intel64/bin/mpirun such, that the corresponding module of our system is loaded right before mpiexec.hydra is called (around line 66).
The hydra_pmi_proxy that is used now is doing the right thing and starts the appropriate number of (in my case) ridft_mpi processes.

So it does work now, but with one major limitation: mpiexec.hydra is running in a way that it uses

srun -N 1 -n 1 ...

to start a hydra_bstrap_proxy that in turn starts said hydra_pmi_proxy which starts the processes that do the calculation. But as we are using cgroups to limit access to CPUs these srun-options have the effect that if the job is configured to run more than one task per node, all those tasks of one node run on the very same single CPU!
Which is obviously a performance problem.

So I guess I am still missing something?

uwe:
Hi,

did you try to completely replace the Intel MPI version that comes with Turbomole by your local one? I tried the MPI binaries with both Intel MPI version 2019 and 2021 and the runtime libraries seem to be compatible. Just remove the $TURBODIR/mpirun_scripts/IMPI content (should be just a intel64 folder) and link intel64 to your local Intel MPI version.

Turbomole is not using anything 'special' when it comes to MPI, so if your Intel MPI works well with your queuing system, it should also run Turbomole correctly.

Navigation

[0] Message Index

[#] Next page

Go to full version