Author Topic: TURBOMOLE + Slurm 22.05: MPI not working? (Read 5615 times)

hermannschwaerzler · « **on:** February 28, 2023, 03:57:26 PM »

Hi everybody,

we are having a problem wrt running MPI parallelisation on our cluster (Slurm 22.05 - TURBOMOLE 7.5 and 7.7 Demo).
As data we are using the example from the tutorial (benzene) and as a job-script this:

Code: [Select]

#!/bin/bash
#SBATCH --job-name=turbomole-test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --hint=nomultithread
#SBATCH --mem-per-cpu=10G
#SBATCH --time=00:30:00

export TURBODIR=/path/to/TURBOMOLE
export PATH=$TURBODIR/scripts:$TURBODIR/mpirun_scripts:$PATH

## set locale to C
unset LANG
unset LC_CTYPE

# set stack size limit to unlimited:
ulimit -s unlimited

# Set environment variables for an MPI job
export PARA_ARCH=MPI
export PATH="${TURBODIR}/bin/`sysname`:${PATH}"
export PARNODES=$SLURM_NTASKS

ridft

When we submit this job it seems to run, but the only output that is produced is this:

Code: [Select]

STARTING ridft VIA YOUR QUEUING SYSTEM!
RUNNING PROGRAM /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi.
/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -machinefile NodeFile.1604633 -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/lib/release:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/lib:/path/to/TURBOMOLE/libso/em64t-unknown-linux-gnu_mpi /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi

When we check with top on the node where it is running we see the situation in the attachment:

ridft has detected that it has to use srun (which is good!).
But it uses "-N 1 -n 1" i.e. it requests one node and only one task (although we asked for 4 in our job-script).
There is a hydra_pmi_proxy running that uses 100% of one CPU (but produces no output).

Why is this not working?
Are we missing anything?

uwe · « **Reply #1 on:** March 02, 2023, 09:28:28 PM »

Hello,

please set:

export MPI_USESRUN=yes

in your submit script before calling ridft and check if that helps.

Why this might help:

What Turbomole does in case of parallel MPI jobs is to call a script instead of the binary, so 'ridft' is here just a script that checks for the queuing system and sets the environment (see $TURBODIR/bin/em64t-unknown-linux-gnu_mpi/ridft). It is actually the same script for all Turbomole modules that run in parallel using MPI.

In there you will find the section that detects SLURM, the lines start with:

# # SLURM environment # if [ $DONE -eq 0 -a -n "${SLURMD_NODENAME}" ]; then # SLURM detected

If srun is configured by the queuing system, Intel MPI uses it automatically, but the Turbomole script creates by default a file with the name of all nodes. This is named NodeFile.<PID_of_the_script>, in your case NodeFile.1604633 and passed to the mpirun command using the command line option -machinefile.

In your output that's how the lines:
RUNNING PROGRAM /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi. /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -machinefile NodeFile.1604633 -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE/[...]
are generated.

In case of srun the -machinefile option might be not helpful and prevents Intel MPI to start the right number of process on the right nodes.

Setting $MPI_USESRUN to some value (it just should not be empty) avoids adding the -machinefile option.

hermannschwaerzler · « **Reply #2 on:** March 03, 2023, 10:06:55 AM »

Thanks for the hint.
I tried that. There are some changes visible but the basic problem still exists:

Code: [Select]

 `- slurmstepd: [63509.batch]                                                                                                                                                                                                                                                                                           
     `- /bin/bash /var/spool/slurm/slurmd/job63509/slurm_script                                                                                                                                                                                                                                                         
         `- /bin/bash /path/to/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft                                                                                                                                                                                                                                  
             `- /bin/sh /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/gpfs/g+
                 `- mpiexec.hydra -genv OMP_NUM_THREADS=1 -genv TURBODIR=/path/to/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/libfabric/lib:/path/to/TURBOMOLE/mpirun_scripts/IMPI/intel+
                     `- /usr/slurm/bin/srun -N 1 -n 1 --nodelist n054 --input none /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_bstrap_proxy --upstream-host n054.domain --upstream-port 34219 --pgid 0 --launcher slurm --launcher-number 1 --base-path /path/to/TURBOM+
                         `- /usr/slurm/bin/srun -N 1 -n 1 --nodelist n054 --input none /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_bstrap_proxy --upstream-host n054.domain --upstream-port 34219 --pgid 0 --launcher slurm --launcher-number 1 --base-path /path/to/TU+
 `- slurmstepd: [63509.0]                                                                                                                                                                                                                                                                                               
     `- /path/to/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

We have a cluster (not the central one) in our Chemistry department where this works. If it works the hydra_pmi_proxy starts the requested number of ridft_mpi processes. In our case as you see no such process is started by hydra_pmi_proxy.

So sorry no, MPI_USESRUN=1 did not help. :-(

hermannschwaerzler · « **Reply #3 on:** March 03, 2023, 04:32:21 PM »

OK, I found the problem and kind of a solution:

It looks like the Intel MPI installation that is shipped with TURBOMOLE (in mpirun_scripts/IMPI) is somehow (at least in our setup) borked.
As we have our own version of Intel MPI available on our systems I "fixed" this by "patching" mpirun_scripts/IMPI/intel64/bin/mpirun such, that the corresponding module of our system is loaded right before mpiexec.hydra is called (around line 66).
The hydra_pmi_proxy that is used now is doing the right thing and starts the appropriate number of (in my case) ridft_mpi processes.

So it does work now, but with one major limitation: mpiexec.hydra is running in a way that it uses

srun -N 1 -n 1 ...

to start a hydra_bstrap_proxy that in turn starts said hydra_pmi_proxy which starts the processes that do the calculation. But as we are using cgroups to limit access to CPUs these srun-options have the effect that if the job is configured to run more than one task per node, all those tasks of one node run on the very same single CPU!
Which is obviously a performance problem.

So I guess I am still missing something?

uwe · « **Reply #4 on:** March 05, 2023, 12:36:22 PM »

Hi,

did you try to completely replace the Intel MPI version that comes with Turbomole by your local one? I tried the MPI binaries with both Intel MPI version 2019 and 2021 and the runtime libraries seem to be compatible. Just remove the $TURBODIR/mpirun_scripts/IMPI content (should be just a intel64 folder) and link intel64 to your local Intel MPI version.

Turbomole is not using anything 'special' when it comes to MPI, so if your Intel MPI works well with your queuing system, it should also run Turbomole correctly.

hermannschwaerzler · « **Reply #5 on:** March 07, 2023, 04:37:11 PM »

Hi uwe,

thanks for your confirmation that I went in the right direction. :-)
Loading that module in mpirun does imo essentially replace the IntelMPI installation of TURBOMOLE as it prepends PATH, LD_LIBRARY_PATH and others with directories that point to the working IntelMPI version.

So this is the official way of doing things?
Why does the provided version not work out of the box?

In our case I had to go one step further and had to add

export TM_MPIADDOPT="-bootstrap ssh"

to my job-script in order to make IntelMPI use all the CPUs of the node it was running on (without this option it was running all processes on one single CPU). Unfortunately this works only when one uses only one node but my "customers" are happy with that.
When one uses more than one node with this option the processes on further nodes are placed there outside the control of Slurm (with ssh) which is a no-go for us.

Regards,
Hermann

uwe · « **Reply #6 on:** March 07, 2023, 11:20:21 PM »

Hello,

there is no 'official' way when it comes to queuing systems. We also use SLURM and do not have similar issues, but we just use the default settings.

Quote

It looks like the Intel MPI installation that is shipped with TURBOMOLE (in mpirun_scripts/IMPI) is somehow (at least in our setup) borked.

Intel MPI in Turbomole is not changed or modified in any way, it is simply 'pure'. So I do not see any reason why it should behave differently in Turbomole.
If you have another application on your cluster that runs fine with Intel MPI, using the very same settings might help.

Quote

But as we are using cgroups to limit access to CPUs these srun-options have the effect that if the job is configured to run more than one task per node, all those tasks of one node run on the very same single CPU!

There have indeed been several cases where cgroups and cpuset (and most likely any other tool that tries to pin tasks to CPU cores) caused similar problems when running MPI jobs. Intel MPI by default enables process pinning. So if SLURM, cgroups and Intel MPI all try to pin the processes to CPU cores in an unlucky way, it might well happen that they all run on the same core... We use the default Intel MPI pinning and do not have anything else in use, perhaps this helps to avoid such problems. But it seems that not using cgroups is not an option for you.

Did you try to set I_MPI_PIN=0 to disable pinning?

TURBOMOLE Users Forum

Author Topic: TURBOMOLE + Slurm 22.05: MPI not working? (Read 5615 times)

hermannschwaerzler

TURBOMOLE + Slurm 22.05: MPI not working?

uwe

Re: TURBOMOLE + Slurm 22.05: MPI not working?

hermannschwaerzler

Re: TURBOMOLE + Slurm 22.05: MPI not working?

hermannschwaerzler

Re: TURBOMOLE + Slurm 22.05: MPI not working?

uwe

Re: TURBOMOLE + Slurm 22.05: MPI not working?

hermannschwaerzler

Re: TURBOMOLE + Slurm 22.05: MPI not working?

uwe

Re: TURBOMOLE + Slurm 22.05: MPI not working?