Author Topic: SLURM multinode parallel problem  (Read 3197 times)

saikat403

  • Newbie
  • *
  • Posts: 7
  • Karma: +0/-0
SLURM multinode parallel problem
« on: September 20, 2021, 12:40:41 PM »
Hello all,
I am using turbomole 7.3 in slurm queuing system. Both ridft and rdgrad running perfectly fine in a single node with 40 procs. However, when I am trying to parallel it in more than one node rdgrad seems to create a problem. Here is the slurm script I am using
Code: [Select]
#!/bin/bash
#SBATCH -J turbo-test   
#SBATCH -p standard-low
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH -t 00:15:00    # walltime in HH:MM:SS, Max value 72:00:00

export TURBODIR=/home/17cy91r04/apps/turbomole730
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/`$TURBODIR/scripts/sysname`:$PATH
export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu
export PATH=$TURBODIR/bin/${TURBOMOLE_SYSNAME}_mpi:$TURBODIR/mpirun_scripts:$TURBODIR/scripts:$PATH
export PARNODES=$SLURN_NTASKS
jobex -ri -c 999 > jobe.out

The error I am getting in job.1 like this
Code: [Select]
rdgrad_mpi         0000000000433429  Unknown               Unknown  Unknown
rdgrad_mpi         00000000029265FD  Unknown               Unknown  Unknown
rdgrad_mpi         0000000000476960  dlp3_                     427  dlp3.f
libpthread-2.17.s  00007FD82B82C630  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F086A13C630  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
libpthread-2.17.s  00007FCBAA17B630  Unknown               Unknown  Unknown
rdgrad_mpi         0000000000476960  dlp3_                     427  dlp3.f
rdgrad_mpi         0000000000471B2F  twoder_                   418  twoder.f
rdgrad_mpi         0000000000480C0E  dasra3_                   155  dasra3.f
rdgrad_mpi         0000000000480C0E  dasra3_                   155  dasra3.f
rdgrad_mpi         0000000000480C0E  dasra3_                   155  dasra3.f
rdgrad_mpi         0000000000471B2F  twoder_                   418  twoder.f

Is there anything wrong with the slurm script?
How to make turbomole parallel over the node?
Thanks in advance.

with regards,
saikat


uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: SLURM multinode parallel problem
« Reply #1 on: September 20, 2021, 12:56:50 PM »
Hello,

segmentation faults often happen due to too small memory limits, especially the stack size limit is causing those crashes.

See: https://forum.turbomole.org/index.php/topic,23.0.html

Note that queuing systems often set the stack size limit themselves. If you have 'ulimit -s unlimited' in your submit script for SLURM, then this is only done on the first node. The processes that are started by MPI on the other nodes will have the default stack size limit and it seems that it is too small on the cluster you are using.

Regards

saikat403

  • Newbie
  • *
  • Posts: 7
  • Karma: +0/-0
Re: SLURM multinode parallel problem
« Reply #2 on: September 20, 2021, 01:45:37 PM »
Thanks for the reply

I have added two more keywords
Code: [Select]
echo $HOSTNAME >> mylimits.out
ulimit -s  unlimited
ulimit -a >> mylimits.out

and output of the mylimits.out
Code: [Select]
cn337
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 768120
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 176128000
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

you are right this is from one node.
Is there a way to set same for the other node?

or I have to ask to change /etc/security/limits.conf ?

saikat403

  • Newbie
  • *
  • Posts: 7
  • Karma: +0/-0
Re: SLURM multinode parallel problem
« Reply #3 on: September 21, 2021, 06:55:02 PM »
I found a way out setting ulimit -s unlimited in bashrc seems to solve the problem
Thanks