TURBOMOLE Users Forum

Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: saikat403 on September 20, 2021, 12:40:41 PM

Title: SLURM multinode parallel problem
Post by: saikat403 on September 20, 2021, 12:40:41 PM: Hello all,
I am using turbomole 7.3 in slurm queuing system. Both ridft and rdgrad running perfectly fine in a single node with 40 procs. However, when I am trying to parallel it in more than one node rdgrad seems to create a problem. Here is the slurm script I am using
Code: [Select]
#!/bin/bash #SBATCH -J turbo-test #SBATCH -p standard-low #SBATCH --nodes=2 #SBATCH --ntasks-per-node=40 #SBATCH -t 00:15:00 # walltime in HH:MM:SS, Max value 72:00:00 export TURBODIR=/home/17cy91r04/apps/turbomole730 export PARA_ARCH=MPI export PATH=$TURBODIR/bin/`$TURBODIR/scripts/sysname`:$PATH export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu export PATH=$TURBODIR/bin/${TURBOMOLE_SYSNAME}_mpi:$TURBODIR/mpirun_scripts:$TURBODIR/scripts:$PATH export PARNODES=$SLURN_NTASKS jobex -ri -c 999 > jobe.out
The error I am getting in job.1 like this
Code: [Select]
rdgrad_mpi 0000000000433429 Unknown Unknown Unknown rdgrad_mpi 00000000029265FD Unknown Unknown Unknown rdgrad_mpi 0000000000476960 dlp3_ 427 dlp3.f libpthread-2.17.s 00007FD82B82C630 Unknown Unknown Unknown libpthread-2.17.s 00007F086A13C630 Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred libpthread-2.17.s 00007FCBAA17B630 Unknown Unknown Unknown rdgrad_mpi 0000000000476960 dlp3_ 427 dlp3.f rdgrad_mpi 0000000000471B2F twoder_ 418 twoder.f rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f rdgrad_mpi 0000000000471B2F twoder_ 418 twoder.f
Is there anything wrong with the slurm script?
How to make turbomole parallel over the node?
Thanks in advance.

with regards,
saikat
Title: Re: SLURM multinode parallel problem
Post by: uwe on September 20, 2021, 12:56:50 PM: Hello,

segmentation faults often happen due to too small memory limits, especially the stack size limit is causing those crashes.

See: https://forum.turbomole.org/index.php/topic,23.0.html (https://forum.turbomole.org/index.php/topic,23.0.html)

Note that queuing systems often set the stack size limit themselves. If you have 'ulimit -s unlimited' in your submit script for SLURM, then this is only done on the first node. The processes that are started by MPI on the other nodes will have the default stack size limit and it seems that it is too small on the cluster you are using.

Regards
Title: Re: SLURM multinode parallel problem
Post by: saikat403 on September 20, 2021, 01:45:37 PM: Thanks for the reply

I have added two more keywords
Code: [Select]
echo $HOSTNAME >> mylimits.out ulimit -s unlimited ulimit -a >> mylimits.out
and output of the mylimits.out
Code: [Select]
cn337 core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 768120 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 176128000 open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
you are right this is from one node.
Is there a way to set same for the other node?

or I have to ask to change /etc/security/limits.conf ?
Title: Re: SLURM multinode parallel problem
Post by: saikat403 on September 21, 2021, 06:55:02 PM: I found a way out setting ulimit -s unlimited in bashrc seems to solve the problem
Thanks