TURBOMOLE Users Forum
Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: saikat403 on September 20, 2021, 12:40:41 PM
-
Hello all,
I am using turbomole 7.3 in slurm queuing system. Both ridft and rdgrad running perfectly fine in a single node with 40 procs. However, when I am trying to parallel it in more than one node rdgrad seems to create a problem. Here is the slurm script I am using
#!/bin/bash
#SBATCH -J turbo-test
#SBATCH -p standard-low
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH -t 00:15:00 # walltime in HH:MM:SS, Max value 72:00:00
export TURBODIR=/home/17cy91r04/apps/turbomole730
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/`$TURBODIR/scripts/sysname`:$PATH
export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu
export PATH=$TURBODIR/bin/${TURBOMOLE_SYSNAME}_mpi:$TURBODIR/mpirun_scripts:$TURBODIR/scripts:$PATH
export PARNODES=$SLURN_NTASKS
jobex -ri -c 999 > jobe.out
The error I am getting in job.1 like this
rdgrad_mpi 0000000000433429 Unknown Unknown Unknown
rdgrad_mpi 00000000029265FD Unknown Unknown Unknown
rdgrad_mpi 0000000000476960 dlp3_ 427 dlp3.f
libpthread-2.17.s 00007FD82B82C630 Unknown Unknown Unknown
libpthread-2.17.s 00007F086A13C630 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
libpthread-2.17.s 00007FCBAA17B630 Unknown Unknown Unknown
rdgrad_mpi 0000000000476960 dlp3_ 427 dlp3.f
rdgrad_mpi 0000000000471B2F twoder_ 418 twoder.f
rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f
rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f
rdgrad_mpi 0000000000480C0E dasra3_ 155 dasra3.f
rdgrad_mpi 0000000000471B2F twoder_ 418 twoder.f
Is there anything wrong with the slurm script?
How to make turbomole parallel over the node?
Thanks in advance.
with regards,
saikat
-
Hello,
segmentation faults often happen due to too small memory limits, especially the stack size limit is causing those crashes.
See: https://forum.turbomole.org/index.php/topic,23.0.html (https://forum.turbomole.org/index.php/topic,23.0.html)
Note that queuing systems often set the stack size limit themselves. If you have 'ulimit -s unlimited' in your submit script for SLURM, then this is only done on the first node. The processes that are started by MPI on the other nodes will have the default stack size limit and it seems that it is too small on the cluster you are using.
Regards
-
Thanks for the reply
I have added two more keywords echo $HOSTNAME >> mylimits.out
ulimit -s unlimited
ulimit -a >> mylimits.out
and output of the mylimits.out
cn337
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 768120
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 176128000
open files (-n) 131072
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
you are right this is from one node.
Is there a way to set same for the other node?
or I have to ask to change /etc/security/limits.conf ?
-
I found a way out setting ulimit -s unlimited in bashrc seems to solve the problem
Thanks