Installation and usage of TURBOMOLE > Parallel Runs
ricc2 dipole moment calculation problem for MPI+amd nodes
(1/1)
glebreto:
Dear all,
I am new at Turbomol and I am trying to perform ADC(2) or CC2 excited state calculation. I run on 2 architectures: x86_64 and em64t. The excitation energies are computed using the same input with em64t and MPI or SMP as the oscillator strength. Using x86_64 , the excitation energies are computed with the MPI or SMP, but only the SMP works when I ask for the oscillator strength. The MPI lead to memory issues, which is strange since it is only a small molecule with quite a large RAM available (128G).
Here is the inputs:
coord:
$coord natoms= 2
0.00000000000000 0.00000000000000 -0.02489783 cl
0.00000000000000 0.00000000000000 2.38483140 h
$user-defined bonds
$end
control:
$coord file=coord
$atoms
basis =cc-pVDZ
cbasis =cc-pVDZ
$symmetry c1
$denconv 1.d-8
$eht charge=0 unpaired=0
$ricc2
adc(2)
maxiter = 100
mxdiis = 50
conv=8
iprint=5
$excitations
irrep=a multiplicity=1 nexc=4
spectrum states=all operators=diplen
maxiter = 100
mxdiis = 50
conv=8
$freeze
defcore
$maxcor 70000 mib per_node
$end
and the submission file:
#!/bin/ksh
#$ -N turbomol
#$ -q batch
#$ -pe dmp* 32
#$ -l vendor=amd
module purge
export TURBODIR=/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE
export PATH=$TURBODIR/scripts:$PATH
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/`sysname`:$PATH
tmpdir='/tmp3/'${JOB_ID}'/TMP'
mkdir -p $tmpdir
export TURBOTMPDIR=$tmpdir
export PARNODES=2
export OMP_NUM_THREADS=2
echo $NSLOTS
ulimit -a
date
dscf &> dscf.out
ricc2 &> ricc2.out
date
and the beginning and end of the ricc2.out :
tmpdir in control file set to "/tmp3/2402/TMP".
This directory must exist and be writable by the master process (slave1).
STARTING ricc2 VIA YOUR QUEUING SYSTEM!
RUNNING PROGRAM /work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/bin/x86_64-unknown-linux-gnu_mpi/ricc2_mpi.
/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/mpirun_scripts/IMPI/intel64/bin/mpirun -machinefile NodeFile.50523 -genv OMP_NUM_THREADS=2 -genv TURBODIR=/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE -genv I_MPI_PIN=off -genv OMP_STACK_SIZE=256M -genv LD_LIBRARY_PATH=/beegfs/data/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/mpirun_scripts/IMPI/intel64//libfabric/lib:/beegfs/data/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/mpirun_scripts/IMPI/intel64//lib/release:/beegfs/data/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/mpirun_scripts/IMPI/intel64//lib:/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/libso/x86_64-unknown-linux-gnu_mpi /work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/bin/x86_64-unknown-linux-gnu_mpi/ricc2_mpi
this is node-proc. number 1 running on node part064.u-bourgogne.fr
the total number of node-proc. spawned is 33
parallel platform: MPP or cluster with fast interconnect
OpenMP run-time library returned nthreads = 2
Program not compiled with OMP parallelization
... only 1 thread can used... 0 2
ricc2 (part064.u-bourgogne.fr) : TURBOMOLE rev. V7-8 compiled 22 Nov 2023 at 12:25:37
Copyright (C) 2023 TURBOMOLE GmbH, Karlsruhe
2024-02-15 17:48:55.656
R I C C 2 - PROGRAM
the quantum chemistry groups
at the universities in
Karlsruhe & Bochum
Germany
-------------------------------
caled vector with: 0.983513025232158
renormalized left eigenvector 2
overlap (left|right): 1.0338E+00
scaled vector with: 9.8351E-01
norm of right eigenvector: 1.00056976448406 1.01702333053144
scaled vector with: 0.983261612570339
renormalized left eigenvector 3
overlap (left|right): 1.0343E+00
scaled vector with: 9.8326E-01
norm of right eigenvector: 1.00065470833162 1.01825960571866
scaled vector with: 0.982067828659689
renormalized left eigenvector 4
overlap (left|right): 1.0369E+00
scaled vector with: 9.8207E-01
The semi-canonical algorithm will be used for densities
======== CC DENSITY MODULE ========
current wave-function model: ADC(2)
calculating 4 xi densities
a semicanonical algorithm will be used when possible
density nr. cpu/min wall/min L R
------------------------------------------------------
total memory allocated in ccn5den1: 1 Mbyte
number of batches in I-loop: 2
memory allocated per RI-intermediate in I-loop: 1 MByte
memory allocated per RI-intermediate in j-loop: 1 MByte
total memory allocated in cc_ybcont: 1 Mbyte
time in cc_ybcont cpu: 0.00 sec wall: 0.00 sec ratio: 1.0
-----
total memory allocated in ccn5den1: 1 Mbyte
number of batches in I-loop: 2
memory allocated per RI-intermediate in I-loop: 1 MByte
memory allocated per RI-intermediate in j-loop: 1 MByte
total memory allocated in cc_ybcont: 1 Mbyte
time in cc_ybcont cpu: 0.00 sec wall: 0.00 sec ratio: 1.0
number of batches in I-loop: 2
memory allocated per RI-intermediate in I-loop: 1 MByte
memory allocated per RI-intermediate in j-loop: 1 MByte
total memory allocated in cc_ybcont: 1 Mbyte
time in cc_ybcont cpu: 0.00 sec wall: 0.00 sec ratio: 1.0
number of batches in I-loop: 2
memory allocated per RI-intermediate in I-loop: 1 MByte
memory allocated per RI-intermediate in j-loop: 1 MByte
total memory allocated in cc_ybcont: 1 Mbyte
time in cc_ybcont cpu: 0.00 sec wall: 0.00 sec ratio: 1.0
2 0.00 0.00 LE0 R0
total memory allocated in ccn5den1: 1 Mbyte
Abort(403292676) on node 9 (rank 8 in comm 496): Fatal error in PMPI_Recv: Invalid tag, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x2b6faeb1f7c0, count=1260, dtype=0x4c000829, src=1, tag=1048577, comm=0x84000002, status=0x7ffe3587d280) failed
PMPI_Recv(105): Invalid tag, value is 1048577
Abort(269074948) on node 10 (rank 9 in comm 496): Fatal error in PMPI_Recv: Invalid tag, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x2b58b7625b40, count=1260, dtype=0x4c000829, src=1, tag=1048577, comm=0x84000002, status=0x7ffccfe71c80) failed
PMPI_Recv(105): Invalid tag, value is 1048577
-----
PMPI_Recv(105): Invalid tag, value is 1048577
Abort(805945860) on node 13 (rank 12 in comm 496): Fatal error in PMPI_Recv: Invalid tag, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x2b4fef781840, count=1260, dtype=0x4c000829, src=1, tag=1048577, comm=0x84000002, status=0x7ffe3cb10f00) failed
PMPI_Recv(105): Invalid tag, value is 1048577
Abort(671728132) on node 18 (rank 17 in comm 496): Fatal error in PMPI_Recv: Invalid tag, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x2b39ec0dc1c0, count=1260, dtype=0x4c000829, src=1, tag=1048577, comm=0x84000002, status=0x7ffd404f0200) failed
PMPI_Recv(105): Invalid tag, value is 1048577
Abort(671728132) on node 29 (rank 28 in comm 496): Fatal error in PMPI_Recv: Invalid tag, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x2b634cd4b7c0, count=1260, dtype=0x4c000829, src=1, tag=1048577, comm=0x84000002, status=0x7fffa1659c00) failed
PMPI_Recv(105): Invalid tag, value is 1048577
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 50712 RUNNING AT part064.u-bourgogne.fr
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Here is the output of the submission script:
Prologue begin
Starter begin : part064.u-bourgogne.fr(49194)
jeu. févr. 15 17:48:43 CET 2024
Version CentOS : 7.7
Starter(49194): PATH=/usr/ccub/sge/scripts:/tmp3/2402.1.batch:/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/bin/em64t-unknown-linux-gnu:/work/shared/icmub/TurboMole/TmoleX2024/TURBOMOLE/scripts:/soft/c7/gv/6.1.1/gv:/soft/c7/spack/0.18.0/packages/linux-centos7-haswell/gcc/11.2.0/gcc/4.8.5/g75x5bhqcqxorvp32f6vs2h3e4vb7tpm/bin:/usr/lib64/qt-3.3/bin:/soft/c7/modules/4.1.2/bin:/usr/ccub/sge-8.1.8/bin:/usr/ccub/sge-8.1.8/bin/lx-amd64:/user1/icmub/gu9875le/bin:/bin:/usr/bin:/usr/sbin:/etc:/usr/ccub/bin:/usr/local/bin:/user1/icmub/gu9875le/bin:.:/work/shared/icmub/bin:/soft/c7/gaussian/16avx2/g16/bsd:/soft/c7/gaussian/16avx2/g16/local:/soft/c7/gaussian/16avx2/g16/extras:/soft/c7/gaussian/16avx2/g16
Starter exec(49194) : '/usr/ccub/sge-8.1.8/ccub/spool/part064/job_scripts/2402'
32
time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) unlimited
data(KiB) unlimited
stack(KiB) unlimited
lockedmem(KiB) unlimited
nofiles(descriptors) 1024
processes unlimited
flocks unlimited
sigpending 513331
msgqueue(bytes) 819200
maxnice 0
maxrtprio 0
address-space(KiB) unlimited
jeu. févr. 15 17:48:44 CET 2024
jeu. févr. 15 17:49:01 CET 2024
Starter(49194): Return code=0
Starter end(49194)
Do you have some ideas?
Best,
Guillaume
uwe:
Hello,
if the number of processes is large but the input small, some MPI processes will not get any tasks to do in some parts of the code. That might result in messaging problems. It should of course not happen, so it is good to get a bug report.
Depending on how the parallelization is done, the limiting factor for size of the input could be something like the number of occupied orbitals - which is 9 in your HCl case. You tried to run the job on 33 cores which is most likely too much for a two-atom input where the serial version just runs a couple of seconds.
Your input runs fine for me on 2 cores in parallel using MPI, but I'd recommend to try a larger input for testing.
Best Regards
glebreto:
Hello,
Thanks for your suggestions. How should I run on 2 cores in parallel using MPI in a machine that has 32 cores? In the above input, I set:
export PARNODES=2
export OMP_NUM_THREADS=2
and I do not have 2 processes starting but 33.
I also tried on Anliline or on this molecule:
$coord
-8.31772213412044 2.35528661159307 0.00000566917837 c
-5.74708611281726 1.68082068100064 0.00002456643962 c
-5.02669605794417 -0.91132420305303 0.00001322808287 c
-6.84130556911606 -2.84734483827988 -0.00004346370087 c
-9.35925035688630 -2.15343929504342 -0.00006236096211 c
-10.07869932814933 0.42537357120101 -0.00002834589187 c
-2.35207029772896 -0.84979094098296 0.00002267671350 c
-1.73118377194804 1.77591925849631 -0.00000377945225 n
-3.71886196051766 3.29465057088330 0.00003401507024 n
-0.61420256447812 -2.62628089676891 0.00004913287924 n
1.73137274456050 -1.77546383450027 0.00001133835675 n
2.35210053334695 0.84903127108086 -0.00000188972612 c
0.61405138638815 2.62631302211303 0.00003968424862 n
5.02651086478396 0.91116546605856 0.00007180959274 c
5.74706154637764 -1.68116650088145 0.00005858150986 c
8.31808685126249 -2.35518456638234 0.00002456643962 c
10.07865208499621 -0.42515436297056 -0.00005291233149 c
9.35889319864875 2.15379267382873 -0.00006803014049 c
6.84105045608924 2.84742798622936 -0.00004157397474 c
3.71926825163446 -3.29486977911375 -0.00000566917837 n
-10.81255368166037 -3.59482033769723 -0.00009826575848 h
-6.27814639699021 -4.81502594499871 -0.00006614041436 h
-8.87810151911696 4.32341369367732 0.00003212534412 h
6.27756436134383 4.81503161417709 -0.00006614041436 h
10.81221731041019 3.59517182675641 -0.00013606028097 h
8.87871190065522 -4.32324739777835 0.00004913287924 h
12.07361327805014 -0.89154632943269 -0.00008692740173 h
-12.07358493215827 0.89215293151870 -0.00003590479637 h
$end
and get similar issues.
If it may help, in the meantime I also had memory issues while performing CC2/ADC2 geometry optimization at the same step of the calculation: while calling the CC density Module using MPI. Note that it worked fine in SMP.
Best,
Guillaume
uwe:
Hello,
Christof found out what the problem is: The MPI version that is used in newer Turbomole releases is too new. Intel MPI 2019/2021 reduced the maximum allowed number for MPI tags (that's an implementation detail and can only be changed when building ricc2, not during runtime).
Turbomole 7.5.1 came with an older Intel MPI version which still works. So as a workaround, if you have older Turbomole releases available, use 7.5.1 (or older) for this job. Alternative is of course to use the latest version but SMP and not MPI.
A fix will be available in one of the next official releases of Turbomole.
Sorry for the inconvenience...
Navigation
[0] Message Index
Go to full version