Dear All,
I have recently switched from Turbomole v5.10 to the v6.1 and I have a few questions regarding the memory allocation by the ridft module in the latest Turbomole v6.1. The same job, which runs flawlessly on the v5.10 version, crashes with the following error on the latest v6.1:
ridft ended abnormally
MPI Application rank 1 exited before MPI_Finalize() with status 13
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 0000003310A99730 Unknown Unknown Unknown
libc.so.6 0000003310ACCDA4 Unknown Unknown Unknown
ridft_mpi 00000000009847F0 pa_rcv_nap_ 220 pa_mp.f
ridft_mpi 0000000000995270 ridftserver_ 113 ridftserver.f
ridft_mpi 0000000000987FDD serversub_ 286 serversub.F
ridft_mpi 000000000098324D pa_ninit_ 83 pa_slave.f
ridft_mpi 000000000091D9FB conny_ 135 conny.f
ridft_mpi 000000000091A924 cntrlp_ 248 cntrlp.f
ridft_mpi 0000000000496E42 prelim_ 125 prelim.f
ridft_mpi 00000000004314D6 MAIN__.A 516 ridft.f
ridft_mpi 000000000040C2BC Unknown Unknown Unknown
libc.so.6 0000003310A1D974 Unknown Unknown Unknown
ridft_mpi 000000000040C1EA Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
ridft_mpi 0000000000A1CE88 gvonr_ 2 gvonr.f
ridft_mpi 00000000009D9487 shrnk0_ 44 shrnk0.f
ridft_mpi 00000000009A4D4A bound_nc_ 276 bound_nc.f
ridft_mpi 000000000044DD42 MAIN__.A 907 ridft.f
ridft_mpi 000000000040C2BC Unknown Unknown Unknown
libc.so.6 000000371701D974 Unknown Unknown Unknown
ridft_mpi 000000000040C1EA Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
ridft_mpi 0000000000A25B4F thirdl_ 108 thirdl.f
ridft_mpi 0000000000A1B726 asra_ 263 asra.f
ridft_mpi 0000000000A1CCFD gvonr_ 97 gvonr.f
ridft_mpi 00000000009D9487 shrnk0_ 44 shrnk0.f
ridft_mpi 00000000009A4D4A bound_nc_ 276 bound_nc.f
ridft_mpi 000000000044DD42 MAIN__.A 907 ridft.f
ridft_mpi 000000000040C2BC Unknown Unknown Unknown
libc.so.6 0000003CFBE1D974 Unknown Unknown Unknown
ridft_mpi 000000000040C1EA Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 000000328EC306A7 Unknown Unknown Unknown
ridft_mpi 000000000101CDE3 Unknown Unknown Unknown
ridft_mpi 00000000009D9BFA shrnk0_ 69 shrnk0.f
ridft_mpi 00000000009A4D4A bound_nc_ 276 bound_nc.f
ridft_mpi 000000000044DD42 MAIN__.A 907 ridft.f
ridft_mpi 000000000040C2BC Unknown Unknown Unknown
libc.so.6 000000328EC1D974 Unknown Unknown Unknown
ridft_mpi 000000000040C1EA Unknown Unknown Unknown
This job is rather huge and intentionally I wanted to get it run on the 8 Opteron cores with 8GB RAM per core.
However as shown below ridft module allocated much more memory than available:
in slave6.output
Memory core needed for (P|Q) and Cholesky 590 MByte
Memory core minimum needed except of (P|Q) 93 MByte
Total minimum memory core needed (sum) 683 MByte
number of direct tasks: 160
****************************************
Memory allocated for RI-J 30985 MByte
****************************************
Allocation failure for ipqof, xcore in <ridft>
abnormal termination
ridft ended abnormally
(END)
The same job run with the same parameters ($ricore 4000) on the v5.10 shows different memory allocation:
in slave6.output
Memory core needed for (P|Q) and Cholesky 590 MByte
Memory core minimum needed except of (P|Q) 93 MByte
Total minimum memory core needed (sum) 683 MByte
number of direct tasks: 160
****************************************
Memory allocated for RI-J 5046 MByte
****************************************
-----------------
-T+V- integrals
-----------------
As a test I set up "$ricore 0" and run the calculation on the 8, 4 and 2 cores (all using v6.1), what gave the following memory allocation:
for 8 cores:
****************************************
Memory allocated for RI-J 10049 MByte
****************************************
Allocation failure for ipqof, xcore in <ridft>
abnormal termination
ridft ended abnormally
for 4 cores: (job runs fine)
****************************************
Memory allocated for RI-J 4388 MByte
****************************************
for 2 cores: (job runs fine)
****************************************
Memory allocated for RI-J 820 MByte
****************************************
Can someone explain me, such a differences in memory allocation by the ridft module in Turbomole 6.1 for a different numbers of processors and why I cannot run this job on e.g. 8 cores? I have never used version 6.0 so maybe I miss something here?
Thank you for any help,
best wishes,
Jakub.
P.S.
To submit jobs with the version 6.1 i used the following script:
#!/bin/csh
cd $PBS_O_WORKDIR
setenv TURBODIR /usr/local/TURBOMOLE_6.1
set path=($TURBODIR/scripts $path)
set path=($TURBODIR/bin/`sysname` $path)
setenv TURBOTMPDIR /tmp/
setenv PARA_ARCH MPI
setenv PARNODES 8
limit stacksize unlimited
limit > limit.out
ridft > ridft.out