Author Topic: Memory allocation by ridft, Turbomole v5.10 vs v6.1  (Read 10896 times)

Jakub

  • Newbie
  • *
  • Posts: 2
  • Karma: +0/-0
Memory allocation by ridft, Turbomole v5.10 vs v6.1
« on: January 06, 2010, 03:21:06 PM »
Dear All,

I have recently switched from Turbomole v5.10 to the v6.1 and I have a few questions regarding the memory allocation by the ridft module in the latest Turbomole v6.1. The same job, which runs flawlessly on the v5.10 version, crashes with the following error on the latest v6.1:

Code: [Select]
ridft ended abnormally
MPI Application rank 1 exited before MPI_Finalize() with status 13
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libc.so.6          0000003310A99730  Unknown               Unknown  Unknown
libc.so.6          0000003310ACCDA4  Unknown               Unknown  Unknown
ridft_mpi          00000000009847F0  pa_rcv_nap_               220  pa_mp.f
ridft_mpi          0000000000995270  ridftserver_              113  ridftserver.f
ridft_mpi          0000000000987FDD  serversub_                286  serversub.F
ridft_mpi          000000000098324D  pa_ninit_                  83  pa_slave.f
ridft_mpi          000000000091D9FB  conny_                    135  conny.f
ridft_mpi          000000000091A924  cntrlp_                   248  cntrlp.f
ridft_mpi          0000000000496E42  prelim_                   125  prelim.f
ridft_mpi          00000000004314D6  MAIN__.A                  516  ridft.f
ridft_mpi          000000000040C2BC  Unknown               Unknown  Unknown
libc.so.6          0000003310A1D974  Unknown               Unknown  Unknown
ridft_mpi          000000000040C1EA  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
ridft_mpi          0000000000A1CE88  gvonr_                      2  gvonr.f
ridft_mpi          00000000009D9487  shrnk0_                    44  shrnk0.f
ridft_mpi          00000000009A4D4A  bound_nc_                 276  bound_nc.f
ridft_mpi          000000000044DD42  MAIN__.A                  907  ridft.f
ridft_mpi          000000000040C2BC  Unknown               Unknown  Unknown
libc.so.6          000000371701D974  Unknown               Unknown  Unknown
ridft_mpi          000000000040C1EA  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
ridft_mpi          0000000000A25B4F  thirdl_                   108  thirdl.f
ridft_mpi          0000000000A1B726  asra_                     263  asra.f
ridft_mpi          0000000000A1CCFD  gvonr_                     97  gvonr.f
ridft_mpi          00000000009D9487  shrnk0_                    44  shrnk0.f
ridft_mpi          00000000009A4D4A  bound_nc_                 276  bound_nc.f
ridft_mpi          000000000044DD42  MAIN__.A                  907  ridft.f
ridft_mpi          000000000040C2BC  Unknown               Unknown  Unknown
libc.so.6          0000003CFBE1D974  Unknown               Unknown  Unknown
ridft_mpi          000000000040C1EA  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libc.so.6          000000328EC306A7  Unknown               Unknown  Unknown
ridft_mpi          000000000101CDE3  Unknown               Unknown  Unknown
ridft_mpi          00000000009D9BFA  shrnk0_                    69  shrnk0.f
ridft_mpi          00000000009A4D4A  bound_nc_                 276  bound_nc.f
ridft_mpi          000000000044DD42  MAIN__.A                  907  ridft.f
ridft_mpi          000000000040C2BC  Unknown               Unknown  Unknown
libc.so.6          000000328EC1D974  Unknown               Unknown  Unknown
ridft_mpi          000000000040C1EA  Unknown               Unknown  Unknown

This job is rather huge and intentionally I wanted to get it run on the 8 Opteron cores with 8GB RAM per core.
However as shown below ridft module allocated much more memory than available:

in slave6.output

Code: [Select]

Memory core needed for (P|Q) and Cholesky    590 MByte
 Memory core minimum needed except of (P|Q)    93 MByte
 Total minimum memory core needed (sum)       683 MByte
     number of direct tasks:         160

 ****************************************
 Memory allocated for RI-J 30985 MByte
 ****************************************


 Allocation failure for ipqof, xcore in <ridft>


  abnormal termination
 ridft ended abnormally
(END)

The same job run with the same parameters ($ricore 4000) on the v5.10 shows different memory allocation:

in slave6.output

Code: [Select]

Memory core needed for (P|Q) and Cholesky    590 MByte
 Memory core minimum needed except of (P|Q)    93 MByte
 Total minimum memory core needed (sum)       683 MByte
     number of direct tasks:         160

 ****************************************
 Memory allocated for RI-J  5046 MByte
 ****************************************


          -----------------
          -T+V- integrals
          -----------------

As a test I set up "$ricore  0" and run the calculation on the 8, 4 and 2 cores (all using v6.1), what gave the following memory allocation:


for 8 cores:
Code: [Select]
****************************************
 Memory allocated for RI-J 10049 MByte
 ****************************************
 Allocation failure for ipqof, xcore in <ridft>

 abnormal termination
 ridft ended abnormally

for 4 cores: (job runs fine)
Code: [Select]
****************************************
 Memory allocated for RI-J  4388 MByte
 ****************************************

for 2 cores: (job runs fine)
Code: [Select]
****************************************
 Memory allocated for RI-J   820 MByte
****************************************

Can someone explain me, such a differences in memory allocation by the ridft module in Turbomole 6.1 for a different numbers of processors and why I cannot run this job on  e.g. 8 cores? I have never used version 6.0 so maybe I miss something here?

Thank you for any help,

best wishes,

Jakub.


P.S.
To submit jobs with the version 6.1 i used the following script:

Code: [Select]
#!/bin/csh
cd $PBS_O_WORKDIR

setenv TURBODIR /usr/local/TURBOMOLE_6.1
set path=($TURBODIR/scripts $path)
set path=($TURBODIR/bin/`sysname` $path)
setenv TURBOTMPDIR /tmp/

setenv PARA_ARCH MPI
setenv PARNODES 8
limit stacksize unlimited

limit > limit.out
ridft > ridft.out

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 560
  • Karma: +0/-0
Re: Memory allocation by ridft, Turbomole v5.10 vs v6.1
« Reply #1 on: January 07, 2010, 10:31:55 AM »
Hi,

I am not aware of any changes since Turbomole 5.10 at this point, except the fact that the memory limit for $ricore was 16GB in total due to integer overflows in 5.10 while this has been fixed in 6.1 (I have tried it with up to 24GB but it should theoretically also work for larger amounts of memory). Note that $ricore is per core or CPU and not per node!

To reduce the memory needs, please try to set in the control file:

$ricore 1
$ricore_slave 1


And rerun on 8 CPUs.

For such large systems memory for RI-Integral storage usually does not speed up the calculation significantly.

But $marij should always be added, a speed-up factor of 4 to 10 is usually obtained without introducing additional errors.

Hope it helps,

Uwe