Author Topic: SCS-MP2 error in gradient step (1) for large system using RI-CC2 module  (Read 6846 times)

mike

  • Jr. Member
  • **
  • Posts: 10
  • Karma: +0/-0
Dear all,

I'm having some problems with SCS-MP2 geometry optimizations performed using the RICC2 code.

For smaller molecules, my geometry optimizations converge with no problems at all, while for larger molecules, the calculation fails with an error message "error in gradient step (1)". I've tried increasing the memory (but still below the maximum of 16GB), and checked that the stack size is unlimited and also tried using the _huge executables, but nothing has solved the problem.

Any tips/suggestions as to what to try next would be much appreciated! The contents of the calculation directory are in the attached .tgz (but without the the large mos and ddens files).

Here is some of the job.last output:

                 ========   CC DENSITY MODULE   ========

                      current wave-function model: MP2     

  calculating CC ground state density

   a semicanonical algorithm will be used

    density nr.      cpu/min        wall/min    L     R
   ------------------------------------------------------
         1            63.03           63.34    L0    R0
   ------------------------------------------------------
     time in cc_1den       cpu:  1 h  3 min  wall:  1 h  3 min  ratio:  1.0
     time in lpzwei        cpu:  0.43 sec    wall:  0.44 sec    ratio:  1.0
     time in invmetri      cpu:  3.24 sec    wall:  3.25 sec    ratio:  1.0

 reading orbital data $scfmo  from file mos .

 orbital characterization : scfconv=9
 
     time in RI-CPHF prep  cpu:  5.27 sec    wall:  5.47 sec    ratio:  1.0
error in gradient step (1)

Many thanks,

Mike

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Hi Mike,

The "error in gradient step (1)" is definitely one of the nasty ones to debug. Your report is a very good one, that helps. Here are few thoughts:

1) In your job script turboGeomOpt.pbs you define the memory as:
#PBS -lmem=24GB

I'm not familiar with all PBS variants, but doesn't this option reserve the "Maximum physical memory used by all processes"? Since every ricc2_mpi process may as much as $maxcor, your four processes can end up allocating 4 * 16 GB = 64 GB. This may hit the PBS memory limit.

2) For MPI processes setting the stacksize can be rather nasty if the queue system starts the processes via SSH. Please see http://www.turbo-forum.com/index.php/topic,23.0.html (in particular the last section on parallel runs)

3) If the above solutions don't help, you could try running the job with the SMP parallel version (OpenMP): http://www.turbomole-gmbh.com/manuals/version_6_5/Documentation_html/node26.html
In this case the $maxcor keyword gives the actual total memory allocated by ricc2. The OpenMP version is not usually as efficient as the MPI version, though.

Hope this helps,
Antti

mike

  • Jr. Member
  • **
  • Posts: 10
  • Karma: +0/-0
Hi Antti,

Many thanks for the suggestions. You're right, I was mixing up the way memory settings work for MPI and OpenMPI.

However, unfortunately this does not seem to be the problem - when reran the MPI calculation with maxcore set to 4000, the job failed in the same way.

I also checked the stacksize in the way recommended for MPI jobs "ssh <hostname> ulimit -a" which showed that stack size, max memory size, file size and virtual memory size were all set to unlimited.

Finally, I also tried the SMP OpenMPI RICC2 version by setting PARA_ARCH=SMP, and setting maxcore to 16000. This also failed in the same way. "error in gradient step".

Any more ideas/suggestions would be much appreciated! In the mean time, I'll see if a serial run will work.

Many thanks,

Mike

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Hi Mike,

Seems like a really nasty case. Did you confirm that the PARA_ARCH=SMP job used the OpenMP binaries (ricc2 should report in the beginning of the output that it's "using 4 threads" or something like this)? Just to make this sure. The serial version is definitely also worth testing, although the OpenMP version is already closely related to the serial version and this suggests that the serial version might crash, too. If the serial version crashes as well, I recommend contacting the Turbomole support.

Antti

mike

  • Jr. Member
  • **
  • Posts: 10
  • Karma: +0/-0
Hi Antti,

I checked that the SMP binaries were indeed being used - the job.last contained "OpenMP run-time library returned nthreads =  4" and the PBS standard error file indicated that the em64t-unknown-linux-gnu_smp/ricc2 binary was being used.

Alas, the serial job also failed.

Many thanks for your help, I'll take the issue up with Turbomole support.

Mike