TURBOMOLE Users Forum
TURBOMOLE Modules => Ricc2 => Topic started by: mike on April 09, 2014, 12:50:51 PM
-
Dear all,
I'm having some problems with SCS-MP2 geometry optimizations performed using the RICC2 code.
For smaller molecules, my geometry optimizations converge with no problems at all, while for larger molecules, the calculation fails with an error message "error in gradient step (1)". I've tried increasing the memory (but still below the maximum of 16GB), and checked that the stack size is unlimited and also tried using the _huge executables, but nothing has solved the problem.
Any tips/suggestions as to what to try next would be much appreciated! The contents of the calculation directory are in the attached .tgz (but without the the large mos and ddens files).
Here is some of the job.last output:
======== CC DENSITY MODULE ========
current wave-function model: MP2
calculating CC ground state density
a semicanonical algorithm will be used
density nr. cpu/min wall/min L R
------------------------------------------------------
1 63.03 63.34 L0 R0
------------------------------------------------------
time in cc_1den cpu: 1 h 3 min wall: 1 h 3 min ratio: 1.0
time in lpzwei cpu: 0.43 sec wall: 0.44 sec ratio: 1.0
time in invmetri cpu: 3.24 sec wall: 3.25 sec ratio: 1.0
reading orbital data $scfmo from file mos .
orbital characterization : scfconv=9
time in RI-CPHF prep cpu: 5.27 sec wall: 5.47 sec ratio: 1.0
error in gradient step (1)
Many thanks,
Mike
-
Hi Mike,
The "error in gradient step (1)" is definitely one of the nasty ones to debug. Your report is a very good one, that helps. Here are few thoughts:
1) In your job script turboGeomOpt.pbs you define the memory as:
#PBS -lmem=24GB
I'm not familiar with all PBS variants, but doesn't this option reserve the "Maximum physical memory used by all processes"? Since every ricc2_mpi process may as much as $maxcor, your four processes can end up allocating 4 * 16 GB = 64 GB. This may hit the PBS memory limit.
2) For MPI processes setting the stacksize can be rather nasty if the queue system starts the processes via SSH. Please see http://www.turbo-forum.com/index.php/topic,23.0.html (in particular the last section on parallel runs)
3) If the above solutions don't help, you could try running the job with the SMP parallel version (OpenMP): http://www.turbomole-gmbh.com/manuals/version_6_5/Documentation_html/node26.html
In this case the $maxcor keyword gives the actual total memory allocated by ricc2. The OpenMP version is not usually as efficient as the MPI version, though.
Hope this helps,
Antti
-
Hi Antti,
Many thanks for the suggestions. You're right, I was mixing up the way memory settings work for MPI and OpenMPI.
However, unfortunately this does not seem to be the problem - when reran the MPI calculation with maxcore set to 4000, the job failed in the same way.
I also checked the stacksize in the way recommended for MPI jobs "ssh <hostname> ulimit -a" which showed that stack size, max memory size, file size and virtual memory size were all set to unlimited.
Finally, I also tried the SMP OpenMPI RICC2 version by setting PARA_ARCH=SMP, and setting maxcore to 16000. This also failed in the same way. "error in gradient step".
Any more ideas/suggestions would be much appreciated! In the mean time, I'll see if a serial run will work.
Many thanks,
Mike
-
Hi Mike,
Seems like a really nasty case. Did you confirm that the PARA_ARCH=SMP job used the OpenMP binaries (ricc2 should report in the beginning of the output that it's "using 4 threads" or something like this)? Just to make this sure. The serial version is definitely also worth testing, although the OpenMP version is already closely related to the serial version and this suggests that the serial version might crash, too. If the serial version crashes as well, I recommend contacting the Turbomole support.
Antti
-
Hi Antti,
I checked that the SMP binaries were indeed being used - the job.last contained "OpenMP run-time library returned nthreads = 4" and the PBS standard error file indicated that the em64t-unknown-linux-gnu_smp/ricc2 binary was being used.
Alas, the serial job also failed.
Many thanks for your help, I'll take the issue up with Turbomole support.
Mike