TURBOMOLE Users Forum

Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: YangSiYa on January 21, 2009, 04:20:32 PM

Title: relax step ended abnormally
Post by: YangSiYa on January 21, 2009, 04:20:32 PM

My system now is operating 50 atoms,the calculation method is rimp2/TZVPP and the process is 4.
How can I set my computer's memory and disk size?Or is there any other way to make my cluster operate normally?

In addition,I am confronted with these problems in job.1:

upper limit for coordinate changes = 0.3000
interpolation/extrapolation has been enabled
display optimization statistics for the last 5 cycles

------------------------------------------------------------------------------

relaxation of NUCLEAR COORDINATES in cartesian space

------------------------------------------------------------------------------

reading data block $coord from file <coord>

<getgrd> : data group $grad is missing

cannot find any information which may be used to optimize geometry ...

MODTRACE: no modules on stack

so long GRANAT !
relax ended abnormally

relax step ended abnormally

next step = relax

Then, the system warns " GEO_OPT_FAILED ".

Can you help me?

Thanks for you.

Title: Re: relax step ended abnormally
Post by: scope on January 26, 2009, 10:04:30 AM

Hello,
I`m having the same problem with parallel DFT (without RI). I can successfully run smaller parallel jobs with the same basis set, functional and scheduler script. This one is rather large and I cannot run it in serial mode to check if it´s somehow a problem of the job itself. The limits on the nodes look alright:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 65536
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 65536
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

The system is running on Rocks 5.1 (basically a clone of RHEL5.1), using gridengine 6.1 as scheduler, Kernel 2.6.18-92.1.13.el5.
TM Version is 5.10.

------------------
tail job.last:

scf.post 9.1 0.01 9.1 0.01
dscf.postscf 129350.6 100.00 1232845849.4 ******

fine, there is no data group "$actual step"
next step = grad
------------------
tail job.1:

cannot find any information which may be used to optimize geometry ...

MODTRACE: no modules on stack

so long GRANAT !
relax ended abnormally
relax step ended abnormally
next step = relax
------------------

I´m running with MPI flags to reduce CPU load of the server process: "-e MPI_FLAGS=y0 -np 1" (there is some other thread about this)
The system is rather large:

total number of primitive shells : 265
total number of contracted shells : 762
total number of cartesian basis functions : 2893
total number of SCF-basis functions : 2438

I would be happy to supply further info or files if that helps. Any ideas ?

Title: Re: relax step ended abnormally
Post by: uwe on January 26, 2009, 05:18:50 PM

Hi,

there is no gradient. As far as I can see from the output you have cited, the energy did run, but the gradient step seems to be missing.

did your energy converge? Look in the output of job.last if "ENERGY CONVERGED" is printed after the SCF iterations.
job.1 should start with the gradient calculation. Could you scan the output from the beginning for an error message? The last error message comes from relax telling you that you have no gradients - but there must be some output of the gradient step itself.
is there anything written in the GEO_OPT_FAILED file?

Regards,

Uwe

Title: Re: relax step ended abnormally
Post by: scope on January 27, 2009, 09:38:46 AM

Hi Uwe,
job.last:
ENERGY CONVERGED !

current damping : 0.500
ITERATION ENERGY 1e-ENERGY 2e-ENERGY NORM[dD(SAO)] TOL
14 -3705.2911461897 -22510.944136 10127.467691 0.155D+00 0.329D-11
Exc = -409.639318637214 N = 388.00007566
max. resid. norm for Fia-block= 3.566D-05 for orbital 54a

max. resid. fock norm = 1.564D-03 for orbital 870a

convergence criteria satisfied after 14 iterations
-------------------------------------------
in job.1 I found errors:

<rddim> : input of entry tasksize
from data group '$pardft' failed !

Default values taken

<rddim> : input of entry memdiv
from data group '$pardft' failed !

Default values taken
DSCF: memory allocation for DFT gridpoints
MEMORY is divided by 1 as DEFAULT
Each node can hold at most the 1 -th part
of the gridpoints
------------------------
and later on:
disc space (kbyte) allocation for 1 2e-integral file(s) :
--------------------------------------------------------
/tmp/twoint_ntrapp.compute-1-2.local.out1 0
--------------------------------------------------------

maximum number of buffers which may be written onto 2e-file(s) = 0

STARTING INTEGRAL EVALUATION FOR 1st SCF ITERATION
time elapsed for pre-SCF steps : cpu 54.183 sec
wall 54.183 sec

WARNING: no static tasks at all
continue fully direct - check size of twoint files
WARNING: no static task received, twoint file is empty - input too small or
statistics run missing: this client continues in direct mode
--------------------------------------
....and still later on:
could not remove file /tmp/twoint_ntrapp.compute-1-2.local.out1

bye bye cruel world ....
PARALLEL run finished by another client

As far as I can tell, the statistics run was completed. It determines 0 kb file space for twoint. However, there is a warning in dscf.statistics.parallel:

after task assignment :
resulting number of tasks = 765
dynamic tasks = 765
number of shell pairs = 290703

WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
WARNING WARNING
WARNING for data group $2e-ints_shell_statistics an WARNING
WARNING external file reference is highly recommended ! WARNING
WARNING writing data onto default file 'metastase' WARNING
WARNING WARNING
WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
--------------------------------
The grad.statistics.parallel file contains an error in the very end:
**** grad : all done ****

2009-01-25 01:23:41.697

Deallocation failure for xaij in <grad>

MODTRACE: no modules on stack

abnormal termination
grad ended abnormally

---------------------
GEO_OPT_FAILED:
ERROR: Module relax failed to run properly - please check output job.1 and job.last for the reason

HTH and cheers,
Nils

Title: Re: relax step ended abnormally
Post by: scope on January 30, 2009, 02:46:17 PM

Just a question: How does parallel Turbomole handle twoint files ? All slaves run on the same machine (with 8 cores). Is it possible that the errors are related to this ?
I change the standard entry in $control to something like this:

$scfintunit
unit=30 size=1 file=/tmp/twoint_ntrapp

so I don´t collide with other users on the same machine. Then I start the run with
export PARA_ARCH=MPI
export MPIRUN_OPTIONS="-intra=nic -e MPI_FLAGS=y0"
export PATH=$TURBODIR/bin/`sysname`:$PATH
export HOSTS_FILE=$TMPDIR/machines
export PARNODES=8
nohup jobex

The slaveX.output files contain this:
WARNING: no static tasks at all
continue fully direct - check size of twoint files
WARNING: no static task received, twoint file is empty - input too small or
statistics run missing: this client continues in direct mode

And: Smaller jobs run without problems, although the program also warns about the twoint.
In other words: Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?

Title: Re: relax step ended abnormally
Post by: uwe on January 30, 2009, 08:19:00 PM

Hi,

there are a lot of warnings which might look a little bit more scaring than they should. Everything is o.k. with your calculation, the binaries just want to give a hint that a fully dynamic task distribution might be less efficient than a partly static one. But this is not true for SMP systems, and Turbomole 6.0 will have a fully dynamic task distribution as default.

The only error, as far as I can see, is:

Deallocation failure for xaij in <grad>

Is this Turbomole 5.10? The gradient did run to the end, you just get an error from freeing some memory which has not been allocated. This should not result in an error, but obviously grad or rdgrad ends as if something would be wrong. I would wait for Turbomole 6.0 and check if the error is still there.

Quote

Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?

twoint is a local scratch file for each of the slaves. It should be placed on local scratch disks, and if you have just an SMP system, it is recommended to set the size to zero. A twoint file only speeds up the calculation if you have either very fast disks (RAID), or if you run just one client per node, so that two clients do not have to wait for the I/O of the other client on the same node...

Hope this helps,

Uwe

Title: Re: relax step ended abnormally
Post by: scope on February 20, 2009, 09:38:43 AM

Hi,
I tried again with TM6.0. Still getting errors, but now they are different because statpt is used. On stdout:

dscf ended normally
0 -------------------------------------FORTRAN server ends
dscf ended normally
dscf ended normally
dscf ended normally
OPTIMIZATION CYCLE 1
grad ended normally
mpirun: Cannot allocate job ID: No such file or directory
statpt ended abnormally
program stopped.
statpt ended abnormally
program stopped.
run statpt for a cartesian step
statpt ended abnormally
program stopped.
ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason

in the end of job.1:

<getgrd> : data group $grad is missing

MODTRACE: no modules on stack

error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt

Should I try to use cartesian coordinates ? It seems the Deallocation error is gone.

Title: Re: relax step ended abnormally
Post by: uwe on February 25, 2009, 03:24:09 PM

Hi,

the gradient part did not work, so you have no gradients. The one single line 'grad ended normally' comes from the statistics run that is performed (with the serial binary) to generate the task distribution.

What confuses me is the error message 'No such file or directory'. mpirun seems not to find the parallel grad_mpi binary. Did you check the permissions in $TURBODIR/bin/`sysname`_mpi/ if dscf_mpi and grad_mpi have different execution or read permissions?

I guess that you have to send that to the Turbomole support, since this looks like a problem with installation or system settings.

Regards,

Uwe

Title: Re: relax step ended abnormally
Post by: scope on February 25, 2009, 04:23:29 PM

Hi Uwe,
that does not seem to be the problem. I can run a smaller test job without any problems, using the same scheduler/scheduler script and the same machine.

Cheers
nils