Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: YangSiYa on January 21, 2009, 04:20:32 PM
Title: relax step ended abnormally
Post by: YangSiYa on January 21, 2009, 04:20:32 PM
My system now is operating 50 atoms,the calculation method is rimp2/TZVPP and the process is 4. How can I set my computer's memory and disk size?Or is there any other way to make my cluster operate normally?
In addition,I am confronted with these problems in job.1:
upper limit for coordinate changes = 0.3000 interpolation/extrapolation has been enabled display optimization statistics for the last 5 cycles
cannot find any information which may be used to optimize geometry ...
MODTRACE: no modules on stack
so long GRANAT ! relax ended abnormally
relax step ended abnormally
next step = relax
Then, the system warns " GEO_OPT_FAILED ".
Can you help me?
Thanks for you.
Title: Re: relax step ended abnormally
Post by: scope on January 26, 2009, 10:04:30 AM
Hello, I`m having the same problem with parallel DFT (without RI). I can successfully run smaller parallel jobs with the same basis set, functional and scheduler script. This one is rather large and I cannot run it in serial mode to check if it´s somehow a problem of the job itself. The limits on the nodes look alright:
core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 65536 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 65536 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
The system is running on Rocks 5.1 (basically a clone of RHEL5.1), using gridengine 6.1 as scheduler, Kernel 2.6.18-92.1.13.el5. TM Version is 5.10.
fine, there is no data group "$actual step" next step = grad ------------------ tail job.1:
cannot find any information which may be used to optimize geometry ...
MODTRACE: no modules on stack
so long GRANAT ! relax ended abnormally relax step ended abnormally next step = relax ------------------
I´m running with MPI flags to reduce CPU load of the server process: "-e MPI_FLAGS=y0 -np 1" (there is some other thread about this) The system is rather large:
total number of primitive shells : 265 total number of contracted shells : 762 total number of cartesian basis functions : 2893 total number of SCF-basis functions : 2438
I would be happy to supply further info or files if that helps. Any ideas ?
Title: Re: relax step ended abnormally
Post by: uwe on January 26, 2009, 05:18:50 PM
Hi,
there is no gradient. As far as I can see from the output you have cited, the energy did run, but the gradient step seems to be missing.
did your energy converge? Look in the output of job.last if "ENERGY CONVERGED" is printed after the SCF iterations.
job.1 should start with the gradient calculation. Could you scan the output from the beginning for an error message? The last error message comes from relax telling you that you have no gradients - but there must be some output of the gradient step itself.
is there anything written in the GEO_OPT_FAILED file?
Regards,
Uwe
Title: Re: relax step ended abnormally
Post by: scope on January 27, 2009, 09:38:46 AM
Hi Uwe, job.last: ENERGY CONVERGED !
current damping : 0.500 ITERATION ENERGY 1e-ENERGY 2e-ENERGY NORM[dD(SAO)] TOL 14 -3705.2911461897 -22510.944136 10127.467691 0.155D+00 0.329D-11 Exc = -409.639318637214 N = 388.00007566 max. resid. norm for Fia-block= 3.566D-05 for orbital 54a
max. resid. fock norm = 1.564D-03 for orbital 870a
convergence criteria satisfied after 14 iterations ------------------------------------------- in job.1 I found errors:
<rddim> : input of entry tasksize from data group '$pardft' failed !
Default values taken
<rddim> : input of entry memdiv from data group '$pardft' failed !
Default values taken DSCF: memory allocation for DFT gridpoints MEMORY is divided by 1 as DEFAULT Each node can hold at most the 1 -th part of the gridpoints ------------------------ and later on: disc space (kbyte) allocation for 1 2e-integral file(s) : -------------------------------------------------------- /tmp/twoint_ntrapp.compute-1-2.local.out1 0 --------------------------------------------------------
maximum number of buffers which may be written onto 2e-file(s) = 0
STARTING INTEGRAL EVALUATION FOR 1st SCF ITERATION time elapsed for pre-SCF steps : cpu 54.183 sec wall 54.183 sec
WARNING: no static tasks at all continue fully direct - check size of twoint files WARNING: no static task received, twoint file is empty - input too small or statistics run missing: this client continues in direct mode -------------------------------------- ....and still later on: could not remove file /tmp/twoint_ntrapp.compute-1-2.local.out1
bye bye cruel world .... PARALLEL run finished by another client
As far as I can tell, the statistics run was completed. It determines 0 kb file space for twoint. However, there is a warning in dscf.statistics.parallel:
after task assignment : resulting number of tasks = 765 dynamic tasks = 765 number of shell pairs = 290703
WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING for data group $2e-ints_shell_statistics an WARNING WARNING external file reference is highly recommended ! WARNING WARNING writing data onto default file 'metastase' WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING -------------------------------- The grad.statistics.parallel file contains an error in the very end: **** grad : all done ****
2009-01-25 01:23:41.697
Deallocation failure for xaij in <grad>
MODTRACE: no modules on stack
abnormal termination grad ended abnormally
--------------------- GEO_OPT_FAILED: ERROR: Module relax failed to run properly - please check output job.1 and job.last for the reason
HTH and cheers, Nils
Title: Re: relax step ended abnormally
Post by: scope on January 30, 2009, 02:46:17 PM
Just a question: How does parallel Turbomole handle twoint files ? All slaves run on the same machine (with 8 cores). Is it possible that the errors are related to this ? I change the standard entry in $control to something like this:
so I don´t collide with other users on the same machine. Then I start the run with export PARA_ARCH=MPI export MPIRUN_OPTIONS="-intra=nic -e MPI_FLAGS=y0" export PATH=$TURBODIR/bin/`sysname`:$PATH export HOSTS_FILE=$TMPDIR/machines export PARNODES=8 nohup jobex
The slaveX.output files contain this: WARNING: no static tasks at all continue fully direct - check size of twoint files WARNING: no static task received, twoint file is empty - input too small or statistics run missing: this client continues in direct mode
And: Smaller jobs run without problems, although the program also warns about the twoint. In other words: Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?
Title: Re: relax step ended abnormally
Post by: uwe on January 30, 2009, 08:19:00 PM
Hi,
there are a lot of warnings which might look a little bit more scaring than they should. Everything is o.k. with your calculation, the binaries just want to give a hint that a fully dynamic task distribution might be less efficient than a partly static one. But this is not true for SMP systems, and Turbomole 6.0 will have a fully dynamic task distribution as default.
The only error, as far as I can see, is:
Deallocation failure for xaij in <grad>
Is this Turbomole 5.10? The gradient did run to the end, you just get an error from freeing some memory which has not been allocated. This should not result in an error, but obviously grad or rdgrad ends as if something would be wrong. I would wait for Turbomole 6.0 and check if the error is still there.
Quote
Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?
twoint is a local scratch file for each of the slaves. It should be placed on local scratch disks, and if you have just an SMP system, it is recommended to set the size to zero. A twoint file only speeds up the calculation if you have either very fast disks (RAID), or if you run just one client per node, so that two clients do not have to wait for the I/O of the other client on the same node...
Hope this helps,
Uwe
Title: Re: relax step ended abnormally
Post by: scope on February 20, 2009, 09:38:43 AM
Hi, I tried again with TM6.0. Still getting errors, but now they are different because statpt is used. On stdout:
dscf ended normally 0 -------------------------------------FORTRAN server ends dscf ended normally dscf ended normally dscf ended normally OPTIMIZATION CYCLE 1 grad ended normally mpirun: Cannot allocate job ID: No such file or directory statpt ended abnormally program stopped. statpt ended abnormally program stopped. run statpt for a cartesian step statpt ended abnormally program stopped. ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason
in the end of job.1:
<getgrd> : data group $grad is missing
MODTRACE: no modules on stack
error reading energy and gradient in rdcor statpt ended abnormally statpt step ended abnormally next step = statpt
Should I try to use cartesian coordinates ? It seems the Deallocation error is gone.
Title: Re: relax step ended abnormally
Post by: uwe on February 25, 2009, 03:24:09 PM
Hi,
the gradient part did not work, so you have no gradients. The one single line 'grad ended normally' comes from the statistics run that is performed (with the serial binary) to generate the task distribution.
What confuses me is the error message 'No such file or directory'. mpirun seems not to find the parallel grad_mpi binary. Did you check the permissions in $TURBODIR/bin/`sysname`_mpi/ if dscf_mpi and grad_mpi have different execution or read permissions?
I guess that you have to send that to the Turbomole support, since this looks like a problem with installation or system settings.
Regards,
Uwe
Title: Re: relax step ended abnormally
Post by: scope on February 25, 2009, 04:23:29 PM
Hi Uwe, that does not seem to be the problem. I can run a smaller test job without any problems, using the same scheduler/scheduler script and the same machine.