Author Topic: relax step ended abnormally  (Read 16013 times)

YangSiYa

  • Newbie
  • *
  • Posts: 5
  • Karma: +0/-0
relax step ended abnormally
« on: January 21, 2009, 04:20:32 PM »
 My system now is operating 50 atoms,the calculation method is rimp2/TZVPP and the process is 4.
How can I set my computer's memory and disk size?Or is there any other way to make my cluster operate normally?

In addition,I am confronted with these problems in job.1:

 upper limit for coordinate changes =   0.3000
 interpolation/extrapolation has been enabled
 display optimization statistics for the last   5 cycles

 ------------------------------------------------------------------------------

     relaxation of NUCLEAR COORDINATES in cartesian space

 ------------------------------------------------------------------------------

 reading data block $coord from file <coord>


 <getgrd> : data group $grad  is missing


 cannot find any information which may be used to optimize geometry ...


 MODTRACE: no modules on stack

  so long GRANAT !
 relax ended abnormally
relax step ended abnormally
next step = relax
     

Then, the system warns " GEO_OPT_FAILED ".

Can you help me?

Thanks for you.

scope

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #1 on: January 26, 2009, 10:04:30 AM »
Hello,
I`m having the same problem with parallel DFT (without RI). I can successfully run smaller parallel jobs with the same basis set, functional and scheduler script. This one is rather large and I cannot run it in serial mode to check if it´s somehow a problem of the job itself. The limits on the nodes look alright:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 65536
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65536
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The system is running on Rocks 5.1 (basically a clone of RHEL5.1), using gridengine 6.1 as scheduler, Kernel 2.6.18-92.1.13.el5.
TM Version is 5.10.

------------------
tail job.last:

            scf.post             9.1    0.01             9.1    0.01
        dscf.postscf        129350.6  100.00    1232845849.4  ******

fine, there is no data group "$actual step"
next step = grad
------------------
tail job.1:


 cannot find any information which may be used to optimize geometry ...


 MODTRACE: no modules on stack

  so long GRANAT !
 relax ended abnormally
relax step ended abnormally
next step = relax
------------------

I´m running with MPI flags to reduce CPU load of the server process: "-e MPI_FLAGS=y0 -np 1" (there is some other thread about this)
The system is rather large:
 
  total number of primitive shells          :  265
   total number of contracted shells         :  762
   total number of cartesian basis functions : 2893
   total number of SCF-basis functions       : 2438

I would be happy to supply further info or files if that helps. Any ideas ?

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 491
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #2 on: January 26, 2009, 05:18:50 PM »
Hi,

there is no gradient. As far as I can see from the output you have cited, the energy did run, but the gradient step seems to be missing.

  • did your energy converge? Look in the output of job.last if "ENERGY CONVERGED" is printed after the SCF iterations.
  • job.1 should start with the gradient calculation. Could you scan the output from the beginning for an error message? The last error message comes from relax telling you that you have no gradients - but there must be some output of the gradient step itself.
  • is there anything written in the GEO_OPT_FAILED file?

Regards,

Uwe

scope

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #3 on: January 27, 2009, 09:38:46 AM »
Hi Uwe,
job.last:
ENERGY CONVERGED !


                                              current damping :  0.500
 ITERATION  ENERGY          1e-ENERGY        2e-ENERGY     NORM[dD(SAO)]  TOL
  14  -3705.2911461897    -22510.944136     10127.467691    0.155D+00 0.329D-11
                            Exc =  -409.639318637214     N = 388.00007566   
          max. resid. norm for Fia-block=  3.566D-05 for orbital     54a       
 
          max. resid. fock norm         =  1.564D-03 for orbital    870a       
 

 convergence criteria satisfied after 14 iterations
-------------------------------------------
in job.1 I found errors:

 <rddim> : input of entry tasksize
           from data group '$pardft' failed !

               Default values taken

 <rddim> : input of entry memdiv
           from data group '$pardft' failed !

               Default values taken
  DSCF: memory allocation for DFT gridpoints
  MEMORY is divided by 1 as DEFAULT
 Each node can hold at most  the             1 -th part
 of the gridpoints   
------------------------
and later on:
   disc space (kbyte) allocation for  1 2e-integral file(s) :
   --------------------------------------------------------
   /tmp/twoint_ntrapp.compute-1-2.local.out1              0
   --------------------------------------------------------

 maximum number of buffers which may be written onto 2e-file(s) =         0


  STARTING INTEGRAL EVALUATION FOR 1st SCF ITERATION
  time elapsed for pre-SCF steps : cpu           54.183 sec
                                   wall          54.183 sec

 WARNING: no static tasks at all
 continue fully direct - check size of twoint files
 WARNING: no static task received, twoint file is empty - input too small or
          statistics run missing:  this client continues in direct mode
--------------------------------------
....and still later on:
 could not remove file /tmp/twoint_ntrapp.compute-1-2.local.out1

 bye bye cruel world ....
 PARALLEL run finished by another client


As far as I can tell, the statistics run was completed. It determines 0 kb file space for twoint. However, there is a warning in dscf.statistics.parallel:


   after task assignment :
         resulting number of tasks =       765
                     dynamic tasks =       765
             number of shell pairs =    290703

      WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
      WARNING                                                 WARNING
      WARNING   for data group $2e-ints_shell_statistics an   WARNING
      WARNING external file reference is highly recommended ! WARNING
      WARNING   writing data onto default file 'metastase'    WARNING
      WARNING                                                 WARNING
      WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
--------------------------------
The grad.statistics.parallel file contains an error in the very end:
   ****  grad : all done  ****


    2009-01-25 01:23:41.697


 Deallocation failure for xaij in <grad>


 MODTRACE: no modules on stack

  abnormal termination
 grad ended abnormally

---------------------
GEO_OPT_FAILED:
ERROR: Module relax failed to run properly - please check output job.1 and job.last for the reason


HTH and cheers,
Nils

scope

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #4 on: January 30, 2009, 02:46:17 PM »
Just a question: How does parallel Turbomole handle twoint files ? All slaves run on the same machine (with 8 cores). Is it possible that the errors are related to this ?
I change the standard entry in $control to something like this:

$scfintunit
 unit=30       size=1        file=/tmp/twoint_ntrapp

so I don´t collide with other users on the same machine. Then I start the run with
export PARA_ARCH=MPI
export MPIRUN_OPTIONS="-intra=nic -e MPI_FLAGS=y0"
export PATH=$TURBODIR/bin/`sysname`:$PATH
export HOSTS_FILE=$TMPDIR/machines
export PARNODES=8 
nohup jobex


The slaveX.output files contain this:
WARNING: no static tasks at all
 continue fully direct - check size of twoint files
 WARNING: no static task received, twoint file is empty - input too small or
          statistics run missing:  this client continues in direct mode

And: Smaller jobs run without problems, although the program also warns about the twoint.
In other words: Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 491
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #5 on: January 30, 2009, 08:19:00 PM »
Hi,

there are a lot of warnings which might look a little bit more scaring than they should. Everything is o.k. with your calculation, the binaries just want to give a hint that a fully dynamic task distribution might be less efficient than a partly static one. But this is not true for SMP systems, and Turbomole 6.0 will have a fully dynamic task distribution as default.

The only error, as far as I can see, is:

Deallocation failure for xaij in <grad>

Is this Turbomole 5.10? The gradient did run to the end, you just get an error from freeing some memory which has not been allocated. This should not result in an error, but obviously grad or rdgrad ends as if something would be wrong. I would wait for Turbomole 6.0 and check if the error is still there.

Quote
Does Turbomole need one twoint file for each slave process ? If so, is this the cause of the error and can I do something about it ? If I run across nodes, must the twoint file be readable anywhere or is it local ?

twoint is a local scratch file for each of the slaves. It should be placed on local scratch disks, and if you have just an SMP system, it is recommended to set the size to zero. A twoint file only speeds up the calculation if you have either very fast disks (RAID), or if you run just one client per node, so that two clients do not have to wait for the I/O of the other client on the same node...

Hope this helps,

Uwe

scope

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #6 on: February 20, 2009, 09:38:43 AM »
Hi,
I tried again with TM6.0. Still getting errors, but now they are different because statpt is used. On stdout:

 dscf ended normally
0 -------------------------------------FORTRAN server ends
 dscf ended normally
 dscf ended normally
 dscf ended normally
OPTIMIZATION CYCLE 1
 grad ended normally
mpirun: Cannot allocate job ID: No such file or directory
 statpt ended abnormally
program stopped.
 statpt ended abnormally
program stopped.
run statpt for a cartesian step
 statpt ended abnormally
program stopped.
ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason

in the end of job.1:

 <getgrd> : data group $grad  is missing

 
 MODTRACE: no modules on stack

 error reading energy and gradient in rdcor
 statpt ended abnormally
statpt step ended abnormally
next step = statpt


Should I try to use cartesian coordinates ? It seems the Deallocation error is gone.

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 491
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #7 on: February 25, 2009, 03:24:09 PM »
Hi,

the gradient part did not work, so you have no gradients. The one single line 'grad ended normally' comes from the statistics run that is performed (with the serial binary) to generate the task distribution.

What confuses me is the error message 'No such file or directory'. mpirun seems not to find the parallel grad_mpi binary. Did you check the permissions in $TURBODIR/bin/`sysname`_mpi/ if dscf_mpi and grad_mpi have different execution or read permissions?

I guess that you have to send that to the Turbomole support, since this looks like a problem with installation or system settings.

Regards,

Uwe

scope

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: relax step ended abnormally
« Reply #8 on: February 25, 2009, 04:23:29 PM »
Hi Uwe,
that does not seem to be the problem. I can run a smaller test job without any problems, using the same scheduler/scheduler script and the same machine.

Cheers
nils