Author Topic: Running v6.0 on an SGE Cluster  (Read 14857 times)

Jerry

  • Jr. Member
  • **
  • Posts: 16
  • Karma: +0/-0
Running v6.0 on an SGE Cluster
« on: March 31, 2009, 08:54:39 PM »
Hi,
I'm trying to run TM v6.0 jobs on our SGE cluster.  My input script is as follows:
------------------------
#!/bin/csh

setenv TURBODIR /cluster/home/tanoury/TURBOMOLE
set path=($TURBODIR/scripts $path)
set path=($TURBODIR/bin/`sysname` $path)
setenv TURBOTMPDIR /scratch

setenv PARA_ARCH MPI
setenv PARNODES 16
limit stacksize unlimited

#$ -pe mpich1 17
#$ -r n
#$ -cwd
#$ -o out_file
#$ -e error_file
#$ -V

grep hpc out_file | grep -v /hpc > machines

jobex -ri -c 400 > jobex.out
-----------------------------
16 slave outputs are listed in my job directory, but 18 cpus are being used on the cluster.  The number of cpus running on the node are always (-pe +1), not (PARNODES +1).  Does anyone have experience with this?  Is there a better script than the one I am using (I'm a novice at writing these SGE scripts).

Thanks,
Jerry

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Running v6.0 on an SGE Cluster
« Reply #1 on: April 01, 2009, 09:11:19 AM »
Hello,

Your script seems very nice, I think that just some minor tweaking is required. It seems that you write a list of available nodes in file "machines", but do not set the Turbomole environment variable HOSTS_FILE that informs TM about your own hostfile. The TM 6.0 mpirun_scripts contain the following section:
Code: [Select]
# check for environment variable containing hostfile name.
MACHINEFILE=""
if [ -n "${PBS_NODEFILE}" ]; then   # PBS/Torque/Maui
   MACHINEFILE="${PBS_NODEFILE}"
elif [ -n "${HOSTS_FILE}" ]; then   # manual settings
   MACHINEFILE="${HOSTS_FILE}"
elif [ -n "${TMPDIR}" -a -f "${TMPDIR}/machines" ]; then  # LSF, untested
   MACHINEFILE="${TMPDIR}/machines"
   KNOTEN=$NSLOTS
fi
and even though the script says that the last option is for "LSF", it actually works for SGE, as well. So, because you have not set HOSTS_FILE, the mpirun_scripts will now read the file ${TMPDIR}/machines and use the SGE variable $NSLOTS as the number of computing processes. Furthermore, the mpirun_scripts add one additional CPU for the dscf/grad/ridft/rdgrad server process:
Code: [Select]
if [ "${PARA_ARCH}" = "MPI" ] ; then
  KNOTEN=`expr $KNOTEN \+ 1`
fi
So, as your script uses the SGE setting "#$ -pe mpich1 17", you will end up with 18 processes, 17 of which are computing processes and the last one is the server process. Now, the server process should not be consuming much CPU, but as we have already discussed in the another thread, it actually does in the case of TM 6.0.

My suggestions:
1) Define the environment variable HOSTS_FILE in your script after setting PARA_ARCH and PARNODES:
Code: [Select]
  setenv HOSTS_FILE machines
Note that in case of SGE, you should also be able to find the list of available nodes from file "$TMPDIR/machines". Now Turbomole should respect your PARNODES setting.

2) For clarity, I suggest that you move all SGE commands (#$) into the beginning of the file, right after the "#!/bin/csh"

Hope this helps,
Antti
« Last Edit: April 01, 2009, 09:14:53 AM by antti_karttunen »

himansu

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
Re: Running v6.0 on an SGE Cluster
« Reply #2 on: December 21, 2012, 05:33:24 AM »
Hello,
  I am facing problem running a test job in recently installed TURBOMOLE V6.4 3 in a SGE cluster. While submitting the job script I am getting the following error message (job.last)
 
<getgrd> : data group $grad  is missing

  MODTRACE: no modules on stack
 error reading energy and gradient in rdcor
 statpt ended abnormally
statpt step ended abnormally
next step = statpt

Here is my script

#!/bin/csh

#$ -N Turbo

#$ -S /bin/bash

#$ -cwd

#$ -j y

#$ -V

#$ -q long_prl.q

set JOB=H2O     
#$ -e error.$JOB_ID.$JOB_NAME

#$ -o H2O.$JOB_NAME

#$ -P himansu_prj

#$ -pe intelmpi 8

#$ -v I_MPI_MPD_TMPDIR=/tmp

set workdir=$PWD

set scratch=$HOME/lustre/scratch/$JOB_ID

echo $scratch

mkdir -p $scratch

lfs setstripe -s 32M -c 14 -i -1 $scratch

#cd $scratch

#cp -a $workdir/new.nw $scratch                 

setenv I_MPI_MPD_TMPDIR /tmp

setenv LD_LIBRARY_PATH /opt/intel/mkl/10.2.0.013/lib/em64t
setenv LD_LIBRARY_PATH /opt/intel/impi/3.2.1.009/lib64:$PATH

setenv TURBODIR /opt/intel/apps/TURBOMOLE
setenv TURBOTMPDIR /scratch

##### Parallel job
# Set environment variables for a MPI job

setenv PARA_ARCH MPI
setenv PARNODES 8
setenv HOSTS_FILE machines
limit stacksize unlimited

jobex -c 500 -energy 6 -gcart 3

echo "Job finished at: `date`"

I will be very much thankful, if somebody will help me out.
Thanks,
Himansu

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Running v6.0 on an SGE Cluster
« Reply #3 on: December 21, 2012, 08:05:14 AM »
Hi,

it looks like statpt cannot read the energy and gradient data it needs for updating the geometry. So, probably ridft/rdgrad (or dscf/grad) has failed. Please check the contents of GEO_OPT_FAILED and also go through job.last and try to find the error message from the modules that calculate the energy and gradient.

Regards,
Antti

himansu

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
Re: Running v6.0 on an SGE Cluster
« Reply #4 on: December 21, 2012, 08:17:17 AM »
Thanks a lot for your reply. I posting the control as well as the job.last file here, hope you will suggest me something
 control file
$title
$operating system unix
$symmetry cs
$redundant    file=coord
$coord    file=coord
$user-defined bonds    file=coord
$atoms
o  1                                                                           \
   basis =o TZVPP                                                              \
   jbas  =o TZVPP
h  2-3                                                                         \
   basis =h TZVPP                                                              \
   jbas  =h TZVPP
$basis    file=basis
$rundimensions
   dim(fock,dens)=2337
   natoms=3
   nshell=23
   nbf(CAO)=66
   nbf(AO)=59
   dim(trafo[SAO<-->AO/CAO])=80
   rhfshells=1
$scfmo   file=mos
$closed shells
 a'      1-4                                    ( 2 )
 a"      1                                      ( 2 )
$scfiterlimit       30
$thize     0.10000000E-04
$thime        5
$scfdump
$scfintunit
 unit=30       size=0        file=twoint
$scfdiis
$scforbitalshift  automatic=.1
$drvopt
   cartesian  on
   basis      off
   global     off
   hessian    on
   dipole     on
   nuclear polarizability
$interconversion  off
   qconv=1.d-7
   maxiter=25
$optimize
   internal   on
   redundant  on
   cartesian  off
   global     off
   basis      off   logarithm
$coordinateupdate
   dqmax=0.3
   interpolate  on
   statistics    5
$forceupdate
   ahlrichs numgeo=0  mingeo=3 maxgeo=4 modus=<g|dq> dynamic fail=0.3
   threig=0.005  reseig=0.005  thrbig=3.0  scale=1.00  damping=0.0
$forceinit on
   diag=default
$energy    file=energy
$grad    file=gradient
$forceapprox    file=forceapprox
$lock off
$dft
   functional b97-d
   gridsize   m3
$scfconv   6
$scfdamp   start=0.700  step=0.050  min=0.050
$ricore     3000
$rij
$jbas    file=auxbasis
$actual step      statpt
$end
$TMPDIR /scratch

job.last

nvironment variable MPI_ROOT could not be set to a valid path!
TURBOTMPDIR environment variable set to "/scratch".
This directory must exist and be writable by the master process (slave1).
NOTE: the number of nodes in your machine list:
      machines
      is LOWER than the number of nodes requested
      PARNODES has been set to 7

      PARNODES will be ignored - change your machine file
      to use more nodes. Remember to add a node with several CPUs
      multiple times to your machine file (one line = one CPU)

Calculation will continue on 0 CPUs
STARTING ridft ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/ridft_mpi.
PLEASE WAIT UNTIL ridft HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown

GEO_OPT_FAILED
ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason

job.1
OPTIMIZATION CYCLE 1
Fri Dec 21 12:43:11 IST 2012
Environment variable MPI_ROOT could not be set to a valid path!
NOTE: the number of nodes in your machine list:
      machines
      is LOWER than the number of nodes requested
      PARNODES has been set to 7

      PARNODES will be ignored - change your machine file
      to use more nodes. Remember to add a node with several CPUs
      multiple times to your machine file (one line = one CPU)

Calculation will continue on 0 CPUs
STARTING rdgrad ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/rdgrad_mpi.
PLEASE WAIT UNTIL rdgrad HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown
 operating system is UNIX !
 hostname is         r1i1n2

 statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
 Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe


    2012-12-21 12:43:11.865



                           this is S T A T P T   


                     hessian and coordinate update for
                          stationary point search

                     by barbara unterreiner, marek sierka,
                           and reinhart ahlrichs

                          quantum chemistry group
                          universitaet  karlsruhe
                                  germany


  Keyword $statpt not found - using default options
 
        ***************  Stationary point options ******************
        ************************************************************
           Maximum allowed trust radius:           3.000000E-01
           Minimum allowed trust radius:           1.000000E-03
           Initial trust radius:                   1.500000E-01
           GDIIS used if gradient norm <           1.000000E-02
           Number of previous steps for GDIIS:       5
           Hessian update method:                  BFGS
                        *** Convergence criteria ***               
           Threshold for energy change:            1.000000E-06
           Threshold for max displacement element: 1.000000E-03
           Threshold for max gradient element :    1.000000E-03
           Threshold for RMS of displacement:      5.000000E-04
           Threshold for RMS of gradient:          5.000000E-04
        ************************************************************
 

 <getgrd> : data group $grad  is missing

 
 MODTRACE: no modules on stack

 error reading energy and gradient in rdcor
 statpt ended abnormally
statpt step ended abnormally
next step = statpt
 operating system is UNIX !
 hostname is         r1i1n2

 data group $actual step is not empty
 due to the abend of statpt


 statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
 Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe


    2012-12-21 12:43:11.946



                           this is S T A T P T   


                     hessian and coordinate update for
                          stationary point search

                     by barbara unterreiner, marek sierka,
                           and reinhart ahlrichs

                          quantum chemistry group
                          universitaet  karlsruhe
                                  germany


  Keyword $statpt not found - using default options
 
        ***************  Stationary point options ******************
        ************************************************************
           Maximum allowed trust radius:           3.000000E-01
           Minimum allowed trust radius:           1.000000E-03
           Initial trust radius:                   1.500000E-01
           GDIIS used if gradient norm <           1.000000E-02
           Number of previous steps for GDIIS:       5
           Hessian update method:                  BFGS
                        *** Convergence criteria ***               
           Threshold for energy change:            1.000000E-06
           Threshold for max displacement element: 1.000000E-03
           Threshold for max gradient element :    1.000000E-03
           Threshold for RMS of displacement:      5.000000E-04
           Threshold for RMS of gradient:          5.000000E-04
        ************************************************************
 

 <getgrd> : data group $grad  is missing

 
 MODTRACE: no modules on stack

 error reading energy and gradient in rdcor
 statpt ended abnormally
statpt step ended abnormally
next step = statpt
 operating system is UNIX !
 hostname is         r1i1n2

 data group $actual step is not empty
 due to the abend of statpt


 statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
 Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe


    2012-12-21 12:43:12.025



                           this is S T A T P T   


                     hessian and coordinate update for
                          stationary point search

                     by barbara unterreiner, marek sierka,
                           and reinhart ahlrichs

                          quantum chemistry group
                          universitaet  karlsruhe
                                  germany


  Keyword $statpt not found - using default options
 
        ***************  Stationary point options ******************
        ************************************************************
           Maximum allowed trust radius:           3.000000E-01
           Minimum allowed trust radius:           1.000000E-03
           Initial trust radius:                   1.500000E-01
           GDIIS used if gradient norm <           1.000000E-02
           Number of previous steps for GDIIS:       5
           Hessian update method:                  BFGS
                        *** Convergence criteria ***               
           Threshold for energy change:            1.000000E-06
           Threshold for max displacement element: 1.000000E-03
           Threshold for max gradient element :    1.000000E-03
           Threshold for RMS of displacement:      5.000000E-04
           Threshold for RMS of gradient:          5.000000E-04
        ************************************************************
 

 <getgrd> : data group $grad  is missing

 
 MODTRACE: no modules on stack

 error reading energy and gradient in rdcor
 statpt ended abnormally
statpt step ended abnormally
next step = statpt

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Running v6.0 on an SGE Cluster
« Reply #5 on: December 21, 2012, 04:07:27 PM »
Hi,

There seem to be quite a few problems with the parallel run. Are you able to execute the job in serial mode?

Actually, I just noticed that you are using csh syntax in your script, but require SGE to run it with bash: #$ -S /bin/bash
This will probably cause problems.

Regards,
Antti