Author Topic: Running v6.0 on an SGE Cluster (Read 14857 times)

Jerry · « **on:** March 31, 2009, 08:54:39 PM »

Hi,
I'm trying to run TM v6.0 jobs on our SGE cluster. My input script is as follows:
------------------------
#!/bin/csh

setenv TURBODIR /cluster/home/tanoury/TURBOMOLE
set path=($TURBODIR/scripts $path)
set path=($TURBODIR/bin/`sysname` $path)
setenv TURBOTMPDIR /scratch

setenv PARA_ARCH MPI
setenv PARNODES 16
limit stacksize unlimited

#$ -pe mpich1 17
#$ -r n
#$ -cwd
#$ -o out_file
#$ -e error_file
#$ -V

grep hpc out_file | grep -v /hpc > machines

jobex -ri -c 400 > jobex.out
-----------------------------
16 slave outputs are listed in my job directory, but 18 cpus are being used on the cluster. The number of cpus running on the node are always (-pe +1), not (PARNODES +1). Does anyone have experience with this? Is there a better script than the one I am using (I'm a novice at writing these SGE scripts).

Thanks,
Jerry

antti_karttunen · « **Reply #1 on:** April 01, 2009, 09:11:19 AM »

Hello,

Your script seems very nice, I think that just some minor tweaking is required. It seems that you write a list of available nodes in file "machines", but do not set the Turbomole environment variable HOSTS_FILE that informs TM about your own hostfile. The TM 6.0 mpirun_scripts contain the following section:

Code: [Select]

# check for environment variable containing hostfile name.
MACHINEFILE=""
if [ -n "${PBS_NODEFILE}" ]; then   # PBS/Torque/Maui
   MACHINEFILE="${PBS_NODEFILE}"
elif [ -n "${HOSTS_FILE}" ]; then   # manual settings
   MACHINEFILE="${HOSTS_FILE}"
elif [ -n "${TMPDIR}" -a -f "${TMPDIR}/machines" ]; then  # LSF, untested
   MACHINEFILE="${TMPDIR}/machines"
   KNOTEN=$NSLOTS
fi

and even though the script says that the last option is for "LSF", it actually works for SGE, as well. So, because you have not set HOSTS_FILE, the mpirun_scripts will now read the file ${TMPDIR}/machines and use the SGE variable $NSLOTS as the number of computing processes. Furthermore, the mpirun_scripts add one additional CPU for the dscf/grad/ridft/rdgrad server process:

Code: [Select]

if [ "${PARA_ARCH}" = "MPI" ] ; then
  KNOTEN=`expr $KNOTEN \+ 1`
fi

So, as your script uses the SGE setting "#$ -pe mpich1 17", you will end up with 18 processes, 17 of which are computing processes and the last one is the server process. Now, the server process should not be consuming much CPU, but as we have already discussed in the another thread, it actually does in the case of TM 6.0.

My suggestions:
1) Define the environment variable HOSTS_FILE in your script after setting PARA_ARCH and PARNODES:

Code: [Select]

  setenv HOSTS_FILE machines

Note that in case of SGE, you should also be able to find the list of available nodes from file "$TMPDIR/machines". Now Turbomole should respect your PARNODES setting.

2) For clarity, I suggest that you move all SGE commands (#$) into the beginning of the file, right after the "#!/bin/csh"

Hope this helps,
Antti

himansu · « **Reply #2 on:** December 21, 2012, 05:33:24 AM »

Hello,
I am facing problem running a test job in recently installed TURBOMOLE V6.4 3 in a SGE cluster. While submitting the job script I am getting the following error message (job.last)

<getgrd> : data group $grad is missing

MODTRACE: no modules on stack
error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt

Here is my script

#!/bin/csh

#$ -N Turbo

#$ -S /bin/bash

#$ -cwd

#$ -j y

#$ -V

#$ -q long_prl.q

set JOB=H2O
#$ -e error.$JOB_ID.$JOB_NAME

#$ -o H2O.$JOB_NAME

#$ -P himansu_prj

#$ -pe intelmpi 8

#$ -v I_MPI_MPD_TMPDIR=/tmp

set workdir=$PWD

set scratch=$HOME/lustre/scratch/$JOB_ID

echo $scratch

mkdir -p $scratch

lfs setstripe -s 32M -c 14 -i -1 $scratch

#cd $scratch

#cp -a $workdir/new.nw $scratch

setenv I_MPI_MPD_TMPDIR /tmp

setenv LD_LIBRARY_PATH /opt/intel/mkl/10.2.0.013/lib/em64t
setenv LD_LIBRARY_PATH /opt/intel/impi/3.2.1.009/lib64:$PATH

setenv TURBODIR /opt/intel/apps/TURBOMOLE
setenv TURBOTMPDIR /scratch

##### Parallel job
# Set environment variables for a MPI job

setenv PARA_ARCH MPI
setenv PARNODES 8
setenv HOSTS_FILE machines
limit stacksize unlimited

jobex -c 500 -energy 6 -gcart 3

echo "Job finished at: `date`"

I will be very much thankful, if somebody will help me out.
Thanks,
Himansu

antti_karttunen · « **Reply #3 on:** December 21, 2012, 08:05:14 AM »

Hi,

it looks like statpt cannot read the energy and gradient data it needs for updating the geometry. So, probably ridft/rdgrad (or dscf/grad) has failed. Please check the contents of GEO_OPT_FAILED and also go through job.last and try to find the error message from the modules that calculate the energy and gradient.

Regards,
Antti

himansu · « **Reply #4 on:** December 21, 2012, 08:17:17 AM »

Thanks a lot for your reply. I posting the control as well as the job.last file here, hope you will suggest me something
control file
$title
$operating system unix
$symmetry cs
$redundant file=coord
$coord file=coord
$user-defined bonds file=coord
$atoms
o 1 \
basis =o TZVPP \
jbas =o TZVPP
h 2-3 \
basis =h TZVPP \
jbas =h TZVPP
$basis file=basis
$rundimensions
dim(fock,dens)=2337
natoms=3
nshell=23
nbf(CAO)=66
nbf(AO)=59
dim(trafo[SAO<-->AO/CAO])=80
rhfshells=1
$scfmo file=mos
$closed shells
a' 1-4 ( 2 )
a" 1 ( 2 )
$scfiterlimit 30
$thize 0.10000000E-04
$thime 5
$scfdump
$scfintunit
unit=30 size=0 file=twoint
$scfdiis
$scforbitalshift automatic=.1
$drvopt
cartesian on
basis off
global off
hessian on
dipole on
nuclear polarizability
$interconversion off
qconv=1.d-7
maxiter=25
$optimize
internal on
redundant on
cartesian off
global off
basis off logarithm
$coordinateupdate
dqmax=0.3
interpolate on
statistics 5
$forceupdate
ahlrichs numgeo=0 mingeo=3 maxgeo=4 modus=<g|dq> dynamic fail=0.3
threig=0.005 reseig=0.005 thrbig=3.0 scale=1.00 damping=0.0
$forceinit on
diag=default
$energy file=energy
$grad file=gradient
$forceapprox file=forceapprox
$lock off
$dft
functional b97-d
gridsize m3
$scfconv 6
$scfdamp start=0.700 step=0.050 min=0.050
$ricore 3000
$rij
$jbas file=auxbasis
$actual step statpt
$end
$TMPDIR /scratch

job.last

nvironment variable MPI_ROOT could not be set to a valid path!
TURBOTMPDIR environment variable set to "/scratch".
This directory must exist and be writable by the master process (slave1).
NOTE: the number of nodes in your machine list:
machines
is LOWER than the number of nodes requested
PARNODES has been set to 7

PARNODES will be ignored - change your machine file
to use more nodes. Remember to add a node with several CPUs
multiple times to your machine file (one line = one CPU)

Calculation will continue on 0 CPUs
STARTING ridft ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/ridft_mpi.
PLEASE WAIT UNTIL ridft HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown

GEO_OPT_FAILED
ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason

job.1
OPTIMIZATION CYCLE 1
Fri Dec 21 12:43:11 IST 2012
Environment variable MPI_ROOT could not be set to a valid path!
NOTE: the number of nodes in your machine list:
machines
is LOWER than the number of nodes requested
PARNODES has been set to 7

PARNODES will be ignored - change your machine file
to use more nodes. Remember to add a node with several CPUs
multiple times to your machine file (one line = one CPU)

Calculation will continue on 0 CPUs
STARTING rdgrad ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/rdgrad_mpi.
PLEASE WAIT UNTIL rdgrad HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown
operating system is UNIX !
hostname is r1i1n2

statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe

2012-12-21 12:43:11.865

this is S T A T P T

hessian and coordinate update for
stationary point search

by barbara unterreiner, marek sierka,
and reinhart ahlrichs

quantum chemistry group
universitaet karlsruhe
germany

Keyword $statpt not found - using default options

*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************

<getgrd> : data group $grad is missing

MODTRACE: no modules on stack

error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
operating system is UNIX !
hostname is r1i1n2

data group $actual step is not empty
due to the abend of statpt

statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe

2012-12-21 12:43:11.946

this is S T A T P T

hessian and coordinate update for
stationary point search

by barbara unterreiner, marek sierka,
and reinhart ahlrichs

quantum chemistry group
universitaet karlsruhe
germany

Keyword $statpt not found - using default options

*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************

<getgrd> : data group $grad is missing

MODTRACE: no modules on stack

error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
operating system is UNIX !
hostname is r1i1n2

data group $actual step is not empty
due to the abend of statpt

statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe

2012-12-21 12:43:12.025

this is S T A T P T

hessian and coordinate update for
stationary point search

by barbara unterreiner, marek sierka,
and reinhart ahlrichs

quantum chemistry group
universitaet karlsruhe
germany

Keyword $statpt not found - using default options

*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************

<getgrd> : data group $grad is missing

MODTRACE: no modules on stack

error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt

antti_karttunen · « **Reply #5 on:** December 21, 2012, 04:07:27 PM »

Hi,

There seem to be quite a few problems with the parallel run. Are you able to execute the job in serial mode?

Actually, I just noticed that you are using csh syntax in your script, but require SGE to run it with bash: #$ -S /bin/bash
This will probably cause problems.

Regards,
Antti

TURBOMOLE Users Forum

Author Topic: Running v6.0 on an SGE Cluster (Read 14857 times)

Jerry

Running v6.0 on an SGE Cluster

antti_karttunen

Re: Running v6.0 on an SGE Cluster

himansu

Re: Running v6.0 on an SGE Cluster

antti_karttunen

Re: Running v6.0 on an SGE Cluster

himansu

Re: Running v6.0 on an SGE Cluster

antti_karttunen

Re: Running v6.0 on an SGE Cluster