TURBOMOLE Users Forum
Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: Jerry on March 31, 2009, 08:54:39 PM
-
Hi,
I'm trying to run TM v6.0 jobs on our SGE cluster. My input script is as follows:
------------------------
#!/bin/csh
setenv TURBODIR /cluster/home/tanoury/TURBOMOLE
set path=($TURBODIR/scripts $path)
set path=($TURBODIR/bin/`sysname` $path)
setenv TURBOTMPDIR /scratch
setenv PARA_ARCH MPI
setenv PARNODES 16
limit stacksize unlimited
#$ -pe mpich1 17
#$ -r n
#$ -cwd
#$ -o out_file
#$ -e error_file
#$ -V
grep hpc out_file | grep -v /hpc > machines
jobex -ri -c 400 > jobex.out
-----------------------------
16 slave outputs are listed in my job directory, but 18 cpus are being used on the cluster. The number of cpus running on the node are always (-pe +1), not (PARNODES +1). Does anyone have experience with this? Is there a better script than the one I am using (I'm a novice at writing these SGE scripts).
Thanks,
Jerry
-
Hello,
Your script seems very nice, I think that just some minor tweaking is required. It seems that you write a list of available nodes in file "machines", but do not set the Turbomole environment variable HOSTS_FILE that informs TM about your own hostfile. The TM 6.0 mpirun_scripts contain the following section:
# check for environment variable containing hostfile name.
MACHINEFILE=""
if [ -n "${PBS_NODEFILE}" ]; then # PBS/Torque/Maui
MACHINEFILE="${PBS_NODEFILE}"
elif [ -n "${HOSTS_FILE}" ]; then # manual settings
MACHINEFILE="${HOSTS_FILE}"
elif [ -n "${TMPDIR}" -a -f "${TMPDIR}/machines" ]; then # LSF, untested
MACHINEFILE="${TMPDIR}/machines"
KNOTEN=$NSLOTS
fi
and even though the script says that the last option is for "LSF", it actually works for SGE, as well. So, because you have not set HOSTS_FILE, the mpirun_scripts will now read the file ${TMPDIR}/machines and use the SGE variable $NSLOTS as the number of computing processes. Furthermore, the mpirun_scripts add one additional CPU for the dscf/grad/ridft/rdgrad server process:
if [ "${PARA_ARCH}" = "MPI" ] ; then
KNOTEN=`expr $KNOTEN \+ 1`
fi
So, as your script uses the SGE setting "#$ -pe mpich1 17", you will end up with 18 processes, 17 of which are computing processes and the last one is the server process. Now, the server process should not be consuming much CPU, but as we have already discussed in the another thread, it actually does in the case of TM 6.0.
My suggestions:
1) Define the environment variable HOSTS_FILE in your script after setting PARA_ARCH and PARNODES:
setenv HOSTS_FILE machines
Note that in case of SGE, you should also be able to find the list of available nodes from file "$TMPDIR/machines". Now Turbomole should respect your PARNODES setting.
2) For clarity, I suggest that you move all SGE commands (#$) into the beginning of the file, right after the "#!/bin/csh"
Hope this helps,
Antti
-
Hello,
I am facing problem running a test job in recently installed TURBOMOLE V6.4 3 in a SGE cluster. While submitting the job script I am getting the following error message (job.last)
<getgrd> : data group $grad is missing
MODTRACE: no modules on stack
error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
Here is my script
#!/bin/csh
#$ -N Turbo
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -V
#$ -q long_prl.q
set JOB=H2O
#$ -e error.$JOB_ID.$JOB_NAME
#$ -o H2O.$JOB_NAME
#$ -P himansu_prj
#$ -pe intelmpi 8
#$ -v I_MPI_MPD_TMPDIR=/tmp
set workdir=$PWD
set scratch=$HOME/lustre/scratch/$JOB_ID
echo $scratch
mkdir -p $scratch
lfs setstripe -s 32M -c 14 -i -1 $scratch
#cd $scratch
#cp -a $workdir/new.nw $scratch
setenv I_MPI_MPD_TMPDIR /tmp
setenv LD_LIBRARY_PATH /opt/intel/mkl/10.2.0.013/lib/em64t
setenv LD_LIBRARY_PATH /opt/intel/impi/3.2.1.009/lib64:$PATH
setenv TURBODIR /opt/intel/apps/TURBOMOLE
setenv TURBOTMPDIR /scratch
##### Parallel job
# Set environment variables for a MPI job
setenv PARA_ARCH MPI
setenv PARNODES 8
setenv HOSTS_FILE machines
limit stacksize unlimited
jobex -c 500 -energy 6 -gcart 3
echo "Job finished at: `date`"
I will be very much thankful, if somebody will help me out.
Thanks,
Himansu
-
Hi,
it looks like statpt cannot read the energy and gradient data it needs for updating the geometry. So, probably ridft/rdgrad (or dscf/grad) has failed. Please check the contents of GEO_OPT_FAILED and also go through job.last and try to find the error message from the modules that calculate the energy and gradient.
Regards,
Antti
-
Thanks a lot for your reply. I posting the control as well as the job.last file here, hope you will suggest me something
control file
$title
$operating system unix
$symmetry cs
$redundant file=coord
$coord file=coord
$user-defined bonds file=coord
$atoms
o 1 \
basis =o TZVPP \
jbas =o TZVPP
h 2-3 \
basis =h TZVPP \
jbas =h TZVPP
$basis file=basis
$rundimensions
dim(fock,dens)=2337
natoms=3
nshell=23
nbf(CAO)=66
nbf(AO)=59
dim(trafo[SAO<-->AO/CAO])=80
rhfshells=1
$scfmo file=mos
$closed shells
a' 1-4 ( 2 )
a" 1 ( 2 )
$scfiterlimit 30
$thize 0.10000000E-04
$thime 5
$scfdump
$scfintunit
unit=30 size=0 file=twoint
$scfdiis
$scforbitalshift automatic=.1
$drvopt
cartesian on
basis off
global off
hessian on
dipole on
nuclear polarizability
$interconversion off
qconv=1.d-7
maxiter=25
$optimize
internal on
redundant on
cartesian off
global off
basis off logarithm
$coordinateupdate
dqmax=0.3
interpolate on
statistics 5
$forceupdate
ahlrichs numgeo=0 mingeo=3 maxgeo=4 modus=<g|dq> dynamic fail=0.3
threig=0.005 reseig=0.005 thrbig=3.0 scale=1.00 damping=0.0
$forceinit on
diag=default
$energy file=energy
$grad file=gradient
$forceapprox file=forceapprox
$lock off
$dft
functional b97-d
gridsize m3
$scfconv 6
$scfdamp start=0.700 step=0.050 min=0.050
$ricore 3000
$rij
$jbas file=auxbasis
$actual step statpt
$end
$TMPDIR /scratch
job.last
nvironment variable MPI_ROOT could not be set to a valid path!
TURBOTMPDIR environment variable set to "/scratch".
This directory must exist and be writable by the master process (slave1).
NOTE: the number of nodes in your machine list:
machines
is LOWER than the number of nodes requested
PARNODES has been set to 7
PARNODES will be ignored - change your machine file
to use more nodes. Remember to add a node with several CPUs
multiple times to your machine file (one line = one CPU)
Calculation will continue on 0 CPUs
STARTING ridft ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/ridft_mpi.
PLEASE WAIT UNTIL ridft HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown
GEO_OPT_FAILED
ERROR: Module statpt failed to run properly - please check output job.1 and job.last for the reason
job.1
OPTIMIZATION CYCLE 1
Fri Dec 21 12:43:11 IST 2012
Environment variable MPI_ROOT could not be set to a valid path!
NOTE: the number of nodes in your machine list:
machines
is LOWER than the number of nodes requested
PARNODES has been set to 7
PARNODES will be ignored - change your machine file
to use more nodes. Remember to add a node with several CPUs
multiple times to your machine file (one line = one CPU)
Calculation will continue on 0 CPUs
STARTING rdgrad ON 0 PROCESSORS!
RUNNING PROGRAM /opt/intel/apps/TURBOMOLE/bin/em64t-sgi-linux-gnu_mpi/rdgrad_mpi.
PLEASE WAIT UNTIL rdgrad HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is machines
No file slave1.output found?
fine, there is no data group "$actual step"
script actual: unknown actual step define
next step = unknown
operating system is UNIX !
hostname is r1i1n2
statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe
2012-12-21 12:43:11.865
this is S T A T P T
hessian and coordinate update for
stationary point search
by barbara unterreiner, marek sierka,
and reinhart ahlrichs
quantum chemistry group
universitaet karlsruhe
germany
Keyword $statpt not found - using default options
*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************
<getgrd> : data group $grad is missing
MODTRACE: no modules on stack
error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
operating system is UNIX !
hostname is r1i1n2
data group $actual step is not empty
due to the abend of statpt
statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe
2012-12-21 12:43:11.946
this is S T A T P T
hessian and coordinate update for
stationary point search
by barbara unterreiner, marek sierka,
and reinhart ahlrichs
quantum chemistry group
universitaet karlsruhe
germany
Keyword $statpt not found - using default options
*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************
<getgrd> : data group $grad is missing
MODTRACE: no modules on stack
error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
operating system is UNIX !
hostname is r1i1n2
data group $actual step is not empty
due to the abend of statpt
statpt (r1i1n2) : TURBOMOLE V6.4 3 Apr 2012 at 16:44:13
Copyright (C) 2012 TURBOMOLE GmbH, Karlsruhe
2012-12-21 12:43:12.025
this is S T A T P T
hessian and coordinate update for
stationary point search
by barbara unterreiner, marek sierka,
and reinhart ahlrichs
quantum chemistry group
universitaet karlsruhe
germany
Keyword $statpt not found - using default options
*************** Stationary point options ******************
************************************************************
Maximum allowed trust radius: 3.000000E-01
Minimum allowed trust radius: 1.000000E-03
Initial trust radius: 1.500000E-01
GDIIS used if gradient norm < 1.000000E-02
Number of previous steps for GDIIS: 5
Hessian update method: BFGS
*** Convergence criteria ***
Threshold for energy change: 1.000000E-06
Threshold for max displacement element: 1.000000E-03
Threshold for max gradient element : 1.000000E-03
Threshold for RMS of displacement: 5.000000E-04
Threshold for RMS of gradient: 5.000000E-04
************************************************************
<getgrd> : data group $grad is missing
MODTRACE: no modules on stack
error reading energy and gradient in rdcor
statpt ended abnormally
statpt step ended abnormally
next step = statpt
-
Hi,
There seem to be quite a few problems with the parallel run. Are you able to execute the job in serial mode?
Actually, I just noticed that you are using csh syntax in your script, but require SGE to run it with bash: #$ -S /bin/bash
This will probably cause problems.
Regards,
Antti