TURBOMOLE Users Forum
Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: SCILX on July 10, 2007, 10:04:26 AM
-
Hi all, a research group on my University use TURBOMOLE 5.7 for their research.
For testing TURBOMOLE on parallel mode I've use a short job that, with the same turbomole version but in serial mode, works well.
These outputs appears (on parallel mode)...
user@scilx:~/TEST_SHORT_RUNNING$ cat job.start
AN OPTIMIZATION WITH MAX. 1000 CYCLES WILL BE PERFORMED
CONVERGENCY CRITERION FOR TOTAL SCF-ENERGY IS 10**(-6)
CONVERGENCY CRITERION FOR MAXIMUM NORM OF SCF-ENERGY GRADIENT IS 10**(-3)
CONVERGENCY CRITERION FOR MAXIMUM NORM OF BASIS SET GRADIENT IS 10**(-3)
PROCEDURE WILL START WITH A dscf like STEP
LOAD MODULES WILL BE TAKEN FROM DIRECTORY /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi
DSCF = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/dscf
GRAD = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/grad
RELAX = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/relax
user@scilx:~/TEST_SHORT_RUNNING$ cat job.last
STARTING dscf ON 2 PROCESSORS!
PLEASE WAIT UNTIL dscf HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /var/spool/torque/aux//3517.scilx
No file slave1.output found?
fine, there is no data group "$actual step"
next step = dscf
user@scilx:~/TEST_SHORT_RUNNING$ cat job.1
CYCLE 1
Tue Jul 10 09:47:53 CEST 2007
STARTING grad ON 2 PROCESSORS!
PLEASE WAIT UNTIL grad HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /var/spool/torque/aux//3517.scilx
No file slave1.output found?
error in gradient step (1)
energy file is empty!!!
What's happend?
My sistem is:
DEBIAN 4.0 with kernel 2.6.18-4-k7 #1 SMP
MPICH Version: 1.2.7p1
MPICH Release date: $Date: 2005/11/04 11:54:51$
MPICH Patches applied: none
MPICH configure: --with-device=ch_p4 --prefix=/opt/mpich-1.2.7-GNU
MPICH Device: ch_p4
GCCversion 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
GLIB version 1.2.10 (output with glib-config)
Thanks for all help.
Best Regard
-
Hi,
I would recommend to run a simple small job (not too small, though) on 2 CPUs by hand. TTEST and jobex both redirect the output to files, and it is impossible to find out what the problem is without having the opportunity to check all files after such a run.
Usually a simple water input with default basis set is sufficient to check if the problem is a general one. Create a usual serial input for RI-DFT, add
$numprocs2
$parallel_platform MPP
to the control file, and start
mpirun -np 3 $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
and check the output and the error output.
My guess is that your system is too new for the MPICH library that was used for building Turbomole 5.7
Regards,
Uwe
-
Hi!
I have read the previous messages after also encountering the error message: "No file slave1.output found?" when running turbomole 5.9.1 through the PBS queuing system on our Opteron cluster. The problem does not appear when I run the test suite (TTEST) or a simple small job interactively. Could it be that the problem comes from PBS and the MPICH library?
Thanks for your help.
Valérie
-
Hi Valérie,
If you get the error "No file slave1.output found?" when running turbomole 5.9.1 through the PBS (or Torque) queuing system, it generally means that Turbomole couldn't initialize MPI and, hence, hasn't done a thing. I suggest that you look through the two log files created by PBS (error and output) to see what the real problem is. For us it was misconfigured MPI: Turbomole parallel scripts use HP-MPI, which does not work out-of-the-box on a SGI Altix SMP cluster.
Cheers,
Heikki
-
Hi,
I would recommend to run a simple small job (not too small, though) on 2 CPUs by hand. TTEST and jobex both redirect the output to files, and it is impossible to find out what the problem is without having the opportunity to check all files after such a run.
Usually a simple water input with default basis set is sufficient to check if the problem is a general one. Create a usual serial input for RI-DFT, add
$numprocs2
$parallel_platform MPP
to the control file, and start
mpirun -np 3 $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
and check the output and the error output.
My guess is that your system is too new for the MPICH library that was used for building Turbomole 5.7
Regards,
Uwe
Thanks Uwe, I try your solution but I'm very confused...
userxxx@node17:~/TEST-UWE$ mpirun -np 3 -machinefile ./machinefile $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
userxxx@node17:~/TEST-UWE$ ls
auxbasis basis control coord energy machinefile mos slave1.output slave2.output test.err test.out zzz
userxxx@node17:~/TEST-UWE$ cat test.out
operating system is UNIX !
hostname is node17
ridft
hostname is node17
ridftserver(node17) : TURBOMOLE V5-7-1 11 May 2005 at 17:08:32
Copyright (C) 2005 University of Karlsruhe
2007-07-17 08:46:59.156
ridft server
r.ahlrichs, s.brode, m. ehrig, h.horn
quantum chemistry group
universitaet karlsruhe, germany
operating system is UNIX !
operating system is UNIX !
***************************************************************************
$operating system unix
***************************************************************************
hostname is node12
hostname is node15
$numprocs 2
<read_control>: keyword $numprocs found in <control>
n_clients is 2
Initialization MPI : 2 tasks intended
2 tasks spawned
0 1 2
0 1 2
node process 1 runs on node15 1
node process 2 runs on node12 2
nshell = 12, fockdim = 352, nfock = 352
time elapsed since starting is : cpu 0.010 sec
wall 0.063 sec
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=1 ---- dynamic distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
total turnaround time of ridftserver 1.90 sec
cpu elapsed by ridftserver process 0.01 sec
*** ridft server : all done ***
2007-07-17 08:47:01.052
userxxx@node17:~/TEST-UWE$ cat test.err
FORTRAN STOP ridft ended normally
FORTRAN STOP ridft ended normally
FORTRAN STOP 0 -------------------------------------FORTRAN server ends
userxxx@node17:~/TEST-UWE$ cat energy
$energy SCF SCFKIN SCFPOT
1 -76.36168107830 75.84269475416 -152.20437583247
$end
node17, node15 and node12 have this Linux system:
userxxx@node17:~/TEST-UWE$ uname -a
Linux node17 2.6.8-3-k7-smp #1 SMP Thu Sep 7 04:08:38 UTC 2006 i686 GNU/Linux
userxxx@node17:~/TEST-UWE$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
userxxx@node17:~/TEST-UWE$ mpichversion
MPICH Version: 1.2.7p1
MPICH Release date: $Date: 2005/11/04 11:54:51$
MPICH Patches applied: none
MPICH configure: --with-device=ch_p4 --prefix=/opt/mpich-1.2.7-GNU
MPICH Device: ch_p4
userxxx@node17:~/TEST-UWE$ glib-config --version
1.2.10
But if I use node 02,node03 and node04 this problem appears...
userxxx@node01:~/TEST-UWE-NEW$ cat test.err
userxxx@node01:~/TEST-UWE-NEW$ cat test.out
p0_30458: p4_error: Could not gethostbyname for host node01; may be invalid name
: 61
userxxx@node01:~/TEST-UWE-NEW$ uname -a
Linux node01 2.6.18-4-k7 #1 SMP Wed May 9 23:42:01 UTC 2007 i686 GNU/Linux
userxxx@node01:~/TEST-UWE-NEW$ glib-config --version
1.2.10
userxxx@node01:~/TEST-UWE-NEW$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
I also try to run TURBOTEST testsuite on node17,15 and 12 with these results...
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export PARNODES=2
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export HOSTS_FILE=/opt/TURBOMOLE_5.7/TURBOTEST/machines
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ ./TEST -short all
/opt/TURBOMOLE_5.7/TURBOTEST
===============================
THIS IS TEST dscf/short/Ag2.SCF.E
mv: cannot stat `statistics': No such file or directory
Tue Jul 17 08:59:38 CEST 2007
TESTING PROGRAM /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf
THERE IS A PROBLEM WITH /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/master
general input file <control> is empty !
general input file <control> is empty !
MODTRACE: no modules on stack
CONTRL empty input
dscf ended abnormally
p2_12043: p4_error: : 16
p2_12043: (0.041032) net_send: could not write to fd=5, errno = 32
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/output.test
FORTRAN STOP dscf ended normally
STARTING dscf ON 2 PROCESSORS!
PLEASE WAIT UNTIL dscf HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /opt/TURBOMOLE_5.7/TURBOTEST/machines
dscf ended abnormally
[2] MPI Abort by user Aborting program !
[2] Aborting program!
No file slave1.output found?