Hi,
I would recommend to run a simple small job (not too small, though) on 2 CPUs by hand. TTEST and jobex both redirect the output to files, and it is impossible to find out what the problem is without having the opportunity to check all files after such a run.
Usually a simple water input with default basis set is sufficient to check if the problem is a general one. Create a usual serial input for RI-DFT, add
$numprocs2
$parallel_platform MPP
to the control file, and start
mpirun -np 3 $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
and check the output and the error output.
My guess is that your system is too new for the MPICH library that was used for building Turbomole 5.7
Regards,
Uwe
Thanks Uwe, I try your solution but I'm very confused...
userxxx@node17:~/TEST-UWE$ mpirun -np 3 -machinefile ./machinefile $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
userxxx@node17:~/TEST-UWE$ ls
auxbasis basis control coord energy machinefile mos slave1.output slave2.output test.err test.out zzz
userxxx@node17:~/TEST-UWE$ cat test.out
operating system is UNIX !
hostname is node17
ridft
hostname is node17
ridftserver(node17) : TURBOMOLE V5-7-1 11 May 2005 at 17:08:32
Copyright (C) 2005 University of Karlsruhe
2007-07-17 08:46:59.156
ridft server
r.ahlrichs, s.brode, m. ehrig, h.horn
quantum chemistry group
universitaet karlsruhe, germany
operating system is UNIX !
operating system is UNIX !
***************************************************************************
$operating system unix
***************************************************************************
hostname is node12
hostname is node15
$numprocs 2
<read_control>: keyword $numprocs found in <control>
n_clients is 2
Initialization MPI : 2 tasks intended
2 tasks spawned
0 1 2
0 1 2
node process 1 runs on node15 1
node process 2 runs on node12 2
nshell = 12, fockdim = 352, nfock = 352
time elapsed since starting is : cpu 0.010 sec
wall 0.063 sec
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=1 ---- dynamic distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
itscf.gt.1.or.idirec.eq.0-static distribution of tasks
ridftserv: entering DFT calculation :
dyndft=0 ---- static distribution of tasks
total turnaround time of ridftserver 1.90 sec
cpu elapsed by ridftserver process 0.01 sec
*** ridft server : all done ***
2007-07-17 08:47:01.052
userxxx@node17:~/TEST-UWE$ cat test.err
FORTRAN STOP ridft ended normally
FORTRAN STOP ridft ended normally
FORTRAN STOP 0 -------------------------------------FORTRAN server ends
userxxx@node17:~/TEST-UWE$ cat energy
$energy SCF SCFKIN SCFPOT
1 -76.36168107830 75.84269475416 -152.20437583247
$end
node17, node15 and node12 have this Linux system:
userxxx@node17:~/TEST-UWE$ uname -a
Linux node17 2.6.8-3-k7-smp #1 SMP Thu Sep 7 04:08:38 UTC 2006 i686 GNU/Linux
userxxx@node17:~/TEST-UWE$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
userxxx@node17:~/TEST-UWE$ mpichversion
MPICH Version: 1.2.7p1
MPICH Release date: $Date: 2005/11/04 11:54:51$
MPICH Patches applied: none
MPICH configure: --with-device=ch_p4 --prefix=/opt/mpich-1.2.7-GNU
MPICH Device: ch_p4
userxxx@node17:~/TEST-UWE$ glib-config --version
1.2.10
But if I use node 02,node03 and node04 this problem appears...
userxxx@node01:~/TEST-UWE-NEW$ cat test.err
userxxx@node01:~/TEST-UWE-NEW$ cat test.out
p0_30458: p4_error: Could not gethostbyname for host node01; may be invalid name
: 61
userxxx@node01:~/TEST-UWE-NEW$ uname -a
Linux node01 2.6.18-4-k7 #1 SMP Wed May 9 23:42:01 UTC 2007 i686 GNU/Linux
userxxx@node01:~/TEST-UWE-NEW$ glib-config --version
1.2.10
userxxx@node01:~/TEST-UWE-NEW$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
I also try to run TURBOTEST testsuite on node17,15 and 12 with these results...
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export PARNODES=2
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export HOSTS_FILE=/opt/TURBOMOLE_5.7/TURBOTEST/machines
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ ./TEST -short all
/opt/TURBOMOLE_5.7/TURBOTEST
===============================
THIS IS TEST dscf/short/Ag2.SCF.E
mv: cannot stat `statistics': No such file or directory
Tue Jul 17 08:59:38 CEST 2007
TESTING PROGRAM /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf
THERE IS A PROBLEM WITH /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/master
general input file <control> is empty !
general input file <control> is empty !
MODTRACE: no modules on stack
CONTRL empty input
dscf ended abnormally
p2_12043: p4_error: : 16
p2_12043: (0.041032) net_send: could not write to fd=5, errno = 32
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/output.test
FORTRAN STOP dscf ended normally
STARTING dscf ON 2 PROCESSORS!
PLEASE WAIT UNTIL dscf HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /opt/TURBOMOLE_5.7/TURBOTEST/machines
dscf ended abnormally
[2] MPI Abort by user Aborting program !
[2] Aborting program!
No file slave1.output found?