Author Topic: No file slave1.output found?  (Read 10183 times)

SCILX

  • Newbie
  • *
  • Posts: 3
  • Karma: +0/-0
No file slave1.output found?
« on: July 10, 2007, 10:04:26 AM »
Hi all, a research group on my University use TURBOMOLE 5.7 for their research.

For testing TURBOMOLE on parallel mode I've use a short job that, with the same turbomole version but in serial mode, works well.

These outputs appears (on parallel mode)...

user@scilx:~/TEST_SHORT_RUNNING$ cat job.start
AN OPTIMIZATION WITH MAX. 1000 CYCLES WILL BE PERFORMED
CONVERGENCY CRITERION FOR TOTAL SCF-ENERGY IS 10**(-6)
CONVERGENCY CRITERION FOR MAXIMUM NORM OF SCF-ENERGY GRADIENT IS 10**(-3)
CONVERGENCY CRITERION FOR MAXIMUM NORM OF BASIS SET GRADIENT IS 10**(-3)
PROCEDURE WILL START WITH A dscf like STEP
LOAD MODULES WILL BE TAKEN FROM DIRECTORY /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi
DSCF  = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/dscf
GRAD  = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/grad
RELAX = /opt/TURBOMOLE/bin/athlon-pc-linux-gnu_mpi/relax


user@scilx:~/TEST_SHORT_RUNNING$ cat job.last
STARTING dscf ON 2 PROCESSORS!
PLEASE WAIT UNTIL dscf HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /var/spool/torque/aux//3517.scilx
No file slave1.output found?
fine, there is no data group "$actual step"
next step = dscf


user@scilx:~/TEST_SHORT_RUNNING$ cat job.1
CYCLE 1
Tue Jul 10 09:47:53 CEST 2007
STARTING grad ON 2 PROCESSORS!
PLEASE WAIT UNTIL grad HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /var/spool/torque/aux//3517.scilx
No file slave1.output found?
error in gradient step (1)

energy file is empty!!!


What's happend?


My sistem is:
DEBIAN 4.0 with kernel 2.6.18-4-k7 #1 SMP
MPICH Version:          1.2.7p1
MPICH Release date:     $Date: 2005/11/04 11:54:51$
MPICH Patches applied:  none
MPICH configure:        --with-device=ch_p4 --prefix=/opt/mpich-1.2.7-GNU
MPICH Device:           ch_p4
GCCversion 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
GLIB version 1.2.10 (output with glib-config)


Thanks for all help.

Best Regard

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 486
  • Karma: +0/-0
Re: No file slave1.output found?
« Reply #1 on: July 12, 2007, 03:43:44 PM »
Hi,

I would recommend to run a simple small job (not too small, though) on 2 CPUs by hand. TTEST and jobex both redirect the output to files, and it is impossible to find out what the problem is without having the opportunity to check all files after such a run.

Usually a simple water input with default basis set is sufficient to check if the problem is a general one. Create a usual serial input for RI-DFT, add

$numprocs2
$parallel_platform MPP

to the control file, and start

mpirun -np 3 $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err

and check the output and the error output.

My guess is that your system is too new for the MPICH library that was used for building Turbomole 5.7

Regards,

Uwe

vvallet

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
Re: No file slave1.output found?
« Reply #2 on: July 14, 2007, 05:00:13 PM »
Hi!

I have read the previous messages after also encountering the error message: "No file slave1.output found?" when running turbomole 5.9.1 through the PBS queuing system on our Opteron cluster. The problem does not appear when I run the test suite (TTEST) or a simple small job interactively. Could it be that the problem comes from PBS and the MPICH library?

Thanks for your help.

Valérie

hetuonon

  • Newbie
  • *
  • Posts: 1
  • Karma: +0/-0
Re: No file slave1.output found?
« Reply #3 on: July 14, 2007, 09:20:01 PM »
Hi Valérie,

If you get the error "No file slave1.output found?" when running turbomole 5.9.1 through the PBS (or Torque) queuing system, it generally means that Turbomole couldn't initialize MPI and, hence, hasn't done a thing. I suggest that you look through the two log files created by PBS (error and output) to see what the real problem is. For us it was misconfigured MPI: Turbomole parallel scripts use HP-MPI, which does not work out-of-the-box on a SGI Altix SMP cluster.

Cheers,

Heikki
   



SCILX

  • Newbie
  • *
  • Posts: 3
  • Karma: +0/-0
Re: No file slave1.output found?
« Reply #4 on: July 17, 2007, 09:05:46 AM »
Hi,

I would recommend to run a simple small job (not too small, though) on 2 CPUs by hand. TTEST and jobex both redirect the output to files, and it is impossible to find out what the problem is without having the opportunity to check all files after such a run.

Usually a simple water input with default basis set is sufficient to check if the problem is a general one. Create a usual serial input for RI-DFT, add

$numprocs2
$parallel_platform MPP

to the control file, and start

mpirun -np 3 $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err

and check the output and the error output.

My guess is that your system is too new for the MPICH library that was used for building Turbomole 5.7

Regards,

Uwe

Thanks Uwe, I try your solution but I'm very confused...

userxxx@node17:~/TEST-UWE$ mpirun -np 3 -machinefile ./machinefile $TURBODIR/bin/`sysname`/ridft_mpi > test.out 2> test.err
userxxx@node17:~/TEST-UWE$ ls
auxbasis  basis  control  coord  energy  machinefile  mos  slave1.output  slave2.output  test.err  test.out  zzz
userxxx@node17:~/TEST-UWE$ cat test.out
 operating system is UNIX !
 hostname is         node17
 ridft


 hostname is    node17

 ridftserver(node17) : TURBOMOLE V5-7-1 11 May 2005 at 17:08:32
 Copyright (C) 2005 University of Karlsruhe


    2007-07-17 08:46:59.156


                                        ridft     server


                     r.ahlrichs, s.brode, m. ehrig, h.horn


                            quantum chemistry group

                        universitaet karlsruhe, germany


 operating system is UNIX !
 operating system is UNIX !
   ***************************************************************************
$operating system unix
   ***************************************************************************
 hostname is         node12
 hostname is         node15
 $numprocs 2

  <read_control>: keyword $numprocs found in <control>
                  n_clients is            2
  Initialization MPI :            2  tasks intended
                                  2  tasks spawned
           0           1           2
           0           1           2
 node process     1 runs on node15           1
 node process     2 runs on node12           2
 nshell =   12, fockdim =       352, nfock =       352
time elapsed since starting is : cpu   0.010 sec
                                  wall    0.063 sec

 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=1 ---- dynamic distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 itscf.gt.1.or.idirec.eq.0-static distribution of tasks
  ridftserv: entering DFT calculation  :
  dyndft=0 ---- static distribution of tasks
 total turnaround time of ridftserver     1.90 sec
 cpu elapsed by ridftserver process       0.01 sec

 ***  ridft server : all done  ***



    2007-07-17 08:47:01.052

userxxx@node17:~/TEST-UWE$ cat test.err
FORTRAN STOP ridft ended normally
FORTRAN STOP ridft ended normally
FORTRAN STOP 0 -------------------------------------FORTRAN server ends
userxxx@node17:~/TEST-UWE$ cat energy
$energy      SCF               SCFKIN            SCFPOT
     1   -76.36168107830    75.84269475416  -152.20437583247
$end

node17, node15 and node12 have this Linux system:
userxxx@node17:~/TEST-UWE$ uname -a
Linux node17 2.6.8-3-k7-smp #1 SMP Thu Sep 7 04:08:38 UTC 2006 i686 GNU/Linux
userxxx@node17:~/TEST-UWE$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
userxxx@node17:~/TEST-UWE$ mpichversion
MPICH Version:          1.2.7p1
MPICH Release date:     $Date: 2005/11/04 11:54:51$
MPICH Patches applied:  none
MPICH configure:        --with-device=ch_p4 --prefix=/opt/mpich-1.2.7-GNU
MPICH Device:           ch_p4
userxxx@node17:~/TEST-UWE$ glib-config --version
1.2.10


But if I use node 02,node03 and node04 this problem appears...

userxxx@node01:~/TEST-UWE-NEW$ cat test.err
userxxx@node01:~/TEST-UWE-NEW$ cat test.out
p0_30458:  p4_error: Could not gethostbyname for host node01; may be invalid name
: 61
userxxx@node01:~/TEST-UWE-NEW$ uname -a
Linux node01 2.6.18-4-k7 #1 SMP Wed May 9 23:42:01 UTC 2007 i686 GNU/Linux
userxxx@node01:~/TEST-UWE-NEW$ glib-config --version
1.2.10
userxxx@node01:~/TEST-UWE-NEW$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

I also try to run TURBOTEST testsuite on node17,15 and 12 with these results...

userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export PARNODES=2
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ export HOSTS_FILE=/opt/TURBOMOLE_5.7/TURBOTEST/machines
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST$ ./TEST -short all
/opt/TURBOMOLE_5.7/TURBOTEST

===============================
THIS IS TEST dscf/short/Ag2.SCF.E
mv: cannot stat `statistics': No such file or directory
Tue Jul 17 08:59:38 CEST 2007
TESTING PROGRAM /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf
THERE IS A PROBLEM WITH /opt/TURBOMOLE_5.7/bin/athlon-pc-linux-gnu_mpi/dscf

userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/master

 general input file <control> is empty !


 general input file <control> is empty !


 MODTRACE: no modules on stack

  CONTRL empty input
 dscf ended abnormally
p2_12043:  p4_error: : 16
p2_12043: (0.041032) net_send: could not write to fd=5, errno = 32
userxxx@node17:/opt/TURBOMOLE_5.7/TURBOTEST/dscf/short/Ag2.SCF.E$ cat TESTDIR.athlon-pc-linux-gnu_mpi/output.test
FORTRAN STOP  dscf ended normally
STARTING dscf ON 2 PROCESSORS!
PLEASE WAIT UNTIL dscf HAS FINISHED
Look for the output in slave1.output
MACHINEFILE is /opt/TURBOMOLE_5.7/TURBOTEST/machines
 dscf ended abnormally
[2] MPI Abort by user Aborting program !
[2] Aborting program!
No file slave1.output found?