Author Topic: problem about parallel TM job on two nodes with 8 cpus  (Read 13974 times)

ardie

  • Newbie
  • *
  • Posts: 8
  • Karma: +0/-0
problem about parallel TM job on two nodes with 8 cpus
« on: April 01, 2009, 07:56:25 PM »
Dear all,
A parallel TM job was submitted to two nodes with 8 cpus (each node has 4 cpus). If I didn't set the variable $HOSTS_FILE,  all the 8 processes were running on the fist node whereas the second node has nothing running. But if the $HOSTS_FILE was set, the whole job was stopped. The queue system to submit TM jobs is LJRS which is like the PBS system. And the script for submition is:

#!/bin/sh
#LJRS -N qjob
#LJRS -l nodes=2:ppn=4
STARTDIR=$LJRS_O_WORKDIR
cd $STARTDIR
export MPI_ROOT=$TURBODIR/mpirun_scripts/HPMPI
sed 's/c/g/g' $LJRS_NODEFILE > $STARTDIR/parallel.nodes
HOSTS_FILE=$STARTDIR/parallel.nodes
export HOSTS_FILE
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/em64t-unknown-linux-gnu_mpi:$TURBODIR/scripts:$PATH
export PARNODES=8
dscf > dscf.out

the generated output file, file named qjob.o1135, from submit system has the following information:
 
Host key verification failed.^M
mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.

Can anyone tell me why the parallel job can not run successfully? Any suggestion is appreciated.

Ardie

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: problem about parallel TM job on two nodes with 8 cpus
« Reply #1 on: April 01, 2009, 11:04:37 PM »
Hi,

could you please check if you can do a passwordless ssh to the machines in your generated parallel.nodes file?

Uwe

ardie

  • Newbie
  • *
  • Posts: 8
  • Karma: +0/-0
Re: problem about parallel TM job on two nodes with 8 cpus
« Reply #2 on: April 02, 2009, 03:00:38 AM »
Hi,

could you please check if you can do a passwordless ssh to the machines in your generated parallel.nodes file?

Uwe

Hi UWe, Thanks for you quick reply. I tried to ssh from one node to another. Password is needed. Then what should I do?
Ardie

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: problem about parallel TM job on two nodes with 8 cpus
« Reply #3 on: April 02, 2009, 09:44:03 AM »
Hi,

this can be done by simply copying your public ssh key to all machines.

How to do this in detail is described on countless web sites, just google for passwordless ssh. For example:

http://www.debian-administration.org/articles/152

On our systems it was sufficient to do two things:

1. run ssh-keygen -t rsa  and ssh-keygen -t dsa  (since I do not know the settings of your machines, it is safe to generate keys for both kinds of encryptions). Do not enter a passphrase.

2. copy the generated *.pub keys to all machines where you want to do passwordless ssh in your home directories under .ssh/*.pub

However, you should always ask your system administrator first. One never knows if your Linux setup is a default one or not...

Regards,

Uwe

ardie

  • Newbie
  • *
  • Posts: 8
  • Karma: +0/-0
Re: problem about parallel TM job on two nodes with 8 cpus
« Reply #4 on: April 02, 2009, 01:55:44 PM »
Hi,

this can be done by simply copying your public ssh key to all machines.

How to do this in detail is described on countless web sites, just google for passwordless ssh. For example:

http://www.debian-administration.org/articles/152

On our systems it was sufficient to do two things:

1. run ssh-keygen -t rsa  and ssh-keygen -t dsa  (since I do not know the settings of your machines, it is safe to generate keys for both kinds of encryptions). Do not enter a passphrase.

2. copy the generated *.pub keys to all machines where you want to do passwordless ssh in your home directories under .ssh/*.pub

However, you should always ask your system administrator first. One never knows if your Linux setup is a default one or not...

Regards,

Uwe

Hi Uwe,
According to your suggestion, I have setup the ssh and now it does not need password. But the parrallel job stopped with the error message in dsf.log file:

STARTING dscf ON 8 PROCESSORS!
RUNNING PROGRAM /export/soft/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/dscf_mpi.
PLEASE WAIT UNTIL dscf HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is /home/ardie/job1/parallel.nodes
No file slave1.output found?

How to resolve it?

Thanks

Ardie

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: problem about parallel TM job on two nodes with 8 cpus
« Reply #5 on: April 04, 2009, 09:12:36 PM »
Hi,

seems that the start of mpirun has not been successful. Is there any other output file which contains an error message? Usually a file called master is generated and either the error message is in there or appears on the screen - depending on where the error comes from.

Uwe