TURBOMOLE Users Forum
Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: ardie on April 01, 2009, 07:56:25 PM
-
Dear all,
A parallel TM job was submitted to two nodes with 8 cpus (each node has 4 cpus). If I didn't set the variable $HOSTS_FILE, all the 8 processes were running on the fist node whereas the second node has nothing running. But if the $HOSTS_FILE was set, the whole job was stopped. The queue system to submit TM jobs is LJRS which is like the PBS system. And the script for submition is:
#!/bin/sh
#LJRS -N qjob
#LJRS -l nodes=2:ppn=4
STARTDIR=$LJRS_O_WORKDIR
cd $STARTDIR
export MPI_ROOT=$TURBODIR/mpirun_scripts/HPMPI
sed 's/c/g/g' $LJRS_NODEFILE > $STARTDIR/parallel.nodes
HOSTS_FILE=$STARTDIR/parallel.nodes
export HOSTS_FILE
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/em64t-unknown-linux-gnu_mpi:$TURBODIR/scripts:$PATH
export PARNODES=8
dscf > dscf.out
the generated output file, file named qjob.o1135, from submit system has the following information:
Host key verification failed.^M
mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.
Can anyone tell me why the parallel job can not run successfully? Any suggestion is appreciated.
Ardie
-
Hi,
could you please check if you can do a passwordless ssh to the machines in your generated parallel.nodes file?
Uwe
-
Hi,
could you please check if you can do a passwordless ssh to the machines in your generated parallel.nodes file?
Uwe
Hi UWe, Thanks for you quick reply. I tried to ssh from one node to another. Password is needed. Then what should I do?
Ardie
-
Hi,
this can be done by simply copying your public ssh key to all machines.
How to do this in detail is described on countless web sites, just google for passwordless ssh. For example:
http://www.debian-administration.org/articles/152 (http://www.debian-administration.org/articles/152)
On our systems it was sufficient to do two things:
1. run ssh-keygen -t rsa and ssh-keygen -t dsa (since I do not know the settings of your machines, it is safe to generate keys for both kinds of encryptions). Do not enter a passphrase.
2. copy the generated *.pub keys to all machines where you want to do passwordless ssh in your home directories under .ssh/*.pub
However, you should always ask your system administrator first. One never knows if your Linux setup is a default one or not...
Regards,
Uwe
-
Hi,
this can be done by simply copying your public ssh key to all machines.
How to do this in detail is described on countless web sites, just google for passwordless ssh. For example:
http://www.debian-administration.org/articles/152 (http://www.debian-administration.org/articles/152)
On our systems it was sufficient to do two things:
1. run ssh-keygen -t rsa and ssh-keygen -t dsa (since I do not know the settings of your machines, it is safe to generate keys for both kinds of encryptions). Do not enter a passphrase.
2. copy the generated *.pub keys to all machines where you want to do passwordless ssh in your home directories under .ssh/*.pub
However, you should always ask your system administrator first. One never knows if your Linux setup is a default one or not...
Regards,
Uwe
Hi Uwe,
According to your suggestion, I have setup the ssh and now it does not need password. But the parrallel job stopped with the error message in dsf.log file:
STARTING dscf ON 8 PROCESSORS!
RUNNING PROGRAM /export/soft/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/dscf_mpi.
PLEASE WAIT UNTIL dscf HAS FINISHED.
Look for the output in slave1.output.
MACHINEFILE is /home/ardie/job1/parallel.nodes
No file slave1.output found?
How to resolve it?
Thanks
Ardie
-
Hi,
seems that the start of mpirun has not been successful. Is there any other output file which contains an error message? Usually a file called master is generated and either the error message is in there or appears on the screen - depending on where the error comes from.
Uwe