Dear all,
A parallel TM job was submitted to two nodes with 8 cpus (each node has 4 cpus). If I didn't set the variable $HOSTS_FILE, all the 8 processes were running on the fist node whereas the second node has nothing running. But if the $HOSTS_FILE was set, the whole job was stopped. The queue system to submit TM jobs is LJRS which is like the PBS system. And the script for submition is:
#!/bin/sh
#LJRS -N qjob
#LJRS -l nodes=2:ppn=4
STARTDIR=$LJRS_O_WORKDIR
cd $STARTDIR
export MPI_ROOT=$TURBODIR/mpirun_scripts/HPMPI
sed 's/c/g/g' $LJRS_NODEFILE > $STARTDIR/parallel.nodes
HOSTS_FILE=$STARTDIR/parallel.nodes
export HOSTS_FILE
export PARA_ARCH=MPI
export PATH=$TURBODIR/bin/em64t-unknown-linux-gnu_mpi:$TURBODIR/scripts:$PATH
export PARNODES=8
dscf > dscf.out
the generated output file, file named qjob.o1135, from submit system has the following information:
Host key verification failed.^M
mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.
Can anyone tell me why the parallel job can not run successfully? Any suggestion is appreciated.
Ardie