Author Topic: MPICH2 problem with TM-5.9.1 (Read 18748 times)

Daniel · « **on:** May 29, 2007, 10:46:37 AM »

Hello

I've installed turbomole 5.9.1 on our rocks cluster and the serial version works quit fine.
After setting the environment variable 'export PATH=$TURBODIR/mpirun_scripts/MPICH2:$PATH' to get not in conflict with the preinstalled MPICH2 version, TM also works parallel on 2 CPUs (1 node).

To use more CPUs I've set 'export PARNODES=8' and 'export HOSTS_FILE=hostsfile' and also generated the 'mpd.hosts' file in my home directory. By the way ssh and rsh to the compute nodes is possible without password request. Now, when I start turbomole, the following error message occur:

Code: [Select]

convgrep will be taken out of the TURBODIR directory
 ridft ended abnormally
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -16) - process 1
 ridft ended abnormally
[cli_8]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -16) - process 8
OPTIMIZATION CYCLE 1
[cli_5]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(795)............................: MPI_Bcast(buf=0xbfffdf68, count=1, dtype=0x4c000430, root=0, comm=0x84000000) failed
MPIR_Bcast(193)...........................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(415):
MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=6,errno=104:Connection reset by peer)
[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(795)............................: MPI_Bcast(buf=0xbfffdf68, count=1, dtype=0x4c000430, root=0, comm=0x84000000) failed
MPIR_Bcast(193)...........................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3_Progress_wait(217)..............: an error occurred wh rdgrad ended abnormally
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -16) - process 1
ile handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(415):
MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=3,errno=104:Connection reset by peer)
[cli_3]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(795)............................: MPI_Bcast(buf=0xbfffdf68, count=1, dtype=0x4c000430, root=0, comm=0x84000000) failed
MPIR_Bcast(193)...........................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(415):
MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=3,errno=104:Connection reset by peer)

Does anybody knows how I can solve this problem?

Thanks
Daniel

uwe · « **Reply #1 on:** May 29, 2007, 03:16:19 PM »

Hello,

First of all:

This Turbomole forum is not a support forum, but the place where Turbomole users can share their experiences, problems, solutions, thoughts, ...

If you have purchased a Turbomole license, you will get support directly from the company where you have purchased the license from (COSMOlogic in your case).

Second:

Only the i686-pc-linux-gnu version of Turbomole 5.9.1 is using MPICH2. All other Linux binary versions are now based on HP-MPI. If you start the parallel Turbomole version with MPICH2, it simply will not work. HP-MPI comes with the Turbomole distribution, and if you are using the parallel scripts (from $TURBODIR/bin/`sysname`/ if sysname prints out *_mpi), the correct mpirun will be taken from the $TURBODIR/mpirun_scripts/HPMPI or $TURBODIR/mpirun_scripts/MPICH2 directory.

Now to your problem:

The mpd.hosts file will not be taken into account, but the HOSTS_FILE should. If it runs on one node and 2 CPUs, but not on 8 CPUs over several nodes, there are several possible reasons:

1. input too small to be run on 8 CPUs - the parallel version of Turbomole will crash if one of the clients does not get a single task to calculate. So if the serial calculation is finished in a few seconds, it is very likely that using several CPUs will fail.

2. ssh needs a password - check if you can start jobs via ssh by using the IP address rather than the name of the hosts

3. stack size limit when running in parallel. See the FAQ about stack size:

http://www.turbo-forum.com/index.php?topic=23.msg38#msg38

If the parallel Turbomole still does not run, please contact the Turbomole support.

Regards,

Uwe

Daniel · « **Reply #2 on:** May 29, 2007, 03:40:00 PM »

Thank you, very much!

I've tested it with the calculation of water, so the input was just too small for 8 CPUs. With bigger molecules there is no problem.

Best regards
Daniel

TURBOMOLE Users Forum

Author Topic: MPICH2 problem with TM-5.9.1 (Read 18748 times)

Daniel

MPICH2 problem with TM-5.9.1

uwe

Re: MPICH2 problem with TM-5.9.1

Daniel

Re: MPICH2 problem with TM-5.9.1