Author Topic: TM 591 parallel - to many computing(!) processes (Read 22478 times)

alex · « **on:** June 15, 2007, 10:24:53 AM »

Hi there,

I always get one process to many eating up cpu-time with the new TM 591 and HPMPI.
This not just a master doing sth. from time to time, but a process taking part ...
I can tell from the 15 min average load of the machine that this process is really doing sth.

It happens with ricc2_mpi as well as with ridft_mpi. I don't know about dscf_mpi (I did not check ...)

Any idea how to workaround?

Thanks

Alex

antti_karttunen · « **Reply #1 on:** June 19, 2007, 07:59:19 AM »

Hi Alex,

I have also noticed that with TM 5.9.1/HPMPI the master process seems to use more resources than before. With dscf_mpi and grad_mpi the 15-minute load of a dual processor machine is 3.00, but a closer inspection shows that only 2 processes actually use 100% of CPU. The third one (master) uses significantly less. However, with ridft_mpi the load is also 3.00 on 2-CPU machine, but this time all three processes are really competing for the CPU time. This behavior is different from TM 5.8 or 5.9 and I guess it is due to HPMPI. It would be interesting to know if there is some tweak that could be used to make the master process less greedy?

christof.haettig · « **Reply #2 on:** June 19, 2007, 08:48:10 AM »

Dear Alex,

check the hp_mpi_appfile for a ricc2 calculation: the number of lines it contains should match
exactly the number of requested prozesses - no extra line for a master or anything else.

Christof

ederat · « **Reply #3 on:** June 19, 2007, 09:43:17 AM »

Got the same behaviour on my cluster with RI DFT calculations.

I request 23 processors:
PARNODES=23
and I found in hp_mpi_appfile 24 lines
One more process is running on the first node. In order to avoid overload, I skip one processor in the machines file for the first node:

node0
node0
node0
node2
node2
node2
node2
node4
node4
node4
node4
node5
node5
node5
node5
node6
node6
node6
node6
node7
node7
node7
node7

If I check what is running on node0 (ps x):
5697 ? Ss 0:00 /home/ederat/TM591/TURBOMOLE/mpirun_scripts/HPMPI/bin/mpid 0 0 33686784 192.
5800 ? R 2:35 /home/ederat/TM591/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi
5801 ? R 2:52 /home/ederat/TM591/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi
5802 ? R 2:24 /home/ederat/TM591/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi
5803 ? R 2:50 /home/ederat/TM591/TURBOMOLE/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi

uwe · « **Reply #4 on:** June 19, 2007, 11:42:18 AM »

Hi,

ridft, dscf, grad, rdgrad, and mpgrad need an extra server process to run in parallel. This process is not doing any calculations, it is just controlling the distribution of the tasks. The scripts in $TURBODIR/mpirun_scripts automatically start that extra process.

We are discussing two topics at a time here:

ricc2

ricc2 is a relatively new program and does not need such an extra process - unfortunately, the extra server process has been added for ricc2 in the scripts. While the server task does not need much CPU time (see below), the extra ricc2 task will, of course.

CPU time for the server

The server process more or less just waits for a signal from the clients, and then sends a few bytes with the new task information. It depends on the MPI library and the used interconnect (ethernet, infiniband, myrinet, smp, ...) if the process is kept alive all the time, actively looking for data from the clients, or if it sleeps while being activated as soon as a message arrives.

If the MPI library vendor wants to keep the latency as low as possible, the proccess will be active all the time, and top will show 100% CPU usage. We do not have anything but ethernet interconnects here, but at least for TCP/IP that should not be the case with HP-MPI.

We have fixed the bug with the ricc2 extra process shortly after the release of 5.9.1, and replaced the files in the official tar ball on the ftp server. There have been some other minor issues with the scripts, so if you have problems with the parallel version, please copy the latest scripts for Linux/PC (i686, i786, x86_64, em64t) from our ftp server:

ftp.chemie.uni-karlsruhe.de/pub/mpirun_scripts_5.9.1/

Copy it to $TURBODIR and unpack it. It will NOT overwrite HPMPI, so please do not remove the old mpirun_scripts directory before unpacking the new file.

Note:

If you are running Turbomole 5.9.1 in parallel from a queuing system or by using a list of machines (by setting $HOSTS_FILE to a file that contains a list of machines), the parallel scripts of Turbomole 5.9.1 for i786, x86_64, and em64t platforms (which are using HP-MPI now), will not use $PARNODES as number of CPUs any more! The number of nodes that you have requested from the queuing system (or the number of lines in $HOSTS_FILE) will be taken instead. For dscf, grad, ridft, rdgrad, and mpgrad, the first node will be taken twice. Hence, if you are setting $PARNODES to, say, 4, make sure that you also request 4 CPUs from your queuing system.

Regards,

Uwe

antti_karttunen · « **Reply #5 on:** June 19, 2007, 11:55:30 AM »

Hello,

we are using SMP machines, so I suppose that HPMPI uses SMP interconnect in this case. At least no hpi_mpi_appfile is being created. I guess that the more greedy master processes are due to SMP interconnect, which polls for data more actively than Ethernet interconnect. How about you Alex, were you using SMP interconnect?

I'll do some tests on SMP and Ethernet interconnects to see, if there are any significant differences in performance.

mpjohans · « **Reply #6 on:** June 26, 2007, 06:39:28 PM »

Hello All!

I can confirm that the master process does consume considerable resources on a GB ethernet cluster. With 4/5 nodes (and cores) allocated, it already runs at about 30% average for dscf.

To allocate a personal core for the master, you have to edit the mpirun_scripts so that they don't add an extra line to the hp_mpi_appfile. Remove | sed "1p" from the sed line ending with > $HP_APPFILE. Sad but true :-/

Also, Turbomole gets confused unless $PARNODES is specified; it inserts $numprocs 2 into control, and as a consequence quits with the error "Number of started proc. wrong"... I had to explicitly specify $PARNODES in the job script with export PARNODES=$(( `wc -l $PBS_NODEFILE |cut -f1 -d" "` -1 ))

Have a nice day,
Mikael J.
http://www.iki.fi/~mpjohans

audi · « **Reply #7 on:** June 27, 2007, 02:07:11 AM »

Hi Folks,

I haven't tried this with the Turbomole runs yet but just like many other MPI implementtaions you have to make a decision if a process (server process in our case) is 'awake' and spinning on the CPU for incoming messages or if it should go to sleep. The standard with HP-MPI is that the process keeps spinning for a few milliseconds and if no message comes in it'll go to sleep. It is a tradeoff between CPU utilization for the server and latency for the messages.
Fortunately, HP-MPI allows the end user to change settings. (man mpienv, look at the 'y' argument).
You can set this either on the mpirun commandline or via an environment variable. E.g.:
1) mpirun -e MPI_FLAGS=y -np ........ does 100% spinning. server NEVER goes to sleep
2) mpirun -e MPI_FLAGS=y,0 -np ...... does 0% spinning. Least CPU consumption fo rthe server process but highest latency!
3) mpirun -e MPI_FLAGS=y,50 -np .... spins for 50 milliseconds before giving up (you can specify numbers between 0 and 10000)

For the environment variable approach, use:
export MPIRUN_OPTIONS="-e MPI_FLAGS=y0" for no spinning.

If the high CPU utilization of the server process is due to spinning, the 'y' flag should solve this. Be aware that the low CPU consumption of the server process may (MAY) come at a cost of a higher elapsed time because the message latency will inevitably increase. Just keep this in mind!

Cheers
audi

mpjohans · « **Reply #8 on:** June 27, 2007, 12:05:48 PM »

Thanks audi, that did the trick!

So one has to edit the startup scripts in mpirun_scripts, by, as audi said, adding -e MPI_FLAGS=y0 to the mpirun command. The server does take next to zero CPU time after this. And while editing, why not put in -v and -prot at the same time.

Have a nice day,
Mikael J.

antti_karttunen · « **Reply #9 on:** June 27, 2007, 09:32:11 PM »

Thanks for the tips, Audi and Mikael!

I performed some dscf and ridft test calculations on our dual-processor machines and it seems that the SMP interconnect is a real CPU hog. I don't know if it could be tweaked some other way, but even -e MPI_FLAGS=y0 does not help at all, as the server process still runs at 100% CPU. So, in addition to -e MPI_FLAGS=y0, I used -intra=nic parameter for mpirun to enforce the Ethernet interconnect for intra-node communication (the options are -intra=shm|nic|mix). This solved the server process CPU usage problems. I'm attaching some timings for a test calculation on a dual-processor machine:

TM 5.9 with MPICH2: 1 h 34 min
TM 5.9.1 with HP-MPI -intra=nic -e MPI_FLAGS=y0: 1 h 34 min
TM 5.9.1 with default settings (SMP interconnect): 2 h 27 min (!!!)

So, I think we will be using -intra=nic -e MPI_FLAGS=y0 in all mpirun_scripts, as it does not seem to have any performance drawbacks in comparison to MPICH2. Otherwise, I think HP-MPI is an improvement over MPICH as it seems to behave much more nicely. MPICH1 in particular had a bad habit of leaving zombie processes around when a job was aborted, even when using MPICH_PROCESS_GROUP=NO.

mpjohans · « **Reply #10 on:** September 18, 2007, 12:27:55 AM »

Hello All!

A short followup. Antti's method seems to work well, but at least on one InfiniBand cluster some additional tweaking is necessary. If the VAPI interface is used, the master process runs at 100% constantly (as does the other parallel processes too, by the way). If TCP is selected, the master runs at close to zero.

With HP-MPI, you can select the interconnect for example with the environment variable MPI_IC_ORDER:
export MPI_IC_ORDER="tcp:vapi"

Of course, all this could be very cluster specific. There's also the chance that using TCP instead of VAPI is less effective despite the extra core available, especially for larger parallel jobs. Or then not :-)

Mikael J.

TURBOMOLE Users Forum

Author Topic: TM 591 parallel - to many computing(!) processes (Read 22478 times)

alex

TM 591 parallel - to many computing(!) processes

antti_karttunen

Re: TM 591 parallel - to many computing(!) processes

christof.haettig

Re: TM 591 parallel - to many computing(!) processes

ederat

Re: TM 591 parallel - to many computing(!) processes

uwe

Re: TM 591 parallel - to many computing(!) processes

antti_karttunen

Re: TM 591 parallel - to many computing(!) processes

mpjohans

Re: TM 591 parallel - to many computing(!) processes

audi

Re: TM 591 parallel - to many computing(!) processes

mpjohans

Re: TM 591 parallel - to many computing(!) processes

antti_karttunen

Re: TM 591 parallel - to many computing(!) processes

mpjohans

Re: TM 591 parallel - to many computing(!) processes