TURBOMOLE Users Forum

TURBOMOLE Modules => Ricc2 => Topic started by: mradon on July 26, 2011, 02:27:33 PM

Title: Efficiency of RICC2_smp
Post by: mradon on July 26, 2011, 02:27:33 PM: Dear Turbomole Developers,

I wonder if you have some benchmarks showing how efficient is OMP parallelization in ricc2_smp?

With this module I have performed a few RI-CCSD(T) calculations, each running on 4 threads (theoretically). However, the actual usage of cpu was oscillating between 10% and 390%. The final timings
------------------------------------------------------------------------
total cpu-time : 1 days 6 hours 50 minutes and 33 seconds
total wall-time : 1 days 3 hours 36 minutes and 59 seconds
------------------------------------------------------------------------
show, indeed, a small difference between the cpu-time and wall-time. And, notably, wall-time is shorter (but only slightly). As far as I am aware, the ideal performance of a program running on 4 cores would be if cpu-time were 4 times larger than wall-time. Is that right? ???

Do you think is there any way to speed-up my calculations by optimizing I/O? I suppose that I/O overhead might be still not optimal in my configuration. But maybe I have already reached the intrinsic limit of ricc2_smp and there is no much reason to optimize I/O? ???

Your comments will be greatly appreciated. Thank you in advance!
Title: Re: Efficiency of RICC2_smp
Post by: Arnim on July 27, 2011, 08:46:51 PM: Hi,

the difference between cpu-time and wall-time can not be used to measure the parallel speedup. For the OpenMP version, it depends on the platform and on the compiler, what is actually collected in the cpu-time. The wall-time should be ok, though. But for real speedup, you have to run the program on one and then on four threads.

If the CPU usage is too often 10% or lower, it might an indication the setup of your scratch disk is not optimal. E.g. always make sure not to run on NFS partitions.

Hope that helps,

Arnim
Title: Re: Efficiency of RICC2_smp
Post by: mradon on July 29, 2011, 02:16:44 PM: > The wall-time should be ok, though. But for real speedup, you have
> to run the program on one and then on four threads.

We will do these tests (1 core vs 4 cores, etc) of course.
I wonder if you have any results of such parallel benchmarks? It would be very nice to see them on the Turbomole webpage.

> If the CPU usage is too often 10% or lower, it might an indication the setup
> of your scratch disk is not optimal. E.g. always make sure not to run on NFS partitions.

No NFS for sure ;) It's Lustre connected through Infiniband, so a quite fast configuration. But I suspect that the I/O rate is still not optimal, so we will perhaps give it a try with a local disk or try to further optimize the filesystem.

Thank you for your answer.