Author Topic: When to set $ricore 0 and what about $ricore_slave? (Read 9942 times)

Hauke · « **on:** February 07, 2013, 09:50:37 PM »

I know that when doing DFT calculations by using hybrid functionals, the SCF step is the most time consuming step and therefore the benefit from using RI-J should be smaller than the huge speed up one gets when using RI-J in combination with non-hybrid functionals.
To see the benefit and to decide if it is better to hybrid functionals with or without RI-J (of course MA-RI-J was used then as well) I did some test-calculations. I used TM 6.3.1 to optimize a molecule having in total 847 basis function, C1-symmetrie on b3-lyp+Disp/def2-TZVP level of theory on 8 cpu cores, each having 2.5 GB of memory available (2.66 GB to be precise but only 2.5 GB where requested by the queing system). For parallelization I've set PARA_ARCH=SMP.
The results can be found in the following table (all times are given in minutes):

module\|	$ricore/MB\|	$ricore_slave/MB\|	cycles\|	cpu-time total\|	wall-time total\|	cpu-time per cycle\|	wall-time per cycle\|
grad	---	---	29	96.0	96.1	3.3	3.3
dscf	---	---	30	3110.6	486.6	103.7	16.2
jobex	---	---	29		584.3		20.1
rdgrad	1875	---	24	554.5	554.7	23.1	23.1
ridft	1875	---	25	313.2	313.7	12.5	12.5
jobex -ri	1875	---	24		872.0		36.3
rdgrad	1875	1875	24	557.3	557.5	23.2	23.2
ridft	1875	1875	25	315.8	316.8	12.6	12.7
jobex -ri	1875	1875	24		879.5		36.6
rdgrad	200	---	24	554.3	554.4	23.1	23.1
ridft	200	---	25	315.6	316.0	12.6	12.6
jobex -ri	200	---	24		873.7		36.4
rdgrad	0	---	25	53.5	53.6	2.1	2.1
ridft	0	---	26	324.6	325.1	12.5	12.5
jobex -ri	0	---	25		381.9		15.3
rdgrad	0	1	25	50.9	50.9	2.0	2.0
ridft	0	1	26	327.2	328.7	12.6	12.6
jobex -ri	0	1	25		383.2		15.3

I think that the high ratio ratio between cpu-time and wall-time in the case of dscf is due to dscf running as one process using all n cores whereas ridft is running as n separate processes and therefore nothing to worry about. In all other cases the cpu- and wall-time are very close to each other.
I don't know why dscf/grad and also the examples with $ricore 0 needed more cycles to fulfill the convergence criteria, but I'll focus on the timings per cycle anyway.

ridft is some percent faster than dscf (in my case over 20%, which is more than I expected). But when assigning the recommended 2/3 to 3/4 of the memory for $ricore (1875 MB) this advantage is overcompensated by the low speed of rdgrad in comparison with grad making the total jobex -ri optimisation in these cases in total significantly (almost factor 2) slower than just jobex Timing does not change much when $ricore_slave 1875 is specified as well (but I'm not sure if my settings make sense anyway, see below). Also when just using the define default 200 MB for $ricore is used the timings don't change.
Only when I set $ricore to 0 also rdgrad speeds up (more then 10 times (!) in comparison with rdgrad with $ricore different from 0 and also significantly in comparison with grad). Then also jobex -ri in total only needs 3/4 of the time of jobex.

I understand that for molecules of a certain size increasing the size of $ricore has little effect (if the molecule is so small that all RI matrices and
RI-integrals can be stored already or if it is so big that the percentage of stored RI stuff is small anyway). But I don't understand that setting $ricore to 0 has such an enormous effect (In the 200 MB case only 8% of the available memory are used for $ricore, so there should be enough memory remaining for other stuff. According to the output rdgrad allcocated only 4 MiB of local memory in all cases.) Or are the turbomole routines that fast that it is faster to recalculate an integral then getting it out of the memory?

Can somebody give an explanation and/or a rule of thumb when setting $ricore to 0 is recommend?

When doing this testing I of course came across $ricore_slave but I don't understand exactly what is does.
In the manual section about " Keywords for Parallel Runs " it is written:

Quote

In the parallel version of ridft, the first client reads in the keyword $ricore from the control file and uses the given memory for the additional RI matrices and for RI-integral storage. All other clients use the same amount of memory as the first client does, although they do not need to store any of those matrices. This leads to a better usage of the available memory per node. But in the case of a big number of auxiliary basis functions, the RI matrices may become bigger than the specified $ricore and all clients will use as much memory as those matrices would allocate even if that amount is much larger than the given memory. To omit this behaviour one can use:
$ricore_slave integer
specifying the number of MBs that shall be used on each client.

In my understanding this is different to the description in the thread about how to
Improve the efficiency and scaling behaviour

Quote

$ricore N
is the memory N in MB used for RI-J integral storage only. In the parallel version, the actual memory requirements are:
slave1: N minus the memory needed to keep the (P|Q) matrix, see output for the values that are needed during runtime.
slave2-slaveX: N plus the memory needed for (P|Q). The memory can also be set explicitely by using $ricore_slaves in addition to $ricore.

By the way, according to the manual the keyword seems to be $ricore_slave whereas in the linked post $ricore_slaves is used. As I didn't see any difference even when $ricore_slave(s) (I tried both ways) is not specified at all I can't tell which way is correct but it should be unified.

In my test case the "Memory core needed for (P|Q) and Cholesky" is always "7 MByte on GA"
And the "Memory allocated for RI-J " is only 1-2 MB
+ integrals (more then 3800 Mbyte if $ricore 1875, exact 1600 MByte if $ricore 200; 0 Mbyte if $ricore 0)
How can I bring these numbers in agreement with the written information?

As 3800 Mbyte/client is much more than I've available in the cluster, according to the manual I was expecting a crash due to every client using this amount of memory?

Can somebody explain me how $ricore_slave should work and give cases where $ricore_slave has an effect (is it only working when using the mpi parallel version?)? Which are reccomendet settings for $ricore_slave? (does it make sens to set it different from $ricore?)

Sorry that this post became a bit long but I hope you get what I mean and can help me to speed up my turbomole calculations even more.
[/list]

antti_karttunen · « **Reply #1 on:** February 08, 2013, 07:56:14 AM »

Hi Hauke,

This is an interesting test set. You can find some previous discussions about ricore_slave if you search the forum using this keyword. But few comments:

1) Perhaps you could do one more test using $ricore 15000 and $ricore_slave 1? Sometimes when all processes are running on a single computer, it's more useful to give all memory to the master process (big ricore) and avoid giving any extra memory to the slave processes (ricore_slave 1, like you used. Setting ricore_slave 0 does not work or at least it did not work previously). At least for my calculations using pure functionals, this has been a great way to speed them up. But then again, I was using the "classic" MPI version, not the SMP (Global arrays) version (but hybrids are not parallelized for classic MPI, at least not when I last checked).

2) Did you also turn on $marij for ridft/rdgrad? This is usually the reasonable thing to do when usin the RI-J approximation. Although in this case your system is not very big and I'm not sure whether the speed gain is that good for hybrids. But I'll mention it anyway since it's definitely something you will want to use for pure functionals.

But looking at your timings it seems that rdgrad has some problems with the memory allocation. It's really strange how it spends so much time doing the gradients with larger ricore. So, in this case ricore 0 would definitely be the most reasonable choice. In any case, ricore 0 is always the simplest choice that should deliver good performance.

Antti

uwe · « **Reply #2 on:** February 08, 2013, 04:16:21 PM »

Hello,

there are several interesting points to comment on, but I will try to limit myself to the most important ones:

you are right, the exact exchange used by hybrid functionals is by far the most time demanding part, so RI-J is not expected to speed up the job significantly. In ridft, however, a (so called) linear scaling algorithm for the exchange is being used similar to the linK method of Ochsenfeld and Head-Gordon (http://link.aip.org/link/doi/10.1063/1.476741). This explains why ridft is faster than dscf also for hybrid functionals.
$ricore_slave (without an s at the end) is only used in the (old) MPI version, it is ignored in the SMP case.
the default for the RI memory settings of the clients has been changed in the MPI version in Turbomole 6.2 (June 2010), so please ignore my posts to the forum related to this keyword which are older than that ...
now to the strange timings for rdgrad - $ricore for the gradient has an effect only if RI-K is being used. Did you switch on RI-K in some of the calculations?

Regards

Uwe

Hauke · « **Reply #3 on:** February 16, 2013, 03:15:27 AM »

Thanks for your replies

Quote from: antti_karttunen on February 08, 2013, 07:56:14 AM

You can find some previous discussions about ricore_slave if you search the forum using this keyword

yes, I've looked for them before, but all of them I've found are about $ricore_slave 1 in combination with $ricore 0 (or 1).

Quote from: antti_karttunen on February 08, 2013, 07:56:14 AM

1) Perhaps you could do one more test using $ricore 15000 and $ricore_slave 1?

If I do so I get the following error message

Quote

GA heap raised to: -1837367296

Seems to be an integer overflow (similar to this fixed bug here)
($ricore 13100 works, it is raising the GA heap to 2146304000 which is very close to the long integer limit of 2^31 -1 = 2147483647)
A bad side effect of this overflow is that the control file after the crashed run only consist of $title and the last part in control.1 is missing.

Quote from: antti_karttunen on February 08, 2013, 07:56:14 AM

2) Did you also turn on $marij for ridft/rdgrad?

In the RI-cases MA-RI-J was used then as well.

Quote from: antti_karttunen on February 08, 2013, 07:56:14 AM

In any case, ricore 0 is always the simplest choice that should deliver good performance.

I agree (although this setting is quite in contrast to the manual). Still I wanted to know the reason for the differences shown above between $ricore 0 or a non 0 value.

Quote from: uwe on February 08, 2013, 04:16:21 PM

$ricore_slave (without an s at the end) is only used in the (old) MPI version, it is ignored in the SMP case.

This explains why I see no effect and also why even $ricore_slave 0 worked (even when it should not). So I'll not further worry about values for $ricore_slave

I did some further investigation and I think I found the reason (but not the explanation) for my problem. I wanted to do an aoforce run after jobex (also to test this problem described here) and had therefore already set $maxcor to 1875 MB for the non-ridft case and 312 MB in the other cases.

If I delete the $maxcor line the timing is the following (all time values in minutes):

module\|	$ricore/MB\|	cycles\|	cpu-time total\|	wall-time total\|	cpu-time per cycle\|	wall-time per cycle\|
grad	---	29	96.0	96.0	3.3	3.3
dscf	---	30	3103.1	493.6	103.4	16.5
jobex	---	29		591.3		20.4
rdgrad	1875	28	59.4	59.5	2.1	2.1
ridft	1875	29	334.5	336.0	11.5	11.6
jobex -ri	1875	28		399.9		14.3
rdgrad	200	29	61.5	61.6	2.1	2.1
ridft	200	30	357.0	357.9	11.9	11.9
jobex -ri	200	29		424.3		14.6
rdgrad	0	29	61.5	61.5	2.1	2.1
ridft	0	30	353.2	354.4	11.8	11.8
jobex -ri	0	29		420.5		14.5

Because for the runs without ri $maxcor in the previous post was set to 1875 MB - which should be more then enough- the timings of grad/dscf didn't change in comparison with the data in this table (without any $maxcor). The runs with ri (without any $maxcor) needed a few more cycles then in the table of the last post ($maxcor 312 MB). I don't know the reason for the different number of cycles but this explains a bit why the time/cycle of ridft is lowered a bit, as later ridft cycles tend to need less time. For the rdgrad steps the variation of the timings within the different cycles of one run is much smaller, so the effect of the increased number of cycles is not important. One can clearly see that the rdgrad steps of whatever $ricore now all take 2.1 minutes/cycle which is equal to the fast value of the run with $ricore 0 from the previous post. Therefore also jobex -ri is now fast for all settings of $ricore (the run with $ricore 1875 seems to be a bit faster then the runs with less $ricore (even if one accounts their extra cycle), but as the difference is only some seconds per cycle, this can also be by coincidence and is not significant)

I didn't expect $maxcor to have an effect on rdgrad (it is not mentioned in this section of the manual and there is for example also -in contrast to aoforce or escf- no warning that $maxcor might be to small). And also I don't get why in the case of $ricore 0 also the $maxcor settings seemed to be ignored?

uwe · « **Reply #4 on:** February 18, 2013, 04:15:10 PM »

Hello,

I used Turbomole 6.4 and could not reproduce the parallel rdgrad timings, but after switching back to version 6.3.1 I also saw that rdgrad became very slow in some cases - although the measured timings in those parts of rdgrad which do calculate something are short, the total wall time is about a factor of 8 longer than it should be. If $ricore and $maxcor have an influence, this looks like a severe memory access problem to me.

Just update to a newer Turbomole version (at least 6.4) and the bad timings will disappear...

Regards,

Uwe

TURBOMOLE Users Forum

Author Topic: When to set $ricore 0 and what about $ricore_slave? (Read 9942 times)

Hauke

When to set $ricore 0 and what about $ricore_slave?

antti_karttunen

Re: When to set $ricore 0 and what about $ricore_slave?

uwe

Re: When to set $ricore 0 and what about $ricore_slave?

Hauke

Re: When to set $ricore 0 and what about $ricore_slave?

uwe

Re: When to set $ricore 0 and what about $ricore_slave?