I know that when doing DFT calculations by using hybrid functionals, the SCF step is the most time consuming step and therefore the benefit from using RI-J should be smaller than the huge speed up one gets when using RI-J in combination with non-hybrid functionals.
To see the benefit and to decide if it is better to hybrid functionals with or without RI-J (of course MA-RI-J was used then as well) I did some test-calculations. I used TM 6.3.1 to optimize a molecule having in total 847 basis function, C1-symmetrie on b3-lyp+Disp/def2-TZVP level of theory on 8 cpu cores, each having 2.5 GB of memory available (2.66 GB to be precise but only 2.5 GB where requested by the queing system). For parallelization I've set PARA_ARCH=SMP.
The results can be found in the following table (all times are given in minutes):
module| | $ricore/MB| | $ricore_slave/MB| | cycles| | cpu-time total| | wall-time total| | cpu-time per cycle| | wall-time per cycle| |
grad | --- | --- | 29 | 96.0 | 96.1 | 3.3 | 3.3 |
dscf | --- | --- | 30 | 3110.6 | 486.6 | 103.7 | 16.2 |
jobex | --- | --- | 29 | | 584.3 | | 20.1 |
rdgrad | 1875 | --- | 24 | 554.5 | 554.7 | 23.1 | 23.1 |
ridft | 1875 | --- | 25 | 313.2 | 313.7 | 12.5 | 12.5 |
jobex -ri | 1875 | --- | 24 | | 872.0 | | 36.3 |
rdgrad | 1875 | 1875 | 24 | 557.3 | 557.5 | 23.2 | 23.2 |
ridft | 1875 | 1875 | 25 | 315.8 | 316.8 | 12.6 | 12.7 |
jobex -ri | 1875 | 1875 | 24 | | 879.5 | | 36.6 |
rdgrad | 200 | --- | 24 | 554.3 | 554.4 | 23.1 | 23.1 |
ridft | 200 | --- | 25 | 315.6 | 316.0 | 12.6 | 12.6 |
jobex -ri | 200 | --- | 24 | | 873.7 | | 36.4 |
rdgrad | 0 | --- | 25 | 53.5 | 53.6 | 2.1 | 2.1 |
ridft | 0 | --- | 26 | 324.6 | 325.1 | 12.5 | 12.5 |
jobex -ri | 0 | --- | 25 | | 381.9 | | 15.3 |
rdgrad | 0 | 1 | 25 | 50.9 | 50.9 | 2.0 | 2.0 |
ridft | 0 | 1 | 26 | 327.2 | 328.7 | 12.6 | 12.6 |
jobex -ri | 0 | 1 | 25 | | 383.2 | | 15.3 |
I think that the high ratio ratio between cpu-time and wall-time in the case of dscf is due to dscf running as one process using all
n cores whereas ridft is running as
n separate processes and therefore nothing to worry about. In all other cases the cpu- and wall-time are very close to each other.
I don't know why dscf/grad and also the examples with $ricore 0 needed more cycles to fulfill the convergence criteria, but I'll focus on the timings per cycle anyway.
ridft is some percent faster than dscf (in my case over 20%, which is more than I expected). But when assigning the recommended 2/3 to 3/4 of the memory for
$ricore (1875 MB) this advantage is overcompensated by the low speed of rdgrad in comparison with grad making the total jobex -ri optimisation in these cases in total significantly (almost factor 2) slower than just jobex Timing does not change much when
$ricore_slave 1875 is specified as well (but I'm not sure if my settings make sense anyway, see below). Also when just using the define default 200 MB for
$ricore is used the timings don't change.
Only when I set
$ricore to 0 also rdgrad speeds up (more then 10 times (!) in comparison with rdgrad with
$ricore different from 0 and also significantly in comparison with grad). Then also jobex -ri in total only needs 3/4 of the time of jobex.
I understand that for molecules of a certain size increasing the size of
$ricore has little effect (if the molecule is so small that all RI matrices and
RI-integrals can be stored already or if it is so big that the percentage of stored RI stuff is small anyway). But I don't understand that setting
$ricore to 0 has such an enormous effect (In the 200 MB case only 8% of the available memory are used for $ricore, so there should be enough memory remaining for other stuff. According to the output rdgrad allcocated only 4 MiB of local memory in all cases.) Or are the turbomole routines that fast that it is faster to recalculate an integral then getting it out of the memory?
Can somebody give an explanation and/or a rule of thumb when setting $ricore to 0 is recommend?
When doing this testing I of course came across
$ricore_slave but I don't understand exactly what is does.
In the manual section about " Keywords for Parallel Runs " it is written:
In the parallel version of ridft, the first client reads in the keyword $ricore from the control file and uses the given memory for the additional RI matrices and for RI-integral storage. All other clients use the same amount of memory as the first client does, although they do not need to store any of those matrices. This leads to a better usage of the available memory per node. But in the case of a big number of auxiliary basis functions, the RI matrices may become bigger than the specified $ricore and all clients will use as much memory as those matrices would allocate even if that amount is much larger than the given memory. To omit this behaviour one can use:
$ricore_slave integer
specifying the number of MBs that shall be used on each client.
In my understanding this is different to the description in the thread about how to
Improve the efficiency and scaling behaviour - $ricore N
is the memory N in MB used for RI-J integral storage only. In the parallel version, the actual memory requirements are:
- slave1: N minus the memory needed to keep the (P|Q) matrix, see output for the values that are needed during runtime.
- slave2-slaveX: N plus the memory needed for (P|Q). The memory can also be set explicitely by using $ricore_slaves in addition to $ricore.
By the way, according to the manual the keyword seems to be
$ricore_slave whereas in the linked post
$ricore_slaves is used. As I didn't see any difference even when
$ricore_slave(s) (I tried both ways) is not specified at all I can't tell which way is correct but it should be unified.
In my test case the "Memory core needed for (P|Q) and Cholesky" is always "7 MByte on GA"
And the "Memory allocated for RI-J " is only 1-2 MB
+ integrals (more then 3800 Mbyte if
$ricore 1875, exact 1600 MByte if
$ricore 200; 0 Mbyte if
$ricore 0)
How can I bring these numbers in agreement with the written information?
As 3800 Mbyte/client is much more than I've available in the cluster, according to the manual I was expecting a crash due to every client using this amount of memory?
Can somebody explain me how $ricore_slave should work and give cases where $ricore_slave has an effect (is it only working when using the mpi parallel version?)?
Which are reccomendet settings for $ricore_slave? (does it make sens to set it different from
$ricore?)
Sorry that this post became a bit long but I hope you get what I mean and can help me to speed up my turbomole calculations even more.
[/list]