Author Topic: Improve the efficiency and scaling behaviour (Read 14259 times)

uwe · « **on:** October 26, 2007, 02:35:50 PM »

Hi,

to run the parallel version of Turbomole efficiently, some keywords and settings should be used which have a great influence on the scaling behavior and the wall time.

The settings depend very much on method and the kind of job.

General

First of all, the parallelization of Turbomole has been made assuming that the users will apply it to large systems. All modules (perhaps except ricc2) will give very small or no speed up if used for small systems. Even worse: Running Turbomole in parallel for very small systems with a few ten basis functions, or running it massively parallel for a small input will prevent the job to finish. If one of the nodes does not get a single task to compute, the communication gets confused and the job will die.

There is no rule of thumb since Hartree-Fock, DFT, MP2, CC2 ground or excited states all have different demands, but if the serial run just needs one or very few minutes to complete, do not try to run it in parallel - or try it, but do not expect it to be faster than the serial version.

While the input has to be on a global disk, i.e. one which can be accessed from all nodes like an NFS directory, the scratch files should be written to local disks. This can be achieved by adding the paths to the local directories to the control file, but this is not done automatically! See below which keywords are needed for each module.

The parallelization is based on data replication, so each client holds the complete set of arrays locally. The programs do not benefit from shared memory, and hence running jobs on an SMP node is usually less efficient than running the clients on different nodes.

dscf and grad

The default setting is to use disk space as scratch, the 'optimal' size of the scratch file (twoint ) is determined automatically. The keywords $thime and $thime decide which integrals will be stored and therefore affect the file size.

It is highly recommended to set the following keywords:

$scfintunit unit=30 size=0 file=/path/to/local/disk/twoint.username

The filename itself can be given arbitrarily, but one has to take care that the path to the local scratch directory is unique on all nodes.

It does not make sense to change the file size itself (the size is the sum of all twoint-files on all nodes), since the parallel statistics run to determine the task distribution - which is done automatically by dscf - will overwrite it again. dscf in this case is a script in $TURBODIR/mpirun_scripts, the dscf in the parallel binary directories are just links to that script.
If you want to set the file size by hand, you will have to run the parallel statistics by hand:
- stati pdscf
- $TURBODIR/bin/`sysname -s`/dscf
after that, you can change the size of the twoint file to any reasonable size you like.
Scratch files of DIIS, etc., can also be written to a local disk by using the keyword
$scratch files dscf dens /localscratch/dens_user dscf fock /localscratch/fock_user dscf ddens /localscratch/ddens_user dscf errvec /localscratch/errvec_user dscf oldfock /localscratch/oldfock_user grad ddens /localscratch/ddens
Make sure that the keyword $scfdump is not present in the control file. That keyword will cause dscf to write out the molecular orbitals in each SCF iteration on the NFS disk which should be avoided if the file is large and/or the disk is slow. The orbitals will of course be written to disk at the end of the run in any case, but you will not be able to restart your job if the calculation or the machines crash.

ridft and rdgrad

ridft and rdgrad do not need much disk space, only the scratch files for DIIS and the molecular orbitals are written to disk during run time.

The two last points of the last section about dscf and grad are also true for ridft and rdgrad, the keywords are identical.

NOTE: The DFT quadrature is distributed over the clients with a simple approach, namely to generate the grid and the properties of the functional on the grid for each atom in a different task. This avoids communication,
BUT the later introduced DFT grid ordering in the serial program, which leads to about 30% speed up of the serial DFT part, can not be used in the parallel version. Therefore the serial DFT code is faster than the parallel one on one CPU.
Please consider this effect when comparing timings of parallel runs to the serial one!

To improve efficiency:

Always switch on $marij !!

Unless you are using very diffuse or highly augmented basis functions (and if your input structure is reasonable), the multipole approximation for RI-J introduces numerical errors that are below the convergence criteria, and far below the RI error. MARI-J does also not slow down your calculation when having small molecules, but it is a factor of 3 to 6 (or more) faster than usual RI-DFT for larger systems (and that is what one should use the parallel code for).
$ricore N

is the memory N in MB used for RI-J integral storage only. In the parallel version, the actual memory requirements are:
- slave1: N minus the memory needed to keep the (P|Q) matrix, see output for the values that are needed during runtime.
- slave2-slaveX: N plus the memory needed for (P|Q). The memory can also be set explicitely by using $ricore_slaves in addition to $ricore.
There is a range of sizes (number of basis sets) where using more $ricore speeds up the calculation significantly (superlinear speedup for RI-J). Distributing the clients over different nodes, each one using $ricore or $ricore_slaves independently on each node, can have a big effect.
Scratch files of DIIS, etc., can also be written to a local disk by using the keyword
$scratch files dscf dens /localscratch/dens_user dscf fock /localscratch/fock_user dscf ddens /localscratch/ddens_user dscf errvec /localscratch/errvec_user dscf oldfock /localscratch/oldfock_user grad ddens /localscratch/ddens

In Turbomole 5.10 this is done automatically if $TURBOTMPDIR is set to a local scratch file directory. Otherwise, ridft will check if there is a /work, /scr, or /tmp directory on the system - if yes, it will make a new directory with the name of the user in there and use it as scratch for the parallel calculation.
Make sure that the keyword $scfdump is not present in the control file. That keyword will cause dscf to write out the molecular orbitals in each SCF iteration on the NFS disk which should be avoided if the file is large and/or the disk is slow. The orbitals will of course be written to disk at the end of the run in any case, but you will not be able to restart your job if the calculation or the machines crash.

ricc2

christof.haettig is the expert, so please also check the ricc2 part of this forum.

ricc2, as a coupled cluster program, is writing quite a lot of different files to disk, and the size of the files can get huge. Hence, the most important thing is to set a local scratch directory for all those files:

$tmpdir /scratch/mydir

Note that mydir in the example above does not have to exist, but will be created (actually the directories /scratch/mydir-001/, /scratch/mydir-002/,... etc. will be created) and used as local scratch directories.

In addition to $tmpdir, one should always add

$sharedtmpdir

to make sure that if two clients will run on the same node, they are going to write their scratch files in different directories.

Hope that all this helps a bit.

Regards,

Uwe

TURBOMOLE Users Forum

Author Topic: Improve the efficiency and scaling behaviour (Read 14259 times)

uwe

Improve the efficiency and scaling behaviour