TURBOMOLE Users Forum

Installation and usage of TURBOMOLE => Parallel Runs => Topic started by: uwe on October 26, 2007, 02:35:50 PM

Title: Improve the efficiency and scaling behaviour
Post by: uwe on October 26, 2007, 02:35:50 PM
Hi,

to run the parallel version of Turbomole efficiently, some keywords and settings should be used which have a great influence on the scaling behavior and the wall time.

The settings depend very much on method and the kind of job.

General

First of all, the parallelization of Turbomole has been made assuming that the users will apply it to large systems. All modules (perhaps except ricc2) will give very small or no speed up if used for small systems. Even worse: Running Turbomole in parallel for very small systems with a few ten basis functions, or running it massively parallel for a small input will prevent the job to finish. If one of the nodes does not get a single task to compute, the communication gets confused and the job will die.

There is no rule of thumb since Hartree-Fock, DFT, MP2, CC2 ground or excited states all have different demands, but if the serial run just needs one or very few minutes to complete, do not try to run it in parallel - or try it, but do not expect it to be faster than the serial version.

While the input has to be on a global disk, i.e. one which can be accessed from all nodes like an NFS directory, the scratch files should be written to local disks. This can be achieved by adding the paths to the local directories to the control file, but this is not done automatically! See below which keywords are needed for each module.

The parallelization is based on data replication, so each client holds the complete set of arrays locally. The programs do not benefit from shared memory, and hence running jobs on an SMP node is usually less efficient than running the clients on different nodes.

dscf and grad

The default setting is to use disk space as scratch, the 'optimal' size of the scratch file (twoint ) is determined automatically. The keywords $thime and $thime decide which integrals will be stored and therefore affect the file size.

It is highly recommended to set the following keywords:




ridft and rdgrad

ridft and rdgrad do not need much disk space, only the scratch files for DIIS and the molecular orbitals are written to disk during run time.

The two last points of the last section about dscf and grad are also true for ridft and rdgrad, the keywords are identical.

NOTE: The DFT quadrature is distributed over the clients with a simple approach, namely to generate the grid and the properties of the functional on the grid for each atom in a different task. This avoids communication,
BUT the later introduced DFT grid ordering in the serial program, which leads to about 30% speed up of the serial DFT part, can not be used in the parallel version. Therefore the serial DFT code is faster than the parallel one on one CPU.
Please consider this effect when comparing timings of parallel runs to the serial one!

To improve efficiency:



ricc2

christof.haettig is the expert, so please also check the ricc2 part of this forum.

ricc2, as a coupled cluster program, is writing quite a lot of different files to disk, and the size of the files can get huge. Hence, the most important thing is to set a local scratch directory for all those files:

$tmpdir /scratch/mydir

Note that mydir in the example above does not have to exist, but will be created (actually the directories /scratch/mydir-001/, /scratch/mydir-002/,... etc. will be created) and used as local scratch directories.


In addition to $tmpdir, one should always add

$sharedtmpdir

to make sure that if two clients will run on the same node, they are going to write their scratch files in different directories.



Hope that all this helps a bit.

Regards,

Uwe