TURBOMOLE Users Forum

TURBOMOLE Modules => Aoforce and Numforce => Topic started by: Pfister on January 28, 2009, 01:31:42 PM

Title: another problem with parallel NumForce 5.10
Post by: Pfister on January 28, 2009, 01:31:42 PM: Hi!

I am trying to get frequencies with NumForce for benzene, optimized with RICC2 (SVP basis).
First I used a non-parallel run which took about 2h computing time. Fine so far. Output files are as they should be.
Then I was trying to get paracyclophane the same way. Well it took a long time and then it crushed after >200h due to the walltime limit.
Then I thought about doing a parallel run to speed it up. Well I tried it for benzene as a start.
The result: From 72 single points to be calculated I have 4 and that took the computer 450h.
If I log on to the node, where the calculation is directed to by our PBS torque queing system and type "top", then I can watch the RICC2 step being calculated parallel. BUT then the NumForce command is displayed and that process takes 0.0% CPU and 0.1% memory resources on only one core (I checked for about 30min, if it stays at that processor load.

The question is: Did I miss sth.? I mean, every other module or script works fine, too. It is only NumForce which is not performing as I want it to.

Best regards

Johannes
Title: Re: another problem with parallel NumForce 5.10
Post by: Pfister on January 28, 2009, 05:29:19 PM: I have not found an edit button, so I have to add it with another post.

I heard there should be an option -np or something like this available to the NumForce skript, which distributes the single point calculations within the job on the seperate cores. But I miss further information within the NumForce script itself or in the manual.

Anyone know something that might help?
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on January 28, 2009, 09:10:48 PM: Hi,

Numforce -mfile hostfile [...]

first of all, NumForce does a lot of independent single-point jobs which can be run simultaneously. This is much more efficient than any parallel version of the binaries, since the speed up is almost perfect. Especially ricc2 in its current parallelization does not run well on SMP systems, because it benefits from independent local scratch disks and distributed memory over different nodes.

Just create a file, call it e.g. hostfile and enter the names of the nodes that should participate in the calculation. If you have a multi-core system, add the same name multiple times. Unset PARA_ARCH and start NumForce with the -mfile option.

Hope this helps,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: Pfister on January 29, 2009, 03:00:54 PM: Yeah, that -mfile is exactly what I was looking for, but...

I tried several things, it is still not working.

"Unset PARA_ARCH" -> done
I created a hostfile (called hostfile) containing the following two lines:
node28
node28
I logged in to node28, created a new subfolder on scratch, copied any data relevant for a structure optimization, went through define and started jobex -c 200 -gcart 4 (groundstate, HF level, just as a test).
Everything fine so far. Then I gave the command: NumForce -mfile hostfile -central > NumForce.out, again only HF level.
Now I excpected to see by "top" that 2 out of the 4 cores were performing the NumForce command, but again only I CPU showed me this command, and the processor load stayed at 0.0% (as well as the memory). dscf and grad ended normally, and I checked the NumForce.out file:

execute parallel run using NODEFILE: hostfile
There are 2 free nodes: node28-1 node28-2
all nodes will be used for calculation
login-TEST on node28-1
login-TEST on node28-2

This tells me, that something is correct now, but not everything. Could it be that there is a mismatch between the actual names of the CPUs on this node and those of our computer architecture? I mean, in our queue the CPUs are called node28/1, node28/2, ... but I don't believe this, since it seems that the script found the 2 free CPUs according to the NumForce.out file.
So where exactly should the single-point-calculations be located? Will there be any automaticly created subfolders? If so I assumed they are created in the folder where I started the calculation, but they are not.
Title: Re: another problem with parallel NumForce 5.10
Post by: acapobia on February 02, 2009, 09:27:29 PM: Hi all,

I have no poblem with the -mfile machinefile option which I use when I need dispersion corrected or excited states DFT vibrational frequencies, but I have two questions because I am encountering problems when I run

NumForce -level cc2 -mfile machinefile

In fact some of SP/gradient calculations go fine, others go wrong; attached are lines from output files coordinatenumber.log files taken from the /KraftWerk subdir of numforce/

Of course I shift to serial version of TM first, then I delete
$parallel_platform MMP $numprocs XX

and related keys from control, unset the $HOSTS_FILE variable and finally I run Numforce with the above mentioned options.

Question #1

I wonder if errors could originate from

$tmpdir /scratch/myscratchdir/
and
$sharedtmpdir

left in control from previous parallel optimization.

My second question is about twoint and scratch files; have I to put them in local scratch directories even in NumForce -mfile step, or is it better to leave them in the current NFS working directory?

Thanks in advance.

Amedeo

Excerptum from xm10.log
did not find file:"/scratch/amedeo/temp/CCYBREVEAIA" Error while opening file: /scratch/amedeo/temp/CCYBREVEAIA ================================== I/O error ================================= I/O module: open_da last action: open direct access file /scratch/amedeo/temp/CCYBREVEAIA unit: 9 I/O status: 167696096 internal status: used internal filename:/scratch/amedeo/temp/CCYBREVEAIA file is not yet open intended record length: 5208192

Excerptum from xm13.log
======================== internal module stack: ------------------------ ricc2 cc_rspdrv cc_rspden cc_rirelden cc_relden ======================== cc_relden> unacceptable error in traces! ricc2 ended abnormally
Title: Re: another problem with parallel NumForce 5.10
Post by: christof.haettig on February 05, 2009, 04:07:45 PM: To Pfister:
NumForce itself is only a script and will never use a significant amount of CPU time and memory.
Only the energy and gradient codes called from it will use significant resources. Did you get
any results for the force constants from this last calculation you mentioned?

To Acaphobia:
If NumForce is run in parallel mode, the directories given in $tmpdir must be chosen
such that none of the ricc2 calculations started in parallel will use identical directories.
If two ricc2 processes access simultaneous the same scratch directories and write and
read the same file, they will give undefined results or crash!
The same holds for the scratch files specified for twoint. But here it is in addition such,
that you usually don't save any time in semi-direct SCF if twoint is located in an NFS file system.
Title: Re: another problem with parallel NumForce 5.10
Post by: Pfister on February 09, 2009, 12:20:47 PM: It is clear to me, that NumForce is a script like jobex, which will execute other programs. But I thought I should see those others programs running on the node, which I don't.
Well a friend of mine looked through the NumForce script and changed some things regarding a problem due to bash/tcsh.
At the moment I can start it without -mfile, but I have now more than 20 processes where I expected only 4. Well we're still working on it... If we get it done I let you know.

And btw.: No I didn't get any force constants, since my try started to calculate the first 4 single points but never finished any, therefore no force constants in the NumForce.out.
Title: Re: another problem with parallel NumForce 5.10
Post by: basklau on February 11, 2009, 04:26:37 PM: I actually run into the same problem whenever I try to run parallel NumForce. He seems to get stuck in the ' # wait until there are free hosts'-while-loop. I really cannot see why this is the case.

Thanks for any helpful comments or solutions

basklau
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on February 25, 2009, 03:42:32 PM: Hi,

some former version of Turbomole (before 6.0) used a minus sign to add numbers to host names if more than one job is started on a node. So if the names of the nodes do already contain a dash (like node-1, node-2, etc.), NumForce does not recognize the correct node name any more. This has been changed to a : meanwhile.

You can try to patch NumForce yourself: search the NumForce script for 'cut' and 'machine'. Change the lines to

loginname=`echo $machine | cut -f 1 -d ":"`

and

cpu=`echo $machine | cut -f 2 -d ":"`

And finally:

machinename=$host":"$i

Regards,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: basklau on February 26, 2009, 12:41:11 PM: Unfortunaly the same problem I mentioned occurs also with TM 6.0. Are there any problems when start NumForce in a tcsh shell?

Regards basklau
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on February 26, 2009, 03:02:12 PM: Hi,

that is was a very good idea! Indeed, NumForce starts the job on the external machine with:

Code: [Select]
$remoteshell -v $loginname "cd $WORKDIR ;\ export PATH=\$PATH:. ;\ . $ENVFILE ; \ $SCRIPTS/runsingle $dir $machine $scrpath" \ 2> $dir.$machine.err &
If on your target node your default shell is csh or tcsh, it will not be able to source the $ENVFILE, and export, etc. will not work. As far as I can see in the ssh manual, there is no option for ssh to switch the login shell type.

I guess one has to add a 'sh' somehow to this call, but I have tried it with:

Code: [Select]
sh -c "export var1=1 ; echo \$var1 "
but it seems that the sh is started for each command separately, since the echo does not work for the previously assigned var1 variable.

Is there a sh expert out there who can help here? I would like to know how to start a sh or a bash by ssh and let the shell do several commands.

Currently the only workaround I see is to use tcsh syntax for this $remoteshell call, and also change the $ENVFILE to tcsh syntax locally on your system.

Regards,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: antti_karttunen on February 27, 2009, 09:37:58 AM: Hello,

We are using tcsh shell with parallel NumForce and the following modification has worked nicely for us for several years:
Code: [Select]
$remoteshell -v $loginname "/bin/sh -c '\ cd $WORKDIR ;\ export PATH=\$PATH:. ;\ . $ENVFILE ; \ $SCRIPTS/runsingle $dir $machine $scrpath '" \ 2> $dir.$machine.err &So the command that is sent to sh needs to be quoted with ' -quotes. I hope this helps.

Antti
Title: Re: another problem with parallel NumForce 5.10
Post by: basklau on February 27, 2009, 10:25:51 AM: Thank you! That works great with TM 5.10 as well as TM 6.0!

basklau
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on February 27, 2009, 10:48:47 AM: Hi,

thanks, Antti!

I guess this patch will be added to the next official Turbomole version :-)

Regards,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: marand on February 10, 2011, 10:28:23 AM: Hello All!

I have got the same problem with NumForce as Acaphobia. Strangely, everything goes fine when I ask Turbomole (6.1) to compute ground state normal modes at the CC2/cc-pVDZ level of theory. The error occurs only when I use a larger basis set (aug-cc-pVTZ) and require the normal modes for the lowest excited state.

Then I receive identical messages as reported by Acaphobia. I use NumForce with the -mfile option, and the serial version of Turbomole.

Some help on this would be great! If the problem has been already solved, pls point me to the right post.
Yours

Marcin
Title: Re: another problem with parallel NumForce 5.10
Post by: christof.haettig on August 12, 2011, 10:19:02 AM: See my reply to the post by acaphopia:

You have to make sure that every ricc2 process has its own directories for his files. There are two ways, you can ensure this in
parallel NumForce run:
1) use only CPUs or CPU cores from one SMP machine (or nodes) : in this case you should (before NumForce is started) place
all input files in a directory on local file system.
2) distribute the calculation over many cluster nodes: in this case you have to start NumForce in a files system, which is accessible
from all nodes: to ensure that the individual single point calculations started by NumForce will run in local file systems of the
cluster nodes use the -scrpath option of NumForce! ricc2 as well as all other program which do some I/O (escf, egrad, rimp2,
dscf in semi-direct mode) will only run efficiently, if started directories on a local file system.

It is important (in both cases!) that you make sure that the control file does not contain any $tmpdir, $TMPDIR, $SHAREDTMPDIR, $sharedtmpdir entries, since they would re-direct the scratch files from several process to the same (!) directory/files and the calculations would crash or go wrong!

Christof
Title: Re: another problem with parallel NumForce 5.10
Post by: martijn on August 25, 2011, 11:48:21 PM: Hi,

I seem to have a related problem. When I try to run a parallel NumForce calc (TM 6.3.1) on our new machine using nohup "NumForce -ex 1 -mfile hostsfile > Freq.out &" with the hostsfile exisitng of eight lines all saying "node17" (I'm trying to use all 8 cores of the node) I get the following output:

############################################################
# #
# N u m F o r c e #
# Numerical Second Derivatives #
# #
#############################################################

running on node17
date: Thu Aug 25 22:04:32 BST 2011

NumForce has been started with the -mfile option, hence it can be
run in parallel. This is most efficient if the serial binaries
are used and started independently at a time.

-> Starting several serial single-point jobs for optimal speed up...
execute parallel run using NODEFILE: hostsfile
[: 290: node17:1: unexpected operator (repeated another 27 times)

There are 8 free nodes: node17:1 node17:1 node17:1 node17:1 node17:1 node17:1 node17:1 node17:1
all nodes will be used for calculation
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
#############################################################

So somehow the NumForce generates an error message multiple times and doesn't give the cores individual names. If I leave the script running NumForce keeps starting up batches of 8 jobs until all the 200+ single points are running at the same time according to NumForce.

I'm confused what I'm doing wrong.

Thanks,

Martijn
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on August 29, 2011, 12:56:05 PM: Hi,

Quote
execute parallel run using NODEFILE: hostsfile
[: 290: node17:1: unexpected operator (repeated another 27 times)

if you use the name of the node without colon, it will work. So your hostsfile should look like this:

node17 node17 node17
etc.

Seems that currently the names are given as node17:1

Regards,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: martijn on August 30, 2011, 04:51:04 PM: hmmm, this is where it gets properly interesting. my machine file is exactly what you propose, i.e. just

node17
node17
node17
node17

there are no colons in the hostfile. It thus seems that the colons get attached somehow by the script

M
Title: Re: another problem with parallel NumForce 5.10
Post by: uwe on August 30, 2011, 06:35:30 PM: Hi,

you are right, sorry for the stupid post!

NumForce indeed does find 8 nodes, but they are not named correctly.

I assume that there is a problem with the NumForce script itself on your Linux system. The same settings usually do work.

Please contact the Turbomole support ! You will get help. Finding out where the problem is without the ability to reproduce it would cause a lot of posts here...

Regards,

Uwe
Title: Re: another problem with parallel NumForce 5.10
Post by: martijn on August 31, 2011, 02:37:47 PM: Solved it (with thanks to Uwe). The nodes on our cluster run the latest release of Debian and in this release the shell interpreter for scripts starting with #! /bin/sh is dash (the Debian Almquist Shell) and not bash as TM expects. Both shells are very similar but have little differences (see e.g. this Ubuntu page: https://wiki.ubuntu.com/DashAsBinSh) which appear to cause the problem.

Changing "#! /bin/sh" to "#! /bin/bash" in the NumForce header means (as long as the bash shell is installed) that the script is correctly interpreted by the bash interpreter and the problems disappears. From what I can gather this problem might appear in Ubuntu, Debian and perhaps other Linux distributions.