Author Topic: another problem with parallel NumForce 5.10  (Read 30574 times)

christof.haettig

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Karma: +0/-0
    • Hattig's Group at the RUB
Re: another problem with parallel NumForce 5.10
« Reply #15 on: August 12, 2011, 10:19:02 AM »
See my reply to the post by acaphopia:

You have to make sure that every ricc2 process has its own directories for his files. There are two ways, you can ensure this in
parallel NumForce run:
   1) use only CPUs or CPU cores from one SMP machine (or nodes) : in this case you should (before NumForce is started) place
        all input files in a directory on local file system.
   2) distribute the calculation over many cluster nodes: in this case you have to start NumForce in a files system, which is accessible
        from all nodes: to ensure that the individual single point calculations started by NumForce will run in local file systems of the
        cluster nodes use the -scrpath option of NumForce! ricc2 as well as all other program which do some I/O (escf, egrad, rimp2,
        dscf in semi-direct mode) will only run efficiently, if started directories on a local file system.

It is important (in both cases!) that you make sure that the control file does not contain any $tmpdir, $TMPDIR, $SHAREDTMPDIR, $sharedtmpdir entries, since they would re-direct the scratch files from several process to the same (!) directory/files and the calculations would crash or go wrong!

Christof 

martijn

  • Full Member
  • ***
  • Posts: 63
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #16 on: August 25, 2011, 11:48:21 PM »
Hi,

I seem to have a related problem. When I try to run a parallel NumForce calc (TM 6.3.1) on our new machine using nohup  "NumForce -ex 1 -mfile hostsfile > Freq.out &" with the hostsfile exisitng of eight lines all saying "node17" (I'm trying to use all 8 cores of the node) I get the following output:

############################################################
#                                                           #
#                      N u m F o r c e                      #
#                Numerical Second Derivatives               #
#                                                           #
#############################################################

 running on node17
 date: Thu Aug 25 22:04:32 BST 2011

NumForce has been started with the -mfile option, hence it can be
    run in parallel. This is most efficient if the serial binaries
    are used and started independently at a time.

    -> Starting several serial single-point jobs for optimal speed up...
execute parallel run using NODEFILE: hostsfile
[: 290: node17:1: unexpected operator (repeated another 27 times)

There are 8 free nodes:  node17:1 node17:1 node17:1 node17:1 node17:1 node17:1 node17:1 node17:1
all nodes will be used for calculation
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
login test on node17:1
#############################################################

So somehow the NumForce generates an error message multiple times and doesn't give the cores individual names. If I leave the script running NumForce keeps starting up batches of 8 jobs until all the 200+ single points are running at the same time according to NumForce.

I'm confused what I'm doing wrong.

Thanks,

Martijn

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 569
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #17 on: August 29, 2011, 12:56:05 PM »
Hi,

Quote
execute parallel run using NODEFILE: hostsfile
[: 290: node17:1: unexpected operator (repeated another 27 times)

if you use the name of the node without colon, it will work. So your hostsfile should look like this:

node17
node17
node17

etc.

Seems that currently the names are given as node17:1

Regards,

Uwe

martijn

  • Full Member
  • ***
  • Posts: 63
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #18 on: August 30, 2011, 04:51:04 PM »
hmmm, this is where it gets properly interesting. my machine file is exactly what you propose, i.e. just

node17
node17
node17
node17

there are no colons in the hostfile. It thus seems that the colons get attached somehow by the script

M

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 569
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #19 on: August 30, 2011, 06:35:30 PM »
Hi,

you are right, sorry for the stupid post!

NumForce indeed does find 8 nodes, but they are not named correctly.

I assume that there is a problem with the NumForce script itself on your Linux system. The same settings usually do work.

Please contact the Turbomole support ! You will get help. Finding out where the problem is without the ability to reproduce it would cause a lot of posts here...

Regards,

Uwe

martijn

  • Full Member
  • ***
  • Posts: 63
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #20 on: August 31, 2011, 02:37:47 PM »
Solved it (with thanks to Uwe). The nodes on our cluster run the latest release of Debian and in this release the shell interpreter for scripts starting with #! /bin/sh is dash (the Debian Almquist Shell) and not bash as TM expects. Both shells are very similar but have little differences (see e.g. this Ubuntu page: https://wiki.ubuntu.com/DashAsBinSh) which appear to cause the problem.

Changing "#! /bin/sh" to "#! /bin/bash" in the NumForce header means (as long as the bash shell is installed) that the script is correctly interpreted by the bash interpreter and the problems disappears. From what I can gather this problem might appear in Ubuntu, Debian and perhaps other Linux distributions.