Author Topic: another problem with parallel NumForce 5.10  (Read 20975 times)

Pfister

  • Jr. Member
  • **
  • Posts: 11
  • Karma: +0/-0
another problem with parallel NumForce 5.10
« on: January 28, 2009, 01:31:42 PM »

Hi!

I am trying to get frequencies with NumForce for benzene, optimized with RICC2 (SVP basis).
First I used a non-parallel run which took about 2h computing time. Fine so far. Output files are as they should be.
Then I was trying to get paracyclophane the same way. Well it took a long time and then it crushed after >200h due to the walltime limit.
Then I thought about doing a parallel run to speed it up. Well I tried it for benzene as a start.
The result: From 72 single points to be calculated I have 4 and that took the computer 450h.
If I log on to the node, where the calculation is directed to by our PBS torque queing system and type "top", then I can watch the RICC2 step being calculated parallel. BUT then the NumForce command is displayed and that process takes 0.0% CPU and 0.1% memory resources on only one core (I checked for about 30min, if it stays at that processor load.

The question is: Did I miss sth.? I mean, every other module or script works fine, too. It is only NumForce which is not performing as I want it to.

Best regards

                            Johannes

Pfister

  • Jr. Member
  • **
  • Posts: 11
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #1 on: January 28, 2009, 05:29:19 PM »
I have not found an edit button, so I have to add it with another post.

I heard there should be an option -np or something like this available to the NumForce skript, which distributes the single point calculations within the job on the seperate cores. But I miss further information within the NumForce script itself or in the manual.

Anyone know something that might help?

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 487
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #2 on: January 28, 2009, 09:10:48 PM »
Hi,

Numforce -mfile hostfile [...]

first of all, NumForce does a lot of independent single-point jobs which can be run simultaneously. This is much more efficient than any parallel version of the binaries, since the speed up is almost perfect. Especially ricc2 in its current parallelization does not run well on SMP systems, because it benefits from independent local scratch disks and distributed memory over different nodes.

Just create a file, call it e.g. hostfile and enter the names of the nodes that should participate in the calculation. If you have a multi-core system, add the same name multiple times. Unset PARA_ARCH and start NumForce with the -mfile option.

Hope this helps,

Uwe


Pfister

  • Jr. Member
  • **
  • Posts: 11
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #3 on: January 29, 2009, 03:00:54 PM »
Yeah, that -mfile is exactly what I was looking for, but...

I tried several things, it is still not working.

"Unset PARA_ARCH" -> done
I created a hostfile (called hostfile) containing the following two lines:
node28
node28
I logged in to node28, created a new subfolder on scratch, copied any data relevant for a structure optimization, went through define and started jobex -c 200 -gcart 4 (groundstate, HF level, just as a test).
Everything fine so far. Then I gave the command: NumForce -mfile hostfile -central > NumForce.out, again only HF level.
Now I excpected to see by "top" that 2 out of the 4 cores were performing the NumForce command, but again only I CPU showed me this command, and the processor load stayed at 0.0% (as well as the memory). dscf and grad ended normally, and I checked the NumForce.out file:

execute parallel run using NODEFILE: hostfile
There are 2 free nodes:  node28-1 node28-2
all nodes will be used for calculation
login-TEST on node28-1
login-TEST on node28-2

This tells me, that something is correct now, but not everything. Could it be that there is a mismatch between the actual names of the CPUs on this node and those of our computer architecture? I mean, in our queue the CPUs are called node28/1, node28/2, ... but I don't believe this, since it seems that the script found the 2 free CPUs according to the NumForce.out file.
So where exactly should the single-point-calculations be located? Will there be any automaticly created subfolders? If so I assumed they are created in the folder where I started the calculation, but they are not.

acapobia

  • Full Member
  • ***
  • Posts: 27
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #4 on: February 02, 2009, 09:27:29 PM »
Hi all,

I have no poblem with the -mfile machinefile option which I use when I need dispersion corrected or excited states DFT vibrational frequencies, but I have two questions because I am encountering problems when I run

NumForce -level cc2 -mfile machinefile

In fact some of SP/gradient calculations go fine, others go wrong; attached are lines from output files coordinatenumber.log files taken from the /KraftWerk subdir of numforce/

Of course I shift to serial version of TM first, then I delete
$parallel_platform MMP
$numprocs XX


and related keys from control, unset the $HOSTS_FILE variable  and finally I run Numforce with the above mentioned options.

Question #1

I wonder if errors could originate from

$tmpdir /scratch/myscratchdir/
and
$sharedtmpdir

left in control from previous parallel optimization.

My second question is about twoint and scratch files; have I to put them in local scratch directories even in NumForce -mfile step, or is it better to leave them in the current NFS working directory?

Thanks in advance.

Amedeo

Excerptum from xm10.log
did not find file:"/scratch/amedeo/temp/CCYBREVEAIA"
 Error while opening file: /scratch/amedeo/temp/CCYBREVEAIA
 ================================== I/O error =================================
 I/O module:    open_da
 last action:   open direct access file /scratch/amedeo/temp/CCYBREVEAIA
 unit:                     9
 I/O status:       167696096
 internal status: used
 internal filename:/scratch/amedeo/temp/CCYBREVEAIA
 file is not yet open
 intended record length:      5208192


Excerptum from xm13.log
========================
  internal module stack:
 ------------------------
    ricc2
    cc_rspdrv
    cc_rspden
    cc_rirelden
    cc_relden
 ========================

 cc_relden> unacceptable error in traces!
 ricc2 ended abnormally

christof.haettig

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 262
  • Karma: +0/-0
    • Hattig's Group at the RUB
Re: another problem with parallel NumForce 5.10
« Reply #5 on: February 05, 2009, 04:07:45 PM »
To Pfister:
  NumForce itself is only a script and will never use a significant amount of CPU time and memory.
  Only the energy and gradient codes called from it will use significant resources. Did you get
  any results for the force constants from this last calculation you mentioned?

To Acaphobia:
  If NumForce is run in parallel mode, the directories given in $tmpdir must be chosen
  such that none of the ricc2 calculations started in parallel will use identical directories.
  If two ricc2 processes access simultaneous the same scratch directories and write and
  read the same file, they will give undefined results or crash!
  The same holds for the scratch files specified for twoint. But here it is in addition such,
   that you usually don't save any time in semi-direct SCF if twoint is located in an NFS file system.

Pfister

  • Jr. Member
  • **
  • Posts: 11
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #6 on: February 09, 2009, 12:20:47 PM »
It is clear to me, that NumForce is a script like jobex, which will execute other programs. But I thought I should see those others programs running on the node, which I don't.
Well a friend of mine looked through the NumForce script and changed some things regarding a problem due to bash/tcsh.
At the moment I can start it without -mfile, but I have now more than 20 processes where I expected only 4. Well we're still working on it... If we get it done I let you know.

And btw.: No I didn't get any force constants, since my try started to calculate the first 4 single points but never finished any, therefore no force constants in the NumForce.out.

basklau

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #7 on: February 11, 2009, 04:26:37 PM »
I actually run into the same problem whenever I try to run parallel NumForce. He seems to get stuck in the ' # wait until there are free hosts'-while-loop. I really cannot see why this is the case.

Thanks for any helpful comments or solutions

basklau

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 487
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #8 on: February 25, 2009, 03:42:32 PM »
Hi,

some former version of Turbomole (before 6.0) used a minus sign to add numbers to host names if more than one job is started on a node. So if the names of the nodes do already contain a dash (like node-1, node-2, etc.), NumForce does not recognize the correct node name any more. This has been changed to a : meanwhile.

You can try to patch NumForce yourself: search the NumForce script for 'cut' and 'machine'. Change the lines to

 loginname=`echo $machine | cut -f 1 -d ":"`

and

cpu=`echo $machine | cut -f 2 -d ":"`

And finally:

                machinename=$host":"$i

Regards,

Uwe





basklau

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #9 on: February 26, 2009, 12:41:11 PM »
Unfortunaly the same problem I mentioned occurs also with TM 6.0. Are there any problems when start NumForce in a tcsh shell?

Regards basklau
« Last Edit: February 26, 2009, 01:22:52 PM by basklau »

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 487
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #10 on: February 26, 2009, 03:02:12 PM »
Hi,

that is was a very good idea! Indeed, NumForce starts the job on the external machine with:

Code: [Select]
$remoteshell -v $loginname "cd $WORKDIR ;\
                          export PATH=\$PATH:. ;\
                          . $ENVFILE ; \
                          $SCRIPTS/runsingle $dir $machine $scrpath" \
                            2> $dir.$machine.err &

If on your target node your default shell is csh or tcsh, it will not be able to source the $ENVFILE, and export, etc. will not work. As far as I can see in the ssh manual, there is no option for ssh to switch the login shell type.

I guess one has to add a 'sh' somehow to this call, but I have tried it with:

Code: [Select]
sh -c "export var1=1 ; echo \$var1 "
but it seems that the sh is started for each command separately, since the echo does not work for the previously assigned var1 variable.

Is there a sh expert out there who can help here? I would like to know how to start a sh or a bash by ssh and let the shell do several commands.

Currently the only workaround I see is to use tcsh syntax for this $remoteshell call, and also change the $ENVFILE to tcsh syntax locally on your system.

Regards,

Uwe

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 216
  • Karma: +1/-0
Re: another problem with parallel NumForce 5.10
« Reply #11 on: February 27, 2009, 09:37:58 AM »
Hello,

We are using tcsh shell with parallel NumForce and the following modification has worked nicely for us for several years:
Code: [Select]
                    $remoteshell -v $loginname "/bin/sh -c '\
                          cd $WORKDIR ;\
                          export PATH=\$PATH:. ;\
                          . $ENVFILE ; \
                          $SCRIPTS/runsingle $dir $machine $scrpath '" \
                            2> $dir.$machine.err &
So the command that is sent to sh needs to be quoted with ' -quotes. I hope this helps.

Antti

basklau

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #12 on: February 27, 2009, 10:25:51 AM »
Thank you! That works great with TM 5.10 as well as TM  6.0!

basklau

uwe

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 487
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #13 on: February 27, 2009, 10:48:47 AM »
Hi,

thanks, Antti!

I guess this patch will be added to the next official Turbomole version :-)

Regards,

Uwe

marand

  • Jr. Member
  • **
  • Posts: 18
  • Karma: +0/-0
Re: another problem with parallel NumForce 5.10
« Reply #14 on: February 10, 2011, 10:28:23 AM »
Hello All!

I have got the same problem with NumForce as Acaphobia. Strangely, everything goes fine when I ask Turbomole (6.1) to compute ground state normal modes at the CC2/cc-pVDZ level of theory. The error occurs only when I use a larger basis set (aug-cc-pVTZ) and require the normal modes for the lowest excited state.

Then I receive identical messages as reported by Acaphobia. I use NumForce with the -mfile option, and the serial version of Turbomole.

Some help on this would be great! If the problem has been already solved, pls point me to the right post.
Yours

Marcin