TURBOMOLE Users Forum

TURBOMOLE Modules => Jobex: Structure Optimization and Molecular Dynamics => Topic started by: evgeniy on April 29, 2014, 11:33:55 AM

Title: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on April 29, 2014, 11:33:55 AM: Hello,

I encountered a strange behavior of jobex with the latest release of TM (6.5) and slurm.
When running a geometry optimization, at the CC2 level, the calculation crashes at the
zero step, i.e. at job.0, in the very beginning of the CC2 step (SCF finishes without
any problem), with the error looking very strange:

ricc2 ended abnormally
general input file <control> is empty

The control file is of course there and not empty. Furthermore when I restart the calculation
from the SCF step, by removing the nextstep file and not changing anything
else the whole jobex calculation runs fine. Of course I can do this restart after the crash
of job.0 but it is clearly annoying.

It is not clear if the problem relates to release 6.5 or it has to do with the use of slurm.

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: antti_karttunen on April 29, 2014, 12:50:03 PM: Hi,

Does the job run in a directory that is NFS-mounted? This sounds somewhat NFS-related. I don't know why it would affect only ricc2, though. Does a HF optimization run normally?

Antti
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on April 29, 2014, 02:29:00 PM: Hi Antti,

Thanks for your response. Yes, the directory where the files needed for
the calculation are is NFS-mounted, can be seen by all the nodes. The HF
optimization runs all right.

Well, the problem with cc2 optimization seems to relate to the present release TM 6.5
because it works absolutely fine with TM 6.4 and presumable earlier ones. I mean it could
be that there were changes in the "input format". I use a relatively old input format, which
works fine with the previous versions of TM, 6.4 or older. Here is a part of
the CC2 input I have:

$TMPDIR /tmp/turbo
$SHAREDTMPDIR
$ricc2
cc2
geoopt model=CC2
maxiter=100
$excitations
irrep=a nexc=1 npre=3
xgrad states=(a 1)
$operating system unix
$symmetry c1
$redundant file=coord
$coord file=coord

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: antti_karttunen on April 29, 2014, 02:53:06 PM: Hi,

OK, since it works fine with version 6.4, it's probably not an NFS issue (although I've seen some really strange NFS <-> Turbomole issues on some clusters. For RHEL-based systems, I haven't observed such problems). I ran some tests on one cluster and all CC2 optimizations ran OK with version 6.5 (serial, OpenMP, and MPI). I used the inputs that were created by "define", without any modifications. So maybe it's worth testing whether you can run some simple molecule like methane by creating the input just with define and submitting it directly. As far as I know, Turbomole should not really care about the order of the keywords in the control file, so in principle there should be no problem with your input. But for testing purposes, running a "standard" job without modifying the control file could be helpful. If this crashes in a similar way, I recommend contacting the Turbomole support.

Best,
Antti
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on April 30, 2014, 02:33:54 PM: Hi Antti,

Just a short question and a guess why it ran fine in your case.
The problem actually occured in the case of the MPI run. You
wrote that you had checked the MPI case. Did you run your test on
one or more than one nodes? When I run my calculation on
one node (with MPI) everything is fine. If more than one node
is used the above problem occurs. Could you try to run
a test on a more than one node?

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: uwe on April 30, 2014, 03:15:17 PM: Hi Evgeniy,

If ricc2 prints out an error about empty input files, the processes on the remote system(s) might run in the wrong working directory.
Did you set the working directory for this job in the submit script to SLURM? Something like

#SBATCH -D /home/me/Turbomole-job/here-is-my-input/

If yes: could you please start such a parallel MPI run on more than one nodes and send all the files generated to the Turbomole support?

Regards,
Uwe
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on April 30, 2014, 03:59:23 PM: Hi Uwe,

Thanks for joining in the discussion of the problem.
I have just sent a tar with the files to the TM support.

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: antti_karttunen on April 30, 2014, 08:12:58 PM: Hi Evgeniy,

Just to follow up on your question: I also tested a case where a 32-CPU MPI job is distributed to two 16-CPU machines (CC2 geometry optimization). The job ran fine. My SLURM script does not enforce a directory with #SBATCH -D (since the default configuration on the cluster appears to be use the original submit directory, anyway). How does your SLURM script look, by the way? (or maybe Uwe already solved the case based on the data you sent him)

Antti
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on May 01, 2014, 12:23:28 PM: Hi Antti,

It's still unsolved :( , which is a pity as 6.5 seems to be faster than 6.4 or any earlier version of TM.
Normally I do not use #BATCH -D . It does not seem to matter. Yet I used it when I ran the test for Uwe.
Normally, my slurm script looks the following:

#!/bin/bash
#SBATCH -N 4
#SBATCH -n 64

export PARA_ARCH=MPI

export TURBODIR=$HOME/Helics3a/prog/turbo-6.5/TURBOMOLE
export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu
export PATH=$TURBODIR/bin/${TURBOMOLE_SYSNAME}_mpi:$TURBODIR/mpirun_scripts:$TURBODIR/scripts:/bin:/usr/bin:$PATH

srun -l /bin/hostname | sort -n | awk '{print $2}' > $PWD/hosts

export HOSTS_FILE=$PWD/hosts

export PARNODES=$SLURM_NTASKS

jobex -ri -gcart 4 -c 500 -level cc2 -keep

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: antti_karttunen on May 01, 2014, 03:49:07 PM: Hi Evgeniy,

Your script is very straightforward so it's really difficult to say what could go wrong. At this point some strange NFS issue is the only idea I have left. Do all the nodes have proper access to "control" if you check this before ricc2 with e.g.

srun head -n 1 control

(so you should get 64 times the first row in the control file)

Antti
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on May 01, 2014, 05:24:24 PM: Hi Antti,

I am not sure I can do this while running jobex, because ricc2 is called
from jobex.

Well, I think the whole problem has to du with some "asynchronization" of the processes.
I would like to insert "sleep" for a few seconds in jobex before ricc2 to let all the processes finish
dscf. Don't you happen to know where in jobex ricc2 is called during an optimization cycle?

Best regards,
Evgeniy
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: antti_karttunen on May 01, 2014, 05:49:32 PM: Hi Evgeniy,

You could try the "srun head -n 1 control" even before running jobex since this would show that everything is OK with the shared disk. I'm not sure but it could be that for dscf only the master process accesses the control file, while for ricc2 every process wants to access the control file. So, dscf could work even if there is soe disk issue. But this is just a guess. In any case, if you want to tweak jobex, you should have a look at two subroutines in the beginning of the script: scf_energy() and gradient(). In fact, the scf_energy subroutine does include the following lines right after executing dscf:
sleep 1 sync
So, you could just increase the sleeping time to test your hypothesis. But I guess the "sync" command should also take care of the file synchronization.

antti
Title: Re: Strange problem with jobex using TM6.5 and slurm
Post by: evgeniy on May 01, 2014, 08:55:56 PM: Hi Antti,

Thanks for your suggestions. I put "srun head -n 1 control" in the script file
and it gave 64 outputs, so it seems OK with the shared disk in the beginning.
Increasing the sleep time however didn't help.
So, it seems to be inexplicable at the moment; can relate to the hardware of the cluster.

Best regards,
Evgeniy