TURBOMOLE Users Forum
TURBOMOLE Modules => Jobex: Structure Optimization and Molecular Dynamics => Topic started by: evgeniy on April 29, 2014, 11:33:55 AM
-
Hello,
I encountered a strange behavior of jobex with the latest release of TM (6.5) and slurm.
When running a geometry optimization, at the CC2 level, the calculation crashes at the
zero step, i.e. at job.0, in the very beginning of the CC2 step (SCF finishes without
any problem), with the error looking very strange:
ricc2 ended abnormally
general input file <control> is empty
The control file is of course there and not empty. Furthermore when I restart the calculation
from the SCF step, by removing the nextstep file and not changing anything
else the whole jobex calculation runs fine. Of course I can do this restart after the crash
of job.0 but it is clearly annoying.
It is not clear if the problem relates to release 6.5 or it has to do with the use of slurm.
Best regards,
Evgeniy
-
Hi,
Does the job run in a directory that is NFS-mounted? This sounds somewhat NFS-related. I don't know why it would affect only ricc2, though. Does a HF optimization run normally?
Antti
-
Hi Antti,
Thanks for your response. Yes, the directory where the files needed for
the calculation are is NFS-mounted, can be seen by all the nodes. The HF
optimization runs all right.
Well, the problem with cc2 optimization seems to relate to the present release TM 6.5
because it works absolutely fine with TM 6.4 and presumable earlier ones. I mean it could
be that there were changes in the "input format". I use a relatively old input format, which
works fine with the previous versions of TM, 6.4 or older. Here is a part of
the CC2 input I have:
$TMPDIR /tmp/turbo
$SHAREDTMPDIR
$ricc2
cc2
geoopt model=CC2
maxiter=100
$excitations
irrep=a nexc=1 npre=3
xgrad states=(a 1)
$operating system unix
$symmetry c1
$redundant file=coord
$coord file=coord
Best regards,
Evgeniy
-
Hi,
OK, since it works fine with version 6.4, it's probably not an NFS issue (although I've seen some really strange NFS <-> Turbomole issues on some clusters. For RHEL-based systems, I haven't observed such problems). I ran some tests on one cluster and all CC2 optimizations ran OK with version 6.5 (serial, OpenMP, and MPI). I used the inputs that were created by "define", without any modifications. So maybe it's worth testing whether you can run some simple molecule like methane by creating the input just with define and submitting it directly. As far as I know, Turbomole should not really care about the order of the keywords in the control file, so in principle there should be no problem with your input. But for testing purposes, running a "standard" job without modifying the control file could be helpful. If this crashes in a similar way, I recommend contacting the Turbomole support.
Best,
Antti
-
Hi Antti,
Just a short question and a guess why it ran fine in your case.
The problem actually occured in the case of the MPI run. You
wrote that you had checked the MPI case. Did you run your test on
one or more than one nodes? When I run my calculation on
one node (with MPI) everything is fine. If more than one node
is used the above problem occurs. Could you try to run
a test on a more than one node?
Best regards,
Evgeniy
-
Hi Evgeniy,
If ricc2 prints out an error about empty input files, the processes on the remote system(s) might run in the wrong working directory.
Did you set the working directory for this job in the submit script to SLURM? Something like
#SBATCH -D /home/me/Turbomole-job/here-is-my-input/
If yes: could you please start such a parallel MPI run on more than one nodes and send all the files generated to the Turbomole support?
Regards,
Uwe
-
Hi Uwe,
Thanks for joining in the discussion of the problem.
I have just sent a tar with the files to the TM support.
Best regards,
Evgeniy
-
Hi Evgeniy,
Just to follow up on your question: I also tested a case where a 32-CPU MPI job is distributed to two 16-CPU machines (CC2 geometry optimization). The job ran fine. My SLURM script does not enforce a directory with #SBATCH -D (since the default configuration on the cluster appears to be use the original submit directory, anyway). How does your SLURM script look, by the way? (or maybe Uwe already solved the case based on the data you sent him)
Antti
-
Hi Antti,
It's still unsolved :( , which is a pity as 6.5 seems to be faster than 6.4 or any earlier version of TM.
Normally I do not use #BATCH -D . It does not seem to matter. Yet I used it when I ran the test for Uwe.
Normally, my slurm script looks the following:
#!/bin/bash
#SBATCH -N 4
#SBATCH -n 64
export PARA_ARCH=MPI
export TURBODIR=$HOME/Helics3a/prog/turbo-6.5/TURBOMOLE
export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu
export PATH=$TURBODIR/bin/${TURBOMOLE_SYSNAME}_mpi:$TURBODIR/mpirun_scripts:$TURBODIR/scripts:/bin:/usr/bin:$PATH
srun -l /bin/hostname | sort -n | awk '{print $2}' > $PWD/hosts
export HOSTS_FILE=$PWD/hosts
export PARNODES=$SLURM_NTASKS
jobex -ri -gcart 4 -c 500 -level cc2 -keep
Best regards,
Evgeniy
-
Hi Evgeniy,
Your script is very straightforward so it's really difficult to say what could go wrong. At this point some strange NFS issue is the only idea I have left. Do all the nodes have proper access to "control" if you check this before ricc2 with e.g.
srun head -n 1 control
(so you should get 64 times the first row in the control file)
Antti
-
Hi Antti,
I am not sure I can do this while running jobex, because ricc2 is called
from jobex.
Well, I think the whole problem has to du with some "asynchronization" of the processes.
I would like to insert "sleep" for a few seconds in jobex before ricc2 to let all the processes finish
dscf. Don't you happen to know where in jobex ricc2 is called during an optimization cycle?
Best regards,
Evgeniy
-
Hi Evgeniy,
You could try the "srun head -n 1 control" even before running jobex since this would show that everything is OK with the shared disk. I'm not sure but it could be that for dscf only the master process accesses the control file, while for ricc2 every process wants to access the control file. So, dscf could work even if there is soe disk issue. But this is just a guess. In any case, if you want to tweak jobex, you should have a look at two subroutines in the beginning of the script: scf_energy() and gradient(). In fact, the scf_energy subroutine does include the following lines right after executing dscf:
sleep 1
sync
So, you could just increase the sleeping time to test your hypothesis. But I guess the "sync" command should also take care of the file synchronization.
antti
-
Hi Antti,
Thanks for your suggestions. I put "srun head -n 1 control" in the script file
and it gave 64 outputs, so it seems OK with the shared disk in the beginning.
Increasing the sleep time however didn't help.
So, it seems to be inexplicable at the moment; can relate to the hardware of the cluster.
Best regards,
Evgeniy