TURBOMOLE Users Forum
TURBOMOLE Modules => Ricc2 => Topic started by: antti_karttunen on April 10, 2007, 09:30:27 AM
-
We have encountered some strange crashes while running ricc2 in parallel (TURBOMOLE version 5.9, architechture em64t). I'm attaching output from one such case. This system had about 1800 basis functions in C1 symmetry. It was run on a 4-processor workstation with "$numprocs 4".
Here is the output before crash:
======== CC DENSITY MODULE ========
current wave-function model: MP2
calculating CC ground state density
a semicanonical algorithm will be used
density nr. cpu/min wall/min L R
------------------------------------------------------
rank 0 in job 1 tremaine.joensuu.fi_33101 caused collective abort of all ranks
exit status of rank 0: return code 1
Program ricc2_mpi has ended.
Shutting down unused mpd ring.
Here is data from stderr (batch system log):
[cli_0]: aborting job:
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(217): MPI_Sendrecv(sbuf=0x2ac9879010, scount=-99655065, dtype=0x4c000829, dest=0, stag=88, rbuf=0x2acd7fa010, rcount=-99655065, dtype=0x4c000829, src=0, rtag=88, MPI_COMM_WORLD, status=0x142f210) failed
MPI_Sendrecv(108): Negative count, value is -99655065
What might cause this error? dscf ran fine in parallel, and so did ricc2 until the CC density module.
By the way, I noticed that the control file included parameter "$parallel_platform cluster". It seems that in all the mpirun_scripts the em64t architechture is given the value "$parallel_platform cluster". I guess "$parallel_platform MPP" would be more appropriate? Can the value of $parallel_platform cause problems in ricc2?
Update: A similar calculation that employed Ci-symmetry did not crash like the C1-symmetric calculations. So now we will try if our calculations will work better when the simplified C1 algorithm is turned off.
Update2: OK, by turning off the simplified C1 algorithm ricc2 could calculate the MP2 energy, but it still crashed in "LINEAR CC RESPONSE SOLVER". It gave similar MPI error message (Fatal error in MPI_Sendrecv).
-
Hi,
the negative values for scound and rcount look a bit like an integer overflow...
how much memory did you specify in the $maxcor data group?
The $parallel_platform data group has no influence on the ricc2 calculation.
Christof
-
Hello,
originally the job was run with $maxcor 2000 (and $numprocs 4 as I mentioned earlier). The workstation has 8 GB of memory, but ricc2 actually uses clearly less than 2000 MB per process according to OS. I also tried to lower $maxcor to 1000, but got the same MPI error with negative scount and rcount. So should I use higher $maxcor and less processors?
We have now been running into this error with all kinds of systems, but if the point group symmetry is higher, ricc2 can handle larger systems before it crashes.
-
well, with $maxcor 2000, the programm should not get integer overflows... then it is probably a bug in the communication of an intermediate between the processes... :-[
if that's the case, the calculation should go through if you switch to the 'minimal communication' mode (described in the manual) by adding in the control file:
$mpi_param
min_comm
Christof
-
Hello again,
we seem to be having quite a bad luck with this particular system. With min_comm (still $numprocs 4) ricc2 went through linear CC response solver twice (equations converged in 3 and 5 iterations, respectively). After this it calculated the ECP gradient and started printing out MP2 occupation numbers but crashed after printing out the unrelaxed occupations. Last messages after the ECP gradient were:
rdiag: n=1778, sequential Div&Con
MP2 unrelaxed natural orbital occupation numbers (I'm not pasting them here)
.
.
.
Maximum change in occupation number:
occupied : -4.60 % ( 305 a )
virtual : 0.00 % ( 0 )
rdiag: n=1778, sequential Div&Con
gradient file is empty, but just before quitting ricc2 seems to have written file restartg. Stdout error log revealed that ricc2 had segfaulted:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown
So, what could be the problem here? It almost got through the calculation...
-
Hi,
my guess is that ricc2 segfaults in the linear algebra routines, i.e. in the BLAS library that comes with Intels MKL. That often happens if the stack size is too small. Did you check your limits (ulimit -a for sh/bash/ksh or limit for csh/tcsh)? And if the limits are o.k. (unlimited or as big as your physical memory), do you also have those limits when you ssh directly to the machine? Tthat is what MPICH is doing when starting a parallel application...
ssh <hostname> ulimit -a
If the ssh command gives a lower stack size, you have to change the file
/etc/security/limits.conf
and add there the line (example for 4GB limit)
* soft stack 4194303
and the redo ssh <hostname> ulimit -a
Then you should get 4GB stack size limit, as it is set in limits.conf now.
Regards,
Uwe
-
Thanks for the tip! We had the stacksize set to unlimited for serial jobs in our batch system scripts, but the parallel processes lauched by MPICH were not covered by this setting. I will try to run the job again. In addition, this might solve several other strange problems we have had with large parallel calculations!
Maybe the details about parallel runs would be worth adding to the FAQ answer that covers the stacksize problems?
-
Hi,
I tried using both very large stack value in /etc/security/limits.conf and unlimited stacksize in .cshrc file, but still the calculation segfaulted. ssh <hostname> limit gives the following results, so stack size should be OK for processes launched by MPICH:
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 8388606 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 32 kbytes
maxproc 71680
The other parallel problems I mentioned in my previous message were not solved either by increasing the stack size for MPICH launched processes. The problem is that parallel dscf systematically segfaults with larger systems. I made some additional tests to see if this would be an architechture dependent problem. I took a moderately large system with 3000 basis functions and ran parallel dscf test calculations on all different architechtures we have (i786-, x86-64-, and em64t-unknown-linux-gnu). All segfault before first SCF iteration, but the i786 is the only one to give more detailed information in stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
dscf_mpi 0855E454 pa_snd_long_ 506 pa_mp.f
dscf_mpi 080C39D0 shlupf_.J 180 shlupf.f
dscf_mpi 080764E3 scf_.H 886 scf.f
dscf_mpi 0804D18B MAIN__.J 1544 dscf.f
dscf_mpi 080482A1 Unknown Unknown Unknown
dscf_mpi 08A455E9 Unknown Unknown Unknown
dscf_mpi 08048181 Unknown Unknown Unknown
This system was Ih-symmetric, but a C1 symmetric run segfaulted, too. It seems that I can run systems with 1500 basis functions with parallel dscf, but anything over 2000 is too much and results in segfault. In comparison, ridft runs these calculations successfully in parallel mode.
Our machines have Redhat-based Linux distributions (Fedora and Centos). Is there something else I could have forgotten/misconfigured? I can also send test inputs, if necessary.
-
Hi Antti,
I would recommend to wait for Turbomole 5.9.1 which will be out this week. It has some bug fixes, but the parallel version is now based on HP-MPI rather than MPICH. If your input still fails with the new version, that problem will be an issue for the Turbomole support. Send an email to turbomole at cosmologic dot de with the input file attached and some details about the machine you are using.
The dscf problem on i786 might come from some internal 32Bit limits - so this dscf version perhaps segfaults for a different reason than the other 64Bit binaries on AMD64 and EM64T systems. And obviously dscf crashes when calling an MPI routine (pa_snd_long sends a large array from one node to another), so perhaps the HP-MPI environment does a better job here.
Regards,
Uwe
-
Hello,
the parallel ricc2 crashes are still causing us some problems. We are successfully using $min_comm to avoid problems with communication of the intermediates, but the segfault after printing out the unrelaxed MP2 natural orbital occupations still occurs for several cases (one case was reported above). The segfault happened after ricc2 had printed out the lines
Maximum change in occupation number:
occupied : -9.00 % ( 200 a )
This time we were trying to optimize a system with C1 symmetry, 80 atoms and 1520 basis functions with Turbomole 5.9.1, em64t. Stacksize is set to unlimited for MPICH launched processes, so the segfault should not be due to stack limits.
ricc2 wrote the file restartg, but gradient file is empty. Data from stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown
Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown
Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown
Stack trace terminated abnormally.
MPI Application rank 1 exited before MPI_Finalize() with status 174
I can send an input file, if necessary.
-
Hi Antti,
please send the input that causes the crash to the Turbomole Support.
Regards,
Uwe