Author Topic: Parallel ricc2 crash  (Read 21027 times)

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Parallel ricc2 crash
« on: April 10, 2007, 09:30:27 AM »
We have encountered some strange crashes while running ricc2 in parallel (TURBOMOLE version 5.9, architechture em64t). I'm attaching output from one such case. This system had about 1800 basis functions in C1 symmetry. It was run on a 4-processor workstation with "$numprocs 4".

Here is the output before crash:

                    ========   CC DENSITY MODULE   ========

                      current wave-function model: MP2

  calculating CC ground state density

   a semicanonical algorithm will be used

    density nr.      cpu/min        wall/min    L     R
   ------------------------------------------------------
rank 0 in job 1  tremaine.joensuu.fi_33101   caused collective abort of all ranks
  exit status of rank 0: return code 1
Program ricc2_mpi has ended.
Shutting down unused mpd ring.


Here is data from stderr (batch system log):

[cli_0]: aborting job:
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(217): MPI_Sendrecv(sbuf=0x2ac9879010, scount=-99655065, dtype=0x4c000829, dest=0, stag=88, rbuf=0x2acd7fa010, rcount=-99655065, dtype=0x4c000829, src=0, rtag=88, MPI_COMM_WORLD, status=0x142f210) failed
MPI_Sendrecv(108): Negative count, value is -99655065


What might cause this error? dscf ran fine in parallel, and so did ricc2 until the CC density module.

By the way, I noticed that the control file included parameter "$parallel_platform cluster". It seems that in all the mpirun_scripts the em64t architechture is given the value "$parallel_platform cluster". I guess "$parallel_platform MPP" would be more appropriate? Can the value of $parallel_platform cause problems in ricc2?

Update: A similar calculation that employed Ci-symmetry did not crash like the C1-symmetric calculations. So now we will try if our calculations will work better when the simplified C1 algorithm is turned off.

Update2: OK, by turning off the simplified C1 algorithm ricc2 could calculate the MP2 energy, but it still crashed in "LINEAR CC RESPONSE SOLVER". It gave similar MPI error message (Fatal error in MPI_Sendrecv).
« Last Edit: April 11, 2007, 07:42:14 AM by antti_karttunen »

christof.haettig

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Karma: +0/-0
    • Hattig's Group at the RUB
Re: Parallel ricc2 crash
« Reply #1 on: April 13, 2007, 10:55:03 PM »
Hi,

the negative values for scound and rcount look a bit like an integer overflow...
how much memory did you specify in the $maxcor data group?

The $parallel_platform data group has no influence on the ricc2 calculation.

Christof

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Parallel ricc2 crash
« Reply #2 on: April 15, 2007, 10:00:24 AM »
Hello,

originally the job was run with $maxcor 2000 (and $numprocs 4 as I mentioned earlier). The workstation has 8 GB of memory, but ricc2 actually uses clearly less than 2000 MB per process according to OS. I also tried to lower $maxcor to 1000, but got the same MPI error with negative scount and rcount. So should I use higher $maxcor and less processors?

We have now been running into this error with all kinds of systems, but if the point group symmetry is higher, ricc2 can handle larger systems before it crashes.

christof.haettig

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 291
  • Karma: +0/-0
    • Hattig's Group at the RUB
Re: Parallel ricc2 crash
« Reply #3 on: April 16, 2007, 09:02:41 AM »
well, with $maxcor 2000, the programm should not get integer overflows... then it is probably a bug in the communication of an intermediate between the processes... :-[
if that's the case, the calculation should go through if you switch to the 'minimal communication' mode (described in the manual) by adding in the control file:
$mpi_param
  min_comm

Christof

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Parallel ricc2 crash
« Reply #4 on: April 19, 2007, 07:42:12 AM »
Hello again,

we seem to be having quite a bad luck with this particular system. With min_comm (still $numprocs 4) ricc2 went through linear CC response solver twice (equations converged in 3 and 5 iterations, respectively). After this it calculated the ECP gradient and started printing out MP2 occupation numbers but crashed after printing out the unrelaxed occupations. Last messages after the ECP gradient were:

rdiag: n=1778, sequential Div&Con

MP2 unrelaxed natural orbital occupation numbers (I'm not pasting them here)
.
.
.
      Maximum change in occupation number:
        occupied         :    -4.60 % (  305 a    )
        virtual          :     0.00 % (    0      )
rdiag: n=1778, sequential Div&Con


gradient file is empty, but just before quitting ricc2 seems to have written file restartg. Stdout error log revealed that ricc2 had segfaulted:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
ricc2_mpi          0000000000EBF30E  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
ricc2_mpi          0000000000EBF30E  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
ricc2_mpi          0000000000EBF30E  Unknown               Unknown  Unknown


So, what could be the problem here? It almost got through the calculation...

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: Parallel ricc2 crash
« Reply #5 on: April 19, 2007, 09:41:29 AM »
Hi,

my guess is that ricc2 segfaults in the linear algebra routines, i.e. in the BLAS library that comes with Intels MKL. That often happens if the stack size is too small. Did you check your limits (ulimit -a for sh/bash/ksh or limit for csh/tcsh)? And if the limits are o.k. (unlimited or as big as your physical memory), do you also have those limits when you ssh directly to the machine? Tthat is what MPICH is doing when starting a parallel application...

ssh <hostname> ulimit -a

If the ssh command gives a lower stack size, you have to change the file

/etc/security/limits.conf

and add there the line (example for 4GB limit)

*                soft    stack           4194303

and the redo ssh <hostname> ulimit -a
Then you should get 4GB stack size limit, as it is set in limits.conf now.

Regards,

Uwe

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Parallel ricc2 crash
« Reply #6 on: April 19, 2007, 10:54:12 AM »
Thanks for the tip! We had the stacksize set to unlimited for serial jobs in our batch system scripts, but the parallel processes lauched by MPICH were not covered by this setting. I will try to run the job again. In addition, this might solve several other strange problems we have had with large parallel calculations!

Maybe the details about parallel runs would be worth adding to the FAQ answer that covers the stacksize problems?
« Last Edit: April 19, 2007, 11:08:00 AM by antti_karttunen »

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Parallel ricc2 crash
« Reply #7 on: April 20, 2007, 10:03:33 PM »
Hi,

I tried using both very large stack value in /etc/security/limits.conf and unlimited stacksize in .cshrc file, but still the calculation segfaulted. ssh <hostname> limit gives the following results, so stack size should be OK for processes launched by MPICH:

cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    8388606 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024
memorylocked 32 kbytes
maxproc      71680


The other parallel problems I mentioned in my previous message were not solved either by increasing the stack size for MPICH launched processes. The problem is that parallel dscf systematically segfaults with larger systems. I made some additional tests to see if this would be an architechture dependent problem. I took a moderately large system with 3000 basis functions and ran parallel dscf test calculations on all different architechtures we have (i786-, x86-64-, and em64t-unknown-linux-gnu). All segfault before first SCF iteration, but the i786 is the only one to give more detailed information in stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC        Routine            Line        Source             
dscf_mpi           0855E454  pa_snd_long_              506  pa_mp.f
dscf_mpi           080C39D0  shlupf_.J                 180  shlupf.f
dscf_mpi           080764E3  scf_.H                    886  scf.f
dscf_mpi           0804D18B  MAIN__.J                 1544  dscf.f
dscf_mpi           080482A1  Unknown               Unknown  Unknown
dscf_mpi           08A455E9  Unknown               Unknown  Unknown
dscf_mpi           08048181  Unknown               Unknown  Unknown

This system was Ih-symmetric, but a C1 symmetric run segfaulted, too. It seems that I can run systems with 1500 basis functions with parallel dscf, but anything over 2000 is too much and results in segfault. In comparison, ridft runs these calculations successfully in parallel mode.

Our machines have Redhat-based Linux distributions (Fedora and Centos). Is there something else I could have forgotten/misconfigured? I can also send test inputs, if necessary.

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: Parallel ricc2 crash
« Reply #8 on: April 23, 2007, 05:06:05 PM »
Hi Antti,

I would recommend to wait for Turbomole 5.9.1 which will be out this week. It has some bug fixes, but the parallel version is now based on HP-MPI rather than MPICH. If your input still fails with the new version, that problem will be an issue for the Turbomole support. Send an email to turbomole at cosmologic dot de with the input file attached and some details about the machine you are using.

The dscf problem on i786 might come from some internal 32Bit limits - so this dscf version perhaps segfaults for a different reason than the other 64Bit binaries on AMD64 and EM64T systems. And obviously dscf crashes when calling an MPI routine (pa_snd_long sends a large array from one node to another), so perhaps the HP-MPI environment does a better job here.

Regards,

Uwe

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: Parallel ricc2 crash
« Reply #9 on: May 25, 2007, 09:54:44 AM »
Hello,

the parallel ricc2 crashes are still causing us some problems. We are successfully using $min_comm to avoid problems with communication of the intermediates, but the segfault after printing out the unrelaxed MP2 natural orbital occupations still occurs for several cases (one case was reported above). The segfault happened after ricc2 had printed out the lines 
Maximum change in occupation number:
        occupied         :    -9.00 % (  200 a    )

This time we were trying to optimize a system with C1 symmetry, 80 atoms and 1520 basis functions with Turbomole 5.9.1, em64t. Stacksize is set to unlimited for MPICH launched processes, so the segfault should not be due to stack limits.

ricc2 wrote the file restartg, but gradient file is empty. Data from stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread.so.0    00000034DF10C430  Unknown               Unknown  Unknown
Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread.so.0    00000034DF10C430  Unknown               Unknown  Unknown
Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread.so.0    00000034DF10C430  Unknown               Unknown  Unknown
Stack trace terminated abnormally.
MPI Application rank 1 exited before MPI_Finalize() with status 174

I can send an input file, if necessary.

uwe

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 558
  • Karma: +0/-0
Re: Parallel ricc2 crash
« Reply #10 on: May 31, 2007, 11:06:55 AM »
Hi Antti,

please send the input that causes the crash to the Turbomole Support.

Regards,

Uwe