Author Topic: Parallel ricc2 crash (Read 21858 times)

antti_karttunen · « **on:** April 10, 2007, 09:30:27 AM »

We have encountered some strange crashes while running ricc2 in parallel (TURBOMOLE version 5.9, architechture em64t). I'm attaching output from one such case. This system had about 1800 basis functions in C1 symmetry. It was run on a 4-processor workstation with "$numprocs 4".

Here is the output before crash:
======== CC DENSITY MODULE ======== current wave-function model: MP2 calculating CC ground state density a semicanonical algorithm will be used density nr. cpu/min wall/min L R ------------------------------------------------------ rank 0 in job 1 tremaine.joensuu.fi_33101 caused collective abort of all ranks exit status of rank 0: return code 1 Program ricc2_mpi has ended. Shutting down unused mpd ring.

Here is data from stderr (batch system log):
[cli_0]: aborting job: Fatal error in MPI_Sendrecv: Invalid count, error stack: MPI_Sendrecv(217): MPI_Sendrecv(sbuf=0x2ac9879010, scount=-99655065, dtype=0x4c000829, dest=0, stag=88, rbuf=0x2acd7fa010, rcount=-99655065, dtype=0x4c000829, src=0, rtag=88, MPI_COMM_WORLD, status=0x142f210) failed MPI_Sendrecv(108): Negative count, value is -99655065

What might cause this error? dscf ran fine in parallel, and so did ricc2 until the CC density module.

By the way, I noticed that the control file included parameter "$parallel_platform cluster". It seems that in all the mpirun_scripts the em64t architechture is given the value "$parallel_platform cluster". I guess "$parallel_platform MPP" would be more appropriate? Can the value of $parallel_platform cause problems in ricc2?

Update: A similar calculation that employed Ci-symmetry did not crash like the C1-symmetric calculations. So now we will try if our calculations will work better when the simplified C1 algorithm is turned off.

Update2: OK, by turning off the simplified C1 algorithm ricc2 could calculate the MP2 energy, but it still crashed in "LINEAR CC RESPONSE SOLVER". It gave similar MPI error message (Fatal error in MPI_Sendrecv).

christof.haettig · « **Reply #1 on:** April 13, 2007, 10:55:03 PM »

Hi,

the negative values for scound and rcount look a bit like an integer overflow...
how much memory did you specify in the $maxcor data group?

The $parallel_platform data group has no influence on the ricc2 calculation.

Christof

antti_karttunen · « **Reply #2 on:** April 15, 2007, 10:00:24 AM »

Hello,

originally the job was run with $maxcor 2000 (and $numprocs 4 as I mentioned earlier). The workstation has 8 GB of memory, but ricc2 actually uses clearly less than 2000 MB per process according to OS. I also tried to lower $maxcor to 1000, but got the same MPI error with negative scount and rcount. So should I use higher $maxcor and less processors?

We have now been running into this error with all kinds of systems, but if the point group symmetry is higher, ricc2 can handle larger systems before it crashes.

christof.haettig · « **Reply #3 on:** April 16, 2007, 09:02:41 AM »

well, with $maxcor 2000, the programm should not get integer overflows... then it is probably a bug in the communication of an intermediate between the processes...

if that's the case, the calculation should go through if you switch to the 'minimal communication' mode (described in the manual) by adding in the control file:
$mpi_param
min_comm

Christof

antti_karttunen · « **Reply #4 on:** April 19, 2007, 07:42:12 AM »

Hello again,

we seem to be having quite a bad luck with this particular system. With min_comm (still $numprocs 4) ricc2 went through linear CC response solver twice (equations converged in 3 and 5 iterations, respectively). After this it calculated the ECP gradient and started printing out MP2 occupation numbers but crashed after printing out the unrelaxed occupations. Last messages after the ECP gradient were:
rdiag: n=1778, sequential Div&Con MP2 unrelaxed natural orbital occupation numbers (I'm not pasting them here) . . . Maximum change in occupation number: occupied : -4.60 % ( 305 a ) virtual : 0.00 % ( 0 ) rdiag: n=1778, sequential Div&Con

gradient file is empty, but just before quitting ricc2 seems to have written file restartg. Stdout error log revealed that ricc2 had segfaulted:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source ricc2_mpi 0000000000EBF30E Unknown Unknown Unknown

So, what could be the problem here? It almost got through the calculation...

uwe · « **Reply #5 on:** April 19, 2007, 09:41:29 AM »

Hi,

my guess is that ricc2 segfaults in the linear algebra routines, i.e. in the BLAS library that comes with Intels MKL. That often happens if the stack size is too small. Did you check your limits (ulimit -a for sh/bash/ksh or limit for csh/tcsh)? And if the limits are o.k. (unlimited or as big as your physical memory), do you also have those limits when you ssh directly to the machine? Tthat is what MPICH is doing when starting a parallel application...

ssh <hostname> ulimit -a

If the ssh command gives a lower stack size, you have to change the file

/etc/security/limits.conf

and add there the line (example for 4GB limit)

* soft stack 4194303

and the redo ssh <hostname> ulimit -a
Then you should get 4GB stack size limit, as it is set in limits.conf now.

Regards,

Uwe

antti_karttunen · « **Reply #6 on:** April 19, 2007, 10:54:12 AM »

Thanks for the tip! We had the stacksize set to unlimited for serial jobs in our batch system scripts, but the parallel processes lauched by MPICH were not covered by this setting. I will try to run the job again. In addition, this might solve several other strange problems we have had with large parallel calculations!

Maybe the details about parallel runs would be worth adding to the FAQ answer that covers the stacksize problems?

antti_karttunen · « **Reply #7 on:** April 20, 2007, 10:03:33 PM »

Hi,

I tried using both very large stack value in /etc/security/limits.conf and unlimited stacksize in .cshrc file, but still the calculation segfaulted. ssh <hostname> limit gives the following results, so stack size should be OK for processes launched by MPICH:
cputime unlimited filesize unlimited datasize unlimited stacksize 8388606 kbytes coredumpsize 0 kbytes memoryuse unlimited vmemoryuse unlimited descriptors 1024 memorylocked 32 kbytes maxproc 71680

The other parallel problems I mentioned in my previous message were not solved either by increasing the stack size for MPICH launched processes. The problem is that parallel dscf systematically segfaults with larger systems. I made some additional tests to see if this would be an architechture dependent problem. I took a moderately large system with 3000 basis functions and ran parallel dscf test calculations on all different architechtures we have (i786-, x86-64-, and em64t-unknown-linux-gnu). All segfault before first SCF iteration, but the i786 is the only one to give more detailed information in stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source dscf_mpi 0855E454 pa_snd_long_ 506 pa_mp.f dscf_mpi 080C39D0 shlupf_.J 180 shlupf.f dscf_mpi 080764E3 scf_.H 886 scf.f dscf_mpi 0804D18B MAIN__.J 1544 dscf.f dscf_mpi 080482A1 Unknown Unknown Unknown dscf_mpi 08A455E9 Unknown Unknown Unknown dscf_mpi 08048181 Unknown Unknown Unknown
This system was I_h-symmetric, but a C1 symmetric run segfaulted, too. It seems that I can run systems with 1500 basis functions with parallel dscf, but anything over 2000 is too much and results in segfault. In comparison, ridft runs these calculations successfully in parallel mode.

Our machines have Redhat-based Linux distributions (Fedora and Centos). Is there something else I could have forgotten/misconfigured? I can also send test inputs, if necessary.

uwe · « **Reply #8 on:** April 23, 2007, 05:06:05 PM »

Hi Antti,

I would recommend to wait for Turbomole 5.9.1 which will be out this week. It has some bug fixes, but the parallel version is now based on HP-MPI rather than MPICH. If your input still fails with the new version, that problem will be an issue for the Turbomole support. Send an email to turbomole at cosmologic dot de with the input file attached and some details about the machine you are using.

The dscf problem on i786 might come from some internal 32Bit limits - so this dscf version perhaps segfaults for a different reason than the other 64Bit binaries on AMD64 and EM64T systems. And obviously dscf crashes when calling an MPI routine (pa_snd_long sends a large array from one node to another), so perhaps the HP-MPI environment does a better job here.

Regards,

Uwe

antti_karttunen · « **Reply #9 on:** May 25, 2007, 09:54:44 AM »

Hello,

the parallel ricc2 crashes are still causing us some problems. We are successfully using $min_comm to avoid problems with communication of the intermediates, but the segfault after printing out the unrelaxed MP2 natural orbital occupations still occurs for several cases (one case was reported above). The segfault happened after ricc2 had printed out the lines
Maximum change in occupation number: occupied : -9.00 % ( 200 a )
This time we were trying to optimize a system with C1 symmetry, 80 atoms and 1520 basis functions with Turbomole 5.9.1, em64t. Stacksize is set to unlimited for MPICH launched processes, so the segfault should not be due to stack limits.

ricc2 wrote the file restartg, but gradient file is empty. Data from stdout:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown Stack trace terminated abnormally. forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown Stack trace terminated abnormally. forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source libpthread.so.0 00000034DF10C430 Unknown Unknown Unknown Stack trace terminated abnormally. MPI Application rank 1 exited before MPI_Finalize() with status 174
I can send an input file, if necessary.

uwe · « **Reply #10 on:** May 31, 2007, 11:06:55 AM »

Hi Antti,

please send the input that causes the crash to the Turbomole Support.

Regards,

Uwe

TURBOMOLE Users Forum

Author Topic: Parallel ricc2 crash (Read 21858 times)

antti_karttunen

Parallel ricc2 crash

christof.haettig

Re: Parallel ricc2 crash

antti_karttunen

Re: Parallel ricc2 crash

christof.haettig

Re: Parallel ricc2 crash

antti_karttunen

Re: Parallel ricc2 crash

uwe

Re: Parallel ricc2 crash

antti_karttunen

Re: Parallel ricc2 crash

antti_karttunen

Re: Parallel ricc2 crash

uwe

Re: Parallel ricc2 crash

antti_karttunen

Re: Parallel ricc2 crash

uwe

Re: Parallel ricc2 crash