PDGBTRF possible bug

Post here if you want to report a bug to the LAPACK team

PDGBTRF possible bug

Postby themos » Thu Mar 10, 2011 10:11 am

Hello,

This is Themos Tsikas from NAG, Oxford, UK. I am dealing with a user of HECToR, the UK academic Cray machine, who has a problem with Scalapack routine PDGBTRF (banded LU factorization with pivoting). Specifically, the code crashes sometimes with a memory corruption in malloc. Using valgrind, I have tracked it down to IPIV array being written beyond its documented bounds:

* IPIV (local output) INTEGER array, dimension >= DESCA( NB ).
* Pivot indices for local factorizations.
* Users *should not* alter the contents between
* factorization and solve.

In fact, it seems to be treated as if it were of length DESCA(NB)+BWL+BWU, where

* BWL (global input) INTEGER
* Number of subdiagonals. 0 <= BWL <= N-1
*
* BWU (global input) INTEGER
* Number of superdiagonals. 0 <= BWU <= N-1

I've noticed a few posts on the forum reporting bad behaviour from this routine so I am wondering if I've come across a documentation bug or an algorithmic bug. I would appreciate your thoughts on this.

Best Regards
Themos Tsikas
themos
 
Posts: 6
Joined: Thu Mar 23, 2006 7:54 am

Re: PDGBTRF possible bug

Postby Julien Langou » Thu Mar 10, 2011 8:50 pm

Hi Themos,
I am not too familiar with the code.
1) Did you trace where the execution crashes in the PDGBTRF? Does it happen line 1045? Does setting IPIV of size ( DESCA(NB)+BWL+BWU ) fixes the problem?
2) There is some restrictions given from lines 134 to 160 of the code.
Cheers,
Julien.
Julien Langou
 
Posts: 733
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Re: PDGBTRF possible bug

Postby themos » Fri Mar 11, 2011 10:31 am

Hello Julien!

Thanks for responding. I made a slightly different version of PDGBTRF (just added LIPIV as a last integer argument and declared INTEGER IPIV(LIPIV).

Valgrind reports


==9075== Invalid write of size 4
==9075== at 0x44B1A8: DGETRF (in /home/amd/loanamd29/themos/75940/a.out)
==9075== by 0x4405FF: mypdgbtrf_ (mypdgbtrf.f:1050)
==9075== by 0x4345D8: MAIN__ (penta_mpi3a.f90:274)
==9075== by 0x42BCDB: main (in /home/amd/loanamd29/themos/75940/a.out)
==9075== Address 0x55adf78 is 0 bytes after a block of size 4,000 alloc'd
==9075== at 0x4A200CB: malloc (vg_replace_malloc.c:207)
==9075== by 0xA03C0A: for_alloc_allocatable (in /home/amd/loanamd29/themos/75940/a.out)
==9075== by 0x433B4A: MAIN__ (penta_mpi3a.f90:242)
==9075== by 0x42BCDB: main (in /home/amd/loanamd29/themos/75940/a.out)

Line 242 of penta_mpi3a.f90 reads
allocate(ipiv(lipiv))

Line 1050 of mypdgbtrf.f reads
CALL DGETRF(BM+BMN,BM+BMN, AF(BBPTR+BW*LDBB), LDBB,
$ IPIV(LN+1), INFO)

LIPIV was set to 1000

on line 1050
LN+1 = 883
BM+BMN = 236

In the PDGBTRF call:

DESCA was given as
501 0 2000 1000 0 237
BWL was set = 59
BWU was set = 59
N was set = 2000
LAF = 250278
LWORK = 250278

and the grid was 1x2 (1 row, 2 columns)

Dimensioning IPIV(DESCA(4)+BWL+BWU) does indeed make the problem disappear. And that's a hard limit, dimensioning by one less results in the same valgrind message.

I think the restrictions are met...

Themos
themos
 
Posts: 6
Joined: Thu Mar 23, 2006 7:54 am

Re: PDGBTRF possible bug

Postby sgildea » Mon May 28, 2012 1:42 pm

Hello,

I had been attempting to debug a "simple" program calling ScaLAPACK routines in C for several days. I set up everything to partition the matrix stored in general banded format on a 1xP process grid. I even got past the tricky bit about each sub-matrix of B (the RHS) having a leading dimension of NB_A (DESCA(4), with Fortran indexing), even if NB_A*P>M_B - which is the case when the order of the matrix (N_A) is not evenly divided by the number of processes. Even then, when I invoked PDGBSV, my program would either segfault when PDGBSV called PDGBTRF, or get the correct answer and then segfault toward the end of my code when it called BLACS_GRIDEXIT.

However, when I changed the length of IPIV from NB_A (again, DESCA(4), with Fortran indexing) to NB_A + BWU + BWL, my test program now appears to work properly, without segfaults, and the output matches the solutions obtained from similar LAPACK routines I wrote using DGBSV and DGBTRS. I am not enough of an expert to know if this is a true bug, or is just covering up another mistake I made, but I will update this post if I find that bugs have persisted.

THANK YOU for not only finding this bug, but also reporting it here so others like me could find it. It has made my weekend.

Steve
sgildea
 
Posts: 3
Joined: Mon May 28, 2012 1:25 pm

Re: PDGBTRF possible bug

Postby tikitaki12 » Thu Jan 03, 2013 6:16 pm

I was having the same problem as Steve with the segfaults using PDGBSV. I was also finding that the code was overwriting matrices that it shouldn't have had access too. Changing the length of IPIV from NB_A to NB_A + BWU + BWL also seems to have solved my problem. Steve, did you ever find any other bugs after doing that?

~BCL
tikitaki12
 
Posts: 3
Joined: Wed Jan 02, 2013 5:43 pm


Return to Bug report

Who is online

Users browsing this forum: Bing [Bot] and 1 guest

cron