bug in p?trsm

Post here if you want to report a bug to the LAPACK team

bug in p?trsm

Postby naromero » Mon Aug 01, 2011 10:50 am

While porting the GPAW code to BG/Q, a team at IBM discovered a bug in the p?trsm. See bug
report below. Basically 'Var' is uinitialized if ForceTop is False, which causes problems later on
in PB_CptrsmAB().

Can someone provide a patch?

Nichols A. Romero, Ph.D.
Argonne Leadership Computing Facility
Argonne National Laboratory
Building 240 Room 2-127
9700 South Cass Avenue
Argonne, IL 60490
(630) 252-3441


From: "Paul Coffman" <pkcoff@us.ibm.com>
To: "Nichols A. Romero" <naromero@alcf.anl.gov>
Sent: Friday, July 29, 2011 1:09:07 AM
Subject: Re: success with print statements --- GPAW for BGQ results on Q32


Nick -

The messaging team ended up finding a bug in PBLAS - here is their analysis - please let me know what you think:


The cause of the MPI truncation error when running GPAW is a bug in the PBLAS (Parallel Basic Linear Algebra Subprograms) library, which is part of the application.
There is a codepath where a "char" variable is declared on the stack, and passed uninitialized into a subroutine.
The subroutine takes a different path than it should, causing an MPI_Recv() to be called with a length that is half the size of what it should be. It appears that this bug is found in four modules, although there could be others.
pctrsm_.c
pdtrsm_.c
pstrsm_.c
pztrsm_.c
The pdtrsm_.c module is the one causing the problem in this issue.

When a WRITE() statement is added to a fortran module higher up the stack, the problem seems to go away. This is because the stack changes, so the value in the uninitialized variable changes.

When GPAW fails with the truncation error, variable "Var" is uninitialized. It just happens to be set to "R" on the 70th iteration (which is a valid value, but not the right one). Here is output showing this.
The boolean ForceTop is calculated to be 0. This skips the codepath that sets "Var". So, when PB_CptrsmAB() is called, "Var" is uninitialized.
PB_CptrsmAB() documents that "Var" is only used if transposing. We see that "notran" is 0, which means it is transposing, so Var will be used. The T form of transpose (TRANSA=T) is being used.
PB_CptrsmAB() ends up calling the 11th occurrence of PB_CInV() in the PB_CptrsmAB.c module.

stdout[46]: PDTRSM: M=64, N=20, IA=4417, JA=4417, DESCA=1, IB=4417, JB=4481, DESCB=1
stdout[46]: ForceTop: *M=64, nb=32, *N=20, ForceTop=0
stdout[46]: PDTRSM: 7
stdout[46]: PB_CptrsmAB: M=64, N=20, Ai=4416, Aj=4416, Bi=4416, Bj=4480, Var=R, Var=0x52, CRIGHT=R, CLEFT=L, notran=0
stdout[46]: PB_CptrsmAB: 1. TRANSA=T, VARIANT=R
stdout[46]: PB_CptrsmAB: S 2.5.1
stdout[46]: Calling PB_CInV 11
stdout[46]: PB_Cpaxpby 2: AisR=0, BisR=0, BmyprocR=6, BprocR=6, BnpD=64, BnR=32, size=8, ma=32, na=64
stdout[46]: MPI_Type_vector: count=64, blocklength=32, stride=32, oldtype=1275070475, newtype=-1946157049
stdout[46]: dgerv2d: Mpval(m)=32, Mpval(n)=64, tlda=32, MatTyp=-1946157049
stdout[46]: BI_Srecv: buf=0x1e3e45f320, N=1, dtype=-1946157049, src=45, msgid=9976
stdout[46]: MPI_Recv calling MPID_Recv with buf=0x1e3e45f320, count=1, source=45, tag=9976, datatype=-1946157049
stderr[46]: Abort(1) on node 46 (rank 46 in comm 1140850688): Fatal error in MPI_Recv: Message truncated, error stack:

When GPAW does not fail with the truncation error because the WRITE() statement was added up the stack, Var happens to be 0xEF (not valid values R or L....just garbage), as seen below.
PB_CptrsmAB() ends up calling the 17th occurrence of PB_CInV() in the PB_CptrsmAB.c module.

stdout[46]: IFB: LC, K, KB, N 70 4417 64 4500
stdout[46]: PDTRSM: M=64, N=20, IA=4417, JA=4417, DESCA=1, IB=4417, JB=4481, DESCB=1
stdout[46]: ForceTop: *M=64, nb=32, *N=20, ForceTop=0
stdout[46]: PDTRSM: 7
stdout[46]: PB_CptrsmAB: M=64, N=20, Ai=4416, Aj=4416, Bi=4416, Bj=4480, Var=<EF>, Var=0xef, CRIGHT=R, CLEFT=L, notran=0
stdout[46]: PB_CptrsmAB: 1. TRANSA=T, VARIANT=<EF>
stdout[46]: Calling PB_CInV 17
stdout[46]: MPI_Type_vector: count=64, blocklength=64, stride=64, oldtype=1275070475, newtype=-1946157049
stdout[46]: dgerv2d: Mpval(m)=64, Mpval(n)=64, tlda=64, MatTyp=-1946157049
stdout[46]: BI_Srecv: buf=0x1e3e558ac0, N=1, dtype=-1946157049, src=45, msgid=9976
stdout[46]: MPI_Recv calling MPID_Recv with buf=0x1e3e558ac0, count=1, source=45, tag=9976, datatype=-1946157049
stdout[46]: PDTRSM: 8

It is interesting that when GPAW failed with the truncation error, Var was actually a good value (R). However, it needed to be the other value (L).
And, when GPAW did not fail with the truncation error, Var was actually a garbage value (0xEF). Because it was not R, it went down the L path that it needed to.

I don't know what Var should be set to when ForceTop is 0 and the if statement is skipped.
I changed all 4 modules to hard-code ForceTop=1 so it always runs through the if statement and sets Var to a valid value. Not sure if this is correct, but at least Var will be valid.

Here is the code in pdtrsm_.c:
My comments are in red.

/*
* BLACS topologies are enforced iff M and N are strictly greater than the
* logical block size returned by pilaenv_. Otherwise, it is assumed that the
* routine calling this routine has already selected an adequate topology.
*/
ForceTop = ( ( *M > nb ) && ( *N > nb ) );
if (grank==46) printf("ForceTop: *M=%d, nb=%d, *N=%d, ForceTop=%d\n",*M,nb,*N,ForceTop);
ForceTop=1; // This is my "fix" to force it down the if-statement so Var gets set, but certainly not the correct fix. The author will need to determine a correct fix.
if( ForceTop )
{
if( lside )
{
if( notran )
{
OpR = CBCAST; OpC = CBCAST; Var = CRIGHT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_DRING; }
else { TopR = TopC = CTOP_IRING; }
}
else
{
if( ABestL <= ABestR )
{
OpR = CBCAST; OpC = CCOMBINE; Var = CLEFT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_IRING; }
else { TopR = TopC = CTOP_DRING; }
}
else
{
OpR = CBCAST; OpC = CBCAST; Var = CRIGHT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_IRING; }
else { TopR = TopC = CTOP_DRING; }
}
}
}
else
{
if( notran )
{
OpR = CBCAST; OpC = CBCAST; Var = CRIGHT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_IRING; }
else { TopR = TopC = CTOP_DRING; }
}
else
{
if( ABestL <= ABestR )
{
OpR = CCOMBINE; OpC = CBCAST; Var = CLEFT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_DRING; }
else { TopR = TopC = CTOP_IRING; }
}
else
{
OpR = CBCAST; OpC = CBCAST; Var = CRIGHT; // This is where Var gets set.
if( upper ) { TopR = TopC = CTOP_DRING; }
else { TopR = TopC = CTOP_IRING; }
}
}
}

rtop = *PB_Ctop( &ctxt, &OpR, ROW, TOP_GET );
ctop = *PB_Ctop( &ctxt, &OpC, COLUMN, TOP_GET );

if( ( rtopsave = rtop ) != TopR )
rtop = *PB_Ctop( &ctxt, &OpR, ROW, &TopR );
if( ( ctopsave = ctop ) != TopC )
ctop = *PB_Ctop( &ctxt, &OpC, COLUMN, &TopC );
/*
* Remove the next 4 lines when the BLACS combine operations support ring
* topologies
*/
if( OpR == CCOMBINE )
rtop = *PB_Ctop( &ctxt, &OpR, ROW, TOP_DEFAULT );
if( OpC == CCOMBINE )
ctop = *PB_Ctop( &ctxt, &OpC, COLUMN, TOP_DEFAULT );
}
// If we get to here without executing the if-statement logic above, Var is uninitialized.
printf("PDTRSM: 7\n");
printf("PB_CptrsmAB: M=%d, N=%d, Ai=%d, Aj=%d, Bi=%d, Bj=%d, Var=%c, Var=0x%x, CRIGHT=%c, CLEFT=%c, notran=%d\n",*M,*N,Ai,Aj,Bi,Bj,Var,Var,CRIGHT,CLEFT,notran);
PB_CptrsmAB( type, &Var, &SideOp, &UploA, ( notran ? NOTRAN : TRAN ),
&DiagA, *M, *N, ((char *)ALPHA), ((char *)A), Ai, Aj, Ad,
((char *)B), Bi, Bj, Bd );
printf("PDTRSM: 8\n");
/*
* Restore the BLACS topologies when necessary.
*/
if( ForceTop )
{
rtopsave = *PB_Ctop( &ctxt, &OpR, ROW, &rtopsave );
ctopsave = *PB_Ctop( &ctxt, &OpC, COLUMN, &ctopsave );
}
}
naromero
 
Posts: 11
Joined: Wed Jan 14, 2009 7:20 pm

Re: bug in p?trsm

Postby naromero » Thu Aug 11, 2011 10:24 am

It looks like this problem was reported two years ago and a patch was even proposed, but
no ScaLAPACK developer ever validated the patch or responded to this thread
viewtopic.php?f=2&t=588&p=1911&hilit=trsm#p1911
naromero
 
Posts: 11
Joined: Wed Jan 14, 2009 7:20 pm

Re: bug in p?trsm

Postby admin » Tue Aug 16, 2011 10:46 am

Sorry for not responding.
Yes the patch works fine.
It just been integrate into our repository.
ScaLAPACK will undergo a major release this coming year and this will be included in the release.
Thank you very much for pointing it out.
admin
Site Admin
 
Posts: 504
Joined: Wed Dec 08, 2004 7:07 pm


Return to Bug report

Who is online

Users browsing this forum: No registered users and 1 guest