Any suggestion is appreciated to help solve this problem :

1) Compiling and running with AMD Sempron CPU:

testing_Xgesv_gpu, with N not an integer multiple of 4, yields the relative compute error of about 5x10^-3, irregardless of X=z/d/c/s. (All have appropriate error 10^-17 or 10^-8 when N is a multiple of 4.) I added an lapack sgesv solution to the testing code and found that the sgesv_gpu solution differs substantially from the lapack sgesv solution when N not a multiple of 4 - thus the 5x10^-3 relative error.

My clMAGMA build is detailed in this post: viewtopic.php?f=2&t=727

2) Compiling and running with Intel Celeron CPU (otherwise everything else supposedly the same):

testing_Xgesvd (X=z,c,d,s) all run fine, with relative compute error 10^-16 (z/d) or 10^-7 (c/s).

"testing_zgesv_gpu.exe -N n -R r", with n = 1, 2, 4, 6, 8...30, 32, or 64, 128, 256, 512, 1024, etc, runs fine, with relative compute error 10^-16.

But, "testing_zgesv_gpu.exe -N n -R r", with n = 3, 5, 7, 9...33, or 34-63, or 65-127, or 129-511, etc, yields relative compute error of 1.#Re+000 (non-printable).

"testing_Xgesv_gpu.exe -N n -R r", with X=c/d/s with n = 1, 2, or 3 all yield appropriate 10^-8/-17/-8 , but with n > 3 report relative compute error "1.#Re+000". I added some diagnostics to the testing_sgesv_gpu prog, and it appears that only the 4th row of x (Ax=b) is transferred from the GPU back to the host when N from 4 to 7, or only the 8th row when N from 8 to 11, etc. There's a pattern there. When only one row of the solution matrix has valid data, the final result including the entire solution matrix is nonsense and the result unprintable (1.#Re+000).

"testing_zgetrf_gpu.exe -M m -N n", with any m and n, yield relative compute error 3e-2, but

"testing_Xgetrf_gpu.exe -M m -N n", with X=c/d/s, and any m and n, yield good relative compute error 10^-9/-18/-9, or zero.

Also, all "testing_Xpotrf_gpu.exe -N n", with any n, and X=z/d/c/s, all yield relative compute error from about 1 to 100 to QNAN. (Might be the same problem mentioned above of data transfer of the gpu solution back from GPU to host.)

My clMAGMA build is detailed in this post: viewtopic.php?f=2&t=727

Update Sept 2013:

I resolve the problems with above mentioned "z" routines by doing the following three things: 1) Replaced the missing "|" symbol (bitwise OR) from line 43 in ztranspose2.cpp. [This also applies to all Xtranspose2.cpp files.] 2) In ztranspose-v2.cl [also all other X types as well] I limited the nesting of the IF statements to no more than four deep: It was five deep. (This may be a hardware dependent issue: it mattered when compiling the OpenCL on a system with Intel CPU but not another with AMD CPU.) 3) In ztranspose.cl I found that on some systems (Intel ) the use of integer multiplication in OpenCL in the determination of the array indices appears to be the cause of corrupt memory references: Instead of "A[ 24 * lda ]" I used "lda8 = lda * 8 ; lda16 = lda8 * 2; lda24= lda16 + lda8;" followed by use of A[ lda24 ], and a similar replacement for A[ 16 * lda ]. Strange, but it works.

I've traced the problems with the "s" routines to an apparent bug in the clAmdBlas STRSM routine, which sgetrf_gpu.cpp and sgetrs_gpu.cpp call via magma_strsm(), which is a wrapper for clAmdBlasStrsmEx(). I've posted my observations on the AMD Developer Central forum: http://devgurus.amd.com/message/1299829#1299829