I have been doing some more testing following on from my last reports. I have seen very few problems since putting in the extra lines to zero the d_X arrays in strsm and dtrsm and I am using that modification.
Unfortunately I am still seeing an isolated error in testing_dsgesv_gpu when running your standard test case. For the first run on starting the computer, and usually only that run, I am seeing the following, with extra prints from dgetrf_gpu:
- Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_dsgesv_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory
Usage:
testing_dsgesv_gpu -nrhs 1 -N 1024
Epsilon(double): 1.110223e-16
Epsilon(single): 5.960464e-08
N DP-Factor DP-Solve SP-Factor SP-Solve MP-Solve ||b-Ax||/||A|| NumIter
==================================================================================
1024 magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
22.66 magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
20.02 40.15 37.42 28.17 2.315249e-16 3
2048 magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
45.92 magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
41.30 115.99 114.52 94.27 3.430431e-15 3
3072 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
59.98 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
55.33 171.94 167.13 144.36 9.016338e-16 3
4032 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
64.36 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
60.58 250.98 240.78 200.07 4.856945e-13 4
5184 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
67.31 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
64.36 283.99 282.36 246.75 5.693667e-16 3
6016 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
68.65 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
66.00 303.16 296.35 259.64 5.912366e-15 4
7040 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
69.60 magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
67.33 314.65 309.90 278.66 8.323081e-17 4
7520 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
69.38 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
67.35 316.30 312.29 304.60 0.000000e+00 0
8064 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
70.48 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
68.56 326.11 321.86 292.30 3.983149e-15 4
8192 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
70.68 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
68.76 319.02 315.18 292.73 2.570672e-15 3
The critical line is the following, where the error is reported as 0.0 and 0 iterations.
- Code: Select all
7520 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
69.38 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
67.35 316.30 312.29 304.60 0.000000e+00 0
I have dug into the code of dsgesv_gpu.cpp and put in some extra prints, when the offending result looks like this:
- Code: Select all
7520 dsgesv_gpu has Rnrm = nan from SP calculation
Rnorm = 0.000000e+00
magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
69.37 magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
67.40 317.50 313.56 304.91 0.000000e+00 0
The critical print is from inside dsgesv_gpu.cpp at line 238 where I have added this line:
- Code: Select all
printf("dsgesv_gpu has Rnrm = %e from SP calculation\n",Rnrm);
*iter = 0;
return MAGMA_SUCCESS;
I have not dug further yet to find out why this is happening. The level above calculates an erroneous error of 0.
I hope this helps. It happens on start up reliably.
John