Further NaN problem with dsgesv_gpu

Open discussion for MAGMA

Further NaN problem with dsgesv_gpu

Postby fletchjp » Sat Feb 05, 2011 2:17 pm

Stan

I have been doing some more testing following on from my last reports. I have seen very few problems since putting in the extra lines to zero the d_X arrays in strsm and dtrsm and I am using that modification.

Unfortunately I am still seeing an isolated error in testing_dsgesv_gpu when running your standard test case. For the first run on starting the computer, and usually only that run, I am seeing the following, with extra prints from dgetrf_gpu:

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_dsgesv_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dsgesv_gpu -nrhs 1 -N 1024

Epsilon(double): 1.110223e-16
Epsilon(single): 5.960464e-08

  N   DP-Factor  DP-Solve  SP-Factor  SP-Solve  MP-Solve  ||b-Ax||/||A||  NumIter
==================================================================================
 1024  magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
 22.66    magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 20.02     40.15      37.42      28.17     2.315249e-16      3
 2048  magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
 45.92    magma dgetrf_gpu block size is 64 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 41.30    115.99     114.52      94.27     3.430431e-15      3
 3072  magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
 59.98    magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 55.33    171.94     167.13     144.36     9.016338e-16      3
 4032  magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
 64.36    magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 60.58    250.98     240.78     200.07     4.856945e-13      4
 5184  magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
 67.31    magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 64.36    283.99     282.36     246.75     5.693667e-16      3
 6016  magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
 68.65    magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 66.00    303.16     296.35     259.64     5.912366e-15      4
 7040  magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
 69.60    magma dgetrf_gpu block size is 192 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.33    314.65     309.90     278.66     8.323081e-17      4
 7520  magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 69.38    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.35    316.30     312.29     304.60     0.000000e+00      0
 8064  magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 70.48    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 68.56    326.11     321.86     292.30     3.983149e-15      4
 8192  magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 70.68    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 68.76    319.02     315.18     292.73     2.570672e-15      3


The critical line is the following, where the error is reported as 0.0 and 0 iterations.

Code: Select all
7520  magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 69.38    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.35    316.30     312.29     304.60     0.000000e+00      0


I have dug into the code of dsgesv_gpu.cpp and put in some extra prints, when the offending result looks like this:

Code: Select all
7520  dsgesv_gpu has Rnrm = nan from SP calculation
Rnorm = 0.000000e+00
magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 69.37    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.40    317.50     313.56     304.91     0.000000e+00      0


The critical print is from inside dsgesv_gpu.cpp at line 238 where I have added this line:

Code: Select all
   printf("dsgesv_gpu has Rnrm = %e from SP calculation\n",Rnrm);
    *iter = 0;
    return MAGMA_SUCCESS;


I have not dug further yet to find out why this is happening. The level above calculates an erroneous error of 0.

I hope this helps. It happens on start up reliably.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby Stan Tomov » Mon Feb 14, 2011 5:15 pm

John,
Thanks. I filled an internal bug report to look into this case. I'll post about the resolution when we get it.
Regards,
Stan
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Wed Feb 16, 2011 1:21 pm

Stan

Thank you. I am very interested in the mixed precision strategy. I think I said once before that I would like to have a version which had the option of allowing the operation on the transpose of the matrix (as dgetrs does).

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby Stan Tomov » Fri Feb 18, 2011 2:35 am

John,
Yes, we will add it to the current release.
Thanks,
Stan
Stan Tomov
 
Posts: 247
Joined: Fri Aug 21, 2009 10:39 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Fri Feb 18, 2011 3:17 am

Thank you.

I shall want to try it out through a FORTRAN interface. If you haven't done one yet I will adapt the one that is there already for other routines.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Fri Feb 25, 2011 7:12 pm

Stan

I think this may be the same problem as I reported on dgetrf - at root the problem with dtrsm and strsm as I had not patched strsm_tesla.cu. I have done so now and will check it out. The problem only happens on the first run so is difficult to check on.

No, it still has the error even with the strsm patched and setting d_X to zero. It has to be something else.

I have dug a bit further. I have put some more prints into dsgesv_gpu.cpp as follows

Code: Select all
    for(i=0;i<NRHS;i++)
    {
   j = cublasIdamax( N, dX+i*N, 1) ;
   cublasGetMatrix( 1, 1, sizeof(double), dX+i*N+j-1, 1, &Xnrmv, 1);
   Xnrm = lapackf77_dlange( "F", &ione, &ione, &Xnrmv, &ione, NULL );
        printf("dsgesv_gpu for %d max at %d gives Xnrm = %e\n",i,j,Xnrm);
   
   j = cublasIdamax ( N, dworkd+i*N, 1 );
   cublasGetMatrix( 1, 1, sizeof(double), dworkd+i*N+j-1, 1, &Rnrmv, 1 );
   Rnrm = lapackf77_dlange( "F", &ione, &ione, &Rnrmv, &ione, NULL );
        printf("dsgesv_gpu for %d max at %d gives Rnrm = %e\n",i,j,Rnrm);
   
   if( Rnrm >  (Xnrm*cte) ){
       goto L10;
   }
    }
    printf("dsgesv_gpu has Rnrm = %e from SP calculation\n",Rnrm);
    *iter = 0;
    return MAGMA_SUCCESS;


When I do the first run after starting the computer, I get this, for the 7520 case:

Code: Select all
 7520  magma sgetrf_gpu block size is 192 (magmablas_strsm)
dsgesv_gpu for 0 max at 7518 gives Xnrm = 3.332330e+00
dsgesv_gpu for 0 max at 1 gives Rnrm = nan
dsgesv_gpu has Rnrm = nan from SP calculation
Rnorm = 0.000000e+00
magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 69.35    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.39    magma sgetrf_gpu block size is 192 (magmablas_strsm)
317.09     magma sgetrf_gpu block size is 192 (magmablas_strsm)
311.13     303.43     0.000000e+00      0


For other runs the output changes to this:

Code: Select all
 7520  magma sgetrf_gpu block size is 192 (magmablas_strsm)
dsgesv_gpu for 0 max at 7518 gives Xnrm = 3.332330e+00
dsgesv_gpu for 0 max at 4498 gives Rnrm = 4.940893e-03
dsgesv_gpu about to start iterations
 1 0 292 9.900568e-06 7518 3.329078e+00
 2 0 4339 3.029361e-08 7518 3.329078e+00
 3 0 705 2.178491e-11 7518 3.329078e+00
Rnorm = 2.188272e-11
magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
 69.35    magma dgetrf_gpu block size is 256 (magmablas_dtrsm)
magma dgetrs_gpu magmablas_dtrsm
 67.40    magma sgetrf_gpu block size is 192 (magmablas_strsm)
316.88     magma sgetrf_gpu block size is 192 (magmablas_strsm)
311.87     286.66     5.686871e-15      3


Something is getting messed up in dworkd.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Tue Mar 08, 2011 6:04 pm

This seems not to have been fixed in RC4?

I have the following results with RC 4. The clue is the 0 iterations at 7520.

John

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc4/testing$ ./testing_dsgesv_gpu
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dsgesv_gpu -nrhs 1 -N 1024

Epsilon(double): 1.110223e-16
Epsilon(single): 5.960464e-08

  N   DP-Factor  DP-Solve  SP-Factor  SP-Solve  MP-Solve  ||b-Ax||/||A||  NumIter
==================================================================================
 1024   26.11     21.07     43.88      40.13      20.45     2.315249e-16      3
 2048   47.89     43.07    125.63     117.58      97.53     3.430431e-15      3
 3072   60.54     56.08    177.66     171.72     145.87     9.016338e-16      3
 4032   64.75     60.98    244.94     236.53     196.64     4.856945e-13      4
 5184   67.23     64.51    281.01     273.39     241.51     5.693667e-16      3
 6016   68.65     66.11    293.47     287.88     255.15     5.912366e-15      4
 7040   69.65     67.39    304.17     299.48     270.32     8.323081e-17      4
 7520   69.55     67.54    306.82     302.86     294.03     0.000000e+00      0
 8064   70.57     68.70    316.72     312.45     284.81     3.983149e-15      4
 8192   70.54     68.65    310.01     306.66     286.27     2.570672e-15      3
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby mateo70 » Tue Mar 08, 2011 8:23 pm

Hi,

I note that and I will try to get it fixed.

Mathieu
mateo70
 
Posts: 41
Joined: Tue Mar 08, 2011 12:38 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Wed Mar 09, 2011 5:53 am

I should have said that it only occurs on the first run. I think it is unset memory somewhere, which once set (to zero?) is ignored.
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: Further NaN problem with dsgesv_gpu

Postby fletchjp » Mon Apr 18, 2011 7:10 am

This problem still exists in RC5

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 3 guests