Different checksum with >1 gpu (testing_dgetrf_mgpu)

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
roalmar2
Posts: 20
Joined: Thu Jul 03, 2014 6:06 am

Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by roalmar2 » Tue Jul 29, 2014 6:03 am

Hello,

I modiffied the testing_dgetrf_mgpu.cpp code, to check the correct use of the app.

Code: Select all

...
// checksum
// lapack y magma
    double max_double = 1073741824; // 2^30 (2^52)
    double auxl= 0.0;
    double auxm= 0.0;
    double res_lap = 0.0;
    double res_mag = 0.0;
    double *aux = NULL;
    double *h_lA[ngpu];
    //aux_ldn_local

for(int k=0; k<ngpu; k++)
{
    TESTING_MALLOC_CPU(h_lA[k], double, ldda*aux_ldn_local[k]);
    cudaMemcpy(h_lA[k], d_lA[k], ldda*aux_ldn_local[k]*sizeof(double), cudaMemcpyDeviceToHost);

    for(int i=0; i<ldda ; i++)
    {
        for (int j=0; j<(int)aux_ldn_local[k]; j++)
        {
         if(auxl > max_double)
         {
          res_lap /= 100;
         }
         res_lap += res_lap + h_A[(int)aux_ldn_local[k]*i+j];
         auxl = res_lap;

         if(auxm > max_double)
         {
          res_mag /= 100;
         }
          res_mag += res_mag + h_lA[k][(int)aux_ldn_local[k]*i+j]; // OJO
          auxm=res_mag;
        }
   }
   TESTING_FREE_CPU(h_lA[k]);
}
    printf("---- Checksum lapack: %f  Checsum magma: %f.\n", res_lap, res_mag);
...
When I execute it for 1 gpu, the values are the same:

Code: Select all

Usage: ./testing_dgetrf_mgpu [options] [-h|--help]

ngpu 1
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   |PA-LU|/(N*|A|)
=========================================================================
---- Checksum lapack: 175129168.103157  Checsum magma: 175129168.103157.
 1088  1088    ---   (  ---  )     32.23 (   0.03)     ---
---- Checksum lapack: 1907947604.779820  Checsum magma: 1907947604.779820.
 2112  2112    ---   (  ---  )    164.75 (   0.04)     ---
---- Checksum lapack: 136711177.014838  Checsum magma: 136711177.014838.
 3136  3136    ---   (  ---  )    268.83 (   0.08)     ---
---- Checksum lapack: 344441255.846596  Checsum magma: 344441255.846596.
 4160  4160    ---   (  ---  )    392.56 (   0.12)     ---
---- Checksum lapack: 370879087.794721  Checsum magma: 370879087.794721.
 5184  5184    ---   (  ---  )    494.81 (   0.19)     ---
---- Checksum lapack: 164429408.520023  Checsum magma: 164429408.520023.
 6208  6208    ---   (  ---  )    574.89 (   0.28)     ---
---- Checksum lapack: 39557278.528352  Checsum magma: 39557278.528352.
 7232  7232    ---   (  ---  )    635.84 (   0.40)     ---
---- Checksum lapack: 461207990.882663  Checsum magma: 461207990.882663.
 8256  8256    ---   (  ---  )    667.69 (   0.56)     ---
---- Checksum lapack: 2081259705.950725  Checsum magma: 2081259705.950725.
 9280  9280    ---   (  ---  )    713.96 (   0.75)     ---
---- Checksum lapack: 46592603.134620  Checsum magma: 46592603.134620.
10304 10304    ---   (  ---  )    742.24 (   0.98)     ---
But with 4 gpus the values change a lot:

Code: Select all

M/N	     lapack	                magma
1088	175147069,830752  	175122108,075274
2112	1908554751,7443	1907798223,11236
3136	136765029,786001	136716184,294073
4160	344726468,832647	344439123,707522
5184	371342616,55389	370977669,367096
6208	164683471,849087	164421714,434309
7232	39635677,068431	39541368,874261
8256	461691038,783903	460646834,479622
9280	2088480297,71815	2080143772,2736
10304 46703953,877662	46585143,384757
It could be by the round off the adds, but the values are very different.

Any idea? Could I take it like a true values? Why that differences? Which is the accuracy?

Thanks a lot!

mgates3
Posts: 916
Joined: Fri Jan 06, 2012 2:13 pm

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by mgates3 » Mon Aug 04, 2014 3:38 pm

Due to rounding differences, it is possible that MAGMA uses different pivots than LAPACK. In that case, the L and U factors will be different, but both pass the check, PA - LU.

Please run with the -c flag to check the answers. (I'm not saying there is or is not a bug, just that -c is the correct way to verify the code.)

Please post the complete command line and complete output. This provides valuable information when trying to debug issues. For instance:

Code: Select all

magma-trunk/testing> ./testing_dgetrf_mgpu -N 5000 --ngpu 3 -c
MAGMA 1.4.0 svn compiled for CUDA capability >= 3.5
CUDA runtime 6000, driver 6050. OpenMP threads 8. MKL 11.1.2, MKL threads 8. 
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 2: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_dgetrf_mgpu [options] [-h|--help]

ngpu 3
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   |PA-LU|/(N*|A|)
=========================================================================
 5000  5000    ---   (  ---  )    467.70 (   0.18)   2.29e-18   ok
-mark

roalmar2
Posts: 20
Joined: Thu Jul 03, 2014 6:06 am

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by roalmar2 » Tue Aug 05, 2014 3:46 am

Ok, the output is this one:

Code: Select all


./testing_dgetrf_mgpu -c -N 5000 --ngpu 8 
MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 1: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 2: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 3: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 4: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 5: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 6: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 7: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
Usage: ./testing_dgetrf_mgpu [options] [-h|--help]

ngpu 8
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   |PA-LU|/(N*|A|)
=========================================================================
 5000  5000    ---   (  ---  )     28.49 (   2.92)   2.29e-18

Code: Select all


[roalmar2@mlxc11 testing]$ ./testing_dgetrf_mgpu -c2 -N 5000 --ngpu 8 
MAGMA 1.4.1 , compiled for CUDA capability >= 1.0
device 0: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 1: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 2: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 3: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 4: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 5: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 6: Tesla K20m, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
device 7: Tesla K20c, 705.5 MHz clock, 4799.6 MB memory, capability 3.5
Usage: ./testing_dgetrf_mgpu [options] [-h|--help]

ngpu 8
    M     N   CPU GFlop/s (sec)   GPU GFlop/s (sec)   |Ax-b|/(N*|A|*|x|)
=========================================================================
 5000  5000    ---   (  ---  )     29.55 (   2.82)   2.38e-19

But in my output not say yes/no/fail, why? That means it fails?

Thanks

roalmar2
Posts: 20
Joined: Thu Jul 03, 2014 6:06 am

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by roalmar2 » Tue Aug 05, 2014 9:03 am

One question, the output value GFlops/s, when is used with mgpu apps, shows the GFlops/s of 1 gpu or the total number of operations (the gpu total GFlops )?

Thanks

mgates3
Posts: 916
Joined: Fri Jan 06, 2012 2:13 pm

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by mgates3 » Thu Aug 07, 2014 4:45 pm

Those results are accurate, being near machine precision for double (1e-16). The newer versions of MAGMA print "ok" or "failed", though this is based on a tolerance, so occasionally a test can "fail" and still be acceptable. In this case, both -c and -c2 are fine ways to check results; -c2 is faster and needs less memory.

The Gflop/s is the total performance of all CPUs + GPUs that MAGMA uses. It is simply (2/3) n^3 / time. (Approximately; the actual flops formula is a bit more complicated, but that is the dominant term.)

A 5000 x 5000 problem size is not sufficiently large to be efficiently split across 8 GPUs. You should try scaling with 1, 2, ..., 8 GPUs to see how many can efficiently solve the problem.

-mark

roalmar2
Posts: 20
Joined: Thu Jul 03, 2014 6:06 am

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by roalmar2 » Wed Sep 10, 2014 10:33 am

Hello,

mark, when you said:
The Gflop/s is the total performance of all CPUs + GPUs that MAGMA uses. It is simply (2/3) n^3 / time.
What did you exactly refer with, time?

Options:
- time initialize cuda driver + data transfer host-device + kernel + data transfer device-host
- time initialize cuda driver + kernel
- time data transfer host-device + kernel + data transfer device-host
- time kernel
- another option (which one?)

Thnaks you, very much!! ^_^

mgates3
Posts: 916
Joined: Fri Jan 06, 2012 2:13 pm

Re: Different checksum with >1 gpu (testing_dgetrf_mgpu)

Post by mgates3 » Thu Sep 11, 2014 4:59 pm

Wall-clock time for the MAGMA function itself. For example, from testing_zgetrf.cpp:

Code: Select all

            gpu_time = magma_wtime();
            magma_zgetrf( M, N, h_A, lda, ipiv, &info);
            gpu_time = magma_wtime() - gpu_time;
            gpu_perf = gflops / gpu_time;
In this case, since that is a CPU interface, the time includes allocating GPU memory and transferring the matrix to the GPU, which both occur inside magma_zgetrf. The GPU interface (magma_zgetrf_gpu) has an advantage that the matrix is already allocated on the GPU, so there is no upfront transfer time.

For BLAS and auxiliary routines, which are normally asynchronous, we do a sync before and after. From testing_zgemm.cpp:

Code: Select all

            magma_time = magma_sync_wtime( NULL );
            magmablas_zgemm( opts.transA, opts.transB, M, N, K, alpha, d_A, ldda, d_B, lddb, beta,  d_C, lddc );
            magma_time = magma_sync_wtime( NULL ) - magma_time;
            magma_perf = gflops / magma_time;
-mark

Post Reply