[Bug?] spotrf_batched fails for batch size > 524280 and dpotrf_batched fails for batch size > 262140

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
vishwakftw
Posts: 11
Joined: Mon Dec 10, 2018 2:32 am

[Bug?] spotrf_batched fails for batch size > 524280 and dpotrf_batched fails for batch size > 262140

Post by vishwakftw » Sat Jul 13, 2019 11:59 pm

Hi,

I found out that the spotrf_batched routine fails for batch size > 524280 and dpotrf_batched routine fails for batch size > 262140. I confirmed this issue using the test suite and the following commands:

Code: Select all

./testing_spotrf_batched --batch 524281 -n 2 --check --matrix rand_dominant
and

Code: Select all

./testing_dpotrf_batched --batch 262141 -n 2 --check --matrix rand_dominant

Code: Select all

% MAGMA 2.5.0 svn compiled for CUDA capability >= 5.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10000, driver 10010. OpenMP threads 4. 
% device 0: GeForce 940M, 1176.0 MHz clock, 2004.5 MiB memory, capability 5.0
% Sun Jul 14 09:15:41 2019
% Usage: ./testing_spotrf_batched [options] [-h|--help]

% BatchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||R_magma - R_lapack||_F / ||R_lapack||_F
%===================================================================================================
    524281     2      0.01 ( 207.07)    119.51 (   0.02)   7.76e-01   failed
The tests pass for batch size 524280, however.

Code: Select all

% MAGMA 2.5.0 svn compiled for CUDA capability >= 5.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10000, driver 10010. OpenMP threads 4. 
% device 0: GeForce 940M, 1176.0 MHz clock, 2004.5 MiB memory, capability 5.0
% Sun Jul 14 09:15:27 2019
% Usage: ./testing_spotrf_batched [options] [-h|--help]

% BatchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||R_magma - R_lapack||_F / ||R_lapack||_F
%===================================================================================================
    524280     2      0.01 ( 207.92)      0.42 (   6.24)   1.49e-07   ok
Similarly, for dpotrf_batched

Code: Select all

% MAGMA 2.5.0 svn compiled for CUDA capability >= 5.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10000, driver 10010. OpenMP threads 4. 
% device 0: GeForce 940M, 1176.0 MHz clock, 2004.5 MiB memory, capability 5.0
% Sun Jul 14 09:25:34 2019
% Usage: ./testing_dpotrf_batched [options] [-h|--help]

% BatchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||R_magma - R_lapack||_F / ||R_lapack||_F
%===================================================================================================
    262141     2      0.01 (  98.74)     65.45 (   0.02)   7.76e-01   failed
The tests pass for batch size 262140, however.

Code: Select all

% MAGMA 2.5.0 svn compiled for CUDA capability >= 5.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10000, driver 10010. OpenMP threads 4. 
% device 0: GeForce 940M, 1176.0 MHz clock, 2004.5 MiB memory, capability 5.0
% Sun Jul 14 09:25:28 2019
% Usage: ./testing_dpotrf_batched [options] [-h|--help]

% BatchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||R_magma - R_lapack||_F / ||R_lapack||_F
%===================================================================================================
    262140     2      0.01 ( 103.36)      0.13 (   9.93)   2.71e-16   ok

abdelfattah83
Posts: 8
Joined: Mon Dec 10, 2018 3:02 pm

Re: [Bug?] spotrf_batched fails for batch size > 524280 and dpotrf_batched fails for batch size > 262140

Post by abdelfattah83 » Sat Jul 20, 2019 8:34 pm

Most of the MAGMA batched kernels use the z-dimension of the kernel grid for batching across different problems. The maximum value of this dimension is 65535 (a hardware limitation of the GPU itself). Depending on the kernel configuration, a large batch size may lead to exceeding this limit.

If you use nvprof with --print-gpu-trace to profile the tests you posted, you will see that the successful runs use the maximum value for the z-dimension of the grid. Failed runs obviously exceed this limit and the batch kernel is not launched.

We have started fixing this issue for some kernels (e.g. batch GEMM), but it has not been populated for other routines yet.

--Ahmad

Post Reply