(d/s)getri_outofplace_batched can't handle many matrices

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
vishwakftw
Posts: 9
Joined: Mon Dec 10, 2018 2:32 am

(d/s)getri_outofplace_batched can't handle many matrices

Post by vishwakftw » Mon Dec 10, 2018 2:43 am

Hi,

Sorry about the ambiguous title.

I filed an issue a month ago about (d/s)getri_outofplace_batched not being able to handle more than 65535 matrices in a batch. This is the link to the issue:
https://bitbucket.org/icl/magma/issues/ ... fails-when

I will present the details in the issue here as well:

Batched getri seems to fail when the number of batches are greater than or equal to 65536.

Below are the outputs from the tests:

Single Precision:

% MAGMA 2.3.0 compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40.
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov 3 11:13:45 2018
% Usage: ./testing/testing_sgetri_batched [options] [-h|--help]

% batchCount N CPU Gflop/s (ms) GPU Gflop/s (ms) ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
65535 2 --- ( --- ) 0.03 ( 41.40) 6.15e-08 ok
65536 2 --- ( --- ) 0.03 ( 32.24) 1.68e+07 failed
68523 2 --- ( --- ) 0.03 ( 43.34) 1.68e+07 failed

Double Precision:

% MAGMA 2.3.0 compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40.
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov 3 11:15:12 2018
% Usage: ./testing/testing_dgetri_batched [options] [-h|--help]

% batchCount N CPU Gflop/s (ms) GPU Gflop/s (ms) ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
65535 2 --- ( --- ) 0.01 ( 81.26) 1.14e-16 ok
65536 2 --- ( --- ) 0.02 ( 58.56) 9.01e+15 failed
68523 2 --- ( --- ) 0.02 ( 66.03) 9.01e+15 failed

I passed the option --matrix rand_dominant to ensure that the random matrices generated are not singular by chance.

It would be great if you could provide a solution for this issue or indicate if this is expected behavior. Thank you.

mgates3
Posts: 879
Joined: Fri Jan 06, 2012 2:13 pm

Re: (d/s)getri_outofplace_batched can't handle many matrices

Post by mgates3 » Mon Dec 10, 2018 10:04 am

We'll track this on the bitbucket issue.
-mark

abdelfattah83
Posts: 7
Joined: Mon Dec 10, 2018 3:02 pm

Re: (d/s)getri_outofplace_batched can't handle many matrices

Post by abdelfattah83 » Mon Dec 10, 2018 3:43 pm

This is a known issue for most of the batch routines, not only getri_outofplace_batched.

The explanation might be a little low-level. Most of the MAGMA batch kernels use the grid-z dimension to get a "batch-ID". This dimension has a maximum value of 65535 on all NVIDIA GPUs (hardware limitation), after which the kernel fails to launch. A simple solution is to loop over your batch in strides of a certain value (say 50k problems at a time). We have started such a solution already for the GEMM kernel, but it would take some time before it is available across all routines.

You can write a simple driver that does this for now. As an example, check out the routine "magmablas_zgemm_batched_strided" under magmablas/zgemm_batched.cpp

--Ahmad

Post Reply