Best solution for solving hundreds of small linear systems

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
al12rs
Posts: 2
Joined: Wed Oct 02, 2019 7:09 am

Best solution for solving hundreds of small linear systems

Post by al12rs » Wed Oct 02, 2019 7:30 am

Hi,
I would like to ask what would be the best approach for my particular problem:

I need to solve 100-1000 small (2x2 - 100x100) linear systems in the most efficient way possible as part of bigger HPC problem.
The most common case will have small 2x2, 3x3 or 4x4 liner systems while bigger systems are possible but less and less probable.
The number of linear systems on the other hand can be expected to be around 500 in the 100-1000 range in a consistent way on a specific installation.

I was looking for a high performing GPU based solution to compare to traditional CPU cluster based solutions, but I was not sure if MAGMA offered something efficient for batches of very small matrices under 100x100.

abdelfattah83
Posts: 8
Joined: Mon Dec 10, 2018 3:02 pm

Re: Best solution for solving hundreds of small linear systems

Post by abdelfattah83 » Thu Oct 03, 2019 12:54 pm

There is a batch routine for solving many small linear systems (magma_Xgesv_batched), where X specifies the precision (s, d, c, z).

The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.

Here are sample results for magma_dgesv_batched against a CPU-based solution (MKL dgesv + OpenMP). The CPU is a 20-core Haswell architecture. The GPU is a Tesla V100 PCIe.

Code: Select all

./testing_dgesv_batched -N 10:100:10 --nrhs 1 --batch 500 --niter 1 -l
% MAGMA 2.5.0 svn compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10010, driver 10010. OpenMP threads 40. MKL 2018.0.1, MKL threads 20. 
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Thu Oct  3 12:44:41 2019
% Usage: ./testing_dgesv_batched [options] [-h|--help]

% BatchCount   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||
%=======================================================================
       500     10     1      0.21 (   0.00)      2.61 (   0.00)   1.03e-17   ok
       500     20     1      4.24 (   0.00)     25.75 (   0.00)   5.10e-18   ok
       500     30     1      5.35 (   0.00)     78.02 (   0.00)   3.38e-18   ok
       500     40     1     28.56 (   0.00)     53.75 (   0.00)   3.08e-18   ok
       500     50     1     38.36 (   0.00)     84.90 (   0.00)   2.21e-18   ok
       500     60     1     55.46 (   0.00)    129.25 (   0.00)   2.23e-18   ok
       500     70     1     68.17 (   0.00)    100.43 (   0.00)   2.12e-18   ok
       500     80     1     60.25 (   0.00)    114.92 (   0.00)   1.69e-18   ok
       500     90     1     68.59 (   0.00)    128.80 (   0.00)   1.63e-18   ok
       500     100     1     84.74 (   0.00)    158.36 (   0.00)   1.65e-18   ok

al12rs
Posts: 2
Joined: Wed Oct 02, 2019 7:09 am

Re: Best solution for solving hundreds of small linear systems

Post by al12rs » Fri Oct 04, 2019 7:55 pm

Thank you very much fir the detailed reply, do I see correctly that all the timings are 0.00 on both GPU and CPU? If so maybe increasing the batch size to 10^3 - 10^5 might actually be more of a workload?

mgates3
Posts: 902
Joined: Fri Jan 06, 2012 2:13 pm

Re: Best solution for solving hundreds of small linear systems

Post by mgates3 » Mon Oct 07, 2019 4:01 pm

Yes, unfortunately here the time gets rounded down. But the performance is reflected in the Gflop/s rate. You can compute the approximate time using the formula:

2/3 n^3 * batch_count / (gflop/s)

For instance

2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.

-mark

Post Reply