Hi,
I would like to ask what would be the best approach for my particular problem:
I need to solve 1001000 small (2x2  100x100) linear systems in the most efficient way possible as part of bigger HPC problem.
The most common case will have small 2x2, 3x3 or 4x4 liner systems while bigger systems are possible but less and less probable.
The number of linear systems on the other hand can be expected to be around 500 in the 1001000 range in a consistent way on a specific installation.
I was looking for a high performing GPU based solution to compare to traditional CPU cluster based solutions, but I was not sure if MAGMA offered something efficient for batches of very small matrices under 100x100.
Best solution for solving hundreds of small linear systems

 Posts: 8
 Joined: Mon Dec 10, 2018 3:02 pm
Re: Best solution for solving hundreds of small linear systems
There is a batch routine for solving many small linear systems (magma_Xgesv_batched), where X specifies the precision (s, d, c, z).
The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.
Here are sample results for magma_dgesv_batched against a CPUbased solution (MKL dgesv + OpenMP). The CPU is a 20core Haswell architecture. The GPU is a Tesla V100 PCIe.
The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.
Here are sample results for magma_dgesv_batched against a CPUbased solution (MKL dgesv + OpenMP). The CPU is a 20core Haswell architecture. The GPU is a Tesla V100 PCIe.
Code: Select all
./testing_dgesv_batched N 10:100:10 nrhs 1 batch 500 niter 1 l
% MAGMA 2.5.0 svn compiled for CUDA capability >= 6.0, 32bit magma_int_t, 64bit pointer.
% CUDA runtime 10010, driver 10010. OpenMP threads 40. MKL 2018.0.1, MKL threads 20.
% device 0: Tesla V100PCIE16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Thu Oct 3 12:44:41 2019
% Usage: ./testing_dgesv_batched [options] [hhelp]
% BatchCount N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) B  AX / N*A*X
%=======================================================================
500 10 1 0.21 ( 0.00) 2.61 ( 0.00) 1.03e17 ok
500 20 1 4.24 ( 0.00) 25.75 ( 0.00) 5.10e18 ok
500 30 1 5.35 ( 0.00) 78.02 ( 0.00) 3.38e18 ok
500 40 1 28.56 ( 0.00) 53.75 ( 0.00) 3.08e18 ok
500 50 1 38.36 ( 0.00) 84.90 ( 0.00) 2.21e18 ok
500 60 1 55.46 ( 0.00) 129.25 ( 0.00) 2.23e18 ok
500 70 1 68.17 ( 0.00) 100.43 ( 0.00) 2.12e18 ok
500 80 1 60.25 ( 0.00) 114.92 ( 0.00) 1.69e18 ok
500 90 1 68.59 ( 0.00) 128.80 ( 0.00) 1.63e18 ok
500 100 1 84.74 ( 0.00) 158.36 ( 0.00) 1.65e18 ok
Re: Best solution for solving hundreds of small linear systems
Thank you very much fir the detailed reply, do I see correctly that all the timings are 0.00 on both GPU and CPU? If so maybe increasing the batch size to 10^3  10^5 might actually be more of a workload?
Re: Best solution for solving hundreds of small linear systems
Yes, unfortunately here the time gets rounded down. But the performance is reflected in the Gflop/s rate. You can compute the approximate time using the formula:
2/3 n^3 * batch_count / (gflop/s)
For instance
2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.
mark
2/3 n^3 * batch_count / (gflop/s)
For instance
2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.
mark