low performance running mixed precision lu factorization

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
shengyushen
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

low performance running mixed precision lu factorization

Post by shengyushen » Tue Dec 03, 2019 10:10 pm

Dear all:

I run mixed precision lu factorization described in paper "Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers".
In this paper it says that the performance of mixed precision reach about 20Tflops in almost all types of matrix.
But in my run, it reach only 4Tflops.
I checks and make sure that the magma_xshgetrf_gpu function in src/xshgetrf_gpu.cpp is actually called to do mixed precision factorization.
But I found that there are lots of switching between CPU and GPU in magma_xshgetrf_gpu function in src/xshgetrf_gpu.cpp

===========================================================================
206 for( j=0; j < minmn; j += jb ) {
207 jb = min(nb, minmn-j);
208 rows = m - j;
209 if(j==0)
210 {
211 // transpose the panel and send it to CPU
212 magmablas_stranspose( jb, m-j, dAT(j,j), lddat, dAP(0,0), maxm, queues[1] );
213 magma_queue_sync( queues[1] ); // wait for transpose
214 magma_sgetmatrix_async( m-j, jb, dAP(0,0), maxm, work, ldwork, queues[0] );
215 }
216 //SSY why do we need to goto CPU?
217 // do the cpu part
218 magma_queue_sync( queues[0] ); // wait to get work
219 //SSY now on CPU
220 lapackf77_sgetrf( &rows, &jb, work, &ldwork, ipiv+j, &iinfo );
221 if ( *info == 0 && iinfo > 0 ){
222 *info = iinfo + j;
223 printf("error sgetrf inside xshgetrf voici info %d\n", (int)*info);
224 goto cleanup;
225 }
226
227 magma_ssetmatrix_async( m-j, jb, work, ldwork, dAP, maxm, queues[0] );
228 for( i=j; i < j + jb; ++i ) {
229 ipiv += j;
230 }
231 magmablas_slaswp( n, dAT(0,0), lddat, j + 1, j + jb, ipiv, 1, queues[1] );
===========================================================================


Following is the log of my run:
============================================================================
% MAGMA 2.5.1 compiled for CUDA capability >= 3.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9000, driver 10000. OpenMP threads 56.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 4: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 5: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 6: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 7: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Tue Dec 3 03:44:07 2019
% Epsilon(double): 1.110223e-16
% Epsilon(single): 5.960464e-08

% Usage: ./build/testing/testing_dxgesv_gpu [options] [-h|--help]

ntest 1
niter 1
cond 0.000000
msize 30016
% trans = No transpose
% N NRHS DP-Factor DP-Solve HP-Factor HP-Solve MP: FP16->FP64-Solve Iter |b-Ax|/N|A|
%=========================================================================================
30016 1 2796.79 2791.48 4012.39 4005.21 3699.01 2 4.55e-19 ok

haidar
Posts: 22
Joined: Fri Sep 19, 2014 3:43 pm

Re: low performance running mixed precision lu factorization

Post by haidar » Wed Dec 04, 2019 12:12 am

For this size you should be able to get easily above 20 Tflops with your machine.
It might be that the CPU binding is having issue, I can see 56 threads but I would believe you have 2x14=28 cores Intel,
then it is better to export OMP_NUM_THREADS=28 or even 14 and try again. you can also look at the htop and be sure that all threads are bind to different cores. I assume you are using MKL for the CPU linking if not (if you are using netlib LAPACK then it is clear the issue is here).
if you ave MKL and you use 28 core try also with numactl --interleave=all ./testing_XXXX

Another alternative is to check the mixed precision solver in cuSolver 10.2, which is exactly the same as Magma and it has been released in Nov 2019 a (dhgesv or Xgesv for expert interface see https://docs.nvidia.com/cuda/cuda-toolk ... w-features and https://docs.nvidia.com/cuda/cusolver/i ... verDNXgesv)
cuSolver does not use the CPU so it should be independent of the CPU however you might get 5% performance less than Magma because Magma benefit from the CPUs.

Please let us know if you have any question and if you have any issues in your solution we might be able to provide some hint to some tuning parameter.
Also I am wondering if you can share info about your application for our record on how many people are using mixed precision solver.
Thanks
Azzam

shengyushen
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

Re: low performance running mixed precision lu factorization

Post by shengyushen » Wed Dec 11, 2019 11:33 pm

Thank you.
I tries you first sugestion, and the result is:
===================================================================
export OMP_NUM_THREADS=14
30016 1 3556.64 3548.72 6339.13 6321.45 6229.81 2 4.55e-19 ok
export OMP_NUM_THREADS=28
30016 1 3912.91 3903.08 6441.10 6422.82 6240.85 2 4.55e-19 ok
export OMP_NUM_THREADS=56
30016 1 2893.41 2887.92 3948.07 3941.15 3734.30 2 4.55e-19 ok
===================================================================
the command running is
===================================================================
numactl --interleave=all ./build/testing/testing_dxgesv_gpu -N 30016 --matrix diag_rand --version3
===================================================================
it seems that properly setting OMP improve the result nearly 2x
but it still far from 20TFlops.

and what is MKL you mentioned? I have NOT modifed the default makefile.

I will try cuSolver this afternoon

Stan Tomov
Posts: 279
Joined: Fri Aug 21, 2009 10:39 pm

Re: low performance running mixed precision lu factorization

Post by Stan Tomov » Thu Dec 12, 2019 2:08 am

MKL is the Intel Math Kernel Library. It provides highly optimized routines that MAGMA uses on the CPU. You can download it from here:
https://software.intel.com/en-us/mkl/ch ... load/linux
After you install it, set environment variable MKLROOT to where the MKL is installed, go to the main magma directory and do:

Code: Select all

cp make.inc-examples/make.inc.mkl-gcc-ilp64 make.inc
make lib -j
cd testing
make testing_dxgesv_gpu
This should set MAGMA to use MKL on the CPU and hopefully improve the performance.

shengyushen
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

Re: low performance running mixed precision lu factorization

Post by shengyushen » Thu Dec 12, 2019 8:50 am

Hi all
I tries MKL, and the result is:

MKL version: l_mkl_2019.5.281.tgz
cuda version: release 9.0, V9.0.176

command to run :
numactl --interleave=all ./testing_dxgesv_gpu -N 30016 --matrix diag_rand --version 3

result is :
export OMP_NUM_THREADS=14
30016 1 3612.25 3603.93 7057.53 7036.49 7064.88 2 4.56e-19 ok
export OMP_NUM_THREADS=28
30016 1 1936.75 1934.41 8437.05 8405.39 7892.45 2 4.56e-19 ok

MKL do improve the performance by about 25%, but still far from 20TFLOPs

Stan Tomov
Posts: 279
Joined: Fri Aug 21, 2009 10:39 pm

Re: low performance running mixed precision lu factorization

Post by Stan Tomov » Thu Dec 12, 2019 10:36 am

Now I see MAGMA is not compiled for Volta, e.g., the tester above prints
% MAGMA 2.5.1 compiled for CUDA capability >= 3.0, 32-bit magma_int_t, 64-bit pointer.
Can you please modify your make.inc file, and in particular, add the GPU_TARGET. After
#GPU_TARGET ?= Kepler Maxwell Pascal
add

Code: Select all

GPU_TARGET = Volta
Do a
make clean
and regenerate the library and the test to see if this will improve performance.

shengyushen
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

Re: low performance running mixed precision lu factorization

Post by shengyushen » Thu Dec 12, 2019 9:20 pm

I modify it and now it is till around 8Tflops:


****************************************************************************************************************************************************************************
root@GPU2:/home/nfs1/ssy/mgm251/testing# numactl --interleave=all ./testing_dxgesv_gpu -N 30016 --matrix diag_rand --version 3
% MAGMA 2.5.1 compiled for CUDA capability >= 7.0, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9000, driver 10000. OpenMP threads 28. MKL 2019.0.5, MKL threads 28.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 4: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 5: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 6: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 7: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Thu Dec 12 20:21:20 2019
% Epsilon(double): 1.110223e-16
% Epsilon(single): 5.960464e-08

% Usage: ./testing_dxgesv_gpu [options] [-h|--help]
% trans = No transpose
% N NRHS DP-Factor DP-Solve HP-Factor HP-Solve MP: FP16->FP64-Solve Iter |b-Ax|/N|A|
%=========================================================================================
30016 1 1935.18 1932.82 8587.24 8554.55 8188.05 2 4.56e-19 ok

****************************************************************************************************************************************************************************

haidar
Posts: 22
Joined: Fri Sep 19, 2014 3:43 pm

Re: low performance running mixed precision lu factorization

Post by haidar » Fri Dec 13, 2019 12:22 am

first of all there is many issue look like.
The goal of the Tensor Cores Accelerated Iterative Refinement Solver (TCAIRS) in both Magma and cuSolver is to provide around 4X speedup over the basic double precision (dgesv) and this is true for your run as well (2Tflops versus 8Tflops).
Note also that you are running cuda 9.0, it is preferable that you use cuda 10.2

Now the issue is that your double preciison dgesv is already performing slow it is 2 Tflops instead of about 4-5 Tflops, for that I am not surprised that the tensor cores TCAIRS is 8 Tflops. It should be 4 Tflops for dgesv and around 20 Tflops for the TCAIRS.
I tried to compile exactly the same way you compiled and it look fine with me even with 8 threads. I am using MKL 2018 and cuda 10.1

Code: Select all

 
./testing_dxgesv_gpu -N 30016 --matrix rand_dominant --version 3 --niter 2 
% MAGMA 2.5.2 svn compiled for CUDA capability >= 3.0, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10010, driver XXXXX. OpenMP threads 8. MKL 2018.0.0, MKL threads 8. 
% device 0: Quadro GV100, 1627.0 MHz clock, 32508.2 MiB memory, capability 7.0
% device 1: Quadro GV100, 1627.0 MHz clock, 32508.2 MiB memory, capability 7.0
% device 2: Quadro RTX 8000, 1770.0 MHz clock, 48601.3 MiB memory, capability 7.5
% device 3: Quadro P1000, 1480.5 MHz clock, 4036.9 MiB memory, capability 6.1
% Thu Dec 12 22:48:39 2019
% Epsilon(double): 1.110223e-16
% Epsilon(single): 5.960464e-08

% Usage: ./testing_dxgesv_gpu [options] [-h|--help]

% trans = No transpose
%   N  NRHS   DP-Factor  DP-Solve  HP-Factor  HP-Solve  MP: FP16->FP64-Solve  Iter   |b-Ax|/N|A|
%=========================================================================================
30016     1   5402.52    5381.89   19632.26    19459.61   17826.99               2   5.17e-20   ok
30016     1   5360.77    5340.43   19562.77    19391.24   17754.07               2   4.41e-20   ok
 


So I would suggest that first we investigate the dgesv why it is slow and I think if we find it then the TCAIRS will most likely reach 20 Tflops.
First let avoid generating the shared library because sometimes loading it is so expensive so try to comment the -fPIC in the make.inc.
also try to compile with the 32bit makefile (cp make.inc-examples/make.inc.mkl-gcc make.inc).
So I would suggest some check to performs:

Thus:
1- can you check the "top" command when you are running and verify that it is running 14/28 threads or sequential 1 thread?
2- can you run testing_dgesv_gpu to figure out what performance the dgesv reach and show us the output
3- can you run numactl --interleave=all ./testing_sgetrf_gpu --range 5000:30000:5000,256 --range 5000:30000:5000,384 --range 5000:30000:5000,512 -l -c --niter 2
4- can you run with the profiler and send us the profiler output nvprof -o dxgesv_N30016.nvvp numactl --interleave=all ./testing_dxgesv_gpu....

The way you run the profiler
for testing_dxgesv_gpu I would suggest minor modifications to the tester to avoid running all the LU's and just run magma_dhgesv_iteref_gpu, so
go to after magma_dhgesv_iteref_gpu (let say line:124) and add "status =0; goto cleanup;

Code: Select all

 
             //=====================================================================
            //              MIXED - GPU
            //=====================================================================
            gpu_time = magma_wtime();
            if ( opts.version == 1 ) {
                // fallback to the FP16 TC
            }
            else if ( opts.version == 2 ) {
                magma_dsgesv_iteref_gpu( opts.transA, N, nrhs,
                        d_A, ldda, h_ipiv, d_ipiv,
                        d_B, lddb, d_X, lddx,
                        d_WORKD, d_WORKS, &gesv_iter, &info);
            }
            else if ( opts.version == 3 ) {
                magma_dhgesv_iteref_gpu( opts.transA, N, nrhs,
                        d_A, ldda, h_ipiv, d_ipiv,
                        d_B, lddb, d_X, lddx,
                        d_WORKD, d_WORKS, &gesv_iter, &info);
            }
            gpu_time = magma_wtime() - gpu_time;
            gpu_perf = gflopsS / gpu_time;
            if (info != 0) {
                printf("magma_dxgesv returned error %lld: %s.\n",
                       (long long) info, magma_strerror( info ));
            }
 status = 0;
 goto cleanup;
 ....
 ....
 ....
 then at line 218 add cleanup:
 cleanup:
    opts.cleanup();
    TESTING_CHECK( magma_finalize() );
    return status; 
 


Azzam

shengyushen
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

Re: low performance running mixed precision lu factorization

Post by shengyushen » Fri Dec 13, 2019 11:47 am

I tries the same code on an AWS instance with 1GPU and 4 cores. it is still about 8TFlops
BUT when I turn to another instance with 4GPU and 16 cores, it reach 17TFLOPs!!!!!!
#####################################################################################
ubuntu@ip-172-31-46-17:~/mgm251/testing$ export OMP_NUM_THREADS=16
ubuntu@ip-172-31-46-17:~/mgm251/testing$ numactl --interleave=all ./testing_dxgesv_gpu -N 30016 --matrix diag_rand --version 3
% MAGMA 2.5.1 compiled for CUDA capability >= 3.0, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10000, driver 10010. OpenMP threads 16. MKL 2019.0.5, MKL threads 16.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Fri Dec 13 15:30:36 2019
% Epsilon(double): 1.110223e-16
% Epsilon(single): 5.960464e-08
% trans = No transpose
% N NRHS DP-Factor DP-Solve HP-Factor HP-Solve MP: FP16->FP64-Solve Iter |b-Ax|/N|A|
%=========================================================================================
30016 1 6173.57 6151.40 18206.42 18075.53 16462.07 2 4.56e-19 ok
#####################################################################################

the result from lscpu is
###################################################################
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.714
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.10
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
###################################################################
I find that this cpu's frequency is very high that can reach 3Ghz

So I check back my own GPU server and find that its CPU is come what slow at 2Ghz, whcih can only reach ~4TFLOPs with 14 orso cores
##############################################################
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5117M CPU @ 2.00GHz
Stepping: 4
CPU MHz: 2000.027
BogoMIPS: 4001.38
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
############################################################


So will this affect my performance?
will the gdesv still need to invoke CPU?

Stan Tomov
Posts: 279
Joined: Fri Aug 21, 2009 10:39 pm

Re: low performance running mixed precision lu factorization

Post by Stan Tomov » Wed Dec 18, 2019 12:56 am

The slow CPU will affect performance since MAGMA still uses CPUs for part of the computation. We can tune for this case or use other codes that are GPU only, but these are not connected yet to the mixed-precision solvers.

Post Reply