## Limitations on precision

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
generalzod
Posts: 3
Joined: Mon Mar 02, 2020 1:12 am

### Limitations on precision

Hello

I want to get eigenvalue/eigenvectors for very large and dense matrix (n=100k)
It seems like errors are accumulate more as matrix gets larger (|A-USU^H|)
Is this inherent limitation of double precision arithmetic or can it be mitigated somehow with other iterative method?

Thank you

mgates3
Posts: 918
Joined: Fri Jan 06, 2012 2:13 pm

### Re: Limitations on precision

Are you using MAGMA's testers to test these, e.g., testing/testing_zheevd?
Which specific routine are you using?
If using MAGMA's tester, can you share the complete input & output that is concerning you?

We generally check the relative backwards error,
|| A - U S U^H ||_1 / ( || A ||_1 N )

MAGMA's tester abbreviates that as |A-USU^H| in the output header, but actually computes the above quantity.

The absolute error || A - U S U^H ||_1 does grow with the matrix size, since more values are accumulated into the norm. E.g., if every element of a vector x has some small error tau, then the whole vector has a cumulative error of n*tau.

Mark

generalzod
Posts: 3
Joined: Mon Mar 02, 2020 1:12 am

### Re: Limitations on precision

mgates3 wrote:
Mon Mar 02, 2020 11:08 am
Are you using MAGMA's testers to test these, e.g., testing/testing_zheevd?
Which specific routine are you using?
If using MAGMA's tester, can you share the complete input & output that is concerning you?

We generally check the relative backwards error,
|| A - U S U^H ||_1 / ( || A ||_1 N )

MAGMA's tester abbreviates that as |A-USU^H| in the output header, but actually computes the above quantity.

The absolute error || A - U S U^H ||_1 does grow with the matrix size, since more values are accumulated into the norm. E.g., if every element of a vector x has some small error tau, then the whole vector has a cumulative error of n*tau.

Mark
Hi

This is what I get with testing_dsyevd

Code: Select all

``````% MAGMA 2.5.2  compiled for CUDA capability >= 6.0, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 10020. OpenMP threads 32.
% device 0: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 1: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 2: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 3: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% Tue Mar  3 18:43:12 2020
% Usage: ./testing_dsyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
1088      ---              3.2039           ---         2.06e-17    6.48e-17   ok
1088      ---              0.2568           ---         1.23e-17    6.68e-17   ok
1088      ---              0.2568           ---         6.84e-08    6.70e-06   failed
1088      ---              0.2593           ---         1.53e-17    6.51e-17   ok
1088      ---              0.2554           ---         1.44e-17    6.87e-17   ok

2112      ---              0.6574           ---         7.33e-18    6.04e-17   ok
2112      ---              0.6583           ---         1.59e-17    6.76e-17   ok
2112      ---              0.6598           ---         2.77e-18    6.42e-17   ok
2112      ---              0.7249           ---         4.79e-18    6.50e-17   ok
2112      ---              0.6582           ---         2.22e-18    6.14e-17   ok

3136      ---              1.3959           ---         2.54e-17    5.45e-17   ok
3136      ---              1.3857           ---         1.01e-17    5.70e-17   ok
3136      ---              1.3820           ---         2.37e-17    6.19e-17   ok
3136      ---              1.4300           ---         7.85e-18    5.51e-17   ok
3136      ---              1.3825           ---         3.57e-18    5.59e-17   ok

4160      ---              1.9709           ---         4.23e-09    1.04e-06   failed
4160      ---              1.9827           ---         1.12e-17    5.23e-17   ok
4160      ---              1.9705           ---         1.75e-17    5.57e-17   ok
4160      ---              1.9741           ---         2.12e-17    5.72e-17   ok
4160      ---              1.9651           ---         4.50e-18    5.36e-17   ok

5184      ---              3.0915           ---         1.88e-17    5.66e-17   ok
5184      ---              2.6725           ---         1.28e-17    5.83e-17   ok
5184      ---              2.6832           ---         1.85e-17    5.31e-17   ok
5184      ---              2.8734           ---         2.62e-09    2.13e-07   failed
5184      ---              2.6751           ---         7.31e-07    6.00e-05   failed

6208      ---              3.5111           ---         2.44e-08    2.99e-06   failed
6208      ---              3.5439           ---         5.87e-18    6.15e-17   ok
6208      ---              3.6768           ---         1.36e-17    5.46e-17   ok
6208      ---              3.7247           ---         1.86e-17    5.37e-17   ok
6208      ---              3.6468           ---         7.55e-08    3.33e-05   failed

7232      ---              4.9060           ---         1.08e-10    1.57e-08   failed
7232      ---              4.6172           ---         2.42e-17    5.82e-17   ok
7232      ---              4.5373           ---         1.54e-17    5.51e-17   ok
7232      ---              4.5184           ---         5.64e-18    5.55e-17   ok
7232      ---              4.5125           ---         1.32e-17    5.80e-17   ok

8256      ---              5.1735           ---         1.16e-17    6.23e-17   ok
8256      ---              5.4694           ---         3.05e-09    3.75e-07   failed
8256      ---              5.3396           ---         3.98e-10    2.29e-07   failed
8256      ---              5.7306           ---         4.29e-07    7.17e-05   failed
8256      ---              5.4754           ---         5.26e-10    1.59e-07   failed

9280      ---              6.4114           ---         3.92e-09    7.72e-07   failed
9280      ---              6.7893           ---         2.25e-08    2.22e-06   failed
9280      ---              6.1874           ---         2.92e-10    7.02e-08   failed
9280      ---              6.2186           ---         3.26e-08    1.51e-05   failed
9280      ---              6.4704           ---         4.46e-10    6.61e-08   failed

10304      ---              7.5922           ---         1.18e-07    4.28e-05   failed
10304      ---              7.3870           ---         3.73e-09    5.19e-07   failed
10304      ---              7.6400           ---         1.01e-07    2.91e-05   failed
10304      ---              7.2780           ---         4.08e-08    7.18e-06   failed
10304      ---              7.3946           ---         5.36e-10    1.48e-07   failed

30000      ---             66.6994           ---         2.76e-11    1.57e-08   failed
30000      ---             67.4788           ---         5.08e-18    6.51e-17   ok
30000      ---             63.7498           ---         7.73e-08    1.96e-05   failed

50000      ---            213.1231           ---         1.73e-12    4.78e-09   failed
50000      ---            210.9395           ---         1.36e-08    3.70e-06   failed
50000      ---            207.4701           ---         2.51e-10    9.76e-08   failed

70000      ---            488.2336           ---         7.23e-08    2.34e-05   failed
70000      ---            478.4960           ---         9.65e-09    2.84e-06   failed
``````
Interesting point is... while it takes only few mins to solve eigenproblem
it takes few hours to check it's error.

Why is lapackf77_dsyt21 so slow???

Also, I wonder would there be any methods to refine the result (reduce error)
after the execution of dsyevd

Thank you

Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm

### Re: Limitations on precision

Some of these errors seem to be large and inconsistent. This is what I get on one of our systems with V100 and Intel CPU.

Code: Select all

``````[tomov@a04 testing]\$ ./testing_dsyevd -JV --niter 5 -c -l -n 7000
% MAGMA 2.5.2 svn compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9020, driver 10010. OpenMP threads 20. MKL 2017.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Wed Mar 11 12:47:04 2020
% Usage: ./testing_dsyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 1
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
7000     12.2292           4.6496         4.83e-19      4.08e-18    4.39e-17   ok
7000     12.2834           4.6451         1.11e-19      6.59e-18    4.20e-17   ok
``````
These errors are what we expect in double precision. Did you by any chance modify the code, e.g., removing the scaling or changing the input matrices?

Here lapackf77_dsyt21 took 40 seconds. It is slower than the CPU dsyevd because of the way the norms are computed - if you look at the code, the computation is done through rank 1 and 2 updates. Your times seem to be quite larger. In this experiment I used MKL on the CPU. What BLAS/LAPACK are you using on the CPU?

Related to refinement, here are some relevant papers:
http://www.netlib.org/utk/people/JackDo ... sicedr.pdf
http://www.netlib.org/utk/people/JackDo ... values.pdf

generalzod
Posts: 3
Joined: Mon Mar 02, 2020 1:12 am

### Re: Limitations on precision

Stan Tomov wrote:
Wed Mar 11, 2020 1:04 pm
Some of these errors seem to be large and inconsistent. This is what I get on one of our systems with V100 and Intel CPU.

Code: Select all

``````[tomov@a04 testing]\$ ./testing_dsyevd -JV --niter 5 -c -l -n 7000
% MAGMA 2.5.2 svn compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9020, driver 10010. OpenMP threads 20. MKL 2017.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Wed Mar 11 12:47:04 2020
% Usage: ./testing_dsyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 1
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
7000     12.2292           4.6496         4.83e-19      4.08e-18    4.39e-17   ok
7000     12.2834           4.6451         1.11e-19      6.59e-18    4.20e-17   ok
``````
These errors are what we expect in double precision. Did you by any chance modify the code, e.g., removing the scaling or changing the input matrices?

Here lapackf77_dsyt21 took 40 seconds. It is slower than the CPU dsyevd because of the way the norms are computed - if you look at the code, the computation is done through rank 1 and 2 updates. Your times seem to be quite larger. In this experiment I used MKL on the CPU. What BLAS/LAPACK are you using on the CPU?

Related to refinement, here are some relevant papers:
http://www.netlib.org/utk/people/JackDo ... sicedr.pdf
http://www.netlib.org/utk/people/JackDo ... values.pdf
Thank you for sharing your result Mr. Tomov

I did not alter the testing code (testing_dsyevd.cpp)

lapackf77_dsyt21 for matrix with N=7000 isn't that long. I guess it's similar with yours but
with N=70k, I think it takes like 10 hours to finish

I am using P100 in IBM POWER 8 system
My MAGMA is using latest OpenBLAS, without IBM MASS or IBM XL compiler. Just compiled by gcc and gfortran

Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm

### Re: Limitations on precision

You may also want to try the 2-stage reduction algorithms, e.g.

Code: Select all

``````./testing_dsyevdx_2stage -JV --niter 2 -n 7000
``````
These are much faster especially for the large sizes that you target.

Maybe also using multiple GPUs would help (adding "--ngpu 4" option).
Also, you can try with ESSL. There is make.inc example for that ("make.inc.power9-essl") that you may have to modify.