Magma in Windows with ILP64

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

Magma in Windows with ILP64

Postby Manuel__ » Sun Mar 26, 2017 11:25 am

Hello,
I have been able to build Magma in Windows 10 with Visual Studio 2015 and MKL (for both Blas and Lapack).
I entered into the GUI of CMAKE the following:

set LAPACK_LIBRARIES to: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_lp64_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_thread_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_core_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\compiler\lib\intel64_win\libiomp5md.lib
set MKLROOT to: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl

Then I compiled (using nvcc) with mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib

I have tested magma_dgesv: it works fine until m=32000,n=1 (i.e. AX=B with A of size 32000*32000 and B of size 32000). Beyond that, it crashes due to LP64.

Now, I am trying to build magma with ILP64 and there are serious problems.

I followed the indications of viewtopic.php?f=2&t=1352
--> I set LAPACK_LIBRARIES to: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_ilp64_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_thread_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_core_dll.lib;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\compiler\lib\intel64_win\libiomp5md.lib
that is, I just replaced mkl_intel_lp64_dll.lib with mkl_intel_ilp64_dll.lib.

Also, I added add_definitions( -DMKL_ILP64 ) and add_definitions( -DMAGMA_ILP64 ) as the second and third lines of magma-2.2.0\CMakeLists.txt.

There were warnings during the build about magma_int_t, but it succeeded, apparently.

After the build, I compiled (using nvcc) with mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and -DMKL_ILP64 -DMAGMA_ILP64

Unfortunately, several runs, even with small matrices (1000*1000 or so), give some totally unpredictable results: either it terminates properly but with wrong solution values or it crashes with "Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP." or it crashes with no error message but a ticket to the debugger. It even happens (once so far) to terminate properly and with correct solution values (which is even more demoralizing...).

Well, based on the discussion in viewtopic.php?f=2&t=1352, I have come to understand that there are some serious problems in the source code of Magma when it comes to ILP64 in Windows.

Since it has been reported that Magma does work properly with ILP64, I surmise that it may do so only in Linux.

Is there anybody that has managed to make Magma work properly with ILP64 in Windows?

I admit that my morale is taking quite a dive, especially since the whole process of finding out how to build Magma in Windows has been... well, difficult.

I would appreciate any insight in the matter.

Many thanks in advance
Manuel
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby Manuel__ » Mon Mar 27, 2017 5:37 am

Hello again,
the situation has improved a bit: I fixed a bug in my test code of magma_dgesv with ILP64 and it now works fine until m=30000, just like with LP64.

However, beyond that, things are uncertain: usually, a run terminates prematurely with the solution vector filled with -nan(ind)'s.
But if I insist and keep on relaunching the run, it may happen to work fine. So far, I have been able to get occasional good runs with m=40000 and m=50000.
I have enough RAM to go until m=120000, so this is very frustrating of course.

It may well be that there are some bugs in Magma with ILP64 that may or may not terminate the execution prematurely.
If so, then are these bugs general issues or are they the result of compiling with VC++ a code that was, I assume, primarily developed with gcc ?

Has anybody already used Magma in Windows and with MKL on very big matrices (50000*50000 and more) ?

By the way, is it possible to use Magma in Windows with MKL in multithreaded mode ?

I have made some preliminary tests with mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib instead of mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib
But, so far, it crashes at runtime.

Manuel
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby mgates3 » Mon Mar 27, 2017 11:38 am

We have reports that MAGMA works with ILP64 on all platforms (e.g., Linux, MacOS, Windows). I haven't personally tested the ILP64 functionality on Windows, though. It should also work with multi-threaded MKL on all platforms. There are commercial products that use MAGMA with multi-threaded MKL on all platforms. However, they may use a different build system than CMake, so that could be one source of issues. I can look into adding an ILP64 option in CMake.

It should be able to go over 40000 before requiring ILP64. Even assuming a signed integer offset, (2^31)^0.5 = 46340. You can test LAPACK or MKL up to around that size to see.

What kind of CUDA GPU do you have? How much CPU and GPU memory? Particularly the GPU memory affects the algorithm, as it may need to run an out-of-GPU-memory algorithm. What CUDA / cuBLAS version?

In the past, we have seen some issues with cuBLAS at very large sizes. This was a couple years ago, though, and the issues should be resolved now.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Magma in Windows with ILP64

Postby mgates3 » Mon Mar 27, 2017 11:45 am

Also, when you say it "terminates prematurely", what do you mean? Does it return an error code in info? Does the same matrix work fine in LAPACK / MKL? I.e., the matrix isn't overflowing or singular?

I'm a bit confused what you mean by "-nan(ind)'s". Do you mean just "nan"? nan isn't signed.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Magma in Windows with ILP64

Postby Manuel__ » Mon Mar 27, 2017 1:47 pm

Hello,
By "terminates prematurely" I simply mean that it finishes obviously way too soon, for example faster than a successful run with a smaller size. It seems to work fine for some time and then suddenly terminates. It then returns info=0, which seems normal, but when I print the solution values in vector B with printf(" %f" , B[i]); , then I get only some "-nan(ind)", which confuses me as well since I had never seen that "thing" before (but I have just made a Google search with "-nan(ind)" -between double quotes- and I do get some hits).

Actually, Magma on Windows with ILP64 does work beyond 40000, I reckon that: I have now done successful runs at 40000, 50000 and 70000. But all my attempts at 60000 have failed, which is odd. And a number of runs at 40000 did fail as well (I finally got at least one good one, out of persevering). Are you sure that the people who reported using Magma on Windows with ILP64 did actually test it on very large matrices ?

Curiously, I could reach up to 32000 only with LP64 and never beyond.

I am using a GTX 1080 (8GB) and a PC with 128GB, Windows 10, CUDA 8.0, Visual Studio 2015 and MKL 2017.2.187. That's pretty much the most up-to-date configuration, as I know.

I have also applied to the very same matrices the function LAPACKE_dgesv of MKL with ILP64 and this function always runs without a glitch. I have reached 100000 with it so far.

It may well be that the problem is with Cmake, but that's a black box for me.

Manuel
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby mgates3 » Mon Mar 27, 2017 3:43 pm

Thanks for the info. Seems to be some issue with the out-of-GPU-memory algorithm then. 8 GiB = 32768 x 32768 doubles, so a 32000 x 32000 matrix probably fits in the GPU memory (if nothing else is taking up significant memory), while a 40000 x 40000 matrix will not.

I'll poke around and see if I can replicate the problem.

Ah, Windows prints nan funny. The "ind" means indeterminate, just what flavor of nan it is.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Magma in Windows with ILP64

Postby Manuel__ » Tue Mar 28, 2017 7:23 am

Hello
I would be delighted if my experiments could help you smash a bug or two!

By the way, currently, it is possible to specify in the GUI of Cmake the architectures Fermi, Kepler and Maxwell only. Would it be possible to have Pascal as well? (from my experience, it should not make any noticeable difference, but still).

I have been trying again to have MKL both in multithreaded mode and with ILP64: I have updated a bit the inputs to Cmake GUI and redone the build.
Before, I was specifying mkl_intel_ilp64_dll.lib, mkl_intel_thread_dll.lib, mkl_core_dll.lib, libiomp5md.lib for LAPACK_LIBRARIES.
Now, I set LAPACK_LIBRARIES at mkl_intel_ilp64.lib, mkl_intel_thread.lib, mkl_core.lib, libiomp5md.lib; which is cleaner since the libraries are supposed to be compiled statically, both during the build and when I compile my code (and yet it was working with dll libraries in LAPACK_LIBRARIES...).

Anyway, the new result is strictly the same: my code with magma_dgesv works (not all the time above 32000, as I had reported to you) when linked with mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and it crashes at runtime when linked with mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib.
An error message appears several times: Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.

Actually, I have just checked that the same error message appears when I build and compile my code with multithreaded MKL and LP64...

I specify in detail below how I build and compile, both for sequential MKL with ILP64 and multithreaded MKL with ILP64 (the difference takes place only at the last line of compilation).
Could you please have a look?

Many thanks in advance
Manuel



********************
******************** Build magma.lib and magma_sparse.lib:
********************

extract magma-2.2.0 from magma-2.2.0.tar.gz into C:\MAGMA and rename it as ilp64-magma-2.2.0

insert add_definitions( -DMKL_ILP64 ) and add_definitions( -DMAGMA_ILP64 ) as the second and third lines of C:\MAGMA\ilp64-magma-2.2.0\CMakeLists.txt

open CMake GUI
Where is the source code: C:\MAGMA\ilp64-magma-2.2.0
Where to build the binaries: C:\MAGMA\build-ilp64-magma-2.2.0
Configure
Specify the generator for this project: Visual Studio 14 2015 Win64
Use default native compilers
Finish
Untick USE_FORTRAN
Configure --> wait until Configuring done

set GPU_TARGET to: Fermi Kepler Maxwell

set LAPACK_LIBRARIES to:
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_ilp64.lib
;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_intel_thread.lib
;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\mkl_core.lib
;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\compiler\lib\intel64_win\libiomp5md.lib
(it should be entered all on one line, without blanks)

set MKLROOT to: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl

Configure --> wait until Configuring done
Generate --> wait until Generating done
close CMake


open C:\MAGMA\build-ilp64-magma-2.2.0\MAGMA.sln --> wait until Ready

select in toolbar: Release

View --> Solution Explorer

select both magma and magma_sparse, then right click on the selection:
--> Properties

--> C/C++ (click arrow to unfold) --> Code Generation --> set Runtime Library to Multithreaded (/MT)

--> Configuration Properties (click arrow to unfold) --> VC++ Directories

--> add to Library Directories:
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\lib\intel64_win\
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\compiler\lib\intel64_win\

--> add to Include Directories:
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\include
OK

select both magma and magma_sparse then select in Build menu: Build Selection --> wait about 4.5 hr


********************
******************** Batch file to compile magma-dense-solver.cu:
********************

set CONFIG_VC="C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\amd64\vcvars64.bat"

set CONFIG_MKL="C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.2.187\windows\mkl\bin\mklvars"

set ARCH_GPU="sm_61"

set CUDA_DIR="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0"

set MAGMA_LIB_DIR="C:\MAGMA\build-ilp64-magma-2.2.0\lib\Release"

set MAGMA_LIB_INCLUDE="C:\MAGMA\ilp64-magma-2.2.0\include"

call %CONFIG_VC%
call %CONFIG_MKL% intel64

@@@@@ for multithreaded MKL: %CUDA_DIR%\bin\nvcc -O -DMKL_ILP64 -DMAGMA_ILP64 -arch %ARCH_GPU% -Xcompiler /I%MAGMA_LIB_INCLUDE% -Xcompiler -F300000000 -Xcompiler -MT -Xcompiler -O2 magma-dense-solver.cu mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib %CUDA_DIR%\lib\x64\cublas.lib %CUDA_DIR%\lib\x64\cusparse.lib %MAGMA_LIB_DIR%\magma.lib -o ..\magma-dense-solver.exe

@@@@@ for sequential MKL: %CUDA_DIR%\bin\nvcc -O -DMKL_ILP64 -DMAGMA_ILP64 -arch %ARCH_GPU% -Xcompiler /I%MAGMA_LIB_INCLUDE% -Xcompiler -F300000000 -Xcompiler -MT -Xcompiler -O2 magma-dense-solver.cu mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib %CUDA_DIR%\lib\x64\cublas.lib %CUDA_DIR%\lib\x64\cusparse.lib %MAGMA_LIB_DIR%\magma.lib -o ..\magma-dense-solver.exe
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby Manuel__ » Fri Mar 31, 2017 2:33 am

Hi
I have done further experiments with my build of Magma in Windows with ILP64 and sequential MKL.

I called magma_dgesv for 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000 and I repeated this series five times.

For 20000, 30000, 70000, 80000, 90000, 100000, 110000, 120000, the calls were always successful.

But, for 40000, 50000, 60000, only 20% to 30% or so of the calls were successful; the other calls terminated prematurely with -nan(ind) solution values.

This "island of instability" around 40000, 50000, 60000 seems to indicate the presence of a bug. I hope these informations will help you fix it.

For the time being, I can make do with this bug. But I really need to be able to run Magma with multithreaded MKL, in addition to ILP64. I would greatly appreciate your help on this.

Have you figured out whether the error "Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP." comes from the way I build Magma with Cmake?

Actually, it is a possibility that the fact that simply linking with mkl_intel_thread.lib instead of mkl_sequential.lib creates such an error does indicate the presence of another bug.

Manuel
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby Manuel__ » Mon Apr 10, 2017 1:14 pm

Hello,
In order to be able to specify Pascal for the build, I have added at the suitable places in CMakeLists.txt the following additions:

.................................

if ( ${GPU_TARGET} MATCHES Pascal )
set( GPU_TARGET "${GPU_TARGET} sm61" )
endif()

.................................

if ( ${GPU_TARGET} MATCHES sm61 )
if ( NOT MIN_ARCH )
set( MIN_ARCH 600 )
endif()
set( NV_SM "${NV_SM} -gencode arch=compute_61,code=sm_61" )
set( NV_COMP "-gencode arch=compute_61,code=compute_61" )
message( STATUS " compile for CUDA arch 6.1 (Pascal)" )
endif()

.................................

After redoing the Cmake build with Fermi Kepler Maxwell Pascal instead of just Fermi Kepler Maxwell, I have tested that there is no change in performance (as I had suspected in a previous post).

Is there any development about that "Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP." error that occurs when the code is linked with mkl_intel_thread.lib instead of mkl_sequential.lib ?

That's quite a problem for me since I need to measure on my configuration the performance of magma_dgesv as a function of n.

I imagine that it should run faster with multithreaded MKL instead of just sequential MKL. However, since MAGMA uses MKL for the critical path only and not the bulk of the matrix area which is handled by the GPU (as I understood), then the proportion of that increase in performance should become quite small when the overall size of the linear system becomes very large (like n=120,000) and dwarfs the bottleneck of the critical path.

I would really like to be able to test these assumptions.

Manuel
Manuel__
 
Posts: 10
Joined: Fri Feb 24, 2017 11:48 pm

Re: Magma in Windows with ILP64

Postby mgates3 » Mon Apr 17, 2017 11:04 am

I haven't been able to reproduce the issues that you are observing. But I also don't have access to the same kind of machine. Here's what I used:

Linux, Magma 2.2.0, Intel MKL 11.3.3, NVIDIA Kepler K40 GPU, compiled with make.inc-examples/make.inc.mkl-icc-ilp64 and $GPU_TARGET = sm35, so it links with:

Code: Select all
-lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lpthread -lstdc++ -lm
-lcublas -lcusparse -lcudart -lcudadevrt


I ran with $MKL_NUM_THREADS = 16, $OMP_NUM_THREADS = 16.

Everything runs fine up to 90k, with multiple runs per size. That's as large as would fit on this machine. For most of these problems, I was running other GPU jobs at the same time, to try to force any race conditions to show themselves. (Which invalidates the performance results, so ignore those.)

I also ran testing_dgesv up to 65k, with similar results. That tester allocates twice as much memory, so that was as large as I could run on this machine.

Were you testing using MAGMA's tester? It would be helpful to have your exact input & output, as I've shown below.

Code: Select all
magma-2.2.0/testing> ./testing_dgetrf -n 100 -n 1000 -n 10000:90000:10000 --niter 5 -c2
% MAGMA 2.2.0  compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. OpenMP threads 16. MKL 11.3.3, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% Sun Apr 16 22:19:18 2017
% Usage: ./testing_dgetrf [options] [-h|--help]

% ngpu 1, version 1
%   M     N   CPU Gflop/s (sec)   GPU Gflop/s (sec)   |Ax-b|/(N*|A|*|x|)
%========================================================================
  100   100     ---   (  ---  )      0.71 (   0.00)   6.17e-19   ok
  100   100     ---   (  ---  )      2.36 (   0.00)   8.80e-19   ok
  100   100     ---   (  ---  )      2.56 (   0.00)   7.38e-19   ok
  100   100     ---   (  ---  )      2.69 (   0.00)   6.06e-19   ok
  100   100     ---   (  ---  )      2.57 (   0.00)   9.46e-19   ok
 1000  1000     ---   (  ---  )     32.85 (   0.02)   3.31e-19   ok
 1000  1000     ---   (  ---  )     32.84 (   0.02)   3.08e-19   ok
 1000  1000     ---   (  ---  )     36.26 (   0.02)   2.64e-19   ok
 1000  1000     ---   (  ---  )     36.19 (   0.02)   3.27e-19   ok
 1000  1000     ---   (  ---  )     36.26 (   0.02)   3.21e-19   ok
10000 10000     ---   (  ---  )    388.00 (   1.72)   2.01e-19   ok
10000 10000     ---   (  ---  )    357.94 (   1.86)   2.00e-19   ok
10000 10000     ---   (  ---  )    381.89 (   1.75)   1.99e-19   ok
10000 10000     ---   (  ---  )    362.70 (   1.84)   2.05e-19   ok
10000 10000     ---   (  ---  )    316.49 (   2.11)   1.92e-19   ok
20000 20000     ---   (  ---  )    646.07 (   8.25)   1.90e-19   ok
20000 20000     ---   (  ---  )    627.25 (   8.50)   1.80e-19   ok
20000 20000     ---   (  ---  )    613.16 (   8.70)   1.89e-19   ok
20000 20000     ---   (  ---  )    595.11 (   8.96)   1.91e-19   ok
20000 20000     ---   (  ---  )    610.02 (   8.74)   1.84e-19   ok
30000 30000     ---   (  ---  )    688.05 (  26.16)   1.77e-19   ok
30000 30000     ---   (  ---  )    681.08 (  26.43)   1.79e-19   ok
30000 30000     ---   (  ---  )    764.14 (  23.56)   1.79e-19   ok
30000 30000     ---   (  ---  )    758.71 (  23.72)   1.76e-19   ok
30000 30000     ---   (  ---  )    758.23 (  23.74)   1.72e-19   ok
40000 40000     ---   (  ---  )    821.34 (  51.95)   1.74e-19   ok
40000 40000     ---   (  ---  )    791.50 (  53.90)   1.70e-19   ok
40000 40000     ---   (  ---  )    748.07 (  57.03)   1.72e-19   ok
40000 40000     ---   (  ---  )    729.40 (  58.49)   1.71e-19   ok
40000 40000     ---   (  ---  )    719.56 (  59.29)   1.75e-19   ok
50000 50000     ---   (  ---  )    732.70 ( 113.73)   1.62e-19   ok
50000 50000     ---   (  ---  )    800.52 ( 104.10)   1.63e-19   ok
50000 50000     ---   (  ---  )    760.77 ( 109.54)   1.69e-19   ok
50000 50000     ---   (  ---  )    745.74 ( 111.74)   1.65e-19   ok
50000 50000     ---   (  ---  )    757.25 ( 110.05)   1.59e-19   ok
60000 60000     ---   (  ---  )    763.39 ( 188.63)   1.65e-19   ok
60000 60000     ---   (  ---  )    773.26 ( 186.22)   1.68e-19   ok
60000 60000     ---   (  ---  )    764.97 ( 188.24)   1.63e-19   ok
60000 60000     ---   (  ---  )    800.87 ( 179.80)   1.63e-19   ok
60000 60000     ---   (  ---  )    760.76 ( 189.28)   1.63e-19   ok
70000 70000     ---   (  ---  )    824.27 ( 277.41)   1.59e-19   ok
70000 70000     ---   (  ---  )    871.85 ( 262.27)   1.61e-19   ok
70000 70000     ---   (  ---  )    874.31 ( 261.54)   1.62e-19   ok
70000 70000     ---   (  ---  )    882.29 ( 259.17)   1.62e-19   ok
70000 70000     ---   (  ---  )    776.57 ( 294.45)   1.61e-19   ok
80000 80000     ---   (  ---  )    844.85 ( 404.01)   1.56e-19   ok
80000 80000     ---   (  ---  )    902.68 ( 378.13)   1.63e-19   ok
80000 80000     ---   (  ---  )    815.12 ( 418.75)   1.59e-19   ok
80000 80000     ---   (  ---  )    958.85 ( 355.98)   1.62e-19   ok
80000 80000     ---   (  ---  )    886.05 ( 385.23)   1.59e-19   ok
90000 90000     ---   (  ---  )    840.81 ( 578.01)   1.54e-19   ok
90000 90000     ---   (  ---  )    959.19 ( 506.68)   1.56e-19   ok
90000 90000     ---   (  ---  )    904.13 ( 537.53)   1.58e-19   ok
90000 90000     ---   (  ---  )    957.62 ( 507.50)   1.59e-19   ok
90000 90000     ---   (  ---  )    909.66 ( 534.26)   1.60e-19   ok


Code: Select all
magma-2.2.0/testing> ./testing_dgesv -n 100 -n 1000 -n 10000:60000:10000 --niter 5 -c2
% MAGMA 2.2.0  compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. OpenMP threads 16. MKL 11.3.3, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% Sun Apr 16 11:54:21 2017
% Usage: ./testing_dgesv [options] [-h|--help]

% ngpu 1
%   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||
%===============================================================================
  100     1     ---   (  ---  )      0.25 (   0.00)   4.51e-19   ok
  100     1     ---   (  ---  )      0.43 (   0.00)   6.85e-19   ok
  100     1     ---   (  ---  )      0.45 (   0.00)   1.07e-18   ok
  100     1     ---   (  ---  )      0.45 (   0.00)   7.41e-19   ok
  100     1     ---   (  ---  )      0.55 (   0.00)   8.62e-19   ok
 1000     1     ---   (  ---  )     37.11 (   0.02)   3.17e-19   ok
 1000     1     ---   (  ---  )     38.17 (   0.02)   2.89e-19   ok
 1000     1     ---   (  ---  )     38.42 (   0.02)   2.60e-19   ok
 1000     1     ---   (  ---  )     38.47 (   0.02)   2.85e-19   ok
 1000     1     ---   (  ---  )     38.43 (   0.02)   3.13e-19   ok
10000     1     ---   (  ---  )    523.38 (   1.27)   2.29e-19   ok
10000     1     ---   (  ---  )    525.57 (   1.27)   1.86e-19   ok
10000     1     ---   (  ---  )    523.51 (   1.27)   2.30e-19   ok
10000     1     ---   (  ---  )    525.72 (   1.27)   2.40e-19   ok
10000     1     ---   (  ---  )    519.88 (   1.28)   2.30e-19   ok
20000     1     ---   (  ---  )    724.71 (   7.36)   2.11e-19   ok
20000     1     ---   (  ---  )    724.95 (   7.36)   2.35e-19   ok
20000     1     ---   (  ---  )    725.05 (   7.36)   1.97e-19   ok
20000     1     ---   (  ---  )    724.78 (   7.36)   1.98e-19   ok
20000     1     ---   (  ---  )    725.19 (   7.36)   2.31e-19   ok
30000     1     ---   (  ---  )    800.79 (  22.48)   1.76e-19   ok
30000     1     ---   (  ---  )    796.78 (  22.59)   2.16e-19   ok
30000     1     ---   (  ---  )    795.85 (  22.62)   1.86e-19   ok
30000     1     ---   (  ---  )    796.13 (  22.61)   1.90e-19   ok
30000     1     ---   (  ---  )    796.89 (  22.59)   1.68e-19   ok
40000     1     ---   (  ---  )    841.67 (  50.70)   2.82e-19   ok
40000     1     ---   (  ---  )    815.48 (  52.32)   2.21e-19   ok
40000     1     ---   (  ---  )    811.93 (  52.55)   2.45e-19   ok
40000     1     ---   (  ---  )    812.78 (  52.50)   2.48e-19   ok
40000     1     ---   (  ---  )    812.00 (  52.55)   2.14e-19   ok
50000     1     ---   (  ---  )    878.03 (  94.91)   2.37e-19   ok
50000     1     ---   (  ---  )    857.82 (  97.15)   2.27e-19   ok
50000     1     ---   (  ---  )    858.44 (  97.08)   2.60e-19   ok
50000     1     ---   (  ---  )    854.84 (  97.49)   2.62e-19   ok
50000     1     ---   (  ---  )    855.76 (  97.38)   2.33e-19   ok
60000     1     ---   (  ---  )    851.21 ( 169.18)   2.27e-19   ok
60000     1     ---   (  ---  )    832.86 ( 172.91)   2.20e-19   ok
60000     1     ---   (  ---  )    840.30 ( 171.37)   2.03e-19   ok
60000     1     ---   (  ---  )    910.66 ( 158.13)   2.56e-19   ok
60000     1     ---   (  ---  )    875.94 ( 164.40)   1.98e-19   ok


Incidentally, when I set MKL_NUM_THREADS = 1, it does run somewhat slower. Your understanding is mostly correct regarding the critical path, but as the trailing matrix gets smaller during a factorization, it can no longer overlap and hide the CPU panel operations completely. The faster the GPU is compared to the CPU, the more this will affect results.

Code: Select all
magma-2.2.0/testing> ./testing_dgetrf -n 100 -n 1000 -n 10000:50000:10000 -c2 > getrf-1core.txt
% MAGMA 2.2.0  compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. OpenMP threads 1. MKL 11.3.3, MKL threads 1.
% device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MiB memory, capability 3.5
% Mon Apr 17 10:51:52 2017
% Usage: ./testing_dgetrf [options] [-h|--help]

% ngpu 1, version 1
%   M     N   CPU Gflop/s (sec)   GPU Gflop/s (sec)   |Ax-b|/(N*|A|*|x|)
%========================================================================
  100   100     ---   (  ---  )      1.37 (   0.00)   6.05e-19   ok
 1000  1000     ---   (  ---  )     32.27 (   0.02)   2.78e-19   ok
10000 10000     ---   (  ---  )    243.91 (   2.73)   2.11e-19   ok
20000 20000     ---   (  ---  )    489.33 (  10.90)   1.86e-19   ok
30000 30000     ---   (  ---  )    679.76 (  26.48)   1.78e-19   ok
40000 40000     ---   (  ---  )    691.24 (  61.72)   1.68e-19   ok
50000 50000     ---   (  ---  )    706.35 ( 117.97)   1.67e-19   ok
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 3 guests

cron