Magma 0.2 with Nvidia Fermi GTX470 and GotoBLAS2 results

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
Allan Menezes
Posts: 14
Joined: Wed Aug 05, 2009 10:01 pm

Magma 0.2 with Nvidia Fermi GTX470 and GotoBLAS2 results

Post by Allan Menezes » Sat Apr 17, 2010 1:52 pm

Dear All,
I tried magma 0.2 recompiled by me for fedora core 12 x86_64 with the nvidia fermi GTX470 and here are the make.inc.goto and the results from the testing directory.
Note that in the following make.inc.goto the place of the libgoto2.a is hard coded. You will have to modify that line for your own installation. I used GotoBLAS2 version 1.08
and not the fatal prime revision level 1.13!

Code: Select all

#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 0.2) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of	Colorado, Denver
#      November 2009
#
#      Contributed by: Allan Menezes (Ontario, Canada)
#//////////////////////////////////////////////////////////////////////////////

	CC        = gcc
	NVCC      = nvcc
	FORT      = gfortran

	ARCH      = ar
	ARCHFLAGS = cr
	RANLIB    = ranlib

	OPTS      = -O3 -DADD_
	NVOPTS    = --compiler-options -fno-inline \
		    --compiler-options -fno-strict-aliasing \
		    -arch sm_20 -DUNIX -O3
	LDOPTS    = -fPIC

	LIB       = -lgoto2  -lpthread -lcublas -lcudart -llapack -lm 

	CUDADIR   = /usr/local/cuda

	LIBDIR    = -L/bummer/GotoBLAS2 -L/usr/local/cuda/lib64 -L/usr/lib64
	INC       = -I../include -I$(CUDADIR)/include
	
	LIBMAGMA     = ../lib/libmagma.a
	LIBMAGMABLAS = ../lib/libmagmablas.a
The results in the testing directory displayed below for the GTX470:

Code: Select all

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     36.50         111.93        2.181499e-06
 2048     46.43         178.75        2.763979e-06
 3072     51.56         218.36        3.224755e-06
 4032     55.16         234.06        4.556965e-06
 5184     55.70         245.28        5.060306e-06
 6016     55.33         251.24        4.582116e-06
 7040     55.39         254.75        4.619145e-06
 8064     55.40         256.17        5.499504e-06
 9088     55.64         261.69        5.515256e-06
10112     56.14         266.43        5.179223e-06
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     37.80         128.12        1.023109e-06
 2048     46.48         190.43        2.401207e-06
 3072     51.28         228.74        2.559615e-06
 4032     55.21         243.32        1.957078e-06
 5184     55.77         250.11        2.122840e-06
 6016     55.77         255.67        2.449219e-06
 7040     55.66         258.14        2.591782e-06
 8064     55.77         259.01        2.737253e-06
 9088     56.02         266.85        2.923932e-06
10112     56.54         270.95        3.040652e-06
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     31.42          83.22         6.999727e-09
 2048     54.95         140.59         7.347308e-09
 3072     63.76         170.82         7.411954e-09
 4032     66.36         192.51         7.398736e-09
 5184     68.36         209.48         7.359929e-09
 6016     69.48         217.40         8.188616e-09
 7040     70.65         227.44         9.392204e-09
 8064     71.62         234.23         1.037497e-08
 9088     72.45         239.14         1.169262e-08
10112     73.21         244.07         1.234344e-08
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     31.59          99.54         6.999727e-09
 2048     54.92         162.54         7.347308e-09
 3072     63.80         192.46         7.411954e-09
 4032     66.34         211.74         7.398736e-09
 5184     68.20         226.98         7.359929e-09
 6016     58.58         195.69         8.188616e-09
 7040     70.66         243.19         9.392204e-09
 8064     71.65         248.47         1.037497e-08
 9088     72.47         252.31         1.169262e-08
10112     73.19         256.31         1.234344e-08
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     42.01          55.58        3.142203e-08
 2048     54.50         100.90        3.059146e-08
 3072     62.13         130.76        2.495757e-08
 4032     66.81         152.90        2.337698e-08
 5184     69.53         172.24        3.669203e-08
 6048     71.10         183.21        3.505888e-08
 7200     71.81         194.28        3.032790e-08
 8064     73.13         202.82        2.819979e-08
 8928     74.01         209.44        3.803236e-08
10080     74.68         217.62        3.449032e-08
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_cpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     42.14          63.60        2.622518e-08
 2048     54.32         112.59        2.492308e-08
 3072     62.17         145.93        2.481157e-08
 4032     66.76         169.86        2.620765e-08
 5184     69.55         190.25        2.439569e-08
 6048     71.08         202.95        2.558894e-08
 7200     29.30         200.59        2.507422e-08
 8064     22.38         198.57        2.604528e-08
 8928     12.26         181.75        2.590413e-08
10080      7.67         171.67        2.688114e-08
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024      4.44          13.01        1.033019e-14
 2048      5.24          28.14        2.041184e-14
 3072      5.47          37.48        3.014136e-14
 4032      5.76          43.04        3.655350e-14
 5184      5.87          47.42        4.374540e-14
 6016      5.97          49.41        5.527769e-14
 7040      6.06          52.31        6.843701e-14
 8064      6.09          50.92        8.011122e-14
 9088      6.13          52.28        9.053372e-14
10112      6.15          52.86        1.020150e-13
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage:
		./testing_dgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		87.495259452412		88.621807857379		0.000000e+00
  513		98.905272527473		75.845897191011		0.000000e+00
 1024		129.506914003136		130.095332162113		0.000000e+00
 1025		115.372897471609		78.648210699288		0.000000e+00
 1536		131.796576083794		124.845097874393		0.000000e+00
 1537		118.730642806926		80.076772922249		0.000000e+00
 2048		132.322823812128		132.793312236711		0.000000e+00
 2049		120.588236970479		81.342356997646		0.000000e+00
 2560		132.439855381361		132.915685940527		0.000000e+00
 2561		121.516228543524		80.982599883807		0.000000e+00
 3072		132.623483212220		126.676101485846		0.000000e+00
 3073		122.216840851325		81.282141104140		0.000000e+00
 3584		132.827109486398		133.288593808177		0.000000e+00
 3585		122.623658664786		81.031689917914		0.000000e+00
 4096		132.759448433612		133.246615217415		0.000000e+00
 4097		122.888285506489		81.512556336204		0.000000e+00
 4608		132.815062477348		126.900248836469		0.000000e+00
 4609		123.219745840591		81.119433330585		0.000000e+00
 5120		132.878051484929		113.454911855825		0.000000e+00
 5121		123.434462451430		81.831506989988		0.000000e+00
SYMV Double Precision

Usage
		 testing_dgemv N

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.34	       0.34		        0
  128        1.02	       0.96		        0
  192        1.76	       1.68		        0
  256        2.43	       2.38		        0
  320        2.28	       2.18		        0
  384        2.76	       2.68		        0
  448        3.26	       3.19		        0
  512        3.74	       3.67		        0
  576        4.23	       4.12		        0
  704        5.27	       5.14		        0
  832        6.29	       6.13		        0
  960        7.29	       7.09		        0
 1088        8.19	       8.05		        0
 1216        9.18	       9.04		        0
 1408       10.57	      10.60		        0
 1600       11.77	      11.91		        0
 1792       13.24	      13.24		        0
 1984       14.58	      14.69		        0
 2240       16.29	      16.34		        0
 2496       18.03	      17.90		        0
 2816       19.82	      19.82		        0
 3136       20.86	      21.38		        0
 3520       22.80	      23.27		        0
 3904       24.52	      24.88		        0
 4352       25.96	      26.42		        0
 4800       27.12	      27.44		        0
 5312       28.12	      28.09		        0
 5888       28.23	      28.97		        0
 6528       29.31	      29.52		        0
 7232       28.89	      19.77		        0
 8000       29.77	      21.48		        0
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     11.89          44.90        1.692476e-15
 2048     14.29          67.30        2.412781e-15
 3072     16.39          74.41        2.873908e-15
 4032     18.16          76.12        2.933770e-15
 5184     18.64          79.86        3.183846e-15
 6016     18.76          81.25        3.638615e-15
 7040     19.38          81.90        4.039368e-15
 8064     19.43          82.65        4.212756e-15
 9088     19.65          84.56        4.495418e-15
10112     19.92          84.70        4.804705e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024      9.37          50.11        1.959699e-15
 2048     14.76          72.04        2.642956e-15
 3072     17.10          78.26        3.271786e-15
 4032     18.78          79.97        3.356442e-15
 5184     19.13          83.80        3.752684e-15
 6016     19.11          84.36        4.070131e-15
 7040     18.91          85.17        4.403128e-15
 8064     19.48          85.00        8.071775e-14
 9088     19.91          86.54        5.335508e-15
10112     20.19          86.81        5.304265e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     11.46          18.10        8.123717e-16
 2048     14.57          68.70        9.930360e-15
 3072     16.75          76.42        1.639680e-14
 4032     18.91          77.09        3.220842e-15
 5184     19.19          82.33        2.035707e-15
 6016     19.19          82.20        5.951416e-15
 7040     19.58          83.66        4.714261e-15
 8064     19.72          83.30        1.597581e-14
 9088     20.03          85.85        3.800731e-15
10112     20.27          85.73        2.706913e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              17.89        4.674082e-16
 2048              73.42        4.412873e-15
 3072              91.04        9.000011e-15
 4032             100.14        1.354061e-15
 5184             107.72        1.173067e-15
 6016             111.54        3.439470e-15
 7040             115.01        2.687958e-15
 8064             117.33        5.400273e-15
 9088             119.30        1.662879e-15
10112             120.82        1.550303e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     21.52          30.99         3.514640e-18
 2048     28.01          58.60         3.258964e-18
 3072     30.92          75.15         2.966111e-18
 4032     32.42          85.09         3.348630e-18
 5184     33.55          93.63         3.333262e-18
 6016     34.10          98.19         2.826022e-18
 7040     34.67         102.60         2.802706e-18
 8064     35.03         105.88         2.761636e-18
 9088     35.41         108.70         2.752465e-18
10112     35.76         110.93         2.726653e-18
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     21.62          40.91         3.514640e-18
 2048     28.07          75.82         3.258964e-18
 3072     30.92          93.30         2.966111e-18
 4032     32.42         101.83         3.348630e-18
 5184     33.58         108.97         3.333262e-18
 6016     34.15         112.71         2.826022e-18
 7040     34.73         115.95         2.802706e-18
 8064     35.10         118.14         2.761636e-18
 9088     35.44         120.06         2.752465e-18
10112     35.81         121.47         2.726653e-18
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.24          32.39        4.368765e-17
 2048     29.41          54.35        5.255033e-17
 3072     32.94          68.23        6.129227e-17
 4032     34.63          76.24        6.249455e-17
 5184     35.85          86.17        6.400078e-17
 6144     36.54          91.53        6.514027e-17
 6912     36.86          95.21        6.548325e-17
 8192     37.46          98.36        6.854160e-17
 8960     37.52          99.77        6.936968e-17
 9984     37.78         102.23        7.147590e-17
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.07          44.01        4.368765e-17
 2048     29.02          73.15        5.255033e-17
 3072     30.23          83.53        6.129227e-17
 4032     34.49          90.55        6.249455e-17
 5184     35.61         100.20        6.400078e-17
 6144     36.41         104.49        6.514027e-17
 6912     36.70         106.63        6.548325e-17
 8192     37.32         108.58        6.854160e-17
 8960     37.34         110.63        6.936968e-17
 9984     37.55         111.84        7.147590e-17
Iterative Refinement- QR 

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_dsgeqrsv_gpu -N 1024



           CPU GFlop/s                 GPU GFlop/s   
  N          Doule           Double	Single	 Mixed    || b-Ax || / ||A||
=========================================================================================
 1024 	   11.45	    44.34	 60.64	 12.02  	 5.555543e-16  2 
 2048 	   14.62	    68.22	118.94	 95.54  	 6.616966e-15  3 
 3072 	   16.86	    73.67	176.15	154.06  	 7.447985e-14  3 
 4032 	   18.96	    77.00	183.96	158.79  	 1.186839e-15  5 
 5184 	   19.35	    79.62	207.41	199.02  	 8.647975e-15  2 
 6016 	   19.20	    80.35	215.42	207.75  	 9.182751e-14  2 
 7040 	   19.78	    83.38	220.00	212.85  	 1.021954e-13  2 
 8000 	   19.87	    84.63	225.21	220.63  	 nan  1 
Iterative Refinement- LU 

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage:
		 ./testing_dsgesv N

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===========================================================================================================================================================
 1024	 40.63		 38.34		 60.97		 57.57			 43.68			4.854485e-16	  2
 2048	 75.88		 73.28		126.34		122.18			 99.83			3.058204e-15	  3
 3072	 93.23		 91.24		169.73		165.92			143.95			9.219986e-15	  3
 4032	101.77		100.07		196.95		193.43			174.12			1.260598e-14	  3
 5184	108.92		107.64		221.36		218.80			201.54			2.280617e-16	  3
 6016	112.75		111.61		235.93		233.47			217.56			4.468209e-15	  3
 7040	115.93		114.99		248.62		246.45			229.51			1.037881e-15	  4
 8064	118.15		117.31		257.57		255.65			237.37			4.916595e-16	  4
Iterative Refinement- Cholesky 

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage:
		 ./testing_dsposv -N 1024

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===============================================================================================================================================================================
 1024 	 43.55		 39.24		 86.06 		 72.60		 46.66			6.319456e-19	  2
 2048 	 69.10		 65.04		160.43 		146.90		114.21			6.472202e-19	  2
 3072 	 84.07		 80.60		203.81 		192.97		159.96			7.352729e-19	  2
 4032 	 90.49		 87.98		236.35 		223.93		193.59			7.035593e-19	  2
 5184 	 99.87		 98.28		251.13 		243.51		216.65			7.499680e-19	  2
 6016 	102.90		101.58		263.54 		256.40		230.72			8.151097e-19	  2
 7040 	105.75		104.06		271.00 		265.25		243.64			6.619553e-19	  2
 8064 	109.31		107.68		278.78 		272.46		252.88			8.569304e-19	  2
SYMV Double Precision

Usage
		 testing_dsymv N

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.30	       0.41		        0
  128        0.76	       1.42		        0
  192        1.13	       2.84		        0
  256        1.56	       4.23		        0
  320        1.97	       5.69		        0
  384        2.22	       7.19		        0
  448        2.48	       8.54		        0
  512        2.76	       9.36		        0
  576        2.99	      11.44		        0
  704        3.26	      14.58		        0
  832        3.75	      15.73		        0
  960        3.86	      18.43		        0
 1088        4.27	      20.06		        0
 1216        4.39	      22.07		        0
 1408        3.65	      21.43		        0
 1600        3.55	      24.15		        0
 1792        3.78	      26.00		        0
 1984        3.97	      25.15		        0
 2240        4.26	      27.72		        0
 2496        4.54	      22.74		        0
 2816        4.42	      25.17		        0
 3136        4.40	      25.54		        0
 3520        4.59	      26.22		        0
 3904        4.87	      26.19		        0
 4352        4.66	      28.52		        0
 4800        4.78	      26.20		        0
 5312        5.00	      27.38		        0
 5888        4.84	      28.32		        0
 6528        4.96	      28.97		        0
 7232        4.89	      26.88		        0
 8000        4.98	      28.22		        0
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024      9.26          23.06        5.627424e-06
 2048     10.38          51.78        1.071559e-05
 3072     11.27          74.07        1.585271e-05
 4032     12.32          87.28        1.977225e-05
 5184     12.56         103.49        2.388178e-05
 6016     12.69         110.56        3.007145e-05
 7040     12.87         117.60        3.626539e-05
 8064     13.05         113.27        4.308553e-05
 9088     13.27         115.97        4.882232e-05
10112     13.35         116.96        5.359977e-05
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgelqf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024      9.67          59.95        1.187897e-06
 2048     13.02         113.57        2.131274e-06
 3072     16.50         164.73        1.629596e-06
 4032     25.37         173.60        1.885469e-06
 5184     27.76         196.56        2.058025e-06
 6016     25.68         204.90        2.222242e-06
 7040     26.05         210.13        2.459177e-06
 8064     26.02         209.25        2.677387e-06
 9088     27.00         219.53        2.847114e-06
10112     27.36         223.11        3.564045e-06
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage:
		./testing_sgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		282.8614		291.4609		0.000000e+00
  513		231.7694		201.5010		0.000000e+00
 1024		324.9332		338.6664		0.000000e+00
 1025		278.3742		233.0932		0.000000e+00
 1536		335.5443		346.4345		0.000000e+00
 1537		289.9318		239.0678		0.000000e+00
 2048		337.7808		348.3135		0.000000e+00
 2049		294.3298		239.4412		0.000000e+00
 2560		338.1004		481.6126		0.000000e+00
 2561		297.3452		243.6591		0.000000e+00
 3072		338.0602		349.8797		0.000000e+00
 3073		297.8834		244.5856		0.000000e+00
 3584		337.9794		349.5612		0.000000e+00
 3585		299.4270		241.4427		0.000000e+00
 4096		337.8256		350.2049		0.000000e+00
 4097		299.8963		245.0743		0.000000e+00
 4608		337.7648		349.3987		0.000000e+00
 4609		300.5507		241.5184		0.000000e+00
 5120		337.5137		475.7731		0.000000e+00
 5121		300.4340		241.6148		0.000000e+00
SYMV Sinlge Precision

Usage
		 testing_sgemv N

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.39       0.43		        0
  128        1.26       1.37		        0
  192        2.23       2.46		        0
  256        3.20       3.64		        0
  320        4.27       5.00		        0
  384        5.36       6.41		        0
  448        4.27       5.65		        0
  512        4.95       6.55		        0
  576        5.58       7.46		        0
  704        6.93       9.26		        0
  832        8.34      11.16		        0
  960        9.70      12.98		        0
 1088       10.91      14.80		        0
 1216       12.43      16.71		        0
 1408       14.47      19.44		        0
 1600       16.36      22.07		        0
 1792       18.46      24.42		        0
 1984       20.40      27.05		        0
 2240       22.81      30.14		        0
 2496       25.53      33.32		        0
 2816       28.22      36.88		        0
 3136       30.97      39.82		        0
 3520       34.42      43.63		        0
 3904       37.82      45.56		        0
 4352       41.13      50.04		        0
 4800       43.93      51.37		        0
 5312       47.03      53.39		        0
 5888       49.78      56.01		        0
 6528       53.00      57.05		        0
 7232       55.00      38.61		        0
 8000       56.44      41.60		        0
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     19.44          64.44        1.125170e-06
 2048     24.77         144.90        1.338669e-06
 3072     28.32         175.35        1.474377e-06
 4032     34.71         180.29        1.622605e-06
 5184     35.88         206.65        1.740285e-06
 6016     34.23         213.08        1.913730e-06
 7040     34.64         217.16        2.637886e-06
 8064     34.41         215.06        2.256273e-06
 9088     35.28         223.63        2.377706e-06
10112     35.38         228.77        2.507514e-06
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     12.71          68.69        1.017220e-06
 2048     24.77         106.50        1.456536e-06
 3072     29.14         187.60        1.708441e-06
 4032     35.66         188.93        1.863337e-06
 5184     36.54         214.57        2.029180e-06
 6016     35.08         220.63        1.398288e-05
 7040     35.55         224.67        2.542552e-06
 8064     35.81         221.56        4.327869e-05
 9088     36.47         231.36        2.765864e-06
10112     36.66         234.06        2.806814e-06
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     20.31          67.15        3.451768e-01
 2048     25.58         129.45        3.397885e-01
 3072     29.29         183.56        3.547670e-01
 4032     36.16         190.09        3.540365e-01
 5184     37.12         211.29        3.732369e-01
 6016     35.47         218.60        3.738632e-01
 7040     35.51         766.10        1.049284e+00
 8064     35.76         911.69        1.359362e+00
 9088     36.31         931.08        1.576229e+00
10112     36.62         784.19        1.747851e+00
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     18.67          60.44        9.912942e-07
 2048     24.25         119.77        7.115094e-06
 3072     28.42         177.13        1.680525e-05
 4032     34.89         184.41        4.287619e-05
 5184     36.44         208.36        1.129348e-06
 6016     34.89         215.91        3.104939e-06
 7040     35.32         220.43        2.191862e-06
 8064     35.46         218.31        2.264476e-05
 9088     36.16         228.50        2.229614e-06
10112     36.46         231.59        1.558051e-06
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              21.97        2.606234e-07
 2048             132.57        2.100937e-06
 3072             190.06        4.839033e-06
 4032             220.26        8.308236e-07
 5184             246.60        6.308150e-07
 6016             261.09        1.857649e-06
 7040             272.15        1.403984e-06
 8064             277.46        2.878860e-06
 9088             285.19        1.073657e-06
10112             289.90        7.553145e-07
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.48          47.34         1.895988e-09
 2048     47.20         110.54         1.774122e-09
 3072     55.86         157.02         1.715500e-09
 4032     61.06         185.45         1.804801e-09
 5184     63.95         211.50         1.798921e-09
 6016     65.40         225.40         1.675016e-09
 7040     66.91         238.37         1.659101e-09
 8064     68.19         246.43         1.770623e-09
 9088     68.90         255.79         1.981117e-09
10112     69.75         262.25         2.168543e-09
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     27.34          58.41         1.895988e-09
 2048     47.18         139.03         1.774122e-09
 3072     55.94         196.13         1.715500e-09
 4032     61.11         225.97         1.804801e-09
 5184     64.11         250.61         1.798921e-09
 6016     65.45         264.62         1.675016e-09
 7040     67.00         275.35         1.659101e-09
 8064     68.20         280.27         1.770623e-09
 9088     68.95         287.74         1.981117e-09
10112     69.84         292.34         2.168543e-09
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     27.28          61.22         1.838853e-09
 2048     47.26         126.08         1.749864e-09
 3072     55.99         169.39         1.699750e-09
 4032     61.11         196.44         1.793962e-09
 5184     63.61         221.62         1.803271e-09
 6016     65.45         235.98         1.655473e-09
 7040     66.87         248.56         1.652063e-09
 8064     68.20         257.26         1.761585e-09
 9088     68.94         265.47         1.984430e-09
10112     69.79         272.07         2.142862e-09
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_spotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     34.39          62.78        1.773349e-08
 2048     45.09         121.45        2.295336e-08
 3072     56.14         159.63        2.736194e-08
 4032     61.52         190.96        3.472694e-08
 5184     65.33         211.82        3.655633e-08
 6048     67.36         223.59        3.691891e-08
 7200     69.34         237.23        3.886026e-08
 8064     70.61         244.79        3.935199e-08
 8928     71.50         251.59        4.046260e-08
10080     72.24         258.25        4.227979e-08
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_spotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     34.52          83.72        1.773349e-08
 2048     45.08         159.27        2.295336e-08
 3072     56.24         206.71        2.736194e-08
 4032     61.54         237.60        3.472694e-08
 5184     65.44         251.24        3.655633e-08
 6048     67.42         262.63        3.691891e-08
 7200     69.35         274.40        3.886026e-08
 8064     70.67         278.48        3.935199e-08
 8928     71.56         285.03        4.046260e-08
10080     72.27         289.72        4.227979e-08
SYMV Sinlge Precision

Usage
		 testing_ssymv N

device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.11       0.43		        0
  128        0.23       1.56		        0
  192        0.34       3.07		        0
  256        0.47       5.04		        0
  320        0.55       7.06		        0
  384        0.67       9.51		        0
  448        0.77      11.81		        0
  512        0.95      14.17		        0
  576        0.96      15.80		        0
  704        1.19      19.82		        0
  832        1.39      22.70		        0
  960        1.64      27.11		        0
 1088        1.83      29.97		        0
 1216        1.97      32.86		        0
 1408        2.17      35.40		        0
 1600        2.27      40.00		        0
 1792        2.46      43.99		        0
 1984        2.60      47.42		        0
 2240        2.82      51.20		        0
 2496        3.04      38.58		        0
 2816        3.25      42.86		        0
 3136        3.35      45.11		        0
 3520        3.50      49.66		        0
 3904        3.65      51.15		        0
 4352        3.88      54.74		        0
 4800        3.92      48.00		        0
 5312        3.93      49.81		        0
 5888        4.04      54.51		        0
 6528        4.14      57.05		        0
 7232        4.15      51.48		        0
 8000        4.16      53.02		        0
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     22.86          56.20        3.392327e-15
 2048     27.36          72.03        4.522000e-15
 3072     29.05          75.70        5.990881e-15
 4032     29.67          75.66        8.269346e-15
 5184     30.27          80.70        9.103060e-15
 6016     30.66          83.31        8.365426e-15
 7040     30.89          86.44        9.857029e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.44          59.00        2.610599e-15
 2048     27.06          73.10        3.576645e-15
 3072     29.12          77.27        4.165347e-15
 4032     29.68          76.74        4.900768e-15
 5184     30.29          81.51        5.629466e-15
 6016     30.61          84.73        8.115494e-15
 7040     30.83          87.30        6.216535e-15
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.54          39.44         1.074309e-17
 2048     33.16          68.37         1.088811e-17
 3072     35.20          84.17         1.080715e-17
 4032     36.20          93.07         1.072353e-17
 5184     36.93         100.58         1.056042e-17
 6016     37.41         104.18         1.019869e-17
 7040     37.73         107.92         1.018500e-17
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.73          46.89         1.927441e-17
 2048     33.10          78.82         1.740228e-17
 3072     35.31          94.59         1.553913e-17
 4032     36.30         102.69         1.492815e-17
 5184     36.91         108.89         1.416735e-17
 6016     37.43         112.13         1.359233e-17
 7040     37.77         114.76         1.304601e-17
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     26.57          25.74        5.541044e-17
 2048     34.20          41.33        5.350120e-17
 3072     36.19          50.52        4.189603e-17
 4032     37.04          56.24        3.490686e-17
 5184     37.80          61.01        5.823441e-17
 6048     38.13          63.37        5.342887e-17
 7200     38.43          66.12        4.425605e-17
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.2 MB memory

Usage: 
  testing_zpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     28.13          27.90        4.521316e-17
 2048     34.14          45.24        4.509872e-17
 3072     36.29          54.87        4.351074e-17
 4032     37.33          60.65        4.324125e-17
 5184     37.87          64.95        4.082688e-17
 6048     38.30          67.04        3.924860e-17
 7200     38.48          69.45        3.899940e-17
The Makefile in the libmagmablas directory of the source version of magma-0.2s.tar.gz copyrighted to Dr. Stan Tomov and modified by me for
compute 2.0 CUDA devices and below as above make.inc.goto.Furthermore the makefile make.inc.goto has been modified so magma can be
complied with gcc versions >=4.3 including 4.4.3 with the inclusion of --compiler-options -fno-inline in NVOPTS as per a NVIDIA forum with just make!
Here it is below:

Code: Select all

#//////////////////////////////////////////////////////////////////////////////
#   -- MAGMA (version 0.2) --
#      Univ. of Tennessee
#      Univ. of California Berkeley
#      November 2009
#//////////////////////////////////////////////////////////////////////////////

    include ../make.inc

    ALLSRC = sinplace_transpose.cu \
             stranspose.cu \
             spermute.cu \
             sdlaswp.cu \
             \
             sauxiliary.cu \
             dauxiliary.cu \
             \
             dinplace_transpose.cu \
             dtranspose.cu \
             dpermute.cu \
             \
             cinplace_transpose.cu \
             ctranspose.cu \
             cpermute.cu \
             \
             zinplace_transpose.cu \
             ztranspose.cu \
             zpermute.cu \
             ztrsm.cu \
             ztrmm.cu \
             zherk.cu \
             \
             ctrsm.cu \
             ctrmm.cu \
             csyrk.cu \
             cherk.cu \
             \
             sgemv.cu \
             dgemv.cu \
             gemv32.cu \
             \
             magma_dlacpy.cu \
             magma_dgemv_MLU.cu \
             magma_dlag2s.cu \
             magma_dlange.cu \
             magma_dlansy.cu \
             magma_dlat2s.cu \
             magma_dsymv.cu \
             magma_ssymv.cu \
             magma_sdaxpycp.cu \
             magma_slag2d.cu \
             magma_strsm.cu \
             magma_dtrsm.cu \
             \
             dgemm_kernel_a_0.cu \
             dgemm_kernel_N_N_64_16_16_16_4_special.cu \
             dgemm_kernel_T_N_32_32_8_8_8.cu \
             dgemm_kernel_T_T_64_16_16_16_4_v2.cu \
             dgemm_kernel_ab_0.cu \
             dgemm_kernel_N_N_64_16_16_16_4.cu \
             dgemm_kernel_N_T_64_16_4_16_4.cu \
             dgemm_kernel_T_T_64_16_16_16_4.cu \
             \
             sgemm_kernel_a_0.cu \
             sgemm_kernel_N_N_64_16_16_16_4_special.cu \
             sgemm_kernel_T_N_32_32_8_8_8.cu \
             sgemm_kernel_T_T_64_16_16_16_4_v2.cu \
             sgemm_kernel_ab_0.cu \
             sgemm_kernel_N_N_64_16_16_16_4.cu \
             sgemm_kernel_N_T_64_16_4_16_4.cu \
             sgemm_kernel_T_T_64_16_16_16_4.cu \

    ALLOBJ = $(ALLSRC:.cu=.cu_o) 

all: $(LIBMAGMABLAS)

$(LIBMAGMABLAS): $(ALLOBJ)
	$(ARCH) $(ARCHFLAGS) $@ $(ALLOBJ)
	$(RANLIB) $@

clean:
	rm -f *.cu_o *~ *.a *.linkinfo ../lib/libmagmablas.a

%.cu_o: %.cu
	$(NVCC) $(NVOPTS) -gencode arch=compute_20,code=compute_20 arch=compute_13,code=compute_13 -gencode arch=compute_10,code=compute_10 $(INC) -c $< -o $@
Enjoy,
Thanks to Dr. Tomov and Mr.Goto!
Cheers,
Allan MeneZes!!!
Last edited by Allan Menezes on Tue Apr 20, 2010 1:34 am, edited 2 times in total.

Allan Menezes
Posts: 14
Joined: Wed Aug 05, 2009 10:01 pm

Re: Magma 0.2 with the Nvidia GTX260,and GotoBLAS2 results

Post by Allan Menezes » Sat Apr 17, 2010 5:31 pm

Dear All,
The makefiles are the same and the cuda toolkit and drivers are rev level 3.0 as per my previous post with the GTX470 results.
Below is the performance of the NVIDIA GTX260 with CUDA 3.0 and all else same as above of compute level 1.3 of magma 0.2 as it was run and is presented as is:
I think the library magma has to be retuned for cuda 3.0 and the fermi but it is just my opinion.

Code: Select all

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     45.30          40.29        2.156754e-06
 2048     48.75          54.85        2.745986e-06
 3072     53.66          61.26        3.310481e-06
 4032     60.33          62.30        4.521514e-06
 5184     61.92          63.65        5.078698e-06
 6016     62.64          63.98        4.598488e-06
 7040     62.99          64.26        4.612075e-06
 8064     62.81          64.49        5.544844e-06
 9088     63.33          64.66        5.639552e-06
10112     63.65          64.79        5.191767e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     45.21          46.49        1.055917e-06
 2048     48.32          55.17        2.425854e-06
 3072     53.97          61.58        2.807738e-06
 4032     60.05          62.58        1.976121e-06
 5184     61.16          63.77        2.120793e-06
 6016     62.00          64.11        2.458057e-06
 7040     62.43          64.40        2.604019e-06
 8064     62.44          64.60        2.730853e-06
 9088     62.87          64.78        2.901967e-06
10112     63.22          64.90        3.038448e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     35.88          48.40         7.096947e-09
 2048     61.03          57.09         7.424769e-09
 3072     69.97          61.14         7.476792e-09
 4032     73.08          61.75         7.481112e-09
 5184     75.20          63.06         7.463139e-09
 6016     76.56          63.40         8.369748e-09
 7040     77.81          63.69         9.469118e-09
 8064     78.78          63.59         1.062781e-08
 9088     79.68          64.05         1.169151e-08
10112     80.42          64.17         1.231068e-08
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     36.00          51.20         7.096947e-09
 2048     61.06          58.90         7.424769e-09
 3072     70.02          62.47         7.476792e-09
 4032     73.04          62.76         7.481112e-09
 5184     75.22          63.87         7.463139e-09
 6016     76.50          64.12         8.369748e-09
 7040     77.80          64.31         9.469118e-09
 8064     78.81          64.13         1.062781e-08
 9088     79.67          64.53         1.169151e-08
10112     80.41          64.60         1.231068e-08
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     50.83          38.15        3.142212e-08
 2048     61.92          49.53        3.059151e-08
 3072     69.53          54.45        2.495760e-08
 4032     73.43          57.02        2.337699e-08
 5184     76.14          58.96        3.669203e-08
 6048     77.75          59.91        3.505888e-08
 7200     79.21          60.91        3.032790e-08
 8064     80.13          61.49        2.819979e-08
 8928     80.87          61.92        3.803236e-08
10080     81.61          62.41        3.449032e-08
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_cpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     50.84          39.81        2.622518e-08
 2048     62.19          51.11        2.492307e-08
 3072     69.51          55.67        2.481159e-08
 4032     73.38          58.04        2.620767e-08
 5184     75.87          59.80        2.439569e-08
 6048     77.70          60.65        2.558891e-08
 7200     31.41          60.38        2.507425e-08
 8064     23.91          60.16        2.604544e-08
 8928     13.06          58.56        2.589235e-08
10080      8.15          57.47        2.689157e-08
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024      8.20           8.35        1.033019e-14
 2048      5.51          15.62        2.041184e-14
 3072      5.72          19.24        3.014136e-14
 4032      6.05          20.86        3.655350e-14
 5184      6.04          21.86        4.374540e-14
 6016      6.21          22.25        5.527769e-14
 7040      6.27          23.40        6.843701e-14
 8064      6.31          24.47        8.011122e-14
 9088      6.33          25.29        9.053372e-14
10112      6.35          25.80        1.020150e-13
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage:
		./testing_dgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		22.945162492521		23.035738093195		0.000000e+00
  513		56.725082773109		26.219789667897		0.000000e+00
 1024		66.939423584053		66.756307252324		0.000000e+00
 1025		61.200876619686		26.802992309224		0.000000e+00
 1536		66.906286632142		66.868015315207		0.000000e+00
 1537		63.915807546406		27.205954901021		0.000000e+00
 2048		67.342722124879		67.550847081490		0.000000e+00
 2049		64.889201372834		27.309035365937		0.000000e+00
 2560		67.493577391129		67.413573351549		0.000000e+00
 2561		65.370372314404		27.397295289156		0.000000e+00
 3072		67.659142698778		67.573510091963		0.000000e+00
 3073		65.919808680620		27.434136996027		0.000000e+00
 3584		67.719972586438		67.650163926814		0.000000e+00
 3585		66.321150585336		27.472588885984		0.000000e+00
 4096		67.779644189666		67.749039494284		0.000000e+00
 4097		66.481173050162		27.495522289543		0.000000e+00
 4608		67.809374839390		67.783772799541		0.000000e+00
 4609		66.596814191177		27.512382252868		0.000000e+00
 5120		67.797054955590		67.713050622441		0.000000e+00
 5121		66.785879882139		27.528184671534		0.000000e+00
SYMV Double Precision

Usage
		 testing_dgemv N

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.26	       0.24		        0
  128        0.74	       0.66		        0
  192        1.37	       1.10		        0
  256        1.93	       1.56		        0
  320        2.53	       1.99		        0
  384        3.14	       2.46		        0
  448        3.82	       2.89		        0
  512        4.41	       3.36		        0
  576        4.92	       3.86		        0
  704        6.16	       4.74		        0
  832        7.40	       5.63		        0
  960        8.65	       6.49		        0
 1088        9.70	       7.40		        0
 1216       10.75	       8.24		        0
 1408       12.05	       9.58		        0
 1600       13.51	      10.85		        0
 1792       15.01	      11.83		        0
 1984       16.71	      12.99		        0
 2240       18.86	      14.59		        0
 2496       15.99	      16.04		        0
 2816       18.06	      17.92		        0
 3136       19.63	      19.04		        0
 3520       19.53	      19.03		        0
 3904       18.86	      21.12		        0
 4352       20.99	      21.00		        0
 4800       18.79	      20.69		        0
 5312       20.82	      20.63		        0
 5888       19.34	      20.88		        0
 6528       21.40	      21.25		        0
 7232       20.32	      21.77		        0
 8000       22.50	      22.36		        0
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     16.07          19.73        1.692476e-15
 2048     15.86          28.51        2.412781e-15
 3072     18.60          30.41        2.873908e-15
 4032     22.19          31.20        2.933770e-15
 5184     22.45          32.26        3.183846e-15
 6016     22.69          32.62        3.638615e-15
 7040     22.60          32.92        4.039368e-15
 8064     22.49          33.04        4.212756e-15
 9088     22.51          33.27        4.495418e-15
10112     22.63          33.39        4.804705e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     16.51          22.24        2.312062e-15
 2048     16.00          28.51        3.177654e-15
 3072     18.77          30.27        4.233690e-15
 4032     22.38          31.34        5.220670e-15
 5184     22.57          32.28        5.788689e-15
 6016     22.75          32.62        5.355434e-15
 7040     22.64          32.91        6.691719e-15
 8064     22.52          33.05        7.198218e-15
 9088     22.59          33.29        8.789219e-15
10112     22.65          33.39        9.029942e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     16.43          22.58        1.959699e-15
 2048     15.87          28.96        2.642956e-15
 3072     18.78          30.48        3.271786e-15
 4032     22.38          31.50        3.356442e-15
 5184     22.60          32.45        3.752684e-15
 6016     22.79          32.81        4.070131e-15
 7040     22.68          33.10        4.403128e-15
 8064     22.60          33.21        8.071775e-14
 9088     22.63          33.43        5.335508e-15
10112     22.74          33.54        5.304265e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     15.28          15.45        8.123717e-16
 2048     15.42          28.16        9.930360e-15
 3072     18.34          29.94        1.639680e-14
 4032     21.95          31.12        3.220842e-15
 5184     22.14          32.18        2.035707e-15
 6016     22.45          32.61        5.951416e-15
 7040     22.42          32.93        4.714261e-15
 8064     22.35          33.08        1.597581e-14
 9088     22.42          33.31        3.800731e-15
10112     22.55          33.44        2.706913e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              14.84        4.674082e-16
 2048              40.82        4.412873e-15
 3072              48.34        9.000011e-15
 4032              51.14        1.354061e-15
 5184              55.96        1.173067e-15
 6016              57.48        3.439470e-15
 7040              59.38        2.687958e-15
 8064              59.67        5.400273e-15
 9088              61.77        1.662879e-15
10112              62.48        1.550303e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     24.08          21.24         3.514640e-18
 2048     30.85          38.91         3.258964e-18
 3072     33.65          46.43         2.966111e-18
 4032     35.12          49.31         3.348630e-18
 5184     36.46          54.14         3.333262e-18
 6016     37.17          55.77         2.826022e-18
 7040     37.88          57.79         2.802706e-18
 8064     38.37          58.23         2.761636e-18
 9088     38.82          60.40         2.752465e-18
10112     39.21          61.20         2.726653e-18
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     24.56          23.30         3.514640e-18
 2048     30.86          42.37         3.258964e-18
 3072     33.64          49.60         2.966111e-18
 4032     35.28          51.95         3.348630e-18
 5184     36.46          56.61         3.333262e-18
 6016     37.18          58.04         2.826022e-18
 7040     37.90          59.86         2.802706e-18
 8064     38.36          60.06         2.761636e-18
 9088     38.80          62.13         2.752465e-18
10112     39.24          62.80         2.726653e-18
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     27.77          13.93        4.368765e-17
 2048     33.58          27.25        5.255033e-17
 3072     36.63          35.87        6.129227e-17
 4032     38.12          40.44        6.249455e-17
 5184     39.30          45.21        6.400078e-17
 6144     39.84          47.70        6.514027e-17
 6912     40.32          49.31        6.548325e-17
 8192     40.91          51.50        6.854160e-17
 8960     41.06          52.52        6.936968e-17
 9984     41.34          53.76        7.147590e-17
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     28.05          14.82        4.368765e-17
 2048     33.77          29.05        5.255033e-17
 3072     36.65          37.90        6.129227e-17
 4032     38.13          42.37        6.249455e-17
 5184     39.30          47.04        6.400078e-17
 6144     39.83          49.39        6.514027e-17
 6912     40.32          50.94        6.548325e-17
 8192     40.89          52.95        6.854160e-17
 8960     41.05          53.90        6.936968e-17
 9984     41.33          55.05        7.147590e-17
Iterative Refinement- QR 

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_dsgeqrsv_gpu -N 1024



           CPU GFlop/s                 GPU GFlop/s   
  N          Doule           Double	Single	 Mixed    || b-Ax || / ||A||
=========================================================================================
 1024 	   14.72	    21.02	 24.95	 10.24  	 5.778380e-16  2 
 2048 	   15.33	    28.16	 44.82	 40.62  	 1.117881e-13  2 
 3072 	   18.26	    29.95	 72.75	 66.35  	 3.773353e-14  3 
 4032 	   21.93	    31.12	 75.76	 71.11  	 3.448697e-14  3 
 5184 	   22.22	    32.19	 89.81	 85.75  	 1.832827e-15  3 
 6016 	   22.46	    32.59	 91.65	 89.24  	 4.666973e-14  2 
 7040 	   22.40	    32.93	 93.16	 91.14  	 5.616475e-14  2 
 8000 	   22.84	    33.14	 94.13	 93.20  	 nan  1 
Iterative Refinement- LU 

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage:
		 ./testing_dsgesv N

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===========================================================================================================================================================
 1024	 23.24		 21.67		 39.34		 36.17			 27.83			3.837667e-16	  2
 2048	 42.38		 40.83		103.23		 96.48			 74.50			1.370072e-15	  3
 3072	 49.54		 48.47		148.88		142.22			118.26			7.932516e-15	  3
 4032	 51.96		 51.14		158.59		154.76			131.78			3.134133e-15	  4
 5184	 56.56		 55.92		205.04		200.42			180.18			4.900909e-16	  3
 6016	 58.00		 57.45		219.81		215.74			196.78			5.078789e-15	  3
 7040	 59.83		 59.36		233.47		229.78			212.29			5.907898e-15	  3
 8064	 60.05		 59.66		200.39		198.76			188.10			6.657144e-14	  3
Iterative Refinement- Cholesky 

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage:
		 ./testing_dsposv -N 1024

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===============================================================================================================================================================================
 1024 	 14.88		 13.61		 45.86 		 37.94		 24.22			6.742573e-19	  2
 2048 	 29.09		 27.60		 87.56 		 78.02		 58.38			5.442533e-19	  2
 3072 	 37.94		 36.67		121.39 		112.21		 89.67			6.928162e-19	  2
 4032 	 42.43		 41.15		130.96 		122.37		102.46			7.975089e-19	  2
 5184 	 47.15		 46.15		158.82 		152.07		131.93			7.317486e-19	  2
 6016 	 49.02		 48.19		171.14 		165.00		145.30			7.616320e-19	  2
 7040 	 51.03		 50.27		183.15 		177.39		158.08			7.310733e-19	  2
 8064 	 52.66		 51.84		140.02 		136.18		125.18			7.318053e-19	  2
SYMV Double Precision

Usage
		 testing_dsymv N

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.22	       0.27		        0
  128        0.60	       0.84		        0
  192        0.97	       1.68		        0
  256        1.32	       2.52		        0
  320        1.69	       3.47		        0
  384        1.99	       4.34		        0
  448        2.03	       4.90		        0
  512        2.56	       6.17		        0
  576        2.92	       7.13		        0
  704        3.33	       8.33		        0
  832        3.74	      10.11		        0
  960        2.55	       6.83		        0
 1088        2.85	       7.79		        0
 1216        3.17	       8.78		        0
 1408        3.46	      10.27		        0
 1600        3.81	      11.08		        0
 1792        2.79	       8.97		        0
 1984        3.34	      10.05		        0
 2240        3.01	      10.91		        0
 2496        4.01	      11.96		        0
 2816        3.57	      10.81		        0
 3136        2.33	      11.67		        0
 3520        3.61	      10.88		        0
 3904        3.86	      11.89		        0
 4352        3.66	      11.13		        0
 4800        3.90	      12.16		        0
 5312        3.81	      11.70		        0
 5888        4.08	      12.61		        0
 6528        3.96	      12.44		        0
 7232        3.97	      12.31		        0
 8000        3.98	      12.25		        0
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024     39.75          17.80        5.826909e-06
 2048     12.74          43.90        1.069172e-05
 3072     14.19          58.20        1.593784e-05
 4032     13.67          65.00        1.991340e-05
 5184     13.23          71.69        2.377779e-05
 6016     13.36          74.08        3.033170e-05
 7040     13.40          76.25        3.642902e-05
 8064     13.53          75.36        4.314292e-05
 9088     13.66          78.50        4.871135e-05
10112     13.69          79.28        5.384498e-05
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage:
		./testing_sgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		268.7042		280.4968		0.000000e+00
  513		234.9969		75.9312		0.000000e+00
 1024		322.4450		325.7712		0.000000e+00
 1025		290.8157		84.4388		0.000000e+00
 1536		326.0496		330.9629		0.000000e+00
 1537		302.8830		87.2785		0.000000e+00
 2048		329.9063		331.2932		0.000000e+00
 2049		311.6348		88.2790		0.000000e+00
 2560		331.5557		334.3141		0.000000e+00
 2561		315.5441		89.0356		0.000000e+00
 3072		331.6615		334.3755		0.000000e+00
 3073		315.9033		89.0125		0.000000e+00
 3584		331.8461		334.2131		0.000000e+00
 3585		316.5181		88.7893		0.000000e+00
 4096		331.9613		335.1197		0.000000e+00
 4097		319.2345		89.3308		0.000000e+00
 4608		332.5727		334.7665		0.000000e+00
 4609		320.8550		89.7136		0.000000e+00
 5120		332.0470		335.3657		0.000000e+00
 5121		321.2648		89.5617		0.000000e+00
SYMV Sinlge Precision

Usage
		 testing_sgemv N

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.19       0.36		        0
  128        0.49       1.09		        0
  192        0.81       1.99		        0
  256        1.11       2.98		        0
  320        1.45       4.10		        0
  384        1.78       5.00		        0
  448        2.09       6.18		        0
  512        2.45       7.28		        0
  576        2.73       8.40		        0
  704        3.38      10.66		        0
  832        4.02      12.82		        0
  960        4.69      15.23		        0
 1088        5.30      17.28		        0
 1216        5.97      19.46		        0
 1408        6.94      22.92		        0
 1600        7.89      26.26		        0
 1792        8.85      29.46		        0
 1984        9.84      32.26		        0
 2240       11.09      35.97		        0
 2496       12.40      39.43		        0
 2816       13.99      42.75		        0
 3136       15.54      45.42		        0
 3520       17.48      47.47		        0
 3904       19.28      49.48		        0
 4352       21.34      49.97		        0
 4800       23.56      50.58		        0
 5312       25.78      50.84		        0
 5888       28.48      51.44		        0
 6528       31.29      51.84		        0
 7232       34.20      52.43		        0
 8000       37.43      52.37		        0
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     30.89          27.05        9.666388e-07
 2048     27.05          64.02        1.334392e-06
 3072     33.72          74.37        1.481030e-06
 4032     40.13          76.48        1.687627e-06
 5184     40.48          90.95        1.738137e-06
 6016     40.84          92.41        1.905932e-06
 7040     41.41          93.62        2.432561e-06
 8064     42.37          87.66        2.263390e-06
 9088     42.57          95.35        2.369967e-06
10112     43.11          95.93        2.525132e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     31.79          27.05        1.229550e-06
 2048     28.79          46.49        1.786729e-06
 3072     34.05          73.91        2.330011e-06
 4032     40.62          76.55        2.672245e-06
 5184     40.98          91.02        2.771203e-06
 6016     41.22          92.48        2.934374e-06
 7040     41.67          93.70        3.694396e-06
 8064     42.50          87.71        3.689634e-06
 9088     42.49          95.42        4.426598e-06
10112     42.77          95.90        4.644951e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     32.08          27.28        1.053004e-06
 2048     28.42          47.16        1.462645e-06
 3072     34.20          75.51        1.724498e-06
 4032     40.76          77.49        1.858337e-06
 5184     41.26          91.68        2.030928e-06
 6016     41.85          93.18        4.095723e-05
 7040     42.19          94.41        2.534200e-06
 8064     42.84          88.21        3.232946e-05
 9088     43.03          96.00        2.873736e-06
10112     43.46          96.53        2.804058e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     32.06          20.33        3.451768e-01
 2048     28.80          46.95        3.397885e-01
 3072     34.10          75.11        3.547670e-01
 4032     41.11          77.23        3.540365e-01
 5184     41.31          91.24        3.732369e-01
 6016     41.90          92.81        3.738632e-01
 7040     42.26         824.21        1.049284e+00
 8064     42.89         1021.03        1.359362e+00
 9088     43.12         1162.84        1.576229e+00
10112     43.51         1300.09        1.747851e+00
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     28.76          25.01        8.207032e-07
 2048     27.42          44.82        6.674477e-06
 3072     33.16          72.85        1.770681e-05
 4032     39.89          75.79        1.405714e-06
 5184     40.52          89.81        1.114624e-06
 6016     41.10          91.65        3.413012e-06
 7040     41.42          93.14        2.135102e-06
 8064     42.23          87.19        6.177356e-06
 9088     42.46          95.10        2.338076e-06
10112     42.93          95.69        1.505736e-06
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              19.05        2.799940e-07
 2048             118.27        2.496092e-06
 3072             175.62        5.860553e-06
 4032             195.52        8.414269e-07
 5184             246.70        5.947454e-07
 6016             260.79        2.153073e-06
 7040             272.09        1.590908e-06
 8064             241.80        3.047496e-06
 9088             285.74        9.667521e-07
10112             290.24        8.300990e-07
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     29.27          42.94         2.046097e-09
 2048     52.07         113.66         1.899421e-09
 3072     61.25         169.75         1.829476e-09
 4032     66.23         183.84         1.905313e-09
 5184     69.12         230.72         1.893457e-09
 6016     70.77         244.45         1.777387e-09
 7040     72.42         256.12         1.763151e-09
 8064     73.73         230.25         1.879329e-09
 9088     74.57         271.24         2.100665e-09
10112     75.55         276.52         2.276886e-09
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     29.65          47.63         2.046097e-09
 2048     52.44         128.99         1.899421e-09
 3072     61.44         192.78         1.829476e-09
 4032     66.39         203.64         1.905313e-09
 5184     68.93         254.45         1.893457e-09
 6016     70.67         267.33         1.777387e-09
 7040     72.51         277.32         1.763151e-09
 8064     73.72         245.77         1.879329e-09
 9088     74.56         288.82         2.100665e-09
10112     75.55         293.41         2.276886e-09
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     29.58          39.11         1.989466e-09
 2048     52.34         103.18         1.871970e-09
 3072     61.46         148.62         1.815089e-09
 4032     65.90         158.65         1.894077e-09
 5184     69.12         204.83         1.885447e-09
 6016     70.89         219.73         1.780790e-09
 7040     72.43         233.37         1.761823e-09
 8064     73.79         201.14         1.854722e-09
 9088     74.70         252.18         2.101642e-09
10112     75.56         259.08         2.280880e-09
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_spotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     38.52          40.83        1.773533e-08
 2048     54.73          79.12        2.295516e-08
 3072     63.82         110.50        2.736352e-08
 4032     68.32         121.26        3.472725e-08
 5184     72.11         148.20        3.655660e-08
 6048     73.80         147.70        3.690762e-08
 7200     76.01         175.74        3.887012e-08
 8064     77.18         134.51        3.935232e-08
 8928     78.05         192.72        4.046280e-08
10080     78.96         173.80        4.228006e-08
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_spotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     37.41          44.99        1.773533e-08
 2048     55.31          87.54        2.295516e-08
 3072     63.95         121.44        2.736352e-08
 4032     68.45         131.07        3.472725e-08
 5184     72.14         158.75        3.655660e-08
 6048     74.14         156.30        3.690762e-08
 7200     76.09         186.43        3.887012e-08
 8064     77.25         139.79        3.935232e-08
 8928     78.12         202.89        4.046280e-08
10080     78.98         181.14        4.228006e-08
SYMV Sinlge Precision

Usage
		 testing_ssymv N

device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.11       0.32		        0
  128        0.27       1.09		        0
  192        0.44       2.11		        0
  256        0.62       3.45		        0
  320        0.79       4.88		        0
  384        0.97       6.27		        0
  448        1.09       7.30		        0
  512        1.31       9.53		        0
  576        1.48      11.25		        0
  704        1.79      14.58		        0
  832        2.03      17.52		        0
  960        2.10      20.48		        0
 1088        2.16      23.21		        0
 1216        2.26      25.06		        0
 1408        2.35      27.92		        0
 1600        2.42      29.94		        0
 1792        3.70      31.95		        0
 1984        2.50      32.53		        0
 2240        4.22      36.36		        0
 2496        2.57      35.40		        0
 2816        2.59      27.06		        0
 3136        3.90      32.03		        0
 3520        2.62      32.74		        0
 3904        2.62      35.08		        0
 4352        2.63      36.56		        0
 4800        2.64      37.89		        0
 5312        2.64      32.21		        0
 5888        2.64      35.43		        0
 6528        2.65      37.22		        0
 7232        2.64      39.09		        0
 8000        2.64      35.04		        0
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     23.18          33.75        3.392327e-15
 2048     27.72          40.05        4.522000e-15
 3072     29.48          42.19        5.990881e-15
 4032     31.81          43.79        8.269346e-15
 5184     32.49          43.83        9.103060e-15
 6016     32.54          44.11        8.365426e-15
 7040     32.85          44.32        9.857029e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     23.30          34.70        2.610599e-15
 2048     27.73          40.28        3.576645e-15
 3072     29.53          42.37        4.165347e-15
 4032     31.88          43.90        4.900768e-15
 5184     32.57          43.97        5.629466e-15
 6016     32.59          44.26        8.115494e-15
 7040     32.94          44.46        6.216535e-15
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     29.42          37.56         1.074309e-17
 2048     36.02          51.28         1.088811e-17
 3072     38.27          56.61         1.080715e-17
 4032     39.49          58.38         1.072353e-17
 5184     40.30          60.88         1.056042e-17
 6016     40.73          61.72         1.019869e-17
 7040     41.14          62.49         1.018500e-17
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     29.48          41.10         1.927441e-17
 2048     35.87          54.21         1.740228e-17
 3072     38.20          58.94         1.553913e-17
 4032     39.37          60.23         1.492815e-17
 5184     40.20          62.41         1.416735e-17
 6016     40.64          63.09         1.359233e-17
 7040     41.05          63.69         1.304601e-17
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     33.41          26.39        5.541044e-17
 2048     37.69          37.69        5.350120e-17
 3072     39.68          43.34        4.189603e-17
 4032     40.64          46.82        3.490686e-17
 5184     41.33          49.31        5.823441e-17
 6048     41.70          50.61        5.342887e-17
 7200     42.01          52.03        4.425605e-17
device 0: GeForce GTX 260, 1296.0 MHz clock, 895.3 MB memory

Usage: 
  testing_zpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     33.23          27.60        4.521316e-17
 2048     37.60          39.43        4.509872e-17
 3072     39.65          44.88        4.351074e-17
 4032     40.62          48.08        4.324125e-17
 5184     41.33          50.47        4.082688e-17
 6048     41.69          51.66        3.924860e-17
 7200     42.02          52.94        3.899940e-17
Enjoy! And thanks Dr. Tomov and Mr. Goto!
Cheers,
Allan MeneZes!!!

Allan Menezes
Posts: 14
Joined: Wed Aug 05, 2009 10:01 pm

Magma 0.2 with Fermi GTX480, GTX470 and GotoBLAS2 results

Post by Allan Menezes » Tue Apr 20, 2010 1:42 am

Dear All,
The makefiles are the same and the cuda toolkit and drivers are rev level 3.0 as per my previous post with the GTX470 and GTX260 results.
Below is the performance of the NVIDIA Fermi GTX480 as device 0 and GTX470 as device 1 with CUDA 3.0 and all else same as above as it was run and is presented as is:
I think the library magma has to be retuned for cuda 3.0 and the fermi but it is just my opinion.
Maybe I should add -m64 to the make.inc.goto for the NVOPTS and try again?

Code: Select all

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     37.87         126.07        2.181499e-06
 2048     46.69         205.58        2.763979e-06
 3072     51.67         254.72        3.224755e-06
 4032     55.26         276.98        4.556965e-06
 5184     56.11         283.95        5.060306e-06
 6016     56.15         289.49        4.582116e-06
 7040     56.02         286.46        4.619145e-06
 8064     56.17         276.91        5.499504e-06
 9088     56.45         290.56        5.515256e-06
10112     56.93         294.31        5.179223e-06
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     37.86         138.77        1.023109e-06
 2048     46.67         219.86        2.401207e-06
 3072     51.66         265.40        2.559615e-06
 4032     54.91         288.85        1.957078e-06
 5184     55.83         294.25        2.122840e-06
 6016     54.73         260.86        2.449219e-06
 7040     55.71         292.76        2.591782e-06
 8064     55.73         282.24        2.737253e-06
 9088     56.14         295.65        2.923932e-06
10112     56.68         302.07        3.040652e-06
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     31.29          86.64         6.999727e-09
 2048     54.96         155.25         7.347308e-09
 3072     63.68         191.02         7.411954e-09
 4032     66.15         218.56         7.398736e-09
 5184     68.00         240.63         7.359929e-09
 6016     69.05         254.88         8.188616e-09
 7040     70.19         266.22         9.392204e-09
 8064     70.88         275.66         1.037497e-08
 9088     72.03         283.62         1.169262e-08
10112     71.51         288.15         1.234344e-08
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     30.67         104.48         6.999727e-09
 2048     54.11         182.87         7.347308e-09
 3072     62.84         214.97         7.411954e-09
 4032     64.58         244.17         7.398736e-09
 5184     67.04         265.62         7.359929e-09
 6016     65.24         275.33         8.188616e-09
 7040     68.93         287.85         9.392204e-09
 8064     68.90         295.37         1.037497e-08
 9088     71.26         301.68         1.169262e-08
10112     72.06         306.73         1.234344e-08
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     42.10          58.96        3.142203e-08
 2048     54.55         107.57        3.059146e-08
 3072     61.60         143.09        2.495757e-08
 4032     66.51         169.52        2.337698e-08
 5184     69.68         193.98        3.669203e-08
 6048     71.06         207.86        3.505888e-08
 7200     72.51         223.70        3.032790e-08
 8064     73.13         233.24        2.819979e-08
 8928     73.97         242.44        3.803236e-08
10080     74.62         253.36        3.449032e-08
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_cpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     42.12          65.26        2.622518e-08
 2048     54.44         120.37        2.492308e-08
 3072     62.19         160.29        2.481157e-08
 4032     66.45         189.88        2.620765e-08
 5184     69.62         215.56        2.439569e-08
 6048     71.07         230.99        2.558894e-08
 7200     29.29         229.09        2.507422e-08
 8064     22.37         226.41        2.604528e-08
 8928     12.22         204.45        2.590413e-08
10080      7.67         191.58        2.688114e-08
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024      4.49          14.36        1.033019e-14
 2048      5.17          32.44        2.041184e-14
 3072      5.47          43.56        3.014136e-14
 4032      5.77          50.53        3.655350e-14
 5184      5.87          55.33        4.374540e-14
 6016      5.97          57.82        5.527769e-14
 7040      6.02          61.69        6.843701e-14
 8064      6.10          63.27        8.011122e-14
 9088      6.13          64.21        9.053372e-14
10112      6.15          63.39        1.020150e-13
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage:
		./testing_dgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		107.805404016064		109.431494496535		0.000000e+00
  513		123.518478499543		93.559041580042		0.000000e+00
 1024		160.104648326251		160.860198352060		0.000000e+00
 1025		142.004433968484		97.139691953816		0.000000e+00
 1536		162.761224163485		154.417873529913		0.000000e+00
 1537		146.966775398689		99.107751914075		0.000000e+00
 2048		163.156302496747		163.842500038148		0.000000e+00
 2049		149.231486395295		100.640207410094		0.000000e+00
 2560		163.861602848032		164.550265795720		0.000000e+00
 2561		150.429515455470		100.127176044899		0.000000e+00
 3072		164.008877594546		156.179366889965		0.000000e+00
 3073		151.127492680412		100.540303696030		0.000000e+00
 3584		164.078305295328		164.679616044185		0.000000e+00
 3585		151.541062656947		100.342188864522		0.000000e+00
 4096		164.130710221824		164.739205055365		0.000000e+00
 4097		152.020170684369		100.847787447739		0.000000e+00
 4608		164.254709608537		156.809729477812		0.000000e+00
 4609		152.301575507226		100.397596129847		0.000000e+00
 5120		164.314972968738		164.903458215027		0.000000e+00
 5121		152.492648033979		101.209296403883		0.000000e+00
SYMV Double Precision

Usage
		 testing_dgemv N

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.37	       0.37		        0
  128        1.13	       1.06		        0
  192        1.94	       1.89		        0
  256        2.73	       2.73		        0
  320        2.66	       2.56		        0
  384        3.10	       3.04		        0
  448        3.65	       3.58		        0
  512        4.23	       4.16		        0
  576        4.74	       4.74		        0
  704        5.97	       5.90		        0
  832        7.10	       6.96		        0
  960        8.23	       8.16		        0
 1088        9.28	       9.14		        0
 1216       10.52	      10.23		        0
 1408       12.28	      11.98		        0
 1600       13.80	      13.58		        0
 1792       15.22	      15.18		        0
 1984       16.79	      16.75		        0
 2240       18.72	      18.83		        0
 2496       20.70	      20.73		        0
 2816       22.95	      23.12		        0
 3136       25.12	      24.90		        0
 3520       27.47	      27.17		        0
 3904       29.00	      29.59		        0
 4352       30.90	      31.10		        0
 4800       32.82	      32.82		        0
 5312       34.35	      34.33		        0
 5888       35.07	      35.38		        0
 6528       35.69	      35.87		        0
 7232       37.16	      36.88		        0
 8000       36.53	      25.09		        0
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024      8.40          45.99        1.692476e-15
 2048     15.05          80.03        2.412781e-15
 3072     17.24          92.18        2.873908e-15
 4032     17.87          95.00        2.933770e-15
 5184     19.22         101.38        3.183846e-15
 6016     19.35          99.75        3.638615e-15
 7040     19.66         103.48        4.039368e-15
 8064     19.82         102.35        4.212756e-15
 9088     20.08         106.65        4.495418e-15
10112     20.42         107.66        4.804705e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     12.17          44.40        2.312062e-15
 2048     14.64          80.77        3.177654e-15
 3072     16.99          92.21        4.233690e-15
 4032     19.05          93.46        5.220670e-15
 5184     19.36          93.94        5.788689e-15
 6016     19.17         100.49        5.355434e-15
 7040     19.69         102.00        6.691719e-15
 8064     19.88         103.61        7.198218e-15
 9088     20.22         104.40        8.789219e-15
10112     20.46         107.00        9.029942e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     12.13          55.22        1.959699e-15
 2048     14.98          88.11        2.642956e-15
 3072     17.08          95.84        3.271786e-15
 4032     19.03          98.98        3.356442e-15
 5184     19.25         103.44        3.752684e-15
 6016     18.51         105.40        4.070131e-15
 7040     19.05         102.64        4.403128e-15
 8064     19.39         105.86        8.071775e-14
 9088     19.60         104.60        5.335508e-15
10112     20.17         109.77        5.304265e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     11.38          29.11        8.123717e-16
 2048     14.53          84.03        9.930360e-15
 3072     16.58          92.41        1.639680e-14
 4032     18.48          97.01        3.220842e-15
 5184     18.76         102.29        2.035707e-15
 6016     18.82         103.82        5.951416e-15
 7040     19.03         105.85        4.714261e-15
 8064     19.26         106.07        1.597581e-14
 9088     19.60         108.65        3.800731e-15
10112     19.84         108.49        2.706913e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              40.96        4.674082e-16
 2048              85.39        4.412873e-15
 3072             106.75        9.000011e-15
 4032             118.72        1.354061e-15
 5184             129.24        1.173067e-15
 6016             134.78        3.439470e-15
 7040             139.33        2.687958e-15
 8064             142.17        5.400273e-15
 9088             145.20        1.662879e-15
10112             147.15        1.550303e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     21.36          32.95         3.514640e-18
 2048     27.94          65.80         3.258964e-18
 3072     30.90          85.34         2.966111e-18
 4032     32.41          98.08         3.348630e-18
 5184     33.54         108.35         3.333262e-18
 6016     34.11         115.47         2.826022e-18
 7040     34.72         121.06         2.802706e-18
 8064     35.09         125.61         2.761636e-18
 9088     35.46         129.62         2.752465e-18
10112     35.83         132.77         2.726653e-18
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     21.54          44.61         3.514640e-18
 2048     27.99          88.01         3.258964e-18
 3072     30.55         109.11         2.966111e-18
 4032     32.40         120.78         3.348630e-18
 5184     33.51         130.57         3.333262e-18
 6016     34.04         136.08         2.826022e-18
 7040     34.71         140.44         2.802706e-18
 8064     35.06         143.17         2.761636e-18
 9088     35.41         146.12         2.752465e-18
10112     35.81         148.12         2.726653e-18
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.02          35.79        4.368765e-17
 2048     29.17          63.48        5.255033e-17
 3072     32.75          80.55        6.129227e-17
 4032     34.44          92.11        6.249455e-17
 5184     35.41         101.94        6.400078e-17
 6144     36.41         107.85        6.514027e-17
 6912     36.62         112.18        6.548325e-17
 8192     37.24         116.60        6.854160e-17
 8960     37.29         119.99        6.936968e-17
 9984     37.56         122.87        7.147590e-17
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.17          50.49        4.368765e-17
 2048     29.39          88.48        5.255033e-17
 3072     32.72         102.50        6.129227e-17
 4032     34.39         113.61        6.249455e-17
 5184     35.49         120.33        6.400078e-17
 6144     36.28         125.00        6.514027e-17
 6912     36.33         127.62        6.548325e-17
 8192     37.30         131.73        6.854160e-17
 8960     37.26         135.18        6.936968e-17
 9984     37.52         136.91        7.147590e-17
Iterative Refinement- QR 

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_dsgeqrsv_gpu -N 1024



           CPU GFlop/s                 GPU GFlop/s   
  N          Doule           Double	Single	 Mixed    || b-Ax || / ||A||
=========================================================================================
 1024 	   11.46	    45.72	 68.58	 13.32  	 5.555543e-16  2 
 2048 	   14.59	    84.23	140.68	112.28  	 6.616966e-15  3 
 3072 	   16.59	    88.61	208.52	181.35  	 7.447985e-14  3 
 4032 	   18.61	    94.15	222.38	194.96  	 4.621670e-14  4 
 5184 	   18.86	   102.50	247.79	237.85  	 8.647975e-15  2 
 6016 	   18.87	    99.57	259.92	250.42  	 9.182751e-14  2 
 7040 	   19.24	   101.22	265.76	258.14  	 1.021954e-13  2 
 8000 	   19.45	   102.82	274.09	267.93  	 nan  1 
Iterative Refinement- LU 

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage:
		 ./testing_dsgesv N

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===========================================================================================================================================================
 1024	 44.20		 41.83		 66.42		 64.41			 48.45			4.854485e-16	  2
 2048	 87.77		 84.82		148.72		142.73			116.77			3.058204e-15	  3
 3072	108.93		106.60		197.88		193.73			168.31			9.219986e-15	  3
 4032	120.72		118.68		232.89		228.57			205.02			1.260598e-14	  3
 5184	130.51		128.92		264.09		260.76			240.56			2.280617e-16	  3
 6016	135.95		134.61		283.07		280.02			260.61			4.468209e-15	  3
 7040	140.08		139.25		299.62		297.19			276.01			1.037881e-15	  4
 8064	142.88		142.22		310.40		308.08			285.56			4.916595e-16	  4
Iterative Refinement- Cholesky 

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage:
		 ./testing_dsposv -N 1024

Epsilon(Double): 0.00000000000000011102 
Epsilon(Single): 0.00000005960464477539


N	Double-Factor	Double-Solve	Single-Factor	Sigle-Solve	Mixed Precision Solver	 || b-Ax || / ||A||  	 NumIter
===============================================================================================================================================================================
 1024 	 50.76		 45.77		 96.24 		 81.69		 52.56			6.319456e-19	  2
 2048 	 81.39		 76.66		187.19 		170.54		131.90			6.472202e-19	  2
 3072 	 99.77		 96.86		240.47 		227.93		186.99			7.352729e-19	  2
 4032 	111.21		108.43		284.48 		268.02		230.48			7.035593e-19	  2
 5184 	120.63		118.50		311.40 		300.43		267.66			7.499680e-19	  2
 6016 	125.13		123.04		327.52 		317.46		287.60			8.151097e-19	  2
 7040 	128.69		126.75		338.06 		329.49		302.01			6.619553e-19	  2
 8064 	132.76		131.15		347.52 		338.93		312.63			8.569304e-19	  2
SYMV Double Precision

Usage
		 testing_dsymv N

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.33	       0.43		        0
  128        0.84	       1.56		        0
  192        1.39	       2.95		        0
  256        1.82	       4.85		        0
  320        2.30	       6.83		        0
  384        2.76	       8.94		        0
  448        2.97	       9.79		        0
  512        3.26	      10.70		        0
  576        3.59	      13.27		        0
  704        4.11	      16.52		        0
  832        4.58	      19.50		        0
  960        4.84	      20.25		        0
 1088        5.19	      22.76		        0
 1216        5.39	      25.28		        0
 1408        5.84	      29.15		        0
 1600        4.35	      27.68		        0
 1792        4.51	      30.58		        0
 1984        4.73	      29.71		        0
 2240        5.04	      32.69		        0
 2496        5.33	      27.21		        0
 2816        5.82	      29.21		        0
 3136        5.28	      31.98		        0
 3520        5.45	      33.13		        0
 3904        5.76	      33.76		        0
 4352        5.59	      34.03		        0
 4800        5.83	      36.57		        0
 5312        6.01	      33.22		        0
 5888        5.83	      35.30		        0
 6528        6.01	      34.34		        0
 7232        5.86	      33.33		        0
 8000        6.07	      34.59		        0
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgehrd -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||A-QHQ'|| / ||A||
========================================================
 1024      9.36          22.54        5.627424e-06
 2048     10.35          57.99        1.071559e-05
 3072     11.35          84.35        1.585271e-05
 4032     12.21         100.76        1.977225e-05
 5184     12.25         116.21        2.388178e-05
 6016     12.37         126.67        3.007145e-05
 7040     12.52         138.99        3.626539e-05
 8064     12.81         136.13        4.308553e-05
 9088     13.22         141.17        4.882232e-05
10112     13.31         141.47        5.359977e-05
This is an Experimental Release of GEMM Routine without Padding

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage:
		./testing_sgemm N

    N		magmablas0.2 GFLops/s	cudablas-2.3 GFlops/s    error
=============================================================================
  512		323.0270		349.9810		0.000000e+00
  513		299.0159		247.2632		0.000000e+00
 1024		399.6806		415.7761		0.000000e+00
 1025		341.9786		287.0560		0.000000e+00
 1536		412.5310		427.8235		0.000000e+00
 1537		357.6773		293.4229		0.000000e+00
 2048		415.3941		429.3894		0.000000e+00
 2049		362.1810		298.3155		0.000000e+00
 2560		416.1945		607.8481		0.000000e+00
 2561		367.0046		296.9090		0.000000e+00
 3072		416.8402		431.3564		0.000000e+00
 3073		368.1981		297.0327		0.000000e+00
 3584		416.4991		431.0450		0.000000e+00
 3585		370.3915		297.0114		0.000000e+00
 4096		416.7924		431.5290		0.000000e+00
 4097		370.4532		297.7351		0.000000e+00
 4608		416.5830		431.5473		0.000000e+00
 4609		371.3548		297.6357		0.000000e+00
 5120		416.6590		606.7517		0.000000e+00
 5121		371.7655		297.8487		0.000000e+00
SYMV Sinlge Precision

Usage
		 testing_sgemv N

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.43       0.46		        0
  128        1.37       1.49		        0
  192        2.46       2.73		        0
  256        3.64       4.10		        0
  320        4.76       5.54		        0
  384        6.02       7.02		        0
  448        5.58       6.58		        0
  512        5.52       7.28		        0
  576        6.20       8.40		        0
  704        7.87      10.66		        0
  832        9.35      12.70		        0
  960       10.97      14.75		        0
 1088       12.27      16.79		        0
 1216       14.08      18.96		        0
 1408       16.38      22.03		        0
 1600       18.62      25.10		        0
 1792       20.85      27.92		        0
 1984       23.15      30.87		        0
 2240       25.93      34.49		        0
 2496       29.04      38.70		        0
 2816       32.43      42.86		        0
 3136       35.83      45.74		        0
 3520       39.97      50.99		        0
 3904       43.42      55.32		        0
 4352       47.59      59.00		        0
 4800       51.49      61.69		        0
 5312       55.33      64.42		        0
 5888       59.41      67.65		        0
 6528       62.99      69.35		        0
 7232       66.80      71.26		        0
 8000       69.83      49.31		        0
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgeqlf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     19.49          71.86        1.125170e-06
 2048     24.76         163.45        1.338669e-06
 3072     28.51         203.53        1.474377e-06
 4032     35.09         216.08        1.622605e-06
 5184     35.78         245.90        1.740285e-06
 6016     34.68         254.77        1.913730e-06
 7040     35.02         259.93        2.637886e-06
 8064     34.99         258.69        2.256273e-06
 9088     35.65         273.32        2.377706e-06
10112     35.70         276.88        2.507514e-06
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     19.66          77.09        1.017220e-06
 2048     24.84         153.60        1.456536e-06
 3072     28.77         219.82        1.708441e-06
 4032     35.41         230.72        1.863337e-06
 5184     36.44         256.35        2.029180e-06
 6016     34.67         267.00        1.398288e-05
 7040     34.86         272.41        2.542552e-06
 8064     34.91         267.76        4.327869e-05
 9088     35.23         281.39        2.765864e-06
10112     35.37         284.05        2.806814e-06
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     18.87          73.63        3.451768e-01
 2048     24.26         151.00        3.397885e-01
 3072     28.83         216.26        3.547670e-01
 4032     35.44         229.60        3.540365e-01
 5184     36.40         251.06        3.732369e-01
 6016     34.74         262.16        3.738632e-01
 7040     34.70         753.86        1.049284e+00
 8064     34.71         860.42        1.359362e+00
 9088     35.26         879.52        1.576229e+00
10112     35.50         764.46        1.747851e+00
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgeqrs_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    || b-Ax || / ||A||
========================================================
 1024     17.66          67.88        9.912942e-07
 2048     23.93         140.40        7.115094e-06
 3072     28.08         205.30        1.680525e-05
 4032     34.54         222.22        4.822237e-05
 5184     35.82         249.15        1.129348e-06
 6016     34.33         260.40        3.104939e-06
 7040     34.69         266.96        2.191862e-06
 8064     34.71         263.67        2.028972e-05
 9088     35.46         277.82        2.229614e-06
10112     35.50         282.01        1.558051e-06
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgesv -N 1024



  N            GPU GFlop/s      || b-Ax || / ||A||
========================================================
 1024              56.88        2.606234e-07
 2048             148.73        2.100937e-06
 3072             211.27        4.839033e-06
 4032             255.33        8.308236e-07
 5184             290.89        6.308150e-07
 6016             308.85        1.857649e-06
 7040             325.72        1.403984e-06
 8064             332.70        2.878860e-06
 9088             343.58        1.073657e-06
10112             350.77        7.553145e-07
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.57          49.09         1.895988e-09
 2048     46.86         119.73         1.774122e-09
 3072     55.61         172.12         1.715500e-09
 4032     60.05         209.17         1.804801e-09
 5184     62.96         242.62         1.798921e-09
 6016     64.15         260.85         1.675016e-09
 7040     65.97         277.64         1.659101e-09
 8064     67.09         288.50         1.770623e-09
 9088     67.88         301.55         1.981117e-09
10112     68.72         310.45         2.168543e-09
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     27.14          60.99         1.895988e-09
 2048     46.63         157.01         1.774122e-09
 3072     55.06         221.47         1.715500e-09
 4032     60.07         253.08         1.804801e-09
 5184     62.89         296.81         1.798921e-09
 6016     64.07         315.18         1.675016e-09
 7040     65.71         329.13         1.659101e-09
 8064     66.71         336.04         1.770623e-09
 9088     67.46         346.92         1.981117e-09
10112     68.23         353.69         2.168543e-09
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     27.42          66.98         1.838853e-09
 2048     47.45         149.19         1.749864e-09
 3072     56.15         197.77         1.699750e-09
 4032     60.83         231.84         1.793962e-09
 5184     64.01         264.88         1.803271e-09
 6016     65.55         281.93         1.655473e-09
 7040     66.84         299.18         1.652063e-09
 8064     68.23         310.16         1.761585e-09
 9088     68.94         321.47         1.984430e-09
10112     69.76         330.21         2.142862e-09
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_spotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     32.64          66.42        1.773349e-08
 2048     43.81         135.27        2.295336e-08
 3072     55.40         182.38        2.736194e-08
 4032     59.45         219.77        3.472694e-08
 5184     64.72         251.28        3.655633e-08
 6048     66.34         267.55        3.691891e-08
 7200     68.75         285.43        3.886026e-08
 8064     69.11         296.99        3.935199e-08
 8928     70.41         306.81        4.046260e-08
10080     71.16         315.42        4.227979e-08
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_spotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     32.61          88.99        1.773349e-08
 2048     38.49         189.57        2.295336e-08
 3072     53.14         243.80        2.736194e-08
 4032     57.03         278.24        3.472694e-08
 5184     59.89         313.39        3.655633e-08
 6048     61.65         326.03        3.691891e-08
 7200     63.52         341.57        3.886026e-08
 8064     63.40         346.74        3.935199e-08
 8928     64.26         354.59        4.046260e-08
10080     65.69         360.65        4.227979e-08
SYMV Sinlge Precision

Usage
		 testing_ssymv N

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

   n   CUBLAS,Gflop/s   MAGMABLAS0.2,Gflop/s   "error"
==============================================================
   64        0.13       0.46		        0
  128        0.27       1.64		        0
  192        0.39       3.51		        0
  256        0.54       5.46		        0
  320        0.64       7.88		        0
  384        0.77      10.17		        0
  448        0.89      13.38		        0
  512        1.10      15.42		        0
  576        1.12      17.93		        0
  704        1.38      22.03		        0
  832        1.60      27.15		        0
  960        1.88      30.22		        0
 1088        2.13      33.82		        0
 1216        2.37      39.43		        0
 1408        2.79      40.88		        0
 1600        3.09      48.30		        0
 1792        3.36      52.22		        0
 1984        3.59      54.67		        0
 2240        3.82      61.57		        0
 2496        4.09      44.98		        0
 2816        4.44      50.51		        0
 3136        4.77      54.64		        0
 3520        5.08      59.00		        0
 3904        5.32      62.59		        0
 4352        5.55      65.09		        0
 4800        5.64      70.35		        0
 5312        5.99      61.28		        0
 5888        6.08      64.44		        0
 6528        5.94      67.16		        0
 7232        6.26      61.57		        0
 8000        6.22      66.22		        0
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgeqrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R - Q'A|| / ||A||
========================================================
 1024     20.35          61.53        3.392327e-15
 2048     23.62          78.72        4.522000e-15
 3072     28.45          78.06        5.990881e-15
 4032     28.49          78.13        8.269346e-15
 5184     29.17          85.36        9.103060e-15
 6016     30.36          87.82        8.365426e-15
 7040     30.49          94.55        9.857029e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgeqrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     22.29          64.76        2.610599e-15
 2048     26.90          83.28        3.576645e-15
 3072     28.72          86.31        4.165347e-15
 4032     29.16          79.68        4.900768e-15
 5184     30.01          87.42        5.629466e-15
 6016     30.34          90.05        8.115494e-15
 7040     30.46          95.61        6.216535e-15
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     25.98          39.36         1.074309e-17
 2048     31.57          74.94         1.088811e-17
 3072     34.04          95.24         1.080715e-17
 4032     34.97         107.43         1.072353e-17
 5184     35.92         117.15         1.056042e-17
 6016     36.35         122.15         1.019869e-17
 7040     36.78         127.60         1.018500e-17
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.68          48.05         1.927441e-17
 2048     32.47          87.67         1.740228e-17
 3072     35.02         108.58         1.553913e-17
 4032     35.97         119.30         1.492815e-17
 5184     36.78         128.57         1.416735e-17
 6016     37.15         133.39         1.359233e-17
 7040     37.51         137.32         1.304601e-17
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zpotrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     27.64          26.96        5.541044e-17
 2048     32.25          45.55        5.350120e-17
 3072     35.86          57.12        4.189603e-17
 4032     37.03          64.50        3.490686e-17
 5184     37.83          71.11        5.823441e-17
 6048     38.08          74.54        5.342887e-17
 7200     38.43          78.57        4.425605e-17
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zpotrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F
========================================================
 1024     27.83          29.49        4.521316e-17
 2048     33.79          48.56        4.509872e-17
 3072     36.20          61.30        4.351074e-17
 4032     37.06          70.02        4.324125e-17
 5184     37.87          75.98        4.082688e-17
 6048     38.01          79.60        3.924860e-17
 7200     38.44          82.97        3.899940e-17
Here is the bash script to automate testing of results:

Code: Select all

#!/bin/sh
./testing_cgeqrf > results_cgeqrf-gtx480.txt
./testing_cgeqrf_gpu > results_cgeqrf_gpu-gtx480.txt
./testing_cgetrf > results_cgetrf-gtx480.txt
./testing_cgetrf_gpu > results_cgetrf_gpu-gtx480.txt
./testing_cpotrf > results_cpotrf-gtx480.txt
./testing_cpotrf_gpu > results_cpotrf_gpu-gtx480.txt
./testing_dgehrd > results_dgehrd-gtx480.txt
##./testing_dgelqf > results_dgelqf-gtx480.txt
./testing_dgemm > results_dgemm-gtx480.txt
./testing_dgemv > results_dgemv-gtx480.txt
./testing_dgeqlf > results_dgeqlf-gtx480.txt
./testing_dgeqrf > results_dgeqrf-gtx480.txt
./testing_dgeqrf_gpu > results_dgeqrf_gpu-gtx480.txt
./testing_dgeqrs_gpu > results_dgeqrs_gpu-gtx480.txt
./testing_dgesv_gpu > results_dgesv_gpu-gtx480.txt
./testing_dgetrf > results_dgetrf-gtx480.txt
./testing_dgetrf_gpu > results_dgetrf_gpu-gtx480.txt
./testing_dpotrf > results_dpotrf-gtx480.txt
./testing_dpotrf_gpu > results_dpotrf_gpu-gtx480.txt
./testing_dsgeqrsv_gpu > results_dsgeqrsv_gpu-gtx480.txt
./testing_dsgesv_gpu > results_dsgesv_gpu-gtx480.txt
./testing_dsposv_gpu > results_dsposv_gpu-gtx480.txt
./testing_dsymv > results_dsymv-gtx480.txt
./testing_sgehrd > results_sgehrd-gtx480.txt
##./testing_sgelqf > results_sgelqf-gtx480.txt
./testing_sgemm > results_sgemm-gtx480.txt
./testing_sgemv > results_sgemv-gtx480.txt
./testing_sgeqlf > results_sgeqlf-gtx480.txt
##./testing_sgeqrf > results_sgeqrf-gtx480.txt
./testing_sgeqrf_gpu > results_sgeqrf_gpu-gtx480.txt
./testing_sgeqrf_gpu-v2 > results_sgeqrf_gpu-v2-gtx480.txt
./testing_sgeqrs_gpu > results_sgeqrs_gpu-gtx480.txt
./testing_sgesv_gpu  > results_sgesv_gpu-gtx480.txt
./testing_sgetrf > results_sgetrf-gtx480.txt
./testing_sgetrf_gpu > results_sgetrf_gpu-gtx480.txt
./testing_sgetrf_gpu-v2 > results_sgetrf_gpu-v2-gtx480.txt
./testing_spotrf > results_spotrf-gtx480.txt
./testing_spotrf_gpu > results_spotrf_gpu-gtx480.txt
./testing_ssymv > results_ssymv-gtx480.txt
./testing_zgeqrf > results_zgeqrf-gtx480.txt
./testing_zgeqrf_gpu > results_zgeqrf_gpu-gtx480.txt
./testing_zgetrf > results_zgetrf-gtx480.txt
./testing_zgetrf_gpu > results_zgetrf_gpu-gtx480.txt
./testing_zpotrf > results_zpotrf-gtx480.txt
./testing_zpotrf_gpu > results_zpotrf_gpu-gtx480.txt
Enjoy! And thanks Dr. Tomov and Mr. Goto!
Cheers,
Allan MeneZes!!!

Stan Tomov
Posts: 263
Joined: Fri Aug 21, 2009 10:39 pm

Re: Magma 0.2 with Nvidia Fermi GTX470 and GotoBLAS2 results

Post by Stan Tomov » Tue Apr 20, 2010 5:28 pm

Hi Allan,
This is very interesting. Thanks for sharing with everyone these first experiences with the new Fermi. We also got one on Friday but I went to IPDPS in Atlanta (this week) and didn't have enough time to play with it. I still can offer two comments though.

First, MAGMA 0.2 is a release before most of the complex arithmetic routines were available in CUBLAS, so we had wrappers for the missing routines (copy the data needed to the CPU, use the CPU BLAS, and copy the result back to the GPU). In the next release we will remove the wrappers, and actually you can also try it, e.g., in zgetrf.cpp you can add at the beginning

Code: Select all

#define magmablas_ztrsm cublasZtrsm
to quickly improve performance from

Code: Select all

  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
1024     25.98          39.36         1.074309e-17
2048     31.57          74.94         1.088811e-17
3072     34.04          95.24         1.080715e-17
4032     34.97         107.43         1.072353e-17
5184     35.92         117.15         1.056042e-17
6016     36.35         122.15         1.019869e-17
7040     36.78         127.60         1.018500e-17
to

Code: Select all

device 0: Tesla C2050, 1147.0 MHz clock, 2687.4 MB memory
device 1: Quadro NVS 290, 918.0 MHz clock, 255.7 MB memory

Usage: 
  testing_zgetrf -N 1024

  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     22.49          55.33         8.831272e-18
 2048     30.91         105.98         8.668125e-18
 3072     32.74         135.30         8.470229e-18
 4032     34.28         153.83         8.267812e-18
 5184     34.96         165.87         8.157549e-18
 6016     35.24         173.98         8.000245e-18
 7040     35.41         179.99         8.000857e-18
Second, as you point out, magma would need tuning for the Fermi. Magma depends on BLAS to get high performance. Higher performance BLAS would require changing the kernels, e.g., increasing the sizes of the inner blocking (to get increased computational intensity which is possible now because of the increased amount of shared memory). Currently, we use gemm implementations that are only slightly better than the that of the one-sided factorizations (from your experiments). In double precision for example, we use dgemm that gets up to 170 GFlop/s. CUBLAS has peaks (at particular sizes) to up to 230 GFlop/s. Also, at IPDPS I learned that NVIDIA is looking at dgemm that gets to above 300 GFlop/s (written in assembly though).
The bottom line, these are preliminary results, and there is a lot of room for improvement (that would come mostly from improved BLAS and tuning on higher level routines).
Stan

Allan Menezes
Posts: 14
Joined: Wed Aug 05, 2009 10:01 pm

Re: Magma 0.2 with Nvidia Fermi GTX470 and GotoBLAS2 results

Post by Allan Menezes » Tue Apr 20, 2010 8:18 pm

Dear Stan,
Thank you for your response! Wow 300GFlops/s sounds interesting. But one stupid question: Is it double precision? And you probably would not get it with the GTX480 just the Tesla versions of Fermi.
I tried in zgetrf.cpp and zgetrf_gpu.cpp adding at the beginning as you suggested:#define magmablas_ztrsm cublasZtrsm and here are the results for the NVIDIA GTX480:

Before adding that define to zgetrf.cpp:

Code: Select all

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     25.98          39.36         1.074309e-17
 2048     31.57          74.94         1.088811e-17
 3072     34.04          95.24         1.080715e-17
 4032     34.97         107.43         1.072353e-17
 5184     35.92         117.15         1.056042e-17
 6016     36.35         122.15         1.019869e-17
 7040     36.78         127.60         1.018500e-17
And after adding that define to zgetrf.cpp, the new results:

Code: Select all

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.78          54.86         1.085451e-17
 2048     32.31          92.82         1.093588e-17
 3072     34.93         112.67         1.084504e-17
 4032     36.14         122.59         1.069110e-17
 5184     36.89         130.84         1.053841e-17
 6016     37.28         135.10         1.021357e-17
 7040     37.61         138.86         1.019910e-17
A significant improvement of approximately 11GFlops/s
And without adding the define to zgetrf_gpu.cpp:

Code: Select all

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.68          48.05         1.927441e-17
 2048     32.47          87.67         1.740228e-17
 3072     35.02         108.58         1.553913e-17
 4032     35.97         119.30         1.492815e-17
 5184     36.78         128.57         1.416735e-17
 6016     37.15         133.39         1.359233e-17
 7040     37.51         137.32         1.304601e-17
And after adding the same line to zgetrf_gpu.cpp, the new results:

Code: Select all

device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.2 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

Usage: 
  testing_zgetrf_gpu -N 1024



  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     26.59          68.16         1.947900e-17
 2048     32.67         114.31         1.748238e-17
 3072     34.85         132.05         1.561828e-17
 4032     35.79         139.84         1.487934e-17
 5184     33.97         144.19         1.412475e-17
 6016     36.65         146.15         1.362567e-17
 7040     37.58         150.23         1.307237e-17
Where we see a performance gain of approximately 13GFlops/s for N=7040.
So, please when is the estimated release date (if you can guess) of the next magma version?
Also for all my above test I used a word size of 64 bits.
Thank you very much,
Allan

Stan Tomov
Posts: 263
Joined: Fri Aug 21, 2009 10:39 pm

Re: Magma 0.2 with Nvidia Fermi GTX470 and GotoBLAS2 results

Post by Stan Tomov » Fri Apr 23, 2010 10:58 pm

Allan,
Yes, 300 GFlop/s is very interesting, actually (even more interesting) it is 360 GFlop/s, and is in double (the kernel is not available yet; I think the GTX470 should get that speed). We hoped to get a new magma release by the end of the month but are running late on this deadline.
Regards,
Stan

Post Reply