Thanks for the new software toys - its also nice to have MAGMA working in matlab!
Some assorted comments, some of which may be sorted out in the formal release...or mentioned in the documentation.
It develops that calls to magma_sgesv_gpu seem to need matrices to be padded with zeros to sizes that are multiples of 32 - anything else and I seem to get the wrong answers. This is the fermi version of this code.
Although the MAGMA 1.0/RC2 tarball came with the source code for sgemv for fermi, the Makefile does not seem to want to compile it - at least it is not set up for it. I'm using the cublas version for now. Perhaps this is an oversight, perhaps sgemv is still a work in progress.
The efficiency of computation seems to depend critically on the host BLAS/LAPACK employed. The Intel MKL seem to run fastest, with AMD's ACML a close second. The native BLAS with my Suse 11.1 system and a downloaded+compiled ATLAS performed poorly, as did Sun's, I mean Oracle's, sunperf.a library. These latter libraries cause my test sgesv routine to run some 30-45% slower than when this routine is used with the Intel or AMD libraries. I find this sensitivity to be a little odd, inasmuch as the GPU is supposed to be employed, but no doubt the technical details/requirements of these routines escape me.
Prompted by a recent post, I just tried GotoBLAS2 which indeed is the fastest of all the BLAS libraries so far.
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/

- Comparison of the performance of sgesv using various versions of BLAS
- Magma_Sgesv_Performance.gif (18.17 KiB) Viewed 9037 times
Here is a comparison of what I get for the various BLAS versions for a call to SGESV, NRHS=5000. The CPU is an AMD Phenom II X4 965, and Tcpu is calculated using matlab R2010b, single processor. The GPU is GTX 480, "fermi". The 3.8.3 version of ATLAS was downloaded and compiled by me, following directions on the Atlas website. Suse 11.1 refers to the generic version of BLAS (probably atlas) available by rpm for that version of linux. MKL is version 10.2.5.035. ACML is version 4.4.0, ifort. GotoBLAS is version 1.13, compiled by me using gfortran. Sunperf is from version 12.2 of Oracle/Sun solstudio. (I gather intel i7's perform quite a bit faster in matlab than the AMD processors.)
In for a penny, in for a pound, here is the result of MAGMA's testing_sgetrf using GotoBLAS2 and a single processor (export OMP_NUM_THREADS=1) to give a benchmark for the AMD 965:
- Code: Select all
./testing_sgetrf
device 0: GeForce GTX 480, 1401.0 MHz clock, 1535.7 MB memory
M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)
============================================================
1024 1024 19.17 39.09 2.251748e-09
2048 2048 20.77 115.83 1.963081e-09
3072 3072 21.59 197.02 1.827424e-09
4032 4032 22.05 254.75 2.083052e-09
5184 5184 22.41 338.94 1.983713e-09
6016 6016 22.64 389.84 1.922557e-09
7040 7040 22.82 439.81 1.863838e-09
8064 8064 22.97 478.04 1.977259e-09
9088 9088 23.11 511.14 2.192004e-09