Turns out it was just the default test settings causing the test errors. Apparently 2GB of GPU RAM isn't enough to run testing_dgemm under the default settings...
Found the solution here on these forums, at viewtopic.php?f=2&t=146
All looking very promising, I must say. Looking forward to getting some results, and fast!
Still, I had played with make.inc and Makefile.internal quite a lot, as I'd assumed that's where the problem lied.
I thought I might get slightly more optimised binaries (and quicker build times) by editing the nvcc flags in Makefile.internal so it doesn't build the "compute1.0" CUDA code as well as compute1.3 code.
Specifically, I changed the TESLA_OPT line to: "TESLAOPT = -arch compute_13 -code sm_13 -DGPUSHMEM=130"
My final make.inc looked like:-
GPU_TARGET = Tesla
CC = icc
NVCC = nvcc
FORT = ifort
ARCH = xiar
ARCHFLAGS = cr
RANLIB = ranlib
OPTS = -DADD_ -O3 -m64 -fPIC -openmp -mkl=sequential
FOPTS = -DADD_ -O3 -m64 -fPIC -cpp -nofor-main -mkl=sequential
NVOPTS = --compiler-options "-fPIC -O3 -fno-strict-aliasing -DUNIX -DADD_"
LDOPTS = -fPIC -Xlinker -zmuldefs
LIB = -lirc -limf -lmkl_rt -lpthread -lcublas -lcudart
CUDADIR = /usr/local/cuda
LIBDIR = -L$(MKLROOT)/lib/intel64 \
INC = -I$(CUDADIR)/include
LIBMAGMA = $(MAGMA_DIR)/lib/libmagma.a
LIBMAGMABLAS = $(MAGMA_DIR)/lib/libmagmablas.a
N.B. Added "lib" prefix to libmagma and libmagmablas. I changed to mkl=sequential after the suggestion in the readme. If using mkl=parallel (the default), also need to add "-liomp5" to the compile flags, and "-openmp" to OPTS and FOPTS.
N.B.2. nvcc seems to occasionally invoke gcc, but this can be overridden. Both Nvidia and Intel have had employees post on their forums (albeit over a year ago) discussing this. NVidia's stance was to patch Intel's math.h, and Intel's stance was to recompile your entire kernel with icc. Both methods seem a bit drastic..
Either way, got it working now! Awesome! :)