Dear All,

I am fairly new to GPU computing using CUDA but have gathered a lot of information from the internet over the last month or so. I am an academic researcher who has mostly been programming computational routines in Python, due to the simplicity of scripting languages (of course I know, runtime speed kind of sucks for more complex problems...). I have identified CUDA GPGPU (and thus potentially also MAGMA) as a great way of "outsourcing" certain hotspot code fragments to the GPU. In particular I am thinking about cases such as bootstrapping techniques in econometrics and Monte Carlo integration in Finance, where in the former case I may have to solve a linear algebra problem many thousand times over, which I want to do in parallel rather than serially.

There is one problem I have been experiencing, which is that many CUDA-optimized routines - such as SGEMM in CUBLAS - appear only to work well with square matrices. Is this the same with MAGMA SGEMM as well? All I want is a general, "fire-and-forget", but very fast GPU SGEMM version for a GENERAL matrix multiplication problem, where the matrices are [p,q] x [q,r] where p neq q neq r. I think the best option in my case would be to use PyCUDA actually, because I employ Python as my main language, but to my surprise I also still have not found a working source code for a kernel which delivers a fool-proof solution to my problem. Could anybody provide me with a hint as to what approach I ought to take? I have played around a lot with Vasily Volkov's fairly new SGEMM kernel source code, but again, for me so far this has only worked for the special case of square matrices. I am 100% certain that the alternative "solution" of applying padding of zeros to general problems to make them all square cannot be answer to what I am looking for :-)

Thanks,

Eric