Using sgemm for rectangular (nonsquare) matrix multiply

 Posts: 1
 Joined: Mon Apr 09, 2012 2:48 am
Using sgemm for rectangular (nonsquare) matrix multiply
Dear All,
I am fairly new to GPU computing using CUDA but have gathered a lot of information from the internet over the last month or so. I am an academic researcher who has mostly been programming computational routines in Python, due to the simplicity of scripting languages (of course I know, runtime speed kind of sucks for more complex problems...). I have identified CUDA GPGPU (and thus potentially also MAGMA) as a great way of "outsourcing" certain hotspot code fragments to the GPU. In particular I am thinking about cases such as bootstrapping techniques in econometrics and Monte Carlo integration in Finance, where in the former case I may have to solve a linear algebra problem many thousand times over, which I want to do in parallel rather than serially.
There is one problem I have been experiencing, which is that many CUDAoptimized routines  such as SGEMM in CUBLAS  appear only to work well with square matrices. Is this the same with MAGMA SGEMM as well? All I want is a general, "fireandforget", but very fast GPU SGEMM version for a GENERAL matrix multiplication problem, where the matrices are [p,q] x [q,r] where p neq q neq r. I think the best option in my case would be to use PyCUDA actually, because I employ Python as my main language, but to my surprise I also still have not found a working source code for a kernel which delivers a foolproof solution to my problem. Could anybody provide me with a hint as to what approach I ought to take? I have played around a lot with Vasily Volkov's fairly new SGEMM kernel source code, but again, for me so far this has only worked for the special case of square matrices. I am 100% certain that the alternative "solution" of applying padding of zeros to general problems to make them all square cannot be answer to what I am looking for :)
Thanks,
Eric
I am fairly new to GPU computing using CUDA but have gathered a lot of information from the internet over the last month or so. I am an academic researcher who has mostly been programming computational routines in Python, due to the simplicity of scripting languages (of course I know, runtime speed kind of sucks for more complex problems...). I have identified CUDA GPGPU (and thus potentially also MAGMA) as a great way of "outsourcing" certain hotspot code fragments to the GPU. In particular I am thinking about cases such as bootstrapping techniques in econometrics and Monte Carlo integration in Finance, where in the former case I may have to solve a linear algebra problem many thousand times over, which I want to do in parallel rather than serially.
There is one problem I have been experiencing, which is that many CUDAoptimized routines  such as SGEMM in CUBLAS  appear only to work well with square matrices. Is this the same with MAGMA SGEMM as well? All I want is a general, "fireandforget", but very fast GPU SGEMM version for a GENERAL matrix multiplication problem, where the matrices are [p,q] x [q,r] where p neq q neq r. I think the best option in my case would be to use PyCUDA actually, because I employ Python as my main language, but to my surprise I also still have not found a working source code for a kernel which delivers a foolproof solution to my problem. Could anybody provide me with a hint as to what approach I ought to take? I have played around a lot with Vasily Volkov's fairly new SGEMM kernel source code, but again, for me so far this has only worked for the special case of square matrices. I am 100% certain that the alternative "solution" of applying padding of zeros to general problems to make them all square cannot be answer to what I am looking for :)
Thanks,
Eric
Re: Using sgemm for rectangular (nonsquare) matrix multiply
The gemm in CUBLAS (and in MAGMA BLAS) works for rectangular matrices. As far as I know, all the CUBLAS work for all matrix sizes. If you are seeing some problems, please report the specific case. A short sample code would be helpful.
mark
mark
Re: Using sgemm for rectangular (nonsquare) matrix multiply
Hello,
I did non  square matrix multiplication by using cublasSgemm, I am getting the wrong output. I m very new to use those library functions. Please any one help me.
I used "cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"
int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2
Thanks in advance..
I did non  square matrix multiplication by using cublasSgemm, I am getting the wrong output. I m very new to use those library functions. Please any one help me.
I used "cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"
int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2
Thanks in advance..
Re: Using sgemm for rectangular (nonsquare) matrix multiply
Hello,
I m very new to use these library files. I did non  square matrix multiplication using cublasSgemm function, but am getting the wrong ouput. Please any one help me.
My sample code :
"cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"
int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2
Thanks in advance..
I m very new to use these library files. I did non  square matrix multiplication using cublasSgemm function, but am getting the wrong ouput. Please any one help me.
My sample code :
"cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"
int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2
Thanks in advance..
Re: Using sgemm for rectangular (nonsquare) matrix multiply
It will help if you use the BLAS m, n, k convention. For C := alpha A B + beta C
C is m x n
A is m x k
B is k x n
It appears your leading dimensions may be wrong. The leading dimension (lda, ldb, ldc) is the number of rows allocated for the matrix, which may be greater than the number of rows in the gemm. If the gemm is the whole matrix without padding, then lda=m, ldb=k, and ldc=m. You can of course add padding (extra unused rows on bottom), so lda >= m, ldb >= k, ldc >= m. We usually pad the number of rows to a multiple of 32 on the GPU. See magma/testing/testing_sgemm.cpp for sample code.
cublasSgemm( 'n', 'n', m, n, k, alpha, dA, lda, dB, ldb, beta, dC, ldc );
You used N for rows and M for columns, which is the opposite of the normal matrix convention (m rows x n cols). In your notation,
lda = N1;
ldb = N2; // == M1, the inner, k, dimension
ldc = N1;
cublasSgemm( 'n', 'n', N1, M2, M1, alpha, dA, lda /*N1*/, dB, ldb /*N2*/, dC, ldc /*N1*/ );
In particular, the lda you passed was M1, not N1.
Hope that helps. If you still have problems, please post a more complete sample code. I can't tell how you are allocating the matrices, putting data onto the GPU, getting results from the GPU, or what your expected and actual outputs are.
mark
C is m x n
A is m x k
B is k x n
It appears your leading dimensions may be wrong. The leading dimension (lda, ldb, ldc) is the number of rows allocated for the matrix, which may be greater than the number of rows in the gemm. If the gemm is the whole matrix without padding, then lda=m, ldb=k, and ldc=m. You can of course add padding (extra unused rows on bottom), so lda >= m, ldb >= k, ldc >= m. We usually pad the number of rows to a multiple of 32 on the GPU. See magma/testing/testing_sgemm.cpp for sample code.
cublasSgemm( 'n', 'n', m, n, k, alpha, dA, lda, dB, ldb, beta, dC, ldc );
You used N for rows and M for columns, which is the opposite of the normal matrix convention (m rows x n cols). In your notation,
lda = N1;
ldb = N2; // == M1, the inner, k, dimension
ldc = N1;
cublasSgemm( 'n', 'n', N1, M2, M1, alpha, dA, lda /*N1*/, dB, ldb /*N2*/, dC, ldc /*N1*/ );
In particular, the lda you passed was M1, not N1.
Hope that helps. If you still have problems, please post a more complete sample code. I can't tell how you are allocating the matrices, putting data onto the GPU, getting results from the GPU, or what your expected and actual outputs are.
mark
Re: Using sgemm for rectangular (nonsquare) matrix multiply
Thanks for your reply Sir, I made a mistake. Now i got correct output.

 Posts: 1
 Joined: Thu Aug 15, 2019 1:23 am
Re: Using sgemm for rectangular (nonsquare) matrix multiply
Hi all, I have a below code. In my code, I want to replace cublasSgemm function by my code. But when I ran, I got a wrong results. If you know how to solve my problem, please show me. Thanks very much.
Diem
Code: Select all
void caffe_gpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
const float alpha, const float* A, const float* B, const float beta,
float* C) {
// Note that cublas follows fortran order.
int lda = (TransA == CblasNoTrans) ? K : M;
int ldb = (TransB == CblasNoTrans) ? N : K;
cublasOperation_t cuTransA =
(TransA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
cublasOperation_t cuTransB =
(TransB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
// CUBLAS_CHECK(cublasSgemm(Caffe::cublas_handle(), cuTransB, cuTransA,
// N, M, K, &alpha, B, ldb, A, lda, &beta, C, N));
////
cudaDeviceSynchronize();
float* A_cpu = (float *)malloc(K*M*sizeof(float));
float* B_cpu = (float *)malloc(N*K*sizeof (float));
float* C_cpu = (float *)malloc (N*M*sizeof (float));
float * D_cpu = (float *)malloc (N*M*sizeof (float));
cudaMemcpy( A_cpu, A, K*M, cudaMemcpyDeviceToHost);
cudaMemcpy( B_cpu, B, N*K, cudaMemcpyDeviceToHost);
cudaMemcpy( D_cpu, C, N*M, cudaMemcpyDeviceToHost);
for (int row = 0; row < N; row++) {
for (int col = 0; col < M; col++) {
for (int i = 0; i < K; i++) {
if (i == 0)
C_cpu[row*M + col] = alpha* (B_cpu[row*K + i] * A_cpu[i*M+ col]) +beta* D_cpu[row*M+col] ;
else C_cpu[row*M + col] += alpha* (B_cpu[row*K + i] * A_cpu[i*M+ col]) +beta* D_cpu[row*M+col] ;
}
}
}
cudaMemcpy(C,C_cpu, N*M, cudaMemcpyHostToDevice);
printf ("da lam xong");
/////
}
Re: Using sgemm for rectangular (nonsquare) matrix multiply
First, in the future, please start a new topic, rather than replying to a 7 year old topic.
Your variable names seem mixed up: there are M rows and N cols in C. You are adding beta*C multiple times. Normally I use i for rows, j for cols. Unfortunately k is used for dimension, so we need something else for that loop counter. Also, use lda, ldb, ldc for matrix strides for clarity and flexibility. I recommend looking at the reference BLAS (http://www.netlib.org/lapack/) for naive, but correct, dgemm implementations, and at magmablas for more optimized, complicated implementations.
mark
Your variable names seem mixed up: there are M rows and N cols in C. You are adding beta*C multiple times. Normally I use i for rows, j for cols. Unfortunately k is used for dimension, so we need something else for that loop counter. Also, use lda, ldb, ldc for matrix strides for clarity and flexibility. I recommend looking at the reference BLAS (http://www.netlib.org/lapack/) for naive, but correct, dgemm implementations, and at magmablas for more optimized, complicated implementations.
Code: Select all
// Naive, slow implementation.
// NoTrans, NoTrans case.
for (j = 0; j < n; ++j) {
for (i = 0; i < m; ++i) {
tmp = 0;
for (kk = 0; kk < k; ++kk) {
tmp += A[ i + kk*lda ] * B[ kk + j*ldb ];
}
C[ i + j*ldc ] = tmp + beta*C[ i + j*ldc ];
}
}