by **mgates3** » Mon Nov 19, 2012 12:00 pm

Yes, you can divide the matrix in half and compute the gemm in parallel. For instance,

A [B0 B1] = [C0 C1]

becomes

A B0 = C0 and A B1 = C1.

Note while B and C are split, the matrix A must be duplicated on the two GPUs. Or you can split the other way,

[ A0 ] B = [ C0 ]

[ A1 ] [ C1 ]

For most linear algebra algorithms, distributing the matrix in a block column cyclic fashion, or sometimes block row cyclic, is efficient for a small number of GPUs.

You can use cublas gemm or magmablas gemm to achieve this, in both cases using cudaSetDevice to switch between GPUs. For example,

cudaSetDevice( 0 );

cublasSgemm( handle0, CUBLAS_OP_N, CUBLAS_OP_N, m, n0, k, alpha, A0, lda, B0, ldb, beta, C0, ldc );

cudaSetDevice( 1 );

cublasSgemm( handle1, CUBLAS_OP_N, CUBLAS_OP_N, m, n1, k, alpha, A1, lda, B1, ldb, beta, C1, ldc );

Here I assumed B and C were split; A is duplicated in A0 and A1 on GPU 0 and 1 respectively.

-mark