Yes, you can divide the matrix in half and compute the gemm in parallel. For instance,
A [B0 B1] = [C0 C1]
A B0 = C0 and A B1 = C1.
Note while B and C are split, the matrix A must be duplicated on the two GPUs. Or you can split the other way,
[ A0 ] B = [ C0 ]
[ A1 ] [ C1 ]
For most linear algebra algorithms, distributing the matrix in a block column cyclic fashion, or sometimes block row cyclic, is efficient for a small number of GPUs.
You can use cublas gemm or magmablas gemm to achieve this, in both cases using cudaSetDevice to switch between GPUs. For example,
cudaSetDevice( 0 );
cublasSgemm( handle0, CUBLAS_OP_N, CUBLAS_OP_N, m, n0, k, alpha, A0, lda, B0, ldb, beta, C0, ldc );
cudaSetDevice( 1 );
cublasSgemm( handle1, CUBLAS_OP_N, CUBLAS_OP_N, m, n1, k, alpha, A1, lda, B1, ldb, beta, C1, ldc );
Here I assumed B and C were split; A is duplicated in A0 and A1 on GPU 0 and 1 respectively.