It will help if you use the BLAS m, n, k convention. For C := alpha A B + beta C
C is m x n
A is m x k
B is k x n
It appears your leading dimensions may be wrong. The leading dimension (lda, ldb, ldc) is the number of rows allocated for the matrix, which may be greater than the number of rows in the gemm. If the gemm is the whole matrix without padding, then lda=m, ldb=k, and ldc=m. You can of course add padding (extra unused rows on bottom), so lda >= m, ldb >= k, ldc >= m. We usually pad the number of rows to a multiple of 32 on the GPU. See magma/testing/testing_sgemm.cpp for sample code.
cublasSgemm( 'n', 'n', m, n, k, alpha, dA, lda, dB, ldb, beta, dC, ldc );
You used N for rows and M for columns, which is the opposite of the normal matrix convention (m rows x n cols). In your notation,
lda = N1;
ldb = N2; // == M1, the inner, k, dimension
ldc = N1;
cublasSgemm( 'n', 'n', N1, M2, M1, alpha, dA, lda /*N1*/, dB, ldb /*N2*/, dC, ldc /*N1*/ );
In particular, the lda you passed was M1, not N1.
Hope that helps. If you still have problems, please post a more complete sample code. I can't tell how you are allocating the matrices, putting data onto the GPU, getting results from the GPU, or what your expected and actual outputs are.