graphicsRat wrote:flexo1234 wrote:But now if I compare the time it takes to do a matrix-vector multiplication with this optimized BLAS library and with Matlab, Matlab is still a bit faster. Is this normal? Did anyone else compare the speed between these?

Did you use the BLAS subroutines

SGEMV and

DGEMV (GotoBLAS implements both). They should outperform Matlab by many orders of magnitude. I would be very surprised if they didn't.

Sorry for getting into this thread, I don't even use a windows machine.

What caught my eye was that the routines sgemv and dgemv are faster than the matrix vector multiplication of matlab. Is that so?

Today a friend of mine was running some matlab code which was using 7 cores in a 12 core machine. I was wondering why matlab would be running in parallel and then it hit me. The matrix multiplication and other routines must be done in parallel.

In any case, I'm writing my friends code in C in order to avoid the many loops that the program has. Since I'm using the LAPACK library I guess it would be useful to know if the routines dgemm would actually outperform matlab. I haven't had much time to test this, but I guess what I want to know is:

Do the routines dgemv and dgemm run in parallel? If they are not, how different is it running the BLAS routine to coding my own loops in C to make the multiplication?

If I want to do A1 = B1*C1; A2 = B2*C2; A3 = B3*C3; ... ; AN = BN*CN given that all the matrices are n^n then I would It would be more efficient to take advantage of loop jamming as in the pseudo-code below:

- Code: Select all
`for i=1:n`

for j=1:n

A1(i, j) = 0;

A2(i, j) = 0;

...

AN(i,j) = 0;

for k=1:n

A1(i, j) = A1(i, j) + B1(i,k)*C1(k,i);

A2(i, j) = A2(i, j) + B2(i,k)*C2(k,i);

...

AN(i, j) = AN(i, j) + BN(i,k)*CN(k,i);

end

end

end

Now, if I were to code this using say pthreads then I could divide this matrix into sections so that each thread could take care the computation of a single entry. All I'm wondering is if the BLAS routine will be better than the one I wrote above. Forgive my pseudo-code, I may have made a mistake somewhere but I hope that you guys can get the idea. Anyway, does anyone think I should code my own matrix multiplication with pthreads in C in order to take advantage of a computer with multiple cores and the loop jamming as in the sample code above? Thanks

-Manuel