Batched GEMV with float4

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
Genji
Posts: 7
Joined: Mon May 29, 2017 8:58 pm

Batched GEMV with float4

Post by Genji » Wed Jun 21, 2017 6:28 am

I want to use batched GEMV with a vector of float4 elements. However, I am ambivalent about how to proceed- do I break my float4 and do the gemv four times and then reassemble tem, or just write the size of float4 as inc and it will sort itself out?



Kind regards
Last edited by Genji on Tue Jul 04, 2017 6:28 am, edited 1 time in total.

haidar
Posts: 19
Joined: Fri Sep 19, 2014 3:43 pm

Re: Batched GEMV with float4

Post by haidar » Mon Jul 03, 2017 10:52 pm

Can you please elaborate in more detail on what you want to do?
Are you meaning the float4 of Cuda vector unit?

I think it might be easy to cast the type into float and use the single precision dgemv.
In term of performance, our GEMV routine reach the theoretical peak which is bandwidth/2 for single precision SGEMV and badnwidth/4 for double precision dgemv
Thanks
Azzam

Genji
Posts: 7
Joined: Mon May 29, 2017 8:58 pm

Re: Batched GEMV with float4

Post by Genji » Tue Jul 04, 2017 6:26 am

To clarify, do i break up my float4 into 4 gemvs OR use gemm with a 4 column matrix?

haidar
Posts: 19
Joined: Fri Sep 19, 2014 3:43 pm

Re: Batched GEMV with float4

Post by haidar » Tue Aug 01, 2017 9:13 pm

I think both should provide similar performance since a gemm with 4 columns will look like 4 gemv's.
This is considered to be memory bound operation and the performance of it will be behave like dgemv performance
Azzam

Post Reply