Page 1 of 1

Batched GEMV with float4

PostPosted: Wed Jun 21, 2017 6:28 am
by Genji
I want to use batched GEMV with a vector of float4 elements. However, I am ambivalent about how to proceed- do I break my float4 and do the gemv four times and then reassemble tem, or just write the size of float4 as inc and it will sort itself out?



Kind regards

Re: Batched GEMV with float4

PostPosted: Mon Jul 03, 2017 10:52 pm
by haidar
Can you please elaborate in more detail on what you want to do?
Are you meaning the float4 of Cuda vector unit?

I think it might be easy to cast the type into float and use the single precision dgemv.
In term of performance, our GEMV routine reach the theoretical peak which is bandwidth/2 for single precision SGEMV and badnwidth/4 for double precision dgemv
Thanks
Azzam

Re: Batched GEMV with float4

PostPosted: Tue Jul 04, 2017 6:26 am
by Genji
To clarify, do i break up my float4 into 4 gemvs OR use gemm with a 4 column matrix?

Re: Batched GEMV with float4

PostPosted: Tue Aug 01, 2017 9:13 pm
by haidar
I think both should provide similar performance since a gemm with 4 columns will look like 4 gemv's.
This is considered to be memory bound operation and the performance of it will be behave like dgemv performance
Azzam