hi
i have been using the dgemv function on a machine with two tesla(1070) cards.
i have an matrix of dimension dxd.
Now in my first call i only use one tesla card and call the function like this
magmablas_dgemv('N',d,d,1.0,matrix,d,vector_in,1,0.0,vector_out,1);
this works fine.
after this i launch two threads each of one sets the device to 0 or 1 and runs the same routine but now the matrix is divided into two parts and so is the vector
so now the matrix is (d/2 x d) and the vector is d/2 so on each of the GPU threads my call becomes
magmablas_dgemv('N',d/2,d,1.0,matrixhalf,d,vector_in_half,1,0.0,vector_out_half,1);
both are passed different versions of the vector and the matrix.
the matrix is stored in row major format. so the final vector (from each of the GPU calls) must be concatenated to get the real vector like it is available from the original (complete matrix and complete vector) call.
the problem is that in the broken calls on separate GPU the final vector has different results. the difference between corresponding indexes of the final vector increases. So if the original (correct) vector has values ranging from 1-256 then the error at index(1)<index(2)....<index(256).
i cannot understand why. i have checked the array passed to the magmablas_dgemv function and also the vector. stored them in a file and imported them in matlab and there concatenated the matrix and the vector and then the result is same as the original case. however from magma it isn't correct.
am i doing something wrong in my implementation? or is this some know error?
my magma version is the 1.0.0-rc5
cuda is 3.2
kindly help. thanks in advance
rohit