by fletchjp » Thu Feb 10, 2011 8:37 pm
Stan
I have just started some testing on the zgesvd example. It looks as though quite a lot of work is needed, as it does not seem any faster on the GPU than on the CPU, and something is making it mainly run on a single thread on the CPU most of the time.
I am running single cases, as the whole run crashes with a seg fault and in any case the larger cases are very slow.
I will report some numbers.
John
I have now done some testing - for square matrices each doubling takes about 10 times as long and there is little difference between the CPU and GPU times.
Square matrices
M=N CPU GPU (seconds)
512 2.35 2.44
1000 18.40 18.49
1024 23.65 23.62
2048 257.25 254.97
4096 2858.04 2832.85
I have also done some work on rectangular matrices. I changed the setting in the calls from A,A to S,S to only calculate the relevant rectangles. There are some cases where the work size is not adequate for the value returned for maxwork, and increasing the size makes a big difference to the timings. There is little difference between the CPU and GPU timings in most cases. However, there is one case, M = 1000, N = 1024 where the CPU case gives a result but the GPU case fails with INFO=999.
Experiments show that the failing case is called Path 10t: N > M but not much larger in zgesvd.cpp.
Maximum worksize reported:
GPU
M N-> 512 1000 1024 2048 4096 8192 10240
512 33792 295936 295936 295936 295936 295936 295936
1000 295936 66000 66768 1066000 1066000 1066000 1066000
1024 295936 66768 67584 1116160 1116160 1116160
2048 295936 1066000 1116160 135168 4329472
4096 295936 1066000 4329472
8192 295936 1066000
10240 295936
Note that the N by N memory sizes reported are much smaller than for other sizes, where the size only depends on the smaller dimension. When the small size is 512 the cases go on being very quick. It may be that I am just reporting known things about the CPU algorithm. I hope it will help you understand what is going on.
GPU times in seconds
GPU
M N-> 512 1000 1024 2048 4096 8192 10240
512 2.57 3.57 3.65 4.26 5.14 7.12 7.99
1000 3.49 18.44 (999) 25.53 29.58 34.97 40.19
1024 3.50 19.77 23.73 60.27 70.11 79.85 81.19
2048 3.94 27.48 38.56 255.82 914.03
4096 4.73 30.42 42.89 483.05
8192 5.91 35.17 44.59
10240 6.77 36.35 47.82
Note some lack of symmetry with N>M slower than N<M for the same sizes. The 2048 by 4096 value may come down (it needed more workspace). Indeed it does, from 914.03 to 523.02 with enough workspace.
The failing case has N<M by a small value. I have not yet tried any other similar cases.
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory