While running the supplied 'testing_cgetrf_gpu', I observed that using a parallel CPU compilation achieves greater GFLOP/s compared to that attained by the GPU. Following are part of the result tables:
- Code: Select all
a) Sequential CPU vs GPU
N CPU GFlop/s GPU GFlop/s ||PA-LU|| / (||A||*N)
==========================================================
1024 13.74 36.66 4.693558e-09
2048 17.67 50.95 4.678659e-09
3072 18.55 54.45 4.583328e-09
4032 19.20 58.67 4.651947e-09
b) Parallel CPU vs GPU
- Code: Select all
N CPU GFlop/s GPU GFlop/s ||PA-LU|| / (||A||*N)
==========================================================
1024 45.28 38.50 4.691306e-09
2048 116.63 57.24 4.679362e-09
3072 128.99 62.58 4.616083e-09
4032 131.83 64.67 4.631509e-09
I was just wondering if this is what to expect, or am I doing something wrong?
