low performance running mixed precision lu factorization

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Posts: 6
Joined: Thu Nov 28, 2019 6:14 am

Re: low performance running mixed precision lu factorization

Post by shengyushen » Wed Dec 18, 2019 5:34 am

yes, that is the point that confuse me the most.
It seems GPU have enough functionality and performance to do all the job, why you still have over some job to CPU?
such CPU side code may invoke huge overhead on both PCIE and CPU.

Posts: 22
Joined: Fri Sep 19, 2014 3:43 pm

Re: low performance running mixed precision lu factorization

Post by haidar » Fri Dec 20, 2019 11:30 pm

When you say
"I tries the same code on an AWS instance with 1GPU and 4 cores. it is still about 8TFlops
BUT when I turn to another instance with 4GPU and 16 cores, it reach 17TFLOPs!!!!!!"
it is not the number of GPUs that is important here but rather the number of cores since the Iterative refinement code is 1 GPU and do not use multi-GPUs, so I would say if you put OMP_NUM_THREADS=16 cores you should get the 17 Tflops.

Now on your machine I still believe that you should be able to get the 17 Tflops as well you have recent CPU (skylake) with 28 cores.
In my paper I tested Magma using 16 Haswell cores (so old cores compared to the ones you have) and was able to get the 24 Tflops, thus there is something in your system configurations.
I also suggest to try the same functionality using the cuSolver routine cusolverDnDHgesv_BufferSize, cusolverDnDHgesv or the expert API cusolverDnXgesv from cuda 10.2 which is a GPU only code and does not depends on CPU.

Post Reply