Search found 22 matches

by haidar
Fri Dec 20, 2019 11:30 pm
Forum: User discussion
Topic: low performance running mixed precision lu factorization
Replies: 11
Views: 296

Re: low performance running mixed precision lu factorization

When you say "I tries the same code on an AWS instance with 1GPU and 4 cores. it is still about 8TFlops BUT when I turn to another instance with 4GPU and 16 cores, it reach 17TFLOPs!!!!!!" it is not the number of GPUs that is important here but rather the number of cores since the Iterative refineme...
by haidar
Fri Dec 13, 2019 12:22 am
Forum: User discussion
Topic: low performance running mixed precision lu factorization
Replies: 11
Views: 296

Re: low performance running mixed precision lu factorization

first of all there is many issue look like. The goal of the Tensor Cores Accelerated Iterative Refinement Solver (TCAIRS) in both Magma and cuSolver is to provide around 4X speedup over the basic double precision (dgesv) and this is true for your run as well (2Tflops versus 8Tflops). Note also that ...
by haidar
Wed Dec 04, 2019 12:12 am
Forum: User discussion
Topic: low performance running mixed precision lu factorization
Replies: 11
Views: 296

Re: low performance running mixed precision lu factorization

For this size you should be able to get easily above 20 Tflops with your machine. It might be that the CPU binding is having issue, I can see 56 threads but I would believe you have 2x14=28 cores Intel, then it is better to export OMP_NUM_THREADS=28 or even 14 and try again. you can also look at the...
by haidar
Fri Dec 08, 2017 11:47 am
Forum: User discussion
Topic: Sequential SVD computation for Big Data using MAGMA
Replies: 2
Views: 2694

Re: Sequential SVD computation for Big Data using MAGMA

Dear B-C, The paper you refer to is a experimental code for CPU multicore, it does not use GPU. your matrix is square, you might look at this paper https://link.springer.com/chapter/10.1007/978-3-319-58667-0_9 The paper include formula to calculate the expected time for a calculation, so you can fir...
by haidar
Tue Aug 15, 2017 11:04 am
Forum: User discussion
Topic: Is autotuning a offline process?
Replies: 1
Views: 1764

Re: Is autotuning a offline process?

Hi,
The current autotunning process is performed offline since it is done once per GPU type.
Thus we generate all acceptable kernel configuration and run them, and analyze the performance and choose the best for the target architecture.
Azzam
by haidar
Tue Aug 01, 2017 9:13 pm
Forum: User discussion
Topic: Batched GEMV with float4
Replies: 3
Views: 1418

Re: Batched GEMV with float4

I think both should provide similar performance since a gemm with 4 columns will look like 4 gemv's.
This is considered to be memory bound operation and the performance of it will be behave like dgemv performance
Azzam
by haidar
Thu Jul 20, 2017 11:10 am
Forum: User discussion
Topic: Multiple queues and sgemv_batched
Replies: 2
Views: 1157

Re: Multiple queues and sgemv_batched

Hi, if you create different queues and launch different sgemv_batched this mean, you are telling the GPU, that whenever he has slot available for work he can launch work from queue 2, 3, 4, etc. Now two questions: 1- if you dispatch them over 9 queues that can run in parallel so why you didn't made ...
by haidar
Thu Jul 20, 2017 10:49 am
Forum: User discussion
Topic: Trouble Compiling Magma on an AMD Cray
Replies: 2
Views: 1156

Re: Trouble Compiling Magma on an AMD Cray

Ray,
Thank you very much for sharing the solution, that's interesting to know.
Azzam
by haidar
Thu Jul 20, 2017 10:48 am
Forum: User discussion
Topic: Toeplitz Matrix Batch
Replies: 2
Views: 1195

Re: Toeplitz Matrix Batch

Dear user, As of today we do not have any particular routine suitable for Toeplitz matrices and we would be happy and will help if you would like to contribute such routine into the Magma library. However, I did not quite understand the batch are you having one system of equation Ax=b meaning one ma...
by haidar
Mon Jul 03, 2017 10:52 pm
Forum: User discussion
Topic: Batched GEMV with float4
Replies: 3
Views: 1418

Re: Batched GEMV with float4

Can you please elaborate in more detail on what you want to do? Are you meaning the float4 of Cuda vector unit? I think it might be easy to cast the type into float and use the single precision dgemv. In term of performance, our GEMV routine reach the theoretical peak which is bandwidth/2 for single...