magma_dsytrd questions

Open discussion for MAGMA

Re: magma_dsytrd questions

Postby mateo70 » Tue Mar 08, 2011 8:13 pm

Hi,

I will check, because I haven't work on this paper, but I think that the difference of performance you get on these routines is due to the cublasAlloc that we added to the function to make it user friendly. When the workspace is allocated once out of the function, it can change a lot the performances, that's why in the next version we will add a interface where the user will be able to provide the workspace.

Mathieu
mateo70
 
Posts: 41
Joined: Tue Mar 08, 2011 12:38 pm

Re: magma_dsytrd questions

Postby mtacconi » Tue Mar 15, 2011 9:14 am

I don't believe it is all about the cublasAlloc: as far as I saw from my tests on the magma_dsytrd you can obtain about 40 GFlop/s (asymptotic) from a Tesla C2050 if:
1- call magmablas_dsymv6_fermi instead of the cublasDsymv
2- use magmablas_dsyr2k (you can obtain this routine from the single precision version magmablas_ssyr2k included in magma1.0RC3 and RC4 with basically no effort) instead of cublasDsyr2k.

consider that the magma_dsytrd out-of-the-box performs around 20 Gflop/s... ;)

edit: BTW I have to say that I also allocate all the device workspace outside the magma_dsytrd. The allocation/deallocation of device memory is particularly costly in the magmablas_dsymv which is called a lot of times within the driver routine.

Mario
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby mateo70 » Tue Mar 15, 2011 5:03 pm

Hi,

Now I see. This routines are part of the functions we still need to port on four precisions, everything has been done in src directory (LAPACK functions), but there is still a lot of work in magmablas directory and most of it will be probably included by NVIDIA.
We will add define to the correct functions in the final release in xlatrd and x[sy|he]trd functions.
Thanks for the report.

Mathieu
mateo70
 
Posts: 41
Joined: Tue Mar 08, 2011 12:38 pm

Re: magma_dsytrd questions

Postby arom » Wed Jun 27, 2012 10:11 am

Hi!

I'm a little bit confused about performance results on the first page.

In my task I substituted LAPACK function with MAGMA one:
Code: Select all
#ifdef MAGMA
      CALL  MAGMAF_DSYTRD( UPLO, N, A, LDA, W, WORK(INDE), WORK(INDTAU),
     $             Z, LLWORK, IINFO )
#else
      CALL DSYTRD( UPLO, N, A, LDA, W, WORK(INDE), WORK(INDTAU),
     $             Z, LLWORK, IINFO )
#endif


UPLO='U', N=1895
In this case MAGMAF_DSYTRD() copies data to GPU, copies it back and perform all calculations on CPU only.

How is it possible to get speedup factor 10x for such matrix size?
arom
 
Posts: 11
Joined: Wed Jun 27, 2012 3:22 am

Re: magma_dsytrd questions

Postby mgates3 » Wed Jun 27, 2012 12:04 pm

Which "first page" are you referring to?
Why do you think that MAGMAF_DSYTRD performs all calculations on CPU only? It calls magma_dsytrd, which does a magma_dsymv (in latrd) and magma_dsyr2k on GPU.
-mark
mgates3
 
Posts: 413
Joined: Fri Jan 06, 2012 2:13 pm

Re: magma_dsytrd questions

Postby arom » Thu Jun 28, 2012 1:46 am

Hi!

I mean first page of this topic (brom » Wed Feb 23, 2011 2:28 am )

If you look at the following code, you could find that you never enter into FOR LOOP if n (matrix size) less than 2048.
magma-1.2.0/src/dsytrd.cpp:209-223
Code: Select all
if (n < 2048)
      nx = n;
    else
      nx = 512;

    if (upper) {

        /* Copy the matrix to the GPU */
        magma_dsetmatrix( n, n, A(0, 0), lda, dA(0, 0), ldda );

        /*  Reduce the upper triangle of A.
            Columns 1:kk are handled by the unblocked method. */
        kk = n - (n - nx + nb - 1) / nb * nb;
        for (i = n - nb; i >= kk; i -= nb)
          {
arom
 
Posts: 11
Joined: Wed Jun 27, 2012 3:22 am

Previous

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot], Yahoo [Bot] and 2 guests

cron