magma_dsytrd questions

Open discussion for MAGMA

magma_dsytrd questions

Postby jeremiahpalmer » Mon Feb 07, 2011 3:35 pm

I have been quite pleased and impressed with MAGMA. Thanks for the great work!

While looking over the code for magma_dsytrd (src/dsytrd.cpp), I found a few things that I don't quite understand:

(1) In the comment section, line 103 mentions a workspace item called "DA", but this item is not listed in the argument list. Do I need to include a "DA" in my argument list for magma_dsytrd?

(2) Why do the "upper" and "lower" code segments operate differently? Perhaps the code is incomplete?

(3) It looks like magma_dlatrd which is called by magma_dsytrd calls some CPU BLAS functions. Does anyone know if the performance benefit in doing these operations on the CPU is significant? (I know that the most significant amount of time is spend in the big symv and syr2k operation. Aside from those times, is the savings of doing the little dgemvs significant?)

Thanks!
-Jeremiah
jeremiahpalmer
 
Posts: 58
Joined: Fri Jan 28, 2011 12:46 pm

Re: magma_dsytrd questions

Postby jeremiahpalmer » Mon Feb 14, 2011 12:28 pm

Does anyone know the answers to these questions? The MAGMA authors could answers these questions very quickly.
jeremiahpalmer
 
Posts: 58
Joined: Fri Jan 28, 2011 12:46 pm

Re: magma_dsytrd questions

Postby fletchjp » Mon Feb 14, 2011 1:34 pm

I don't have specific answers - I am not part of the development team.

Arguments starting with D are usually for memory on the GPU. Functions which explictly define gpu memory usually have names ending in _gpu and there are sometimes versions with and without. Sometimes the comments and names have been carried from one to another and not consistently corrected.

John
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: magma_dsytrd questions

Postby Stan Tomov » Mon Feb 14, 2011 5:58 pm

Related to question (1), DA should not be passed as an argument. This is a bug in the documentation that is fixed now. The use is as in testing_dsytrd.cpp. We had before an interface where the user would provide work space which we changed to internal allocation (and forgot to remove it from the argument description).

Related to (2), the "upper" case is not implemented yet (to use GPU acceleration. It has to be still done, e.g., similarly to the lower case.

Finally, related to question (3), magma_dlatrd is the LAPACK dlatrd where the most time consuming dsymvs (actually accounting for 50% of the flops for the entire dsytrd algorithm) are offloaded to the GPU (note that the matrix stays always on the GPU and we only transfer the vectors, i.e., for every O(n) communication we still do O(n^2) computation on the GPU). I can not quantify how much slower the algorithm would be if everything was ported on the GPU, but I expect that the difference would be significant.
Stan Tomov
 
Posts: 251
Joined: Fri Aug 21, 2009 10:39 pm

Re: magma_dsytrd questions

Postby jeremiahpalmer » Wed Feb 16, 2011 12:26 pm

Thanks a lot!
jeremiahpalmer
 
Posts: 58
Joined: Fri Jan 28, 2011 12:46 pm

Re: magma_dsytrd questions

Postby mtacconi » Tue Feb 22, 2011 11:38 am

Stan Tomov wrote: The use is as in testing_dsytrd.cpp


I could not find this file in the testing directory.

However I ran some test on the dsytrd subroutine using the following code:

Code: Select all
 * @the testing_dgetrf source code has been used as a template.
 *
 **/
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cublas.h>

// includes, project
#include "flops.h"
#include "magma.h"
#include "testings.h"

// Flops formula
#define PRECISION_d
#define FLOPS(n) (      FMULS_SYTRD(n) +      FADDS_SYTRD(n) )


/* ////////////////////////////////////////////////////////////////////////////
   -- Testing dsytrd
*/

int main( int argc, char** argv)
{
    TESTING_CUDA_INIT();

    TimeStruct       start, end;
    double           flops, gpu_perf, cpu_perf, error;
    double *h_A, *h_R;
    double *h_DA, *h_EA, *h_DR, *h_ER;
    double *h_TAUA, *h_TAUR;
    double *h_WORKA, *h_WORKR;
    double etime_cpu, etime_gpu;
    cublasStatus status;
    double *d_work;
    int LWORKA, LWORKR;
   
    char uplo='L';
    /* Matrix size */
    magma_int_t N = 0, n2, lda;
    magma_int_t size[10] = {1024,2048,3072,4032,5184,6016,7040,8064,9088,10112};

    magma_int_t i, info, min_mn, nb;
    magma_int_t ione     = 1;
    magma_int_t ISEED[4] = {0,0,0,1};

   if (argc != 1){
   for(i = 1; i<argc; i++){
            if (strcmp("-N", argv[i])==0)
                N = atoi(argv[++i]);
        }
        if (N>0)
            printf("  testing_dsytrd -N %d\n\n", N);
        else
            {
                printf("\nUsage: \n");
                printf("  testing_dsytrd -N %d\n\n", 1024);
                exit(1);
            }
    }
    else {
        printf("\nUsage: \n");
        printf("  testing_dsytrd_gpu -N %d\n\n", 1024);
        N = size[9];
    }
   
    n2     = N * N;
    min_mn = N;
    nb     = magma_get_dsytrd_nb(N);

    /* Allocate host memory for the matrix */
    TESTING_MALLOC(    h_A, double, n2     );
    TESTING_HOSTALLOC( h_R, double, n2     );

    /* Allocate memory for the tridiagonal factorization result */
    /* Diagonal of the tridiagonal matrices */
    TESTING_MALLOC(h_DA, double, N);
    TESTING_MALLOC(h_DR, double, N);
    /* Off-diagonal */
    TESTING_MALLOC(h_EA, double, N-1);
    TESTING_MALLOC(h_ER, double, N-1);
    /* scalar factors */
    TESTING_MALLOC(h_TAUA, double, N-1);
    TESTING_MALLOC(h_TAUR, double, N-1);
    /* workspaces */
    LWORKA=N*nb;
    TESTING_MALLOC(h_WORKA, double, LWORKA);
    LWORKR=N*nb;
    TESTING_MALLOC(h_WORKR, double, LWORKR);


   
    printf("\n\n");
    printf("  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)\n");
    printf("============================================================\n");
    for(i=0; i<10; i++){
        if (argc == 1){
        N = size[i];
        }
   min_mn= N;
   lda   = N;
   n2    = lda*N;
   flops = FLOPS( (double)N ) / 1000000;

    /* Allocate memory on the GPU */
    status = cublasAlloc(N*lda+2*N*nb, sizeof(double), (void**)&d_work);
    if (status != CUBLAS_STATUS_SUCCESS) {
      fprintf (stderr, "!!!! device memory allocation error (magma_dsytrd)\n");
      return 0;
    }


        /* Initialize the matrices */
        dlarnv_( &ione, ISEED, &n2, h_A );
        dlacpy_( MagmaUpperLowerStr, &N, &N, h_A, &lda, h_R, &lda );

        /* =====================================================================
           Performs operation using LAPACK
           =================================================================== */
        start = get_current_time();
        dsytrd_(&uplo, &N, h_A, &lda, h_DA, h_EA, h_TAUA, h_WORKA, &LWORKA, &info);
        end = get_current_time();
        if (info < 0)
            printf("Argument %d of dsytrd had an illegal value.\n", -info);

        etime_cpu = GetTimerValue(start, end);
        cpu_perf = flops / etime_cpu;
     
        /* ====================================================================
           Performs operation using MAGMA
           =================================================================== */
        start = get_current_time();
        magma_dsytrd(uplo, N, h_R, lda, h_DR, h_ER, h_TAUR, h_WORKR, &LWORKR, d_work, &info);
        end = get_current_time();

        if (info < 0)
            printf("Argument %d of dsytrd had an illegal value.\n", -info);
        etime_gpu = GetTimerValue(start, end);
        gpu_perf = flops / etime_gpu;

        cublasFree(d_work);

        /* =====================================================================
           Check the factorization (TODO)
           =================================================================== */
        error = 0.0;

        printf("%5d %5d  %6.2f         %6.2f     %6.2f      %6.2f     %e\n",
               N, N, cpu_perf,     etime_cpu, gpu_perf, etime_gpu, error);

        if (argc != 1)
            break;
    }

    /* Memory clean up */
    TESTING_FREE( h_A );
    TESTING_HOSTFREE( h_R );
    TESTING_FREE( h_DA );
    TESTING_FREE( h_DR );
    TESTING_FREE( h_EA );
    TESTING_FREE( h_ER );
    TESTING_FREE( h_TAUA );
    TESTING_FREE( h_TAUR );
    TESTING_FREE( h_WORKA );
    TESTING_FREE( h_WORKR );

    /* Shutdown */
    TESTING_CUDA_FINALIZE();
}


The measured performance of the above code (GFlops and Elapsed Time in ms) on a Tesla M2050 and a Xeon X5650@2.67GHz follows
(the multithreaded MKL 10.2 library has been used throughout)
Code: Select all
MKL_NUM_THREADS=1
M     N   CPU GFlop/s  CPU etime      GPU GFlop/s   GPU etime
==============================================================
 1024  1024    2.55         563.33       7.27        197.51   
 2048  2048    6.89         1664.27      12.80       895.77   
 3072  3072    5.91         6546.29      15.97       2423.07   
 4032  4032    5.77         15163.73     17.20       5084.59   
 5184  5184    6.16         30185.97     19.04       9761.71   
 6016  6016    6.32         45922.65     19.64       14787.11 

MKL_NUM_THREADS=2
M     N   CPU GFlop/s  CPU etime      GPU GFlop/s   GPU etime
============================================================
 1024  1024    7.18         199.84       10.58         135.64 
 2048  2048   14.06         815.36       13.51         848.70 
 3072  3072   13.35         2897.68      16.22        2384.71 
 4032  4032   12.32         7100.76      17.05        5130.04 
 5184  5184   12.17         15275.06     18.87        9847.06 
 6016  6016   12.29         23641.25     19.61       14808.18 
 8064  8064   11.04         63346.21     19.69       35525.21 

MKL_NUM_THREADS=4
M     N   CPU GFlop/s  CPU etime      GPU GFlop/s   GPU etime
============================================================
 1024  1024   12.23         117.38       13.30      107.96     
 2048  2048   25.61         447.73       13.08      876.94     
 3072  3072   23.21         1666.62      16.54      2339.50   
 4032  4032   19.07         4586.34      17.15      5100.60   
 5184  5184   16.84         11038.70     19.11      9725.92   
 6016  6016   16.11         18023.51     19.70      14741.81   
 7040  7040   15.52         29980.80     20.14      23107.40   

MKL_NUM_THREADS=6
M     N   CPU GFlop/s  CPU etime      GPU GFlop/s   GPU etime
============================================================
 1024  1024   13.50         106.30       18.53       77.47     
 2048  2048   30.59         374.86       13.54      847.26     
 3072  3072   28.96         1335.93      16.27      2377.54   
 4032  4032   22.31         3920.85      16.98      5148.98   
 5184  5184   19.73         9418.17      18.76      9904.61   
 6016  6016   19.07         15231.36     19.58      14831.17   
 7040  7040   18.26         25488.31     20.08      23171.77


As you can see I could not be able to reproduce the speed up reported in the paper "Accelerating the reduction to upper Hessenberg,
tridiagonal, and bidiagonal forms through hybrid GPU-based computing" (in this paper an asymptotic 80 GFlops for the GPU/Magma is shown in fig. 6 with the GPU constantly outperforming the CPU/MKL) .
Clearly I am doing something wrong. Any suggestions?

edit: Actually, the 80Gflops refers to the single precision tridiagonalization routine. So it is possible that the double precision version has much lower preformance. Still, it seems rather odd to me that the CPU multi-threads dsytrd routine can match (when not outperforms...) the GPU performance of magma_dsytrd.
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby brom » Tue Feb 22, 2011 6:28 pm

I'm seeing similar results.

I had to edit your example, though, removing the d_work parameter.

Code: Select all
magma_dsytrd(uplo, N, h_R, lda, h_DR, h_ER, h_TAUR, h_WORKR, &LWORKR, d_work, &info);

to
Code: Select all
magma_dsytrd(uplo, N, h_R, lda, h_DR, h_ER, h_TAUR, h_WORKR, &LWORKR, &info);


DSYTRD Results:
Code: Select all
  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  1024   11.27         127.32      10.04      143.03     0.000000e+00
 2048  2048   10.79         1062.71      15.11      759.06     0.000000e+00
 3072  3072   11.86         3263.32      18.90      2046.59     0.000000e+00
 4032  4032   12.18         7179.38      19.28      4535.24     0.000000e+00
 5184  5184   11.26         16511.49      20.51      9059.73     0.000000e+00
 6016  6016   10.66         27242.04      21.15      13733.36     0.000000e+00
 7040  7040   10.47         44438.88      21.77      21373.30     0.000000e+00


I also made a single precision version converting all the doubles to floats and instead running ssytrd. The results still don't even come close to the paper!

SSYTRD
Code: Select all
  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  1024   11.17         128.53      16.50       86.99     0.000000e+00
 2048  2048   22.20         516.55      24.00      477.73     0.000000e+00
 3072  3072   21.79         1775.45      30.58      1265.00     0.000000e+00
 4032  4032   25.08         3487.32      33.14      2638.83     0.000000e+00
 5184  5184   22.32         8327.24      33.94      5476.13     0.000000e+00
 6016  6016   21.83         13304.58      34.05      8529.20     0.000000e+00
 7040  7040   20.96         22207.10      34.62      13442.23     0.000000e+00


The paper shows an 8x speedup reaching about 80 GFLOPS. I'm only seeing about a 40% speed-up with a peak of about 40 GFLOPS.

I'm not sure what I'm doing wrong either! I also have a better (a C2050) compared to the GTX280 used in the paper!

Has anyone run the other routines in that paper to see if the speedups match for those?

Thanks.
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: magma_dsytrd questions

Postby fletchjp » Tue Feb 22, 2011 7:32 pm

I have ported the file testing_dsytrd.cpp and get the following results which seem comparable. My system is an 8 core CPU and I am running GotoBLAS2 compiled for CORE2, which uses 4 cores on the CPU. Just for comparison, I get the following values for the last line on testing_dgetrf:

10112 10112 26.86 69.73 2.537635e-18

Code: Select all
fletcher@fletcher-desktop:~/magma_1.0.0-rc3/testing$ ./testing_dsytrd
device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memory

Usage:
  testing_dsytrd_gpu -N 1024



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  1024   13.89         103.33      17.26       83.14     0.000000e+00
 2048  2048   12.77         898.26      13.77      832.64     0.000000e+00
 3072  3072   12.27         3153.21      14.22      2719.84     0.000000e+00
 4032  4032   12.09         7233.66      13.90      6291.65     0.000000e+00
 5184  5184   12.02         15462.91      14.71      12631.96     0.000000e+00
 6016  6016   12.30         23609.58      14.86      19540.01     0.000000e+00
 7040  7040   12.29         37861.68      14.76      31520.74     0.000000e+00
 8064  8064   11.33         61708.06      14.50      48224.16     0.000000e+00
 9088  9088   12.21         81974.06      14.78      67725.84     0.000000e+00
10112 10112   12.41         111154.08      14.81      93141.57     0.000000e+00
fletchjp
 
Posts: 170
Joined: Mon Dec 27, 2010 7:29 pm

Re: magma_dsytrd questions

Postby mtacconi » Wed Feb 23, 2011 7:51 am

brom wrote:I'm seeing similar results.

I had to edit your example, though, removing the d_work parameter.

Code: Select all
magma_dsytrd(uplo, N, h_R, lda, h_DR, h_ER, h_TAUR, h_WORKR, &LWORKR, d_work, &info);

to
Code: Select all
magma_dsytrd(uplo, N, h_R, lda, h_DR, h_ER, h_TAUR, h_WORKR, &LWORKR, &info);



I had modified the magma_dsytrd routine to allocate the GPU memory in the main program. Initially I thougth that the "GPU warm-up" and/or cublasMalloc overhead was the cause of the poor performance 'cause the cublasMalloc in the original magma_dsytrd was the first call to the CUDA subsystem. However I noticed the two versions of the magma_dsytrd perform almost the same :(

brom wrote:DSYTRD Results:
Code: Select all
  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  1024   11.27         127.32      10.04      143.03     0.000000e+00
 2048  2048   10.79         1062.71      15.11      759.06     0.000000e+00
 3072  3072   11.86         3263.32      18.90      2046.59     0.000000e+00
 4032  4032   12.18         7179.38      19.28      4535.24     0.000000e+00
 5184  5184   11.26         16511.49      20.51      9059.73     0.000000e+00
 6016  6016   10.66         27242.04      21.15      13733.36     0.000000e+00
 7040  7040   10.47         44438.88      21.77      21373.30     0.000000e+00


I also made a single precision version converting all the doubles to floats and instead running ssytrd. The results still don't even come close to the paper!

SSYTRD
Code: Select all
  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  1024   11.17         128.53      16.50       86.99     0.000000e+00
 2048  2048   22.20         516.55      24.00      477.73     0.000000e+00
 3072  3072   21.79         1775.45      30.58      1265.00     0.000000e+00
 4032  4032   25.08         3487.32      33.14      2638.83     0.000000e+00
 5184  5184   22.32         8327.24      33.94      5476.13     0.000000e+00
 6016  6016   21.83         13304.58      34.05      8529.20     0.000000e+00
 7040  7040   20.96         22207.10      34.62      13442.23     0.000000e+00


The paper shows an 8x speedup reaching about 80 GFLOPS. I'm only seeing about a 40% speed-up with a peak of about 40 GFLOPS.

I'm not sure what I'm doing wrong either! I also have a better (a C2050) compared to the GTX280 used in the paper!

Has anyone run the other routines in that paper to see if the speedups match for those?

Thanks.


Good job for the single precision version. Still disappointing performance on the Fermi GPU...
mtacconi
 
Posts: 11
Joined: Tue Dec 07, 2010 4:21 am

Re: magma_dsytrd questions

Postby brom » Wed Feb 23, 2011 11:00 am

I ran the bidiagonalization (SGEBRD) described in that paper as well. I'm seeing a similar discrepancy.

Code: Select all
  M    N    CPU GFlop/s    GPU GFlop/s   |A-QHQ'|/N|A|  |I-QQ'|/N
==================================================================
 1024  1024     8.42          9.09
 2048  2048    12.82         14.43
 3072  3072    14.09         24.53
 4032  4032    13.39         30.04
 5184  5184    12.73         34.82
 6016  6016    11.01         37.09
 7040  7040    11.63         38.26


Their GPU numbers from the paper are about 2x faster than my results (my 40 GFLOPs vs their 80 GFLOPs) and the numbers from their CPU are about 5x slower (my 13 GFLOPs vs their 2.5 GFLOPs).

The paper also claims a worse GPU (their GTX280 vs my C2050) and a better CPU (their Xeon vs my Desktop PC). So I'm not sure what I'm doing wrong! Perhaps the numbers in the paper were theoretical throughput but mistakenly presented as observed results?
brom
 
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Next

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot], Yahoo [Bot] and 1 guest

cron