Page 1 of 2

Cholesky decomposition - magma_dpotrf

Posted: Fri Aug 06, 2010 12:14 pm
Hello,
I find strange thing:
When I use dportf and decompose the matrix and then I do the multiplication I don't get the previous matrix. What going on?
Maybe I have bad calling. Can anybody confirm that decomposition works fine? How should look the call of this method for lower triangular square matrix?

Re: Cholesky decomposition - magma_dpotrf

Posted: Fri Aug 06, 2010 5:19 pm
Yes, I can confirm it works. You can look in testing_dpotrf.cpp. The examples there compares the
results from LAPACK and MAGMA. If you want to do the test by computing A - L L^T, using the notations
in testing_dpotrf.cpp, you can do something like
Code: Select all
`      double alpha = 1., beta = 0.;      double *C = (double*)malloc(N*N * sizeof(double));      for( j = 0; j < N; j++ )        for( i = 0; i < j; i++ )           // here I assume h_R holds the lower triangular matrix from the           // Cholesky decomposition, as returned by magma_dpotrf           h_R[i+j*N] = 0.;      // make C = the original matrix      for( j = 0; j < N*N; j++ )        C[j] = h_A[j];      dsyrk_("L", "N", &N, &N, &alpha, h_R, &N, &beta, C, &N);      daxpy_(&n2, &mone, C, &one, h_A, &one);      printf("%5d    %6.2f         %6.2f        %e\n",             size[i], cpu_perf, gpu_perf,             dlange_("f", &N, &N, h_A, &N, work) );`

Re: Cholesky decomposition - magma_dpotrf

Posted: Sat Aug 07, 2010 8:11 am
oh, i find out what is wrong by viewing testing_dportf.cpp and showing the input matrix. If i have lower triangular matrix i must pass the
"U" argument instead of "L". I don't know why, but it works. Is it possible that there is such a bug in magma library?

Re: Cholesky decomposition - magma_dpotrf

Posted: Sat Aug 07, 2010 11:02 am
Stan Tomov wrote:Yes, I can confirm it works. You can look in testing_dpotrf.cpp. The examples there compares the
results from LAPACK and MAGMA. If you want to do the test by computing A - L L^T, using the notations
in testing_dpotrf.cpp, you can do something like
Code: Select all
`      double alpha = 1., beta = 0.;      double *C = (double*)malloc(N*N * sizeof(double));      for( j = 0; j < N; j++ )        for( i = 0; i < j; i++ )           // here I assume h_R holds the lower triangular matrix from the           // Cholesky decomposition, as returned by magma_dpotrf           h_R[i+j*N] = 0.;      // make C = the original matrix      for( j = 0; j < N*N; j++ )        C[j] = h_A[j];      dsyrk_("L", "N", &N, &N, &alpha, h_R, &N, &beta, C, &N);      daxpy_(&n2, &mone, C, &one, h_A, &one);      printf("%5d    %6.2f         %6.2f        %e\n",             size[i], cpu_perf, gpu_perf,             dlange_("f", &N, &N, h_A, &N, work) );`

funny thing, I run Your code (without making C = original matrix , h_R[j+i*N] = 0. because this is lower matrix so upper should be 0 and dsyrk_ with "U" argument) .
Code: Select all
`        for(int i=N-10;i<N;i++){                for(int j=N-10;j<N;j++)                        printf("%lf ",C[i*N+j]);                printf("\n");        }`

because the begin of matrix is good, but the end of matrix isn't the same as the input matrix before decomposition.

If You have time please give me the whole code with printf statements where I can see that decomposition is done correctly

also I tried my own code for matrix multiplication and it shows that the decomposition isn't correctly
Code: Select all
`double *temp,*h_a_u,*h_a_l;        double sum;        temp = new double[n2];        h_a_u = new double[n2];        h_a_l = new double[n2];        for(int i=0;i<N;i++)            for(int j=0;j<N;j++){                if(j>=i) h_a_u[i*N+j]=h_R[j*N+i];                if(j<=i)                    h_a_l[i*N+j]=h_R[i*N+j];            }for(int j=N-20;j<N;j++)            for(int k=N-20;k<N;k++){                sum=0;                for(int i=0;i<N;i++)                    sum+=h_a_l[j*N+i]*h_a_u[i*N+k];                    temp[j*N+k]=sum;            }`

Re: Cholesky decomposition - magma_dpotrf

Posted: Tue Aug 10, 2010 1:10 pm
Here is a code that would test the factorization.
Code: Select all
`/*    -- MAGMA (version 0.2) --       Univ. of Tennessee, Knoxville       Univ. of California, Berkeley       Univ. of Colorado, Denver       November 2009*/// includes, system#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>// includes, project#include "cuda.h"#include "cuda_runtime_api.h"#include "cublas.h"#include "magma.h"/* ////////////////////////////////////////////////////////////////////////////   -- Testing dpotrf*/int main( int argc, char** argv){    cuInit( 0 );    cublasInit( );    printout_devices( );    double *h_A, *h_R;    double *d_A;    double gpu_perf, cpu_perf;    TimeStruct start, end;    /* Matrix size */    int N=0, n2, lda;    int size[10] = {1024,2048,3072,4032,5184,6144,6912,8192,8960,9984};    cublasStatus status;    int i, j, info[1];    if (argc != 1){      for(i = 1; i<argc; i++){        if (strcmp("-N", argv[i])==0)          N = atoi(argv[++i]);      }      if (N>0) size[0] = size[9] = N;      else exit(1);    }    else {      printf("\nUsage: \n");      printf("  testing_dpotrf -N %d\n\n", 1024);    }    /* Initialize CUBLAS */    status = cublasInit();    if (status != CUBLAS_STATUS_SUCCESS) {        fprintf (stderr, "!!!! CUBLAS initialization error\n");    }    lda = N;    n2 = size[9] * size[9];    /* Allocate host memory for the matrix */    h_A = (double*)malloc(n2 * sizeof(h_A[0]));    if (h_A == 0) {        fprintf (stderr, "!!!! host memory allocation error (A)\n");    }    cudaMallocHost( (void**)&h_R,  n2*sizeof(double) );    if (h_R == 0) {        fprintf (stderr, "!!!! host memory allocation error (R)\n");    }    status = cublasAlloc(n2, sizeof(double), (void**)&d_A);    if (status != CUBLAS_STATUS_SUCCESS) {      fprintf (stderr, "!!!! device memory allocation error (d_A)\n");    }    printf("\n\n");    printf("  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F\n");    printf("========================================================\n");    for(i=0; i<10; i++){      N = lda = size[i];      n2 = N*N;      for(j = 0; j < n2; j++)        h_R[j] = h_A[j] = rand() / (double)RAND_MAX;     for(j=0; j<n2; j+=(lda+1))        h_R[j] = (h_A[j]+=2000);      magma_dpotrf("L", &N, h_R, &lda, d_A, info);      //magma_dpotrf("U", &N, h_R, &lda, d_A, info);      for(j=0; j<n2; j++)        h_R[j] = h_A[j];      /* ====================================================================         Performs operation using MAGMA         =================================================================== */      start = get_current_time();      magma_dpotrf("L", &N, h_R, &lda, d_A, info);      //magma_dpotrf("U", &N, h_R, &lda, d_A, info);      end = get_current_time();      gpu_perf = 1.*N*N*N/(3.*1000000*GetTimerValue(start,end));      // printf("GPU Processing time: %f (ms) \n", GetTimerValue(start,end));      // printf("Speed: %f GFlops \n", gpu_perf);      /* =====================================================================         Performs operation using LAPACK         =================================================================== */      start = get_current_time();      //dpotrf_("L", &N, h_A, &lda, info);      //dpotrf_("U", &N, h_A, &lda, info);      end = get_current_time();      if (info[0] < 0)        printf("Argument %d of dpotrf had an illegal value.\n", -info[0]);     cpu_perf = 1.*N*N*N/(3.*1000000*GetTimerValue(start,end));      // printf("CPU Processing time: %f (ms) \n", GetTimerValue(start,end));      // printf("Speed: %f GFlops \n", cpu_perf);      double alpha = 1., work[1], mone = -1.;      int one = 1;      double *C = (double*)malloc(N*N * sizeof(double));      // make C = the original matrix      for( j = 0; j < N*N; j++ )        C[j] = h_A[j];      for( j = 0; j < N; j++ )        for(int k = 0; k < j; k++ )          // here I assume h_R holds the lower triangular matrix from the          // Cholesky decomposition, as returned by magma_dpotrf          C[k+j*N] = h_R[k+j*N] = 0.;      dsyrk_("L", "N", &N, &N, &alpha, h_R, &N, &mone, C, &N);      printf("%5d    xxxxxxx      %6.2f       %e\n",             size[i], gpu_perf,             dlange_("f", &N, &N, C, &N, work) );      /* =====================================================================         Check the result compared to LAPACK         =================================================================== */      /*      double work[1], matnorm, mone = -1.;      int one = 1;      matnorm = dlange_("f", &N, &N, h_A, &N, work);      daxpy_(&n2, &mone, h_A, &one, h_R, &one);      printf("%5d    %6.2f         %6.2f        %e\n",            size[i], cpu_perf, gpu_perf,             dlange_("f", &N, &N, h_R, &N, work) / matnorm);      */      if (argc != 1)        break;    }    /* Memory clean up */    free(h_A);    cublasFree(h_R);    cublasFree(d_A);    /* Shutdown */    status = cublasShutdown();    if (status != CUBLAS_STATUS_SUCCESS) {        fprintf (stderr, "!!!! shutdown error (A)\n");    }}`

Re: Cholesky decomposition - magma_dpotrf

Posted: Tue Aug 10, 2010 3:14 pm
First thing:
This code has a bug, because h_R matrix cannot be allocated via cudaMalloc function. h_R should be allocated on host not on device.

Second thing:
You don't understand me. Give me code that i can see that last 10x10 matrix of input matrix and matrix from multiplication after decomposition are the same. Try to printf some elements from end of matrix. My point is that the decomposition has bug or very small precision. I find out that about 3000th element of matrix after multiplication differ from the input about more then e-15. At the end the difference is about e-3. If You want to prove me that magma works fine give me code that shows that these matrices are the same.

Re: Cholesky decomposition - magma_dpotrf

Posted: Tue Sep 07, 2010 1:23 pm
Regarding the first remark, note that the allocation is done using cudaMallocHost (not cudaMalloc), so h_R is allocated on the CPU (host) as intended.
Regarding the second remark, can you please post the code that you think gives you wrong result and we will look into it. All our tests are passing without problem. The code that I posted above gives you the residual so you can print the last 10x10 block directly from there.

Re: Cholesky decomposition - magma_dpotrf

Posted: Wed Sep 08, 2010 11:56 am
Code: Select all
`/*    -- MAGMA (version 0.2) --       Univ. of Tennessee, Knoxville       Univ. of California, Berkeley       Univ. of Colorado, Denver       November 2009*/// includes, system#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>#include <cutil_inline.h>// includes, project#include "cuda.h"#include "cuda_runtime_api.h"#include "cublas.h"#include "magma.h"/* ////////////////////////////////////////////////////////////////////////////   -- Testing dpotrf*/int main( int argc, char** argv) {    cudaSetDevice(3);    cuInit( 0 );    cublasInit( );    printout_devices( );        double *h_A, *h_R, *h_R2;    double *d_A;    double gpu_perf, cpu_perf;    TimeStruct start, end;    /* Matrix size */    int N=0, n2, lda;    int size[10] = {1024,2048,3072,4032,5184,6144,6912,8192,8960,9984};        cublasStatus status;    int i, j, info[1];    if (argc != 1){      for(i = 1; i<argc; i++){      if (strcmp("-N", argv[i])==0)     N = atoi(argv[++i]);      }      if (N>0) size[0] = size[9] = N;      else exit(1);    }    else {      printf("\nUsage: \n");      printf("  testing_dpotrf -N %d\n\n", 1024);    }    /* Initialize CUBLAS */    status = cublasInit();    if (status != CUBLAS_STATUS_SUCCESS) {        fprintf (stderr, "!!!! CUBLAS initialization error\n");    }    lda = N;    n2 = size[9] * size[9];    /* Allocate host memory for the matrix */    /*h_A = (double*)malloc(n2 * sizeof(h_A[0]));    if (h_A == 0) {        fprintf (stderr, "!!!! host memory allocation error (A)\n");    }  */    h_A = new double[n2];    /*cudaMallocHost( (void**)&h_R,  n2*sizeof(double) );    if (h_R == 0) {        fprintf (stderr, "!!!! host memory allocation error (R)\n");    }*/    h_R = new double[n2];    h_R2 = new double[n2];    //status = cublasAlloc(n2, sizeof(double), (void**)&d_A);    //if (status != CUBLAS_STATUS_SUCCESS) {     // fprintf (stderr, "!!!! device memory allocation error (d_A)\n");    //}    cutilSafeCall(cudaMalloc((void**)&d_A,n2*sizeof(double)));        printf("\n\n");    printf("  N    CPU GFlop/s    GPU GFlop/s    ||R||_F / ||A||_F\n");    printf("========================================================\n");    for(i=0; i<1; i++){      N = lda = size[i];      n2 = N*N;      for(j = 0; j < n2; j++)   h_A[j] = rand() / (double)RAND_MAX;            for(j=0; j<n2; j+=(lda+1))         h_R[j] = (h_A[j]+=2000);         //magma_dpotrf("L", &N, h_R, &lda, d_A, info);      magma_dpotrf("U", &N, h_R, &lda, d_A, info);      for(j=0; j<n2; j++){        h_R[j] = h_A[j];     h_R2[j] = h_R[j];      }            for(int i=0;i<N;i++)      for(int j=0;j<N;j++)         if(j>i)            h_R[i*N+j] = 0;      /* ====================================================================         Performs operation using MAGMA     =================================================================== */   printf("---------input h_R ------\n");   for(int i=N-10;i<N;i++){      for(int j=N-10;j<N;j++)         printf("%lf ",h_R[i*N+j]);      printf("\n");          }      start = get_current_time();      //magma_dpotrf("L", &N, h_R, &lda, d_A, info);      magma_dpotrf("U", &N, h_R, &lda, d_A, info);      end = get_current_time();      printf("----------- h_R after decomposition ----------------------------\n");   for(int i=N-10;i<N;i++){      for(int j=N-10;j<N;j++)         printf("%lf ",h_R[i*N+j]);      printf("\n");          }   double *temp,*h_r_u,*h_r_l;   double sum;   temp = new double[n2];   h_r_u = new double[n2];//uper triangular part of h_R   h_r_l = new double[n2];//lower triangular part of h_R   for(int i=0;i<N;i++)       for(int j=0;j<N;j++){      if(j>=i) h_r_u[i*N+j]=h_R[j*N+i];      if(j<=i) h_r_l[i*N+j]=h_R[i*N+j];       }   /*   printf("---- h_r_u -----------\n");   for(int j=N-10;j<N;j++){       for(int k=N-10;k<N;k++)      printf("%lf ",h_r_u[j*N+k]);       printf("\n");   }   printf("---- h_r_l -----------\n");   for(int j=N-10;j<N;j++){       for(int k=N-10;k<N;k++)      printf("%lf ",h_r_l[j*N+k]);       printf("\n");   }*/   for(int j=N-20;j<N;j++)       for(int k=N-20;k<N;k++){      sum=0;      for(int i=0;i<N;i++)          sum+=h_r_l[j*N+i]*h_r_u[i*N+k];          temp[j*N+k]=sum;       }   printf("---- temp (after manual multiplication)-----------\n");   for(int i=N-10;i<N;i++){       for(int j=N-10;j<N;j++)      printf("%lf ",temp[i*N+j]);       printf("\n");   }                              gpu_perf = 1.*N*N*N/(3.*1000000*GetTimerValue(start,end));      // printf("GPU Processing time: %f (ms) \n", GetTimerValue(start,end));      // printf("Speed: %f GFlops \n", gpu_perf);      /* =====================================================================         Performs operation using LAPACK     =================================================================== */      start = get_current_time();      dpotrf_("L", &N, h_A, &lda, info);      //dpotrf_("U", &N, h_A, &lda, info);      end = get_current_time();      if (info[0] < 0)     printf("Argument %d of dpotrf had an illegal value.\n", -info[0]);             cpu_perf = 1.*N*N*N/(3.*1000000*GetTimerValue(start,end));      // printf("CPU Processing time: %f (ms) \n", GetTimerValue(start,end));      // printf("Speed: %f GFlops \n", cpu_perf);            /* =====================================================================         Check the result compared to LAPACK         =================================================================== */      double work[1], matnorm, mone = -1.;      int one = 1;      matnorm = dlange_("f", &N, &N, h_A, &N, work);      double alpha = 1., beta = 0.;     double *C = (double*)malloc(N*N * sizeof(double));     for( j = 0; j < N; j++ )         for( i = 0; i < j; i++ )            // here I assume h_R holds the lower triangular matrix from the            // Cholesky decomposition, as returned by magma_dpotrf            h_R[j+i*N] = 0.;           // make C = the original matrix          // for( j = 0; j < N*N; j++ )          //     C[j] = h_A[j];   /*printf("=====================- C ----------------------\n");   for(int i=N-10;i<N;i++){      for(int j=N-10;j<N;j++)         printf("%lf ",C[i*N+j]);      printf("\n");          }*/           dsyrk_("L", "N", &N, &N, &alpha, h_R, &N, &beta, C, &N);   printf("================= C (multiplication by function dsyrk_)====\n");   for(int i=N-10;i<N;i++){      for(int j=N-10;j<N;j++)         printf("%lf ",C[i*N+j]);      printf("\n");          }   for(int i=0;i<n2;i++)          if(C[i]-h_R2[i]>0.000000000000001){             printf("First difference at: %d %d %d\n",i,i/N,i%N);             break;          }            daxpy_(&n2, &mone, h_A, &one, h_R, &one);      printf("%5d    %6.2f         %6.2f        %e\n",         size[i], cpu_perf, gpu_perf,        dlange_("f", &N, &N, h_R, &N, work) / matnorm);      if (argc != 1)   break;    }       /* Memory clean up */    free(h_A);    cublasFree(h_R);    cublasFree(d_A);    /* Shutdown */    status = cublasShutdown();    if (status != CUBLAS_STATUS_SUCCESS) {        fprintf (stderr, "!!!! shutdown error (A)\n");    }}`

on my machine code with cudaMallosHost doesn't work (segmentation fault), but i think it isn't a reason.
i'm doing standard make from testing folder provided with magma.
Please point out my mistake in code
thx

Re: Cholesky decomposition - magma_dpotrf

Posted: Mon Sep 13, 2010 7:13 pm
I get the correct result with your code, just replacing h_R to be allocated with cudaMallocHost. This is needed because the code assumes the memory is pinned (and uses some asynchronous communications that work only with pinned memory). What card is device 3 (in your case)?
Stan

Re: Cholesky decomposition - magma_dpotrf

Posted: Sat Sep 25, 2010 10:48 am
Device 3: "Tesla T10 Processor"
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294770688 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 4, Device = Tesla T10 Processor, Device = Tesla T10 Processor

Tesla S870

tell me if this problem is caused by hardware

or maybe i can use the magma without this specific type of memory??