## Parallel LU factorization 'pdgetrf', problem with 'descinit'

Post here if you have a question about LAPACK or ScaLAPACK algorithm or data format

### Parallel LU factorization 'pdgetrf', problem with 'descinit'

Dear all,

I would like to use ScaLAPACK's parallel LU factorization routine 'pdgetrf'. At the moment I experience difficulties with the creation of matrix descriptor 'descinit' and as a result of that the program crashes with a memory corruption.

Short description of the example setup:

I create a 10x10 matrix A(MxN) and use 2x2 computation blocks MBxNB. The matrix is distributed amongst a 2x2 processor grid (npRow x npCol), i.e. each processor gets a 5x5 matrix block AA(RPP x CPP). The master process (0) allocates the matrix A and initializes it with consequtive numbers from 1 to 100 and sends 5x5 matrix blocks AA to all slave processes (1, 2, 3). The LU factorization routine 'pdgetrf' is called. The matrix blocks AA with the solution are send back to the master process and collected into a new matrix R. Please, correct me if my understanding of the ScaLAPACK logics is wrong and the usage of ScaLAPACK routines is inappropriate.

Source code of the example:
Code: Select all
`      PROGRAM LU_FACT_PAR_MINI      IMPLICIT NONE      integer, parameter :: M = 10, N = 10, MB = 2, NB = 2      integer, parameter :: RPP = M/2, CPP = M/2  ! rows/columns per procesor      integer :: npRow = 2, npCol = 2      integer :: iCtxt, iRow, iCol, me, err      integer, parameter :: IA = 1, JA = 1      integer :: info, descA(9), iPiv(N)      double precision, allocatable :: A(:,:), AA(:,:), R(:,:)      integer :: i, j      ! external functions      integer, external :: blacs_pnum      ! initialize process grid      call sl_init(iCtxt, npRow, npCol)      call blacs_gridinfo(iCtxt, npRow, npCol, iRow, iCol)      ! get process ID      me = blacs_pnum(iCtxt, iRow, iCol)      allocate(AA(RPP,CPP), stat=err);      call descinit(descA, M, N, MB, NB, 0, 0, iCtxt, RPP, info)      ! master: initialize matrix, send matrix blocks to slaves      if (me == 0) then        allocate(A(M,N), stat=err)        do i = 1, M          do j = 1, N              A(i,j) = (i-1)*N+j            end do        end do        ! copy own matrix block        AA = A(1:RPP,1:CPP)        ! send matrix blocks to slaves        call dgesd2d(iCtxt, RPP, CPP, A(1:RPP,6:N), RPP, 0, 1)        call dgesd2d(iCtxt, RPP, CPP, A(6:M,1:CPP), RPP, 1, 0)        call dgesd2d(iCtxt, RPP, CPP, A(6:M,6:N),   RPP, 1, 1)      ! slave: receive matrix block      else        call dgerv2d(iCtxt, RPP, CPP, AA, RPP, 0, 0)      end if      call pdgetrf(M, N, AA, IA, JA, descA, iPiv, info)      if (me == 0) print *, "pdgetrf info:", info      ! master: collect result blocks of LU factorization      if (me == 0) then        allocate(R(M,N), stat=err)        R = 0.0        ! copy own result block        R(1:RPP,1:CPP) = AA        ! receive result blocks from slaves        call dgerv2d(iCtxt, RPP, CPP, R(1:RPP,6:N), RPP, 0, 1)        call dgerv2d(iCtxt, RPP, CPP, R(6:M,1:CPP), RPP, 1, 0)        call dgerv2d(iCtxt, RPP, CPP, R(6:M,6:N),   RPP, 1, 1)        ! deallocate matrices        deallocate(A, R, stat=err)      ! slave: send result block to master      else        call dgesd2d(iCtxt, RPP, CPP, AA, RPP, 0, 0)      end if      deallocate(AA, stat=err)      ! finalize      call blacs_gridexit(iCtxt)      call blacs_exit(0)      END PROGRAM LU_FACT_PAR_MINI`

Compilers:
Code: Select all
`MPICH Version:          3.2MPICH Release date:     Wed Nov 11 22:06:48 CST 2015MPICH Device:           ch3:nemesisMPICH configure:        --prefix=/opt/mpich-gnu --enable-fast=O3,ndebug --disable-error-checking --without-timing --without-mpit-pvars --enable-fortran=all --enable-romioMPICH CC:       gcc -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64   -DNDEBUG -DNVALGRIND -O3MPICH CXX:      g++   -DNDEBUG -DNVALGRIND -O3MPICH F77:      gfortran   -O3MPICH FC:       gfortran   -O3`

Compilation procedure:
Code: Select all
`mpif90 -g -O0 -fcheck=all -fbacktrace -fcheck=bounds -Wall   -c lu_fact_par_mini.f90mpif90 -g -O0 -fcheck=all -fbacktrace -fcheck=bounds -Wall   -L /opt/scalapack/lib -L /opt/lapack/lib -L /opt/blas/lib -L ./ -o lu_fact_par_mini lu_fact_par_mini.o -lscalapack -llapack -lblas -lnagprint`

Error message:
Code: Select all
`mpiexec -np 4 ./lu_fact_par_mini{    0,    0}:  On entry to DESCINIT parameter number    9 had an illegal valueAt line 60 of file lu_fact_par_mini.f90Fortran runtime warning: An array temporary was createdAt line 61 of file lu_fact_par_mini.f90Fortran runtime warning: An array temporary was created{    0,    1}:  On entry to DESCINIT parameter number    9 had an illegal valueAt line 62 of file lu_fact_par_mini.f90Fortran runtime warning: An array temporary was created*** Error in `./lu_fact_par_mini': malloc(): memory corruption (fast): 0x0000000001a76220 ***======= Backtrace: =========/usr/lib/libc.so.6(+0x6f364)[0x7fd76869d364]/usr/lib/libc.so.6(+0x74d96)[0x7fd7686a2d96]/usr/lib/libc.so.6(+0x77344)[0x7fd7686a5344]/usr/lib/libc.so.6(__libc_malloc+0x54)[0x7fd7686a6924]/opt/mpich-gnu/lib/libmpi.so.12(MPID_Dataloop_alloc_and_copy+0x6a)[0x7fd769579ada]/opt/mpich-gnu/lib/libmpi.so.12(MPID_Dataloop_create_contiguous+0x407)[0x7fd76957c977]/opt/mpich-gnu/lib/libmpi.so.12(MPID_Dataloop_create_vector+0x417)[0x7fd76957ebf7]/opt/mpich-gnu/lib/libmpi.so.12(MPID_Dataloop_create+0xbbd)[0x7fd76957aced]/opt/mpich-gnu/lib/libmpi.so.12(MPID_Type_commit+0x94)[0x7fd769589004]/opt/mpich-gnu/lib/libmpi.so.12(MPI_Type_commit+0x5d)[0x7fd7694cebbd]./lu_fact_par_mini[0x405636]./lu_fact_par_mini[0x40e4d6]./lu_fact_par_mini[0x40cccc]./lu_fact_par_mini[0x4047f6]./lu_fact_par_mini[0x402942]./lu_fact_par_mini[0x403ae8]/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7fd76864e710]./lu_fact_par_mini[0x4017e9]======= Memory map: ========00400000-00467000 r-xp 00000000 08:16 799033                             /home/mabalenk/repository/git/plasma-exp/scalapack/lu_fact_par_mini00667000-00668000 rw-p 00067000 08:16 799033                             /home/mabalenk/repository/git/plasma-exp/scalapack/lu_fact_par_mini01a54000-01a97000 rw-p 00000000 00:00 0                                  [heap]7fd760000000-7fd760021000 rw-p 00000000 00:00 0 7fd760021000-7fd764000000 ---p 00000000 00:00 0 7fd766c95000-7fd766f36000 rw-p 00000000 00:00 0 7fd766f36000-7fd766f41000 r-xp 00000000 08:06 2624720                    /usr/lib/libnss_files-2.23.so7fd766f41000-7fd767140000 ---p 0000b000 08:06 2624720                    /usr/lib/libnss_files-2.23.so7fd767140000-7fd767141000 r--p 0000a000 08:06 2624720                    /usr/lib/libnss_files-2.23.so7fd767141000-7fd767142000 rw-p 0000b000 08:06 2624720                    /usr/lib/libnss_files-2.23.so7fd767142000-7fd767148000 rw-p 00000000 00:00 0 7fd767148000-7fd768209000 rw-s 00000000 00:13 15330110                   /dev/shm/mpich_shar_tmpwOLvsY (deleted)7fd768209000-7fd768221000 r-xp 00000000 08:06 2624624                    /usr/lib/libpthread-2.23.so7fd768221000-7fd768420000 ---p 00018000 08:06 2624624                    /usr/lib/libpthread-2.23.so7fd768420000-7fd768421000 r--p 00017000 08:06 2624624                    /usr/lib/libpthread-2.23.so7fd768421000-7fd768422000 rw-p 00018000 08:06 2624624                    /usr/lib/libpthread-2.23.so7fd768422000-7fd768426000 rw-p 00000000 00:00 0 7fd768426000-7fd76842d000 r-xp 00000000 08:06 2624728                    /usr/lib/librt-2.23.so7fd76842d000-7fd76862c000 ---p 00007000 08:06 2624728                    /usr/lib/librt-2.23.so7fd76862c000-7fd76862d000 r--p 00006000 08:06 2624728                    /usr/lib/librt-2.23.so7fd76862d000-7fd76862e000 rw-p 00007000 08:06 2624728                    /usr/lib/librt-2.23.so7fd76862e000-7fd7687c6000 r-xp 00000000 08:06 2624649                    /usr/lib/libc-2.23.so7fd7687c6000-7fd7689c5000 ---p 00198000 08:06 2624649                    /usr/lib/libc-2.23.so7fd7689c5000-7fd7689c9000 r--p 00197000 08:06 2624649                    /usr/lib/libc-2.23.so7fd7689c9000-7fd7689cb000 rw-p 0019b000 08:06 2624649                    /usr/lib/libc-2.23.so7fd7689cb000-7fd7689cf000 rw-p 00000000 00:00 0 7fd7689cf000-7fd768a0d000 r-xp 00000000 08:06 2624996                    /usr/lib/libquadmath.so.0.0.07fd768a0d000-7fd768c0d000 ---p 0003e000 08:06 2624996                    /usr/lib/libquadmath.so.0.0.07fd768c0d000-7fd768c0e000 rw-p 0003e000 08:06 2624996                    /usr/lib/libquadmath.so.0.0.07fd768c0e000-7fd768c24000 r-xp 00000000 08:06 2624968                    /usr/lib/libgcc_s.so.17fd768c24000-7fd768e23000 ---p 00016000 08:06 2624968                    /usr/lib/libgcc_s.so.17fd768e23000-7fd768e24000 rw-p 00015000 08:06 2624968                    /usr/lib/libgcc_s.so.17fd768e24000-7fd768f27000 r-xp 00000000 08:06 2624725                    /usr/lib/libm-2.23.so7fd768f27000-7fd769127000 ---p 00103000 08:06 2624725                    /usr/lib/libm-2.23.so7fd769127000-7fd769128000 r--p 00103000 08:06 2624725                    /usr/lib/libm-2.23.so7fd769128000-7fd769129000 rw-p 00104000 08:06 2624725                    /usr/lib/libm-2.23.so7fd769129000-7fd769251000 r-xp 00000000 08:06 2624977                    /usr/lib/libgfortran.so.3.0.07fd769251000-7fd769451000 ---p 00128000 08:06 2624977                    /usr/lib/libgfortran.so.3.0.07fd769451000-7fd769453000 rw-p 00128000 08:06 2624977                    /usr/lib/libgfortran.so.3.0.07fd769453000-7fd769629000 r-xp 00000000 08:06 542411                     /opt/mpich-gnu/lib/libmpi.so.12.1.07fd769629000-7fd769828000 ---p 001d6000 08:06 542411                     /opt/mpich-gnu/lib/libmpi.so.12.1.07fd769828000-7fd76983a000 rw-p 001d5000 08:06 542411                     /opt/mpich-gnu/lib/libmpi.so.12.1.07fd76983a000-7fd769874000 rw-p 00000000 00:00 0 7fd769874000-7fd7698aa000 r-xp 00000000 08:06 542415                     /opt/mpich-gnu/lib/libmpifort.so.12.1.07fd7698aa000-7fd769aaa000 ---p 00036000 08:06 542415                     /opt/mpich-gnu/lib/libmpifort.so.12.1.07fd769aaa000-7fd769aab000 rw-p 00036000 08:06 542415                     /opt/mpich-gnu/lib/libmpifort.so.12.1.07fd769aab000-7fd769ace000 r-xp 00000000 08:06 2624648                    /usr/lib/ld-2.23.so7fd769adc000-7fd769c90000 rw-p 00000000 00:00 0 7fd769ccc000-7fd769cce000 rw-p 00000000 00:00 0 7fd769cce000-7fd769ccf000 r--p 00023000 08:06 2624648                    /usr/lib/ld-2.23.so7fd769ccf000-7fd769cd0000 rw-p 00024000 08:06 2624648                    /usr/lib/ld-2.23.so7fd769cd0000-7fd769cd1000 rw-p 00000000 00:00 0 7ffdaec99000-7ffdaecbc000 rw-p 00000000 00:00 0                          [stack]7ffdaecea000-7ffdaecec000 r--p 00000000 00:00 0                          [vvar]7ffdaecec000-7ffdaecee000 r-xp 00000000 00:00 0                          [vdso]ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]Program received signal SIGABRT: Process abort signal.Backtrace for this error:`
mabalenk

Posts: 4
Joined: Tue Feb 16, 2016 7:46 am

### Re: Parallel LU factorization 'pdgetrf', problem with 'desci

RPP is set to the wrong value.
You can set it with NUMROC
RPP = NUMROC( M, MB, iRow, 0, npRow )
CPP = NUMROC( N, NB, iCol, 0, npCol )

I would strongly advice you to go through
- the ScaLAPACK user manual http://netlib.org/scalapack/slug/index.html
- the routine comments of descinit: http://www.netlib.org/scalapack/explore ... ource.html
- and also the scalapack examples: http://www.netlib.org/scalapack/examples/

Posts: 616
Joined: Wed Dec 08, 2004 7:07 pm

### Re: Parallel LU factorization 'pdgetrf', problem with 'desci

Thank you very much for pointing out the mistake to me. I have modified the code to include the 'numroc' functions and to use correct data ranges in the BLACS send/receive routines.

Updated source code of the example:
Code: Select all
`      PROGRAM LU_FACT_PAR_MINI      IMPLICIT NONE      integer, parameter :: M = 10, N = 10, MB = 2, NB = 2      integer, parameter :: IA = 1, JA = 1      integer :: rpp, cpp              ! rows/cols per procesor      integer :: npRow = 2, npCol = 2  ! processor grid (npRow x npCol)      integer :: rSrc  = 0, cSrc  = 0  ! origin process for matrix distribution      integer :: iCtxt, iRow, iCol, me, err      integer :: info, descA(9), iPiv(N)      double precision, allocatable :: A(:,:), AA(:,:), R(:,:)      integer :: i, j      ! external functions      integer, external :: blacs_pnum, numroc      ! initialize process grid      call sl_init(iCtxt, npRow, npCol)      call blacs_gridinfo(iCtxt, npRow, npCol, iRow, iCol)      ! get process ID      me = blacs_pnum(iCtxt, iRow, iCol)      ! determine no. of rows/cols per processor      rpp = numroc(M, MB, iRow, rSrc, npRow)      cpp = numroc(N, NB, iCol, cSrc, npCol)      ! allocate matrix block AA      allocate(AA(rpp,cpp), stat=err);      ! create matrix descriptor      call descinit(descA, M, N, MB, NB, rSrc, cSrc, iCtxt, rpp, info)      ! master: initialize matrix, send matrix blocks to slaves      if (me == 0) then        allocate(A(M,N), stat=err)        do i = 1, M          do j = 1, N              A(i,j) = (i-1)*N+j            end do        end do        ! copy own matrix block        AA = A(1:rpp,1:cpp)        ! send matrix blocks to slaves        call dgesd2d(iCtxt, rpp, N-cpp, A(1:rpp,cpp+1:N), rpp, 0, 1)        call dgesd2d(iCtxt, M-rpp, cpp, A(rpp+1:M,1:cpp), M-rpp, 1, 0)        call dgesd2d(iCtxt, M-rpp, N-cpp, A(rpp+1:M,cpp+1:N), M-rpp, 1, 1)      ! slave: receive matrix block      else        call dgerv2d(iCtxt, rpp, cpp, AA, rpp, rSrc, cSrc)      end if      call pdgetrf(M, N, AA, IA, JA, descA, iPiv, info)      if (me == 0) print *, "pdgetrf info:", info         ! master: collect result blocks of LU factorization      if (me == 0) then         allocate(R(M,N), stat=err)         R = 0.0         ! copy own result block        R(1:rpp,1:cpp) = AA         ! receive result blocks from slaves        call dgerv2d(iCtxt, rpp, N-cpp, R(1:rpp,cpp+1:N), rpp, 0, 1)        call dgerv2d(iCtxt, M-rpp, cpp, R(rpp+1:M,1:cpp), M-rpp, 1, 0)        call dgerv2d(iCtxt, M-rpp, N-cpp, R(rpp+1:M,cpp+1:N), M-rpp, 1, 1)         ! deallocate matrices        deallocate(A, R, stat=err)       ! slave: send result block to master      else         call dgesd2d(iCtxt, rpp, cpp, AA, rpp, rSrc, cSrc)       end if      deallocate(AA, stat=err)      ! finalize      call blacs_gridexit(iCtxt)      call blacs_exit(0)      END PROGRAM LU_FACT_PAR_MINI`

Since I use 2x2 blocks to process a 10x10 matrix, the matrix is divided into 4 submatrices of sizes (0,0): 6x6, (0,1): 6x4, (1,0): 4x6 and (1,1): 4x4. The numbers in parentheses are the (iRow,iCol) of the given processor. Please see the attached picture (matrix-a.png) of the matrix distribution. I apply parallel LU decomposition with 'pdgetrf' and collect the results. I compare collected results with a sequential LU decomposition run on a master processor. The results are different. Please see the attached picture of the differences. Parallel is on the left and sequential---on the right. Interestingly, when I use blocks MBxNB of 5x5, i.e. matrix blocks overlap with the processor grid and each processor get exactly one block the parallel and sequential results are identical. To check that I correctly collect the data on the master processor I have commented out the LU decomposition routine and I can recollect the original matrix back. Which means that my BLACS communication of the matrix is correct. Can you please shed any light on what might be the problem causing the result difference?

Another question that I have: is there any routine that automatically distributes my matrix data over the available processor grid or the best that I can do is to use a BLACS send/receive routine pair for each matrix block?
Attachments
10x10 matrix with 5x5 blocks (LU parallel vs sequential)
pdgetrf-par-seq-10x10-5x5.png (45.55 KiB) Viewed 5139 times
10x10 matrix with 2x2 blocks (LU: parallel vs sequential)
pdgetrf-par-seq-10x10-2x2.png (56.48 KiB) Viewed 5139 times
10x10 matrix with 2x2 blocks distributed over 2x2 processor grid
matrix-a.png (30.73 KiB) Viewed 5139 times
mabalenk

Posts: 4
Joined: Tue Feb 16, 2016 7:46 am

### Re: Parallel LU factorization 'pdgetrf', problem with 'desci

Look here: viewtopic.php?t=139
there is an example code for LU - It is in C but the logic is the same.

I found that document on the web https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjsv-iDxJTLAhUERSYKHT25BzgQFggdMAA&url=https%3A%2F%2Fnotendur.hi.is%2Fjonasson%2Ffob%2Fscalapack-notes.doc&usg=AFQjCNFR5pNNUJpgjoxDqQRLnKx1WM2OiA%20 . it is pretty good to start :

There are now several ways to create and distribute the matrix. To name a few:
(a) Each local matrix may be created in its process from scratch
(b) The global matrix may be read from file by one process and then distributed in pieces
(c) The global matrix may be created in one of the processes, and then distributed to all the rest of them
(d) Each process can read its portion of the matrix from file
(e) The matrix may be assembled in small pieces (as in finite element updating), each time sending updates to the relevant process.
One reason for avoiding (b) or (c) would be that the global matrix is too big to fit into the memory of a sin¬gle process. ScaLapack comes with two example programs, no. 1 uses (a) and no. 2 uses (b). There are also several programs available to distribute the matrix. Maybe the simplest is to follow (c) and distri¬bute with the routine pdgemr2d, described below, and in more detail in the ScaLapack User’s Guide (see the section Matrix Redistribution/Copy Routines; once there select Next a few times).
Other possibilities include broadcasting the whole matrix to all the proceessors with dgebs2d/dgebr2d, or sending each local matrix (or block) to its process with dgesd2d/dgerv2d (the routine names are mnemonic for double general-matrix broadcast-send 2-dimensional, …broadcast-receive…, …send… and …receive…). More information on these routines is in the Reference to BLACS routines on netlib.org/blacs. Some other useful routines are pdelset (that finds where aij of the global matrix is in the local one) and numroc (that calculates dimensions of the local matrix). A summary of these distri¬bution routines is given below.