expected performance and fortran interface

Open forum for general discussions relating to PLASMA.

expected performance and fortran interface

Postby faircdl » Tue Jun 15, 2010 1:46 pm

First off let me say I enjoyed the NVIDIA webinar yesterday. It got me motivated to try out PLASMA/MAGMA.

Two issues thus far...

1) I have PLASMA compiled on 3 machines: core i5 running Ubuntu 10.04, core i7 running RHEL 5, and a dual quad core Xeon running RHEL 5. I have built PLASMA using gfortran/gcc 4.4 with -O3 flag linking in ATLAS BLAS (unthreaded). My main interest is in sgesv performance (called from fortran90). When I compare the wall-clock time of PLASMA vs. threaded sgesv from ATLAS, ATLAS wins usually by a factor of 2X, sometimes even 3X. I noticed in the webinar and in the user guide that comparisons are usually done on many-core machines. Is this performance expected in my case? If not, what tricks can I play to get PLASMA running optimally? I have played with setting the tile size and inner block sizes to no avail.

2) I understand (I think) that PLASMA operates on tiled matrices. Would it be better to directly build the matrices using tiled storage? Does it cause a performance drag if I fill my matrix in the "normal" way as compared to directly building a tiled matrix? I ask this because I notice a couple of things. One is that the memory usage goes way up upon entry to plasma_sgesv (presumably due to the plasma_shared_alloc calls). The other is that I notice there are then two calls to plasma_lapack_to_tile and plasma_tile_to_lapack. Is this where my speed hit is coming from in question 1?

Thanks.
faircdl
 
Posts: 7
Joined: Tue Jun 15, 2010 1:33 pm

Re: expected performance and fortran interface

Postby mateo70 » Tue Jun 15, 2010 3:35 pm

Hi,

yes a part of your performance problem come from the layout transformation. Some explanations are on this recent thread : viewtopic.php?f=2&t=36

Mathieu
mateo70
 
Posts: 94
Joined: Fri May 07, 2010 3:48 pm

Re: expected performance and fortran interface

Postby luszczek » Tue Jun 15, 2010 3:38 pm

Did you link PLASMA against single-threaded ATLAS? If you did than there is another problem.
If you link PLASMA against threaded ATLAS then PLASMA will be slower. 2 or 3 times as you describe.
luszczek
 
Posts: 15
Joined: Tue Jul 14, 2009 2:10 pm

Re: expected performance and fortran interface

Postby faircdl » Wed Jun 16, 2010 6:54 am

luszczek - I did link PLASMA against single threaded ATLAS because that's what I thought I was supposed to to. For the BLAS lib I linked "-L$DIR -lf77blas -lcblas -latlas" instead of "-L$DIR -lptf77blas -lptcblas -latlas". I thought using the single threaded BLAS was the correct way.

mateo70 - I did see that post. I am calling from fortran90 and so was having a little trouble figuring out what exactly I was supposed to do at the interface. Here's where I show my ignorance... In the reference guide, the fortran interface is given as

PLASMA_LAPACK_TO_TILE(INTEGER*4 Af77, INTEGER LDA, INTEGER*4 A, INTEGER INFO)

I have tried this:

real, allocatable :: af(:,:)
allocate(af(1000,1000))
! some code here to fill the matrix and init plasma, etc.
call plasma_lapack_to_tile(af,1000,a,info)

In the past when mixing fortran and C, 'a' has just been an integer to hold the pointer. When I do this however, I get a seg fault while the execution is inside of this function. In moment of frustration I have tried several other things which in hindsight were dumb and shouldn't have ever worked. So I guess I'm not sure what I'm doing wrong here. Any ideas?
faircdl
 
Posts: 7
Joined: Tue Jun 15, 2010 1:33 pm

Re: expected performance and fortran interface

Postby luszczek » Wed Jun 16, 2010 10:03 am

Could please tell me which version of ATLAS you're running?
Is it built from source or the prepackaged one that ships with Ubuntu?
Also, which version of PLASMA you're using?
I'll try to reproduce your problem if I have this information.
luszczek
 
Posts: 15
Joined: Tue Jul 14, 2009 2:10 pm

Re: expected performance and fortran interface

Postby faircdl » Wed Jun 16, 2010 10:26 am

I'm using ATLAS 3.8.3 compiled from source using gfortran/gcc 4.4. I'm using PLASMA 2.1.0.
faircdl
 
Posts: 7
Joined: Tue Jun 15, 2010 1:33 pm

Re: expected performance and fortran interface

Postby luszczek » Thu Jun 17, 2010 12:50 am

I did try ATLAS 3.8.1 with gcc 4.1.2. My machine was dual quad-core Xeon.
I know this is not exactly what you have but I believe it's close enough.

One thing is certain: the default blocking parameters are not optimal.

I decided to find the optimal value by asking ATLAS. I wrote a short C file
to tell me some information about ATLAS:
Code: Select all
int
main(void) {
  ATL_buildinfo();
  printf( "%d %d %d %d %d %d %d %d\n",
    ATL_sGetNB(), ATL_dGetNB(),  ATL_cGetNB(), ATL_zGetNB(),
    ATL_sGetNCNB(),  ATL_dGetNCNB(), ATL_cGetNCNB(), ATL_zGetNCNB() );

  return 0;
}

The most important call is ATL_dGetNB() which on my machine returns 56.

So my timing code has the following sequence of calls:

Code: Select all
  PLASMA_Init( thread_count );

  PLASMA_Disable( PLASMA_AUTOTUNING );

  PLASMA_Set( PLASMA_TILE_SIZE, 448 );

  PLASMA_Set( PLASMA_INNER_BLOCK_SIZE, 56 );

  PLASMA_Alloc_Workspace_dgesv(n, &L, &piv);

  PLASMA_dgesv( n, nrhs, a, n, L, piv, x, n );

  PLASMA_Finalize();


So 56 became my inner-blocking factor and 448 (56 multiplied by 8) became the blocking factor.
With these parameters, I can get within 10% of ATLAS in a sequential run with matrix of size 2000
and below. This can be justified by the overhead of translating the matrix to tile storage.
If you keep your matrices in tile storage the overhead will disappear.

Overall, ATLAS seems to be much more sensitive to a wrong choice of the tuning parameters.
Maybe it's because ATLAS is at the mercy of the C compiler while other BLAS such as Intel
MKL and Goto BLAS use assembly language and they almost always get it right as far as performance
is concerned.
luszczek
 
Posts: 15
Joined: Tue Jul 14, 2009 2:10 pm

Re: expected performance and fortran interface

Postby faircdl » Thu Jun 17, 2010 11:20 am

Thanks for the info. I did code up a script to loop over a wide range of tile and inner block sizes, but I didn't try your method. I will give that a shot.
faircdl
 
Posts: 7
Joined: Tue Jun 15, 2010 1:33 pm

Re: expected performance and fortran interface

Postby luszczek » Thu Jun 17, 2010 12:02 pm

Good luck. Don't forget to disable autotuning:

Code: Select all
PLASMA_Disable( PLASMA_AUTOTUNING );


This is crucial and it might have been the reason why your previous approach didn't work.
Even if you test various blocking parameters, autotuning will override them.
luszczek
 
Posts: 15
Joined: Tue Jul 14, 2009 2:10 pm

Re: expected performance and fortran interface

Postby faircdl » Thu Jun 17, 2010 12:10 pm

Yes, I was doing that already.. Thanks.
faircdl
 
Posts: 7
Joined: Tue Jun 15, 2010 1:33 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 1 guest