thread control

Open forum for general discussions relating to PLASMA.

thread control

Postby katayama » Mon Nov 22, 2010 7:58 pm

Dear experts,

I am a new user of PLASMA (and linear algebra packages in general). I am trying to compute an inverse (and log(det)) of big positive definite matrices as fast as possible on one computer with (eventually multi) GPUs.

I compute 24576 X 24576 matrix using 16 6144 X 6144 tiles. I now use MKL to compute dpotrf, dtrsm, dgemm etc. of the 6144 X 6144 matrix. I would eventually send them off to GPUs using CUBLAS/CULA/MAGMA.

I am testing plasma_dpotrf_tile_async and plasma_dpotri_tile_async to do this. For now I want to use only one thread of plasma but want to use multi-core in mkl routines for testing my idea.

When I set to plasma cores to one, it seems mkl also do not use threads. (I observe it with top going only to 100%).
When I set to plasma cores to 6, during the first dpotrf call, top gets to 600% but when three trsm starts, it becomes 300%. I take each trsm is using 100% of a core.
I looked around the affinity code...

I wonder how I can control number of plasma threads and MKL threads independently and achieve what I want to do.

Thank you for help.
Best,

Nobu
katayama
 
Posts: 4
Joined: Thu Nov 18, 2010 3:07 pm

Re: thread control

Postby admin » Mon Nov 22, 2010 10:44 pm

We are happy to hear that our new Cholesky inversion routine has a user.
About thread control:
Due to the inner workings of PLASMA and MKL, you don't have full control over thread assignment.
First of all, you can just use multithreaded MKL to do the inversion (no PLASMA).
You have a large matrix, you should get decent performance.
Then you can use single-threaded PLASMA and multithreaded MKL. In principle it should work. If it does not, I am not sure why.
And third, you can use single-threaded MKL and multithreaded PLASMA.
None of these scenarios will give you perfect speedups.
But, I would expect the Cholesky inversion in PLASMA to be the fastest.
This is the third scenario (sequential MKL with multithreaded PLASMA).
Jakub
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm

Re: thread control

Postby katayama » Tue Nov 23, 2010 2:31 am

Dear Jakub,

Thanks for a quick reply. Yes I hope to use plasma cholesky routines. Sorry for not explaining well.

I have been using multi threaded MKL to do the inversion of 24576X24576 matrix which takes quite some time, say 72 GFLPOS on a 6 core machine.

With single GPU board (C2050) with the CULA package, I can invert 12288 X 122288 matrix at around 190 GFLOPS. (unpacked 24576X24576 matrix does not fit on GPU local memory)

The matrix I would like to invert is 24576 X 24576 and eventually, 98304 X 98304. Thus I was hoping to use both CPUs and multi GPUs
on a same machine. 3 GPU boards are easy. I think with PLASMA, it is possible and I was testing it.

You say,
> Then you can use single-threaded PLASMA and multithreaded MKL. In principle it should work. If it does not, I am not sure why.
but I somehow can't make it work.

[katayama@btesla1 plasma]$ export OMP_NUM_THREADS=6
[katayama@btesla1 plasma]$ export MKL_NUM_THREADS=6
[katayama@btesla1 plasma]$ /usr/bin/nohup time ./plasma_dpotri --n_range=24576:24577:2 --
nb=6144 --threads=1 --dyn

only gets up to 100% with top. If I use --threads=6, as I said in the previous post, I get 600% during core_dpotrf but not after.

I would eventually like to control number of threads in plasma independently of OMP/MKL_NUM_THREADS as, using plasma threads,
I would like to call GPU gemm routines, for example while CPUs are doing, say, dtrtri.

Thanks,

Nobu
katayama
 
Posts: 4
Joined: Thu Nov 18, 2010 3:07 pm

Re: thread control

Postby mateo70 » Tue Nov 23, 2010 10:44 am

Dear Nobu,

You have different way to control the number of thread in PLASMA. The first classical one is to use :
Code: Select all
PLASMA_Init( numthread );


but another solution to test your problem is to use numthread = 0 in this call and use the following environment setup:
Code: Select all
export PLASMA_NUM_THREADS=4


By default, if PLASMA_NUM_THREADS is not set, it will use the number of core of your system.
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: thread control

Postby admin » Tue Nov 23, 2010 10:58 am

Have you looked into MAGMA?
http://icl.cs.utk.edu/magma/
Jakub
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm

Re: thread control

Postby admin » Tue Nov 23, 2010 4:55 pm

Like I mentioned before, we are very happy to have a user for our Cholesky inversion routine.
Would you please share with us some more details about the application.
Perhaps you could provide some pointers to the literature.
Are you calling the Cholesky inversion routine from a larger software package?
What package is it?
You mentioned interest in inverting matrices as large as 100K x 100K elements.
How large can the matrix potentially be?
Thanks,
Jakub
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm

Re: thread control

Postby katayama » Tue Nov 23, 2010 9:30 pm

Dear Jakub,

Thanks for replies. I will try the environment variable. I have tried magma and hope to call magma_dgemm in plasma core_dgemm.

I should also report that with ATLAS, I could use multi cores in ATLAS routines with --thread=6.
(I switched to MKL because ATLAS's dpotrf fails with 24576X24576 matrix.)

My application is in cosmology. We are planning a future experiment in space to observe B mode polarization of cosmic microwave background radiation (CMB or CMBR). It is a an evidence of gravitational waves during the inflation period.

We observe the sky using a camera which has a field of view of 30 minutes or so. (About the size of the moon.) We then pixelize the universe (or the sky in 2D) in 12*n**2 pixels using a package called Healpix. Thus the number of pixels are, 3072 for n=4, 12288 for n=5 and so on. For each pixel we have two or three measurements (the total power and two polarizations.) Let's start with the two measurements/pixel case. The measurements in pixels are all correlated and we can compute the covariance matrix (C) whose size is 6144X6144, 24576X24576.

We then would like to perform a likelihood fit varying cosmological and other parameters. In doing the fit, I need to compute m^T C^{-1} m / log(det(C)) many times as C is a function of cosmological parameters and m is the measurement vector.
(Actually, I probably don't need the last lauum to make up C^{-1} but compute L^T m and take inner product.)

Best,

Nobu
katayama
 
Posts: 4
Joined: Thu Nov 18, 2010 3:07 pm

Re: thread control

Postby adnan » Tue May 03, 2011 11:49 am

Dear DR.

I am interested to test Quark for some benchmarks with multicore CPU
If possible i would like to get the developer guide of Quark.
Could you please help me obtaining the dev guide or the user guide of Quark ?

best regard

adnan,
adnan
 
Posts: 2
Joined: Tue May 03, 2011 11:31 am

Re: thread control

Postby yarkhan » Thu May 05, 2011 1:14 pm

We are currently preparing the QUARK Users' Guide for release, and it should be ready shortly. However, I will send a draft (work-in-progress) copy of this guide directly to your email address. Please feel free to provide feedback, and make suggestions for improvements and corrections. I will also send a copy of the reference guide that is automatically generated using Doxygen from the QUARK code.
Regards,
Asim
yarkhan
 
Posts: 15
Joined: Thu Oct 01, 2009 10:38 am

Re: thread control

Postby adnan » Thu May 12, 2011 7:11 am

Dear DR. Asim

Thank you, I have a question about Quark.

is LWN 220 the report of Quark?

best regards

adnan,
adnan
 
Posts: 2
Joined: Tue May 03, 2011 11:31 am

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 5 guests

cron