Page 1 of 1

Lapack/BLAS on dual processor machines?

PostPosted: Wed Apr 27, 2005 7:53 pm
by ashtonaut

I realise that this may not be the most appropriate forum to ask this question, but this seems to be a place where more experienced people can help newcomers without resorting to technobabble....

We currently run a cluster with ten nodes. Each node has two Opteron processors onboard. The O/S is Rocks Linux (based on RHEL3).

At the moment we are running Fortan code compiled with g77 with calls to the AMD ACML libraries (AMD tuned versions of Lapack and BLAS). We noticed that the processor usage never gets above 50% of the total when running this code (i.e. one processor might be at 75%, the other at 25%, but the total is never more than 50%).

If we move to a multithreaded BLAS library (such as goto_blas,, will the processor utilization improve (i.e. will we get closer to 100% processor usage)? Is there any other libraries out there that are optimised for multi processor machines?

It just seems like at present we are wasting a lot of the capacity of our cluster.

Any thoughts on this would be most appreciated. If I am completely on the wrong track, please let me know.



PostPosted: Thu Apr 28, 2005 12:39 pm
by whaley
You should definitely try other libs. Last time I checked the overall speed ranking on the Opteron was [goto,ATLAS,ACML], but which is actually faster will probably vary by application, so if you are really looking for maximal performance, try them all and choose between them based on actual application runs. Having them all around is a good idea, as it can aid in later debugging as well.

PostPosted: Tue May 03, 2005 6:52 pm
by ashtonaut
Thanks for the reply. What I was really interested in knowing though was if using this goto_blas library would actually use both processors on my machine.

Is there some BLAS test problem I can try that will clearly demonstrate if I am in fact getting multithreaded operation or not?


PostPosted: Tue May 03, 2005 9:59 pm
by Julien Langou
Both ATLAS and GOTO BLAS will give you performance close to the peak performance of your two CPUs in the mulithreaded mode.

If you want to give a try to ATLAS, download it from
the installation should be easy and it should automatically proposed the optimization of
a threaded library. During the installation you will see that both CPUs are working close to their peak simultaneously.


PostPosted: Thu May 05, 2005 11:03 pm
by ashtonaut
Thanks for the reply Julien,

Are you saying that I can achieve close to full load on both CPUs with ATLAS and goto BLAS, but not with LAPACK and goto BLAS?

We are trying to get this cluster running at full capacity, so I really appreciate the help from you guys.

If we did switch to ATLAS, would this mean current Fortran code that calls LAPACK would need to be re-written?



PostPosted: Tue May 10, 2005 3:42 pm
by Julien Langou
Are you saying that I can achieve close to full load on both CPUs with ATLAS and goto BLAS, but not with LAPACK and goto BLAS?

Not exactly.
LAPACK uses the BLAS. Any correct BLAS is fine. Either the one of Mr Goto, the one of ATLAS, vendor's BLAS library or the reference BLAS (not optimized, provided in LAPACK/BLAS directory).

The performance of LAPACK comes mainly from the BLAS beneath. LAPACK is not itself multithreaded. But if your BLAS is multithreaded (ATLAS provides a multithreaded BLAS in its library libptf77blas.a for example), LAPACK will appear like running on both CPUs (since the most part of the time is in the multithreaded BLAS routines).

If we did switch to ATLAS, would this mean current Fortran code that calls LAPACK would need to be re-written?

No, you can keep your Fortran code they are fine.
There is no problem in having a Fortran code using LAPACK and ATLAS-BLAS.

So for example with ATLAS linking with something like:
Code: Select all
 /LAPACK_DIR/lapack.a -L${ATLAS_DIR} -lptf77blas -latlas

will give you a multithreaded code that uses both CPUs.

PostPosted: Sat Jun 25, 2005 4:14 pm
by mbibby
Hi Ashton.

I am contemplating building a cluster of nodes using dual processors, quite possibly of the Opteron type.

Did you ever resolve your system performance to get closer to your expectations?

Would you provide an update on your experience(s) with this issue please?


PostPosted: Sun Jun 26, 2005 6:56 pm
by ashtonaut
Hi Malcolm,

We have been fortunate enough to have Julien Langou help us with this problem last week. I don't think Julien will mind me posting some of the notes he left for me about the solution (note: this is all work done by Julien, so credit for this goes to him entirely). We were using the ACML versions of Lapack and BLAS, but will now use the standard netlib version of LAPACK and the Goto BLAS.

multithreaded BLAS:

- the problem comes from the ACML library.

1- The ACML library provides LAPACK and BLAS, but their BLAS does not seem to be multithreaded (note: this is strange, there should be a multithreaded version somewhere on the net)
2- If you want to use LAPACK from ACML, and BLAS from Goto BLAS, you need to link first to Goto BLAS and then the ACML library
3- However, this is not working. ACML LAPACK calls some strange BLAS name so that they are sure you are using their BLAS. It doesn't appear to be possible to use ACML LAPACK with the Goto BLAS.
4- I installed LAPACK from netlib and linked with Goto Blas, below the results :)

So type
> g77 -m64 -Wall -Wimplicit -ffixed-line-length-0 -lm -o \
fwd_3D_harmonic_banded_serial.opt fwd_3D_harmonic_banded_serial.f \
LAPACK/lapack_LINUX.a /usr/local/goto_blas/ -lpthread
> ./fwd_3D_harmonic_banded_serial.opt

17131 ape20 25 0 958M 957M 816 R 194.4 48.4 0:43 1 fwd_3D_harmonic

+ Elapsed wall time (s): 49.2200012

The major thing that has changed is now the processor usage is 194.4% (i.e. both processors are going almost full speed), meaning the multithreaded BLAS is now working in our case. I'm sorry the text showing this is not aligned correctly. Note the elapsed wall time, this is almost 50% faster than it was with the single-threaded BLAS from the ACML.

So, in summary, if you link properly to a multithreaded BLAS library (we needed help with this), you can get a large code speedup and near full utilization of both processors on the machine.

I must thank Julien again for his assistance in working through this problem.

I hope this helps, all the best with your cluster.


Re: Lapack/BLAS on dual processor machines?

PostPosted: Sun Jan 16, 2011 2:28 pm
by jsrassa
Hello, I would thank anyone who can help me with this problem, is very similar to the previous one. I made a Bewoulf cluster using Ubuntu with two computers and I'm trying to use Scalapack but the CPU use is really low (25% most of the time) so its taking more time than just using Lapack. I'm trying to link with the ATLAS library but it still does not solve the problem. The way I am compiling is:
mpif90 -lscalapack-openmpi -lblacsF77init-openmpi -lblacsCinit-openmpi -lblas-openmpi -llapack_atlas -lf77-blas -lcblas -latlas -lpthread
Thank you very much