Hi Malcolm,
We have been fortunate enough to have Julien Langou help us with this problem last week. I don't think Julien will mind me posting some of the notes he left for me about the solution (note: this is all work done by Julien, so credit for this goes to him entirely). We
were using the ACML versions of Lapack and BLAS, but will now use the standard netlib version of LAPACK and the Goto BLAS.
--------------------
multithreaded BLAS:
--------------------
- the problem comes from the ACML library.
1- The ACML library provides LAPACK and BLAS, but their BLAS does not seem to be multithreaded (note: this is strange, there should be a multithreaded version somewhere on the net)
2- If you want to use LAPACK from ACML, and BLAS from Goto BLAS, you need to link first to Goto BLAS and then the ACML library
3- However, this is not working. ACML LAPACK calls some strange BLAS name so that they are sure you are using their BLAS. It doesn't appear to be possible to use ACML LAPACK with the Goto BLAS.
4- I installed LAPACK from netlib and linked with Goto Blas, below the results

So type
> g77 -m64 -Wall -Wimplicit -ffixed-line-length-0 -lm -o \
fwd_3D_harmonic_banded_serial.opt fwd_3D_harmonic_banded_serial.f \
LAPACK/lapack_LINUX.a /usr/local/goto_blas/libgoto_opt-64_1024p_old-r0.97.so -lpthread
> ./fwd_3D_harmonic_banded_serial.opt
-------------------------------------------------------------------------------------
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
17131 ape20 25 0 958M 957M 816 R 194.4 48.4 0:43 1 fwd_3D_harmonic
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
+ Elapsed wall time (s): 49.2200012
-------------------------------------------------------------------------------------
The major thing that has changed is now the processor usage is 194.4% (i.e. both processors are going almost full speed), meaning the multithreaded BLAS is now working in our case. I'm sorry the text showing this is not aligned correctly. Note the elapsed wall time, this is almost 50% faster than it was with the single-threaded BLAS from the ACML.
So, in summary, if you link
properly to a multithreaded BLAS library (we needed help with this), you can get a large code speedup and near full utilization of both processors on the machine.
I must thank Julien again for his assistance in working through this problem.
I hope this helps, all the best with your cluster.
Ashton