Lapack/BLAS on dual processor machines?

Hi,
I realise that this may not be the most appropriate forum to ask this question, but this seems to be a place where more experienced people can help newcomers without resorting to technobabble....
We currently run a cluster with ten nodes. Each node has two Opteron processors onboard. The O/S is Rocks Linux (based on RHEL3).
At the moment we are running Fortan code compiled with g77 with calls to the AMD ACML libraries (AMD tuned versions of Lapack and BLAS). We noticed that the processor usage never gets above 50% of the total when running this code (i.e. one processor might be at 75%, the other at 25%, but the total is never more than 50%).
If we move to a multithreaded BLAS library (such as goto_blas, http://www.cs.utexas.edu/users/kgoto/), will the processor utilization improve (i.e. will we get closer to 100% processor usage)? Is there any other libraries out there that are optimised for multi processor machines?
It just seems like at present we are wasting a lot of the capacity of our cluster.
Any thoughts on this would be most appreciated. If I am completely on the wrong track, please let me know.
Thanks,
Ashton
I realise that this may not be the most appropriate forum to ask this question, but this seems to be a place where more experienced people can help newcomers without resorting to technobabble....
We currently run a cluster with ten nodes. Each node has two Opteron processors onboard. The O/S is Rocks Linux (based on RHEL3).
At the moment we are running Fortan code compiled with g77 with calls to the AMD ACML libraries (AMD tuned versions of Lapack and BLAS). We noticed that the processor usage never gets above 50% of the total when running this code (i.e. one processor might be at 75%, the other at 25%, but the total is never more than 50%).
If we move to a multithreaded BLAS library (such as goto_blas, http://www.cs.utexas.edu/users/kgoto/), will the processor utilization improve (i.e. will we get closer to 100% processor usage)? Is there any other libraries out there that are optimised for multi processor machines?
It just seems like at present we are wasting a lot of the capacity of our cluster.
Any thoughts on this would be most appreciated. If I am completely on the wrong track, please let me know.
Thanks,
Ashton