I am trying to use ScaLAPACK routine pdgetrf to do the LU decompostion. I wrote a simple test fortran program and it worked well with 2*2 processes on the cluster. However, when I tried to use more processes, like 'mpiexec -n 16', The program got stuck.
One possible reason might be that the BLAS spawns too many threads which lead to a performance disaster as you mentioned before. So I tried to export OMP_NUM_THREADS=1, set different combinations of pbs -l select=:ncpu:mpiprocs: in the pbs scripts. But none of them solved the problem.
Now I have no idea why it is fine with 2*2 processes but fails with 4*4 or more processes. Could you teach me more technical details about these internal environment variables? Or if it is not the reason mentioned above, where the problem could be?
Thanks a lot,
Problem solved. It seems the MKL scalapack packed in composer 13.0.1 doesn't get along with MPICH3.0. After switching to the Intel MPI, the pdgetrf routine works fine with more procs.
Havn't try MPICH2 and Newer release of MKL. Hope the issue has been solved already.