I have fortran code that fits kriging response surfaces. The most
time-consuming steps are the Cholesky factorization and later determination
of the inverse for square positive definite symmetric matrixes with n =
100-6000 (using DPOTRF and DPOTRI). My users have powerful Windows XP
machines with lots of memory and 4 processors. Note that this is not a
cluster, it is a single machine (box) with 4 processors. I currently use
the Compaq Visual Fortran compiler, and I have access to the Intel Fortran
Do you have any experience and can you give any guidance on how to get
faster DPOTRF performance in such a hardware environment? What compiler
would I have to use? Is PLAPACK perhaps better for this environment? Do
you know any gurus on parallel linear algebra under Windows?
I also use the GOTO Blas code for the Pentium. This speeds things up by a
factor of 2 or so. I would not want to lose this advantage.
With LAPACK all the performance is in BLAS, both on a single processor
and on a shared memory machine, like your 4 processor boxes. Compiler
does not matter much as long as you are using a fast implementation of BLAS.
Also on a shared memory machine all parallelization is in BLAS. Just make sure
the number of threads is equal to the number of processors. This should be the
default behavior, but you can also control it with environment variables
(OMP_NUM_THREADS or GOTO_NUM_THREADS).
Finally, if I were to put together a server enviroment that would do the
computations for Windows clients, what would be the optimal
hardware/software setup for my problem? Would it be a LINUX cluster? We
have a high performance computing center that has HP Unix workstations with
up to 32 processors. Would that be a good hardware setup?
On a cluster you have to use ScaLAPACK. ScaLAPACK will use message passing
(MPI) for communication between the nodes. You performance will also depend
on the interconnection. The faster the processors and the interconnection,
the better performance. However, factorization of small matrices will not scale
to large numbers of processor. You are probably better off staying on small
shared memory boxes.