by **Julien Langou** » Mon Nov 21, 2016 3:47 pm

Algorithm

You have P-by-Q tiles.

(step 1) Set a vector of size P*Q to store the Frobenius norms of each tile.

(step 2) Call LANGE on each tile in parallel

(step 3) Use NRM2 (in sequential) on the vector of norms to get the results

Note:

(1) step #3 done with NRM2 is a reduction, the reduction operation is LAPY2, z = sqrt( x^2 +y^2 ). You can parallelize the reduction using LAPY2 (or LAPY3 or smaller NRM2()), I do not think this is worth it in the shared memory case, just mentioning it. Using a reduction with LAPY2 is the way to go in parallel distributed.

(2) step #3 uses NRM2. You may use LANGE instead of NRM2 if you prefer to work with a P-by-Q `matrix` of norms instead of a P*Q `vector` of norms.

(3) it is not a good idea to use something like DOT() or SUM( A(I,J)^2 ) to directly compute the squared norm of each tile (and then summing these squares, and then square-rooting). Doing so may lead to what is called `unnecessary` underflow with some specific input data. (See Note 5 below.)

(4) LAPY2, LAPY3, LANGE, and NRM2 guarantee no `unnecessary` underflow

(5) By `unnecessary` underflow we mean that if we want to compute the norm of [ 1e200 1e200 1e200 1e200], then the answer is 2e200. If we square and use the basic approach, we get `Inf` as result (in double precision arithmetic). (Because (1e200)^2=1e400=overflow in double precision, so (1e200)^2= Inf.) If you use NRM2, you get 2e200. (NRM2 `simply` scales appropriately the data.) So NRM2 returns the correct results, while a `basic` approach does not. Note that the input is made of valid floating point numbers, the desired output is a valid floating point numbers. So this problem deserves to be correctly computed.

(6) As far as note #3, Fred Gustavson once proposed the `fast path` approach. The fast path approach applied to LANGE would be to indeed go ahead with DOT() on the tiles to get their squared norms. If ever we see Inf as a result, well, then we will use LANGE. So we use LANGE (the careful but slow implementation) as needed on case-by-case basis. DOT is several order faster than LANGE. LANGE is really careful to not overflow, but this leads to a really slow code. In `most cases`, the `fast path` approach is good enough and does not overflow and so is a big win in term of time. So the idea to mix them. It would be interesting (and not too hard) to implement this in PLASMA.

(7) Fred Gustavson also proposed to use `fast path` approach to speedup DGECON by using DLATRS instead of DTRSM. Same thing. DLATRS is much much slower than DTRSM. But is safer. However being safe is not really needed for most of the input data, so the idea is to be careless. If NaNs or Infs, then redo the operation again carefully.

Julien.