Hi,
Firstly, thanks for your continued work on LAPACK!
We are using LAPACK in a highly multithreaded context and we appear to have
found a race condition in DLAMCH.
We use g77 on x86_64 (both AMD and Intel processors) under Linux. We have
built our own version of LAPACK 3.1.1, although the problem also occurs under
the vendor (Red Hat/Fedora) supplied version of LAPACK 3.1.0.
We are using DGELSD in one of our algorithms. This algorithm calls DGELSD
tens of thousands of times from 8 different threads with relatively small
matrices (in the order of 2-10x10-100). Often, the matrices passed are quite
badly behaved (very large and/or very small elements, dernormal numbers,
etc).
We have observed an infinite loop in the DLASCL routine used internally from
DGELSD. The loop that attempts to scale the vector several times without
under/overflow if scaling it all at once would cause under/overflow never
terminates (the condition DONE never reaches true).
The bug is not reproducible on a single threaded machine, and is very rare
even with 8 threads (on an 8 core machine). It seems to be that the call
SMLNUM = DLAMCH( 'S' )
is not giving the same value all the time (I don't have the exact values; it
is based upon second hand reports).
When I look at the call to DLAMCH, it contains a guarded initialization:
IF( FIRST ) THEN
CALL DLAMC2( BETA, IT, LRND, EPS, IMIN, RMIN, IMAX, RMAX )
BASE = BETA
T = IT
IF( LRND ) THEN
RND = ONE
EPS = ( BASE**( 1-IT ) ) / 2
ELSE
RND = ZERO
EPS = BASE**( 1-IT )
END IF
PREC = EPS*BASE
EMIN = IMIN
EMAX = IMAX
SFMIN = RMIN
SMALL = ONE / RMAX
IF( SMALL.GE.SFMIN ) THEN
*
* Use SMALL plus a bit, to avoid the possibility of rounding
* causing overflow when computing 1/sfmin.
*
SFMIN = SMALL*( ONE+EPS )
END IF
END IF
...
FIRST = .FALSE.
The race condition is in the setting of SFMIN. What I believe is happening is
that two threads arrive in DLAMCH with FIRST set to true, and so both proceed
to initialize the saved versions of the parameters.
Note that SFMIN is inialized once:
SFMIN = RMIN
and then potentially modified:
SFMIN = SMALL*( ONE+EPS )
If the threads execute in the following sequence:
1. thread #1 sets SFMIN to RMIN
2. thread #1 completes the initialization (so that SFMIN = SMALL * (ONE+EPS))
3. thread #2 resets SFMIN to RMIN
4. thread #1 returns SFMIN (which is currently RMIN)
5. thread #2 completes the initialization (so that SFMIN = SMALL * (ONE+EPS))
6. thread #2 returns SFMIN (which is SMALL * (ONE+EPS) as required)
then thread 1 will get an invalid value for SFMIN which will overflow when 1 /
SFMIN is calculated, which leads to the infinite loop.
We are fixing it for our application by calling both slamch and dlamch at
program initialization time before the threads have been created, to ensure
that FIRST is not set in a multithreaded context (this will allow us to
function correctly without requiring changes on top of LAPACK 3.1.0).
A fix for LAPACK would be to do something like:
SMALL = ONE / RMAX
IF( SMALL.GE.RMIN ) THEN
*
* Use SMALL plus a bit, to avoid the possibility of rounding
* causing overflow when computing 1/sfmin.
*
SFMIN = SMALL*( ONE+EPS )
ELSE
SFMIN = RMIN
END IF
With this fix, SFMIN only gets set once and will only have one value (due to
the idempotence of assignment).
There are potentially other problems in the *lamch.f functions of a similar
nature (I have not checked). Wherever a saved value is written twice with
different values, there is a potential race condition.
Cheers,
Jeremy Barnes
Research and Development
Idilia Inc
Montreal, Quebec, Canada
|