LAPACK Archives

[Lapack] Race condition in *LAMCH causes infinite loop in *LASCL in LAPA


Firstly, thanks for your continued work on LAPACK!

We are using LAPACK in a highly multithreaded context and we appear to have 
found a race condition in DLAMCH.

We use g77 on x86_64 (both AMD and Intel processors) under Linux.  We have 
built our own version of LAPACK 3.1.1, although the problem also occurs under 
the vendor (Red Hat/Fedora) supplied version of LAPACK 3.1.0.

We are using DGELSD in one of our algorithms.  This algorithm calls DGELSD 
tens of thousands of times from 8 different threads with relatively small 
matrices (in the order of 2-10x10-100).  Often, the matrices passed are quite 
badly behaved (very large and/or very small elements, dernormal numbers, 

We have observed an infinite loop in the DLASCL routine used internally from 
DGELSD.  The loop that attempts to scale the vector several times without 
under/overflow if scaling it all at once would cause under/overflow never 
terminates (the condition DONE never reaches true).

The bug is not reproducible on a single threaded machine, and is very rare 
even with 8 threads (on an 8 core machine).  It seems to be that the call

      SMLNUM = DLAMCH( 'S' )

is not giving the same value all the time (I don't have the exact values; it 
is based upon second hand reports).

When I look at the call to DLAMCH, it contains a guarded initialization:

      IF( FIRST ) THEN
         BASE = BETA
         T = IT
         IF( LRND ) THEN
            RND = ONE
            EPS = ( BASE**( 1-IT ) ) / 2
            RND = ZERO
            EPS = BASE**( 1-IT )
         END IF
         PREC = EPS*BASE
         EMIN = IMIN
         EMAX = IMAX
         SFMIN = RMIN
         SMALL = ONE / RMAX
*           Use SMALL plus a bit, to avoid the possibility of rounding
*           causing overflow when computing  1/sfmin.
            SFMIN = SMALL*( ONE+EPS )
         END IF
      END IF


      FIRST  = .FALSE.

The race condition is in the setting of SFMIN.  What I believe is happening is 
that two threads arrive in DLAMCH with FIRST set to true, and so both proceed 
to initialize the saved versions of the parameters.

Note that SFMIN is inialized once:

         SFMIN = RMIN

and then potentially modified:

            SFMIN = SMALL*( ONE+EPS )

If the threads execute in the following sequence:

1.  thread #1 sets SFMIN to RMIN
2.  thread #1 completes the initialization (so that SFMIN = SMALL * (ONE+EPS))
3.  thread #2 resets SFMIN to RMIN
4.  thread #1 returns SFMIN (which is currently RMIN)
5.  thread #2 completes the initialization (so that SFMIN = SMALL * (ONE+EPS))
6.  thread #2 returns SFMIN (which is SMALL * (ONE+EPS) as required)

then thread 1 will get an invalid value for SFMIN which will overflow when 1 / 
SFMIN is calculated, which leads to the infinite loop.

We are fixing it for our application by calling both slamch and dlamch at 
program initialization time before the threads have been created, to ensure 
that FIRST is not set in a multithreaded context (this will allow us to 
function correctly without requiring changes on top of LAPACK 3.1.0).

A fix for LAPACK would be to do something like:

         SMALL = ONE / RMAX
*           Use SMALL plus a bit, to avoid the possibility of rounding
*           causing overflow when computing  1/sfmin.
            SFMIN = SMALL*( ONE+EPS )
            SFMIN = RMIN
         END IF

With this fix, SFMIN only gets set once and will only have one value (due to 
the idempotence of assignment).

There are potentially other problems in the *lamch.f functions of a similar 
nature (I have not checked).  Wherever a saved value is written twice with 
different values, there is a potential race condition.

Jeremy Barnes

Research and Development
Idilia Inc
Montreal, Quebec, Canada

<Prev in Thread] Current Thread [Next in Thread>

For additional information you may use the LAPACK/ScaLAPACK Forum.
Or one of the mailing lists, or