I tried to solve a least squares problem of size 1e6-by-1e3. The machine I used has 12 Xeon cores. DGELSD took 152.8s to solve the problem. DGELSY took 223.8s. I tested ATLAS+LAPACK (3.3.0) and MATLAB's BLAS/LAPACK. DGELSD is faster than DGELSY in both cases. However, from the LAPACK Benchmark (http://www.netlib.org/lapack/lug/node71.html), DGELSY is almost as fast as DGELS and is significantly faster than DGELSD. Is it because DGELSD has better multi-threading support? or because of later improvement of DGELSD?