Lapack test failed in Magma 2.2

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)

Lapack test failed in Magma 2.2

Postby organicchemistry_01 » Mon Dec 26, 2016 3:14 am

I followed the make example for make.inc-mkl-gcc in a system with

Pascal GPU gtx 1060
MKL 2017
Cuda 8.0
Gcc 4.9
Dual intel xeon (32 logic cores)

I run test provided from magma source dir and got about 10x times performance on cuBlas compared to cpu blas, all good but when it reached the lapack testing I got lots of failed result that seem to occur regardless of matrix size, some big matrix passed but mostly fails. 1/4 of lapack testing fails.

So i thought maybe it was just mkl problem, so I switch to OpenBlas+Gcc4.9+Cuda8, I use the openblas gcc make.inc of course, however I got the same results. Lapack failure occurs exactly were magma-mkl fails.

I couldnt see any problem in the supplied make.inc examples as all cuBlas related test passed graciously with flying colors however fails 1/4 of it in lapack tests on either mkl or openblas, how could this be resolved?
organicchemistry_01
 
Posts: 4
Joined: Mon Dec 26, 2016 2:21 am

Re: Lapack test failed in Magma 2.2

Postby mgates3 » Wed Dec 28, 2016 3:07 pm

Which routines passed & which failed? Can you post failures of some routines? Please include the complete input & output so we know what command line you used. Please also include your make.inc file and any environment variables you set (e.g., CUDADIR, GPU_TARGET).

I assume this is on Linux?

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Lapack test failed in Magma 2.2

Postby organicchemistry_01 » Thu Dec 29, 2016 5:03 pm

Hi,

Yes this is on Linux,

The command line I used for testing is

Code: Select all
./run_tests.py


in the testing directory

Make file is make.inc.mkl-gcc, I did not change anything there except for Paths and GPU_TARGET, here is it

Code: Select all
GPU_TARGET ?= Pascal

CC        = gcc
CXX       = g++
NVCC      = /usr/local/cuda/bin/nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

FPIC      = -fPIC

CFLAGS    = -O3 $(FPIC) -fopenmp -DNDEBUG -DADD_ -Wall -Wshadow -DMAGMA_WITH_MKL
FFLAGS    = -O3 $(FPIC)          -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument
F90FLAGS  = -O3 $(FPIC)          -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument -x f95-cpp-input
NVCCFLAGS = -O3                  -DNDEBUG -DADD_ -Xcompiler "$(FPIC) -Wall -Wno-unused-function"
LDFLAGS   =     $(FPIC) -fopenmp

CXXFLAGS := $(CFLAGS) -std=c++11
CFLAGS   += -std=c99

LIB       = -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lstdc++ -lm -lgfortran
LIB      += -lcublas -lcusparse -lcudart -lcudadevrt

MKLROOT ?= /opt/intel/mkl
CUDADIR ?= /usr/local/cuda
-include make.check-mkl
-include make.check-cuda

LIBDIR    = -L$(CUDADIR)/lib64 \
            -L$(MKLROOT)/lib/intel64

INC       = -I$(CUDADIR)/include \
            -I$(MKLROOT)/include


I did the same for make.inc.openblas but updated the directories and nvcc path.

I wish to attach the ./run_tests.py output but our workstations are down now due to December/January maintenance, there is nothing unusuall in the table outputed by the ./run_tests.py its just all cuBlas related error check passed but when It reach the lapack error check (last column) there were too much test failed.

As I can remember its those tests involving lapack error check as the last column from
Code: Select all
testing_c**** routine


however, not all steps have failures, but mostly those involving bigger matrix. Failed tests never stops coming out when it reached those routines so I find it unusual as compared to cuBlas error check that passed all, I terminated it before it could reach other routines

I hope that this could help us in coming up for a possible solution, but I think, there is just something missing in the make.inc file?
organicchemistry_01
 
Posts: 4
Joined: Mon Dec 26, 2016 2:21 am

Re: Lapack test failed in Magma 2.2

Postby organicchemistry_01 » Fri Jan 06, 2017 11:21 pm

I am now attaching a more detailed output of ./run_tests.py

There were 5k failed tests over 150k passed, this is I can say remarkable passed tests.

Here is a short summary of failed tests:

1. testing_zgemv on 600x1 matrix
2. testing_*trmv on CUBLAS error
3. testing_*trsm on LAPACK error

However, I could not completely finish all tests as it is taking too long! If you would like I can continue the tests but I dont know how to start it were it left off.
Attachments
lapackerrors.tar.gz
Magma 2.2 failed tests
(1.96 MiB) Downloaded 38 times
organicchemistry_01
 
Posts: 4
Joined: Mon Dec 26, 2016 2:21 am

Re: Lapack test failed in Magma 2.2

Postby mgates3 » Sat Jan 07, 2017 3:55 pm

Thanks.
The output is a bit garbled in places from mixing stdout and stderr. For future reference, you can redirect output into a file which should avoid that issue. You can also select smaller tests to run it faster. The default is --small --medium --large (-s -m -l).
Code: Select all
run_tests.py -s -m > results.txt


Mostly, the "failures" are caused by having the tolerance a bit too low. The default is 30. Using 100 will eliminate a lot of these issues. A few routines -- notably trsm -- don't have very tight error bounds yet, so may require a higher tolerance than that even.
Code: Select all
run_tests.py -s -m --tol 100 > results.txt


Fortunately, you can see what the results would be with a different tolerance without re-running them. Use
Code: Select all
run_summarize.py --tol 100 lapackerrors.txt > results100.txt
run_summarize.py --tol 200 lapackerrors.txt > results200.txt

This does several things:
  • Finds errors like "3.34e-06" and adds error/eps after it in { } braces, like "3.34e-06 { 56.0}". That (error/eps) number is what is tested against tolerance. So in this case, 56.0 > 30, the default tolerance, so it would fail, but it's less than 100.
  • Changes "failed" to "suspect" if all the (error/eps) are less than the new tolerance.
  • Sorts failures into categories: okay, errors (segfaults), failed, suspicious, known failures. Most of the failures that you observed are in the known failures, and come from 4 routines: trsm, gesv_rbt, geqr2x version 2 and 4, and gegqr. We need to fix the error check for trsm. See BUGS.txt about others.

There are a few errors to look into here. zheevd version 3, which is actually zheevr (MRRR) seems to have some issues. zgemv had one weird error.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm

Re: Lapack test failed in Magma 2.2

Postby mgates3 » Sat Jan 07, 2017 4:06 pm

Also, if you are interested, to restart it near where it left off, use the --start option.
Code: Select all
run_tests.py --start testing_zhetrd


I usually run smaller groups of routines together, e.g.,
Code: Select all
run_tests.py --blas > blas.txt
run_tests.py --aux > aux.txt
run_tests.py --chol > chol.txt

and so on.

-mark
mgates3
 
Posts: 750
Joined: Fri Jan 06, 2012 2:13 pm


Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 4 guests

cron