CUDA_VISIBLE_DEVICES limits which devices are available. It is not a count of the number of devices to use.
MAGMA_NUM_GPUS is a count of the number of devices to use for multi-GPU routines that don't take ngpu as an argument.
It's a bit unclear exactly what you did. Providing the exact settings would be helpful. If you can reproduce the problem using one of our provided testers, that would also be helpful. Please see the web pages that I linked to that describe CUDA_VISIBLE_DEVICES, particularly the acceleware one gives some examples.
Here's an example using 2 GPUs, which are devices 0 and 2. To MAGMA, these appear as devices 0 and 1, but you can verify with nvidia-smi where it is running.
- Code: Select all
magma-trunk/testing> setenv MAGMA_NUM_GPUS 2
magma-trunk/testing> setenv CUDA_VISIBLE_DEVICES 0,2
magma-trunk/testing> ./testing_cgesv -N 1000 -N 30000
MAGMA 1.6.2 svn compiled for CUDA capability >= 3.5
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.3, MKL threads 16.
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]
ngpu 2
N NRHS CPU Gflop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1000 1 --- ( --- ) 7.80 ( 0.34) 2.96e-10 ok
30000 1 --- ( --- ) 3393.90 ( 21.22) 2.25e-10 ok
[meanwhile, on another terminal]
- Code: Select all
magma-trunk/testing> nvidia-smi
...
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 51461 C ./testing_cgesv 96MiB |
| 2 51461 C ./testing_cgesv 96MiB |
+-----------------------------------------------------------------------------+
Because of the increased communication overhead when using multiple GPUs, the benefits aren't seen until the matrix size gets rather large, say, N=30000. Depends a lot on your CPU, GPUs, and PCI bus. Here's with 1 to 3 K40c GPUs and a 2x8 core Intel Sandy Bridge Xeon.
- Code: Select all
magma-trunk/testing> setenv MAGMA_NUM_GPUS 1
bunsen magma-trunk/testing> ./testing_cgesv -N 1000 -N 30000
MAGMA 1.6.2 svn compiled for CUDA capability >= 3.5
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.3, MKL threads 16.
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 2: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]
ngpu 1
N NRHS CPU Gflop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1000 1 --- ( --- ) 112.15 ( 0.02) 3.44e-10 ok
30000 1 --- ( --- ) 2250.70 ( 31.99) 2.76e-10 ok
magma-trunk/testing> setenv MAGMA_NUM_GPUS 2
magma-trunk/testing> ./testing_cgesv -N 1000 -N 30000
MAGMA 1.6.2 svn compiled for CUDA capability >= 3.5
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.3, MKL threads 16.
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 2: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]
ngpu 2
N NRHS CPU Gflop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1000 1 --- ( --- ) 7.82 ( 0.34) 2.96e-10 ok
30000 1 --- ( --- ) 3457.67 ( 20.83) 2.25e-10 ok
magma-trunk/testing> setenv MAGMA_NUM_GPUS 3
magma-trunk/testing> ./testing_cgesv -N 1000 -N 30000
MAGMA 1.6.2 svn compiled for CUDA capability >= 3.5
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.3, MKL threads 16.
device 0: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 1: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
device 2: Tesla K40c, 745.0 MHz clock, 11519.6 MB memory, capability 3.5
Usage: ./testing_cgesv [options] [-h|--help]
ngpu 3
N NRHS CPU Gflop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1000 1 --- ( --- ) 3.67 ( 0.73) 2.96e-10 ok
30000 1 --- ( --- ) 3516.41 ( 20.48) 2.19e-10 ok