Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
Posts: 2
Joined: Sat Sep 20, 2014 12:04 pm


Post by RobReed » Sat Sep 20, 2014 12:11 pm

Hi All,

I am doing some development work on ARM based architectures for our group working at ATLAS, CERN. I was wondering if anyone has been able to get MAGMA to compile on ARM?

I dont want to dive straight into it without seeing what's been done first. I saw one other post about cross compiling MAGMA for ARM but it was for the CARMA kit. I can build directly on the board no problem :)

Just for interest sake I have the Tegra K1 Development board.

Thanks in advance!

Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm


Post by Stan Tomov » Sat Sep 20, 2014 8:22 pm

We have been able to compile on ARM, and in particular on the TK1 development board that you also mentioned. We also compile directly on the TK1 now and everything works out of the box but performance can be further optimized, and we are developing a MAGMA Embedded version of MAGMA to address that. In particular, the stress is on using entirely GPU implementations in contrast to the hybrid algorithms in magma that use both GPUs and CPUs. We are interested to know more about the applications that you target and the linear algebra that you need for them.

Posts: 2
Joined: Sat Sep 20, 2014 12:04 pm


Post by RobReed » Tue Sep 23, 2014 3:52 am

Hi Stan,

Thats great news. At least I know its been done so I am not walking down a dead-end. The exact details are still not quite clear but in general we are looking at using ARM/GPU systems to do out of band energy reconstruction for proton/proton collisions. This is a pseudo-live energy reconstruction check which will allow almost real time adjustment of the in-band algorithm parameters. This hasn't been done before and would prevent the need to recalibrate as its done on the fly.

Do you have any idea when you expect the embedded version to be available? Beta version perhaps? Did you have any major difficulties getting Magma to compile on the TK1?


Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm


Post by Stan Tomov » Tue Sep 23, 2014 3:03 pm

Hi Rob,
Thanks for the info. We are targeting a middle of November release, but can provide specific routines in advance if you want to test.
There were no problems with the compilation - we just put lapack with reference blas for the ARM with a looking like this:

Code: Select all

#   -- MAGMA (version 1.5.0-beta1) --
#      Univ. of Tennessee, Knoxville
#      Univ. of California, Berkeley
#      Univ. of Colorado, Denver
#      @date April 2014

# GPU_TARGET contains one or more of Tesla, Fermi, or Kepler,
# to specify for which GPUs you want to compile MAGMA:
#     Tesla  - NVIDIA compute capability 1.x cards
#     Fermi  - NVIDIA compute capability 2.x cards
#     Kepler - NVIDIA compute capability 3.x cards
# The default is all, "Tesla Fermi Kepler".
# See
#GPU_TARGET ?= Tesla Fermi Kepler
GPU_TARGET ?= Kepler

CC        = gcc
NVCC      = /usr/local/cuda-6.0/bin/nvcc
FORT      = gfortran

ARCH      = ar
RANLIB    = ranlib

F77OPTS   = -O3 -DADD_
FOPTS     = -O3 -DADD_ -x f95-cpp-input
NVOPTS    = -m32 -O3 -DADD_ -Xcompiler -fno-strict-aliasing
LDOPTS    = -fopenmp

# Depending on how ATLAS and LAPACK were compiled, you may need one or more of:
# -lifcore -ldl -lf2c -lgfortran
LIB       = -llapack -lrefblas -lcublas -lcudart -lstdc++ -lm -lgfortran

# define library directories here or in your environment
LAPACKDIR ?= /home/tomov/LIBS/lapack-3.4.2
ATLASDIR  ?= /home/tomov/LIBS/lapack-3.4.2
CUDADIR   ?= /usr/local/cuda-6.0
-include make.check-atlas
-include make.check-cuda

            -L$(ATLASDIR)/lib \

INC       = -I$(CUDADIR)/include

but the performance is very low. If there are no modifications of the code, I think something happens with the unified memory feature of the GPU, and explicit ARM-GPU data transfers are very slow, making the overall code, e.g. LU in single, run at less than 1 GFlop/s. The system also is probably making decisions where to allocate memory. In the testing_sgemm example, we allocate memory on the CPU and the GPU, initialize the matrices on the CPU, copy them to the GPU, and run sgemm on the GPU. It turns out performance is slower if the CPU memory is not pinned (~15-20 GFlop/s vs ~100 GFlop/s), while our expectation was that there should not have been any effect on performance for a totally GPU code like the gemm. We tried also totally GPU codes where there are no transfers and the memory is allocated just on the GPU and those can run to above 100 GFlop/s.

Posts: 1
Joined: Thu Oct 02, 2014 12:58 pm


Post by c0g » Thu Oct 02, 2014 1:02 pm


I'm also using MAGMA on ARM to speed up some flying gaussian process robots I'm working on. Specifically I'm using the TK1.
The two relatively heavyweight things I do are cholesky decompositions and then the attendant triangular solve - solve_chol as my matrices are always PD. The rest is just vector/matrix vector/vector stuff.

Looking forward to the next release if this is being worked on. I'm happy to help test/profile since all my TK1 is doing at the moment is sitting there running benchmarks.

I had no problem getting it all compiled and linked using OpenBlas as my BLAS/LAPACK library.


Posts: 1
Joined: Tue Dec 08, 2015 4:10 pm


Post by wbergerson » Tue Dec 08, 2015 4:19 pm

I'm looking into putting MAGMA on the new Jetson, the TX1, specifically to use the magma versions of cheev_ and zheev_ from LAPACK, but I would probably use other calls as well. Those are the lead offs though.

Does anybody have any updated guidance for doing an install on the jetson beyond what's in this thread? or has there been any progress on the MAGMA embedded effort?


Post Reply