To achieve any reasonable performance, MAGMA requires using a blocked algorithm, which requires using a workspace based on the block size. In some cases, MAGMA uses a larger block size or otherwise needs more workspace than LAPACK.
Incidentally, the LAPACK performance will also greatly increase if the workspace is allocated based on the block size. Otherwise, it must use an inefficient non-blocked version. Unfortunately, minimum workspace gives minimum performance.
Also, if your problems are really around size N=100, you will not see any improvement with MAGMA over LAPACK. There simply is not enough work for the GPU to do to overcome the cost of copying the matrix to the GPU. The matrix size needs to be >1000 before seeing real benefits of the GPU. You can use the executables in the testing directory to try different matrix sizes for performance comparison.