MAGMA so far has not targeted this type of problems, but we are looking into it, e.g., in connection to spectral element agglomeration AMG. I would be interested to know what is your application. Thanks.
I can think of several ways to organize the computation for these problems. One way is, as you mentioned, to use streams. This may require though the matrices to be somehow larger so that the computation can be done efficiently on a multiprocessor. Another way may be to have a single thread deal with a 5x5 matrix. In this case one has to think of what data structures to use, for example, to interleave 32 (or more) 5x5 matrices to insure coalescent reads, etc.