Will,
This sounds good! I will be interested to hear more about it, especially since more and more users are requesting support for this type of problems.
How much work would be involved to modify MAGMA to assign independent matrices to thread blocks? Would it be viable to give the problem to a graduate student as a thesis project? We might be able to find a willing student.
One way to modify MAGMA quickly would be to add interfaces where the user would provide as input the computational streams that a routine should use. The user then can create several streams and run the independent factorizations through them. This will not involve changes in the MAGMA sources but a single routine may not be constrained to just a single multiprocessor - based on the matrix size it can use several multiprocessors; the rest though would not be idle as they would pick up work from the other streams.
Another way would be to enforce the use of a single multiprocessor for a single problem. This may require some coding. I guess the first approach would be better for "medium" size problems, and the latter for "small" (compared to the default blocking size for gemm, which is for example $64$ for dgemm on Fermi; the problem that Jeff mention probably would be considered small). I find this will be interesting for a thesis project - it can involve a combination of algorithm design and code optimization, experimentation, and probably can expand in the problem area from where the matrices come.
Stan