I have a persistent, but intermittent problem where code hangs forever in pdsygst for some matrix sizes for 64 cores (8x8) and perhaps other larger sizes. It does not occur every time, but can happen. (I know this sounds strange, but this is a truly reproducibly irreproducible bug.) It went away for a while, but following some recent updates to my cluster OFED/mvapich it has reappeared. I first reported it back in 2008 (viewtopic.php?f=2&t=795
) without any useful responses.
When using Totalview I can see a bit more. Some of the cores are in PDTRSM waiting for information, while others are in PDSYR2K. I wonder if this could cause a problem, when some cores are waiting for one set of information while others are waiting for something else.
N.B., I note that there is a unverified 1.7 bug by Joan:IBM but I cannot find more information to see if it is similar.