Hello.
Is there further information/documentation about the scope of the data structures (icl_hash (for task and address lookup), bsd_tree (for FIFO and LIFO queues), icl_list, quark) than the actual code at Quark ?
More detail than some of the current working notes that does not include the hashing nor other implementations (trees, queues) would be appreciated.
I have modified the matmul example to call DGEMM from BLAS under ACML (single threaded) and I was able to achieve slightly better performance (83%) than the multithreaded version of ACML (80%) on our latest AMD processors (Interlagos and Abudhabi). I believe though I can get more performance (~90%) out of quark if I understand better the tuning parameters such as sliding window and locality + work stealing.
One comment that catched my attention on the documentation is that it was recommended to use "numactl --interleave=all ./executable".
This makes no sense if you are really trying to leverage locality on a NUMA system. All the theory of work stealing but not stealing tasks that would incur into remote memory accesses goes away if you run threads under a system with interleaved memory configuration.
If you run something like DGEMM kernels, definitely, it does not make a big difference to run in interleaved but if you run L[1,2] BLAS kernels, then you raise the level of memory bandwidth requirements so you start being penalized by interleaved configurations.
I want to be able to understand how it is implemented in Quark the work stealing that is cache hierarchy aware.
It should be for instance doable to apply work stealing with not only locality policies but also power policies.
Thanks,
Joshua
