Understanding data structures behind the dynamic scheduler

Open forum for general discussions relating to PLASMA.

Understanding data structures behind the dynamic scheduler

Postby joshua_mora » Tue Nov 27, 2012 6:15 am

Hello.
Is there further information/documentation about the scope of the data structures (icl_hash (for task and address lookup), bsd_tree (for FIFO and LIFO queues), icl_list, quark) than the actual code at Quark ?
More detail than some of the current working notes that does not include the hashing nor other implementations (trees, queues) would be appreciated.

I have modified the matmul example to call DGEMM from BLAS under ACML (single threaded) and I was able to achieve slightly better performance (83%) than the multithreaded version of ACML (80%) on our latest AMD processors (Interlagos and Abudhabi). I believe though I can get more performance (~90%) out of quark if I understand better the tuning parameters such as sliding window and locality + work stealing.

One comment that catched my attention on the documentation is that it was recommended to use "numactl --interleave=all ./executable".
This makes no sense if you are really trying to leverage locality on a NUMA system. All the theory of work stealing but not stealing tasks that would incur into remote memory accesses goes away if you run threads under a system with interleaved memory configuration.
If you run something like DGEMM kernels, definitely, it does not make a big difference to run in interleaved but if you run L[1,2] BLAS kernels, then you raise the level of memory bandwidth requirements so you start being penalized by interleaved configurations.

I want to be able to understand how it is implemented in Quark the work stealing that is cache hierarchy aware.
It should be for instance doable to apply work stealing with not only locality policies but also power policies.

Thanks,
Joshua
joshua_mora
 
Posts: 7
Joined: Mon Nov 26, 2012 10:23 am

Re: Understanding data structures behind the dynamic schedul

Postby mateo70 » Tue Nov 27, 2012 6:45 am

Hello,

Asim will probably answer to the question about structures because I don't know where they come from and won't be of any help on this topic.
But thank you to point out the numactl comment, because it is an out of date comment that should be completed. It is indeed useful in some of the testing routines thaat are still using zlarnv to initialize the matrix. This function being sequential, it can be useful to spread the memory over the nodes.

For every other testings that are using PLASMA_zplrnt, zplghe or zplgsy and their equivalent in other precisions, the initialization is done in parallel to have a better distribution/locality during the following computations. In this case, the numactl comment definitely don't apply and can decrease the performance of the algorithm.

Mathieu
mateo70
 
Posts: 95
Joined: Fri May 07, 2010 3:48 pm

Re: Understanding data structures behind the dynamic schedul

Postby joshua_mora » Tue Nov 27, 2012 12:36 pm

In that case, of requiring to spread out a data because not fitting in a single numanode you may want to think in distributing it, like if they were on different compute nodes rather than still use interleaved. The remote access is twice slow than local access for both latency and bandwidth and it is more or less an architecture independent statement.
Joshua
joshua_mora
 
Posts: 7
Joined: Mon Nov 26, 2012 10:23 am

Re: Understanding data structures behind the dynamic schedul

Postby mateo70 » Tue Nov 27, 2012 12:54 pm

I agree, and this is a complete architecture independent statement. I have one where it can be 4 to 10 times slower. However the initialization of the data is usually something we don't have access to. So it really depends on what the user is doing to initialize the matrix before to give it to PLASMA.
We could decide to redistribute the data as you suggest, but what if the user already spread the data accordingly to the need of it's own code ? It could be a bad choice to move them around.
I still have to do it, but I also wanted to integrate a map function that the user could call to initialize the matrix. This way we could spread the data correctly, and more advanced users could still do their own placement.

Mathieu
mateo70
 
Posts: 95
Joined: Fri May 07, 2010 3:48 pm

Re: Understanding data structures behind the dynamic schedul

Postby yarkhan » Tue Nov 27, 2012 2:41 pm

Joshua,

The bsd_tree and bsd_queue data structures are header only implementations taken from the BSD distribution. Descriptions for them can be found at:
http://www.unix.com/man-page/FreeBSD/3/queue/
http://www.unix.com/man-page/freebsd/3/tree/
The icl_hash and icl_list are fairly basic implementations, developed initially for an earlier project (GridSolve) and then reused here. The implementation are very simple and should be fairly reasonable to read in the code and headers.

There is a QUARK specific site http://icl.cs.utk.edu/quark/
There is a QUARK users guide that will give you some idea of how to use QUARK.
http://icl.cs.utk.edu/projectsfiles/plasma/pubs/56-quark_users_guide.pdf

QUARK does not do well with small quantities of data, because the relative overheads of the runtime become large when compared to the computation. So, for a double precision matrix multiplication, a tile size of 220x220 will perform much better than a small tile size. However, the there is no specific tile size... it depends on the machine and algorithm. We are planning to do work on autotuning to determine the appropriate tile sizes automatically, but for the present, a user simply needs to guess/tune manually.

The window size is not as important for the performance. There is a environment variable (QUARK_UNROLL_TASKS_PER_THREAD) that can be set to change the window size from the default, but I doubt that you need to alter it.
There is very limited user control over the work stealing. You can prevent a task from being stolen by binding it to a specific thread or set of threads, but that it all.

As Mathieu said, the numactl comment is out of date.
However, if you have no better way to distribute your data, it is better than nothing!

QUARK work stealing is not really cache aware. There is an ordering to the threads determined by the machine layout (and discovered by the hwloc library if it is configured and installed). QUARK steals from the threads in that order, but this may not correspond to nearest neighbors/local caches. We need to fix that!

There has been discussion to see if QUARK could do power aware scheduling, but this has not gone very far yet. If you have an example of power policies that would be of interest, please let us know.

Hope this helps,
Regards,
Asim
yarkhan
 
Posts: 15
Joined: Thu Oct 01, 2009 10:38 am

Re: Understanding data structures behind the dynamic schedul

Postby joshua_mora » Tue Nov 27, 2012 4:19 pm

I have used block (tile) sizes of 1000 for DGEMM and solved problems of size 30,000.
That is how I get good performance (> 80% efficiency). DGEMM calls have some overheads and I use ACML_FAST_MALLOC environment variable for the single threaded version of ACML.
I am interested in understanding how those structures (trees, queues, lists) are used for the different goals of dynamic scheduling, work stealing, and certain tasks like reductions/agglomeration and how to do speculative, checkpointing, resilience and power policies.
I was perhaps looking to understand a high level view of each of those structures what is their role and how they are connected to each other since there are several ways to implement things.
For power policies you can migrate tasks to cores into the same numanode and "shutdown" sets of cores within a numanode that is not used.
Additionally you can throttle down some computing cores if you are waiting on some dependencies that due to data locality you will eventually run right there.

Joshua
joshua_mora
 
Posts: 7
Joined: Mon Nov 26, 2012 10:23 am


Return to User discussion

Who is online

Users browsing this forum: Yonghyun and 2 guests