The PULSAR Runtime (PRT) is a lightweight layer that maps the abstractions described in the previous section to a distributed memory system with multicore nodes, where the Message Passing Interface (MPI) is the mechanism for inter-node communication and POSIX Threads (Pthreads) is the mechanism for intra-node multithreading. In its 1.0 rendition, released in August 2013, the system only supports classic multicore processors, and does not yet provide support for GPU accelerators or Xeon Phi co-processors.
The VSA is executed by a collection of processes, each process containing a number of worker threads and a thread dedicated to handling inter-node communication, referred to here as the proxy. Usually, one process is mapped to one distributed memory node, one thread is mapped to one core, and the proxy runs on a dedicated core. Other mappings are possible, such as placing one process in each socket of a multi-socket system, or launching more threads than cores (oversubscribing).
Each thread continuously sweeps its list of VDPs in search of VDPs that are ready to launch, i.e., have at least one packet in each active channel. Two scheduling modes are available: lazy and aggressive. In the lazy mode, a ready VDP is fired once, and the sweep moves on to other VDPs. In the aggressive mode, a VDP is repeatedly fired for as long as it is ready.
Each local (intra-node) channel, i.e., a channel within the same shared-memory node, is a simple First In First Out (FIFO) queue, connected to a source VDP on one end and a destination VDP on the other, with mutual exclusion enforced by a mutex. Each non-local (inter-node) channel is only connected to a VDP on one end, with the other end managed by the proxy.
The proxy contains one queue for all incoming packets from all nodes, and one queue per thread for outgoing packets. Upon reception of an incoming packet, the appropriate local channel is located and the packet is moved from the incoming queue to that channel. An outgoing packet is sent by moving the packet from its channel to the outgoing queue of the thread and posting a send. Upon completion of the send, the packet is removed from the outgoing queue.
The proxy follows the cycle of serving communication requests until no more requests are pending and no more VDPs are operating. Thanks to the simplicity of the underlying abstractions, the proxy is implemented using a total of six MPI functions, and spends most of its time cycling through the three basic functions:
MPI_Test. Also, because the proxy is implemented as a separate process, it does not require the underlying MPI implementation to be thread safe.
In principle, the proxy implements the abstraction of VDP-to-VDP communication. The routing of packets to appropriate channels is accomplished by assigning consecutive numbers to all channels connecting a pair of processes. These numbers are placed in the message tag and combined with the sender rank to identify the destination channel on the receiving side. Since the channel numbering is applied independently to each pair of nodes, the minimum guaranteed range of MPI tag values of 16K should be more than enough for the foreseeable future.
Overall, the PULSAR runtime provides the benefits of:
- data-driven runtime scheduling,
- overlapping of communication and computation by relying solely on non-blocking messaging,
- zero-copy shared-memory communication by relying solely on aliasing of local data, and
- hiding the combined complexity of multithreading and message-passing.
At the same time, the runtime introduces minimal scheduling overheads, and assures portability by relying only on the rudimentary set of MPI calls and a small set of Pthreads functions.