Programming Model

PULSAR offers a straightforward programming model, based on a few simple abstractions. The main abstraction in PULSAR is the Virtual Systolic Array (VSA), which is a connection of Virtual Data Processors (VDPs), identified by tuples, connected with channels and communicating with data packets.

VDP Operation The VDP is the main building block of the VSA, a direct descendant of the Processing Element (PE) in the traditional systolic array nomenclature. The VDP contains executable code, read-only global parameters, read/write, persistent, local store, a set of input channels and a set of output channels. Each VDP is uniquely identified by a tuple (a string of integers), and has a counter defining its life span.

When all input channels of a VDP contain data packets, the VDP can be fired, i.e., scheduled for execution by the runtime. When fired, the VDP can pop packets from its input channels, invoke any kind of computational kernels, and push the results to its output channels. At each firing, the counter is decremented, and when it goes to zero, the VDP is destroyed.

A channel is a static unidirectional connection between a pair of VDPs, in principle a First In First Out (FIFO) queue containing packets, where the source VDP pushes packets to the channel and the destination VDP pops packets from the channel. Input and output channels of each VDP are assigned slots, i.e., consecutive numbers starting at zero.

In order to implement an algorithm, one needs to build the VSA, by defining all the VDPs and their connections, and subsequently launch the VSA, i.e., pass the control to the runtime, which propagates the data through the VSA and dynamically schedules VDPs for execution.

In order to define a VDP, one needs to implement its processing cycle. The VDP code can invoke any kind of computational kernels, access global parameters in read-only mode, use local store for persistent storage of data between firing, read from input channels and write to output channels.

Runtime System

The PULSAR Runtime (PRT) is a lightweight layer that maps the abstractions described in the previous section to a distributed memory system with multicore nodes, where the Message Passing Interface (MPI) is the mechanism for inter-node communication and POSIX Threads (Pthreads) is the mechanism for intra-node multithreading. In its 1.0 rendition, released in August 2013, the system only supports classic multicore processors, and does not yet provide support for GPU accelerators or Xeon Phi co-processors.

PULSAR Runtime The VSA is executed by a collection of processes, each process containing a number of worker threads and a thread dedicated to handling inter-node communication, referred to here as the proxy. Usually, one process is mapped to one distributed memory node, one thread is mapped to one core, and the proxy runs on a dedicated core. Other mappings are possible, such as placing one process in each socket of a multi-socket system, or launching more threads than cores (oversubscribing).

Each thread continuously sweeps its list of VDPs in search of VDPs that are ready to launch, i.e., have at least one packet in each active channel. Two scheduling modes are available: lazy and aggressive. In the lazy mode, a ready VDP is fired once, and the sweep moves on to other VDPs. In the aggressive mode, a VDP is repeatedly fired for as long as it is ready.

Each local (intra-node) channel, i.e., a channel within the same shared-memory node, is a simple First In First Out (FIFO) queue, connected to a source VDP on one end and a destination VDP on the other, with mutual exclusion enforced by a mutex. Each non-local (inter-node) channel is only connected to a VDP on one end, with the other end managed by the proxy.

PULSAR Runtime 3D The proxy contains one queue for all incoming packets from all nodes, and one queue per thread for outgoing packets. Upon reception of an incoming packet, the appropriate local channel is located and the packet is moved from the incoming queue to that channel. An outgoing packet is sent by moving the packet from its channel to the outgoing queue of the thread and posting a send. Upon completion of the send, the packet is removed from the outgoing queue.

The proxy follows the cycle of serving communication requests until no more requests are pending and no more VDPs are operating. Thanks to the simplicity of the underlying abstractions, the proxy is implemented using a total of six MPI functions, and spends most of its time cycling through the three basic functions: MPI_Isend, MPI_Irecv, and MPI_Test. Also, because the proxy is implemented as a separate process, it does not require the underlying MPI implementation to be thread safe.

In principle, the proxy implements the abstraction of VDP-to-VDP communication. The routing of packets to appropriate channels is accomplished by assigning consecutive numbers to all channels connecting a pair of processes. These numbers are placed in the message tag and combined with the sender rank to identify the destination channel on the receiving side. Since the channel numbering is applied independently to each pair of nodes, the minimum guaranteed range of MPI tag values of 16K should be more than enough for the foreseeable future.

Overall, the PULSAR runtime provides the benefits of:

  • data-driven runtime scheduling,
  • overlapping of communication and computation by relying solely on non-blocking messaging,
  • zero-copy shared-memory communication by relying solely on aliasing of local data, and
  • hiding the combined complexity of multithreading and message-passing.

At the same time, the runtime introduces minimal scheduling overheads, and assures portability by relying only on the rudimentary set of MPI calls and a small set of Pthreads functions.

Nov 26 2014 Admin Login