X10RT API Specification

Introduction

X10RT is a C library (implemented in C++) that a compiled X10 program uses to communicate between places. It can also be used by other languages or applications, if they have sufficiently similar semantics. It also should be possible for third parties to reimplement certain X10RT components on top of new networking layers and HPC systems. This allows the running of X10 programs in those environments. The API documentation discusses what the X10 process needs from this layer in terms of semantics and performance.

Technical Goals

define a minimal set of network operations fundamentally needed by X10

represent only the communication elements of the X10 implementation

abstract from details of the X10 object model, serialisation, remote references, etc.

allow highly scaled X10 configurations (e.g. 250000 places)

avoid forcing implementations to introduce latency/bandwidth overhead

allow implementations to use hardware accelerated memory copies (where appropriate)

Configuration and Launch

The X10RT implementation defines the application launch process, e.g., requiring the use of mpirun. The configuration of the system (i.e. number of places) should be controlled by the end-user of the X10 program. Determining the identity of a particular X10 process in the execution is the responsibility of the X10RT implementation. When executing X10 processes, the standard C entry point (int main) should be invoked as normal. The application should then explicitly initialize and finalize X10RT like any other library.

Places

An X10 program runs on more than one place. These places consist firstly of hosts (e.g. x86 / PPC machines, or some mixture of the above), on a network such as Ethernet, Infiniband, etc. However there are also accelerator places that are children of these hosts. Thus the topology of places is a 2 level tree where all the hosts are equals on the network, and under each host are optionally some accelerator cards. Currently the accelerators are CUDA-capable GPUs only. We have plans for OpenCL places and there is some incomplete code for supporting QS21 blades. Note that in the current implementation, there is no distinction between hosts and processes.

Messages

During the execution of an X10 program, messages are exchanged between places. Messages have a type/length/value format, where the type is a number in the range 0-65535. Messages are registered by the X10 program but X10RT chooses the message type values that identify messages. Messages can have arbitrary size, and two messages with the same type need not be the same size. In all cases, individual messages may be re-ordered but delivery must be guaranteed.

There are three types of messages: 'Plain' messages carry data that has been specially packed into a buffer (allocated by the caller), and invoke a callback on the remote side to handle the message. 'Put' messages additionally carry large amount of data that is transmitted directly from a given existing datastructure, and have callbacks on the remote side to locate a home for the data as well as to indicate that the transfer is complete. 'Get' messages cause data to be retrieved from a place directly into the given datastructure, and have a callback on the remote side to locate the source of the data and a callback on the destination to indicate that the transfer is complete. In X10, plain messages are used to implement asyncs whereas put/get messages are used to implement the copyTo/copyFrom functionality.

All of these messages cause callbacks to be invoked on the far side, so can be considered 'active'. They all carry a packed buffer of serialised data. There is therefore a lot of overlap in functionality. Note that if the amount of data is small or is not in a contiguous form, or if the data formatting (endian, alignment, etc.) is not the same between the two hosts, it is appropriate to use the 'plain' messages. If the data is in a contiguous form, the allocation and packing overhead of using a big packed buffer can be avoided by using the get/put messages. Note that the get/put messages still require a packed buffer, but it would in this case only be used for metadata and thus the overhead would be negligible.

Library Structure

The X10RT library has various modules for implementing specific behavior. The highest level API is accessed via the functions in x10rt_front.h. X10 and other languages should usually target the x10rt_front.h API. There is an auxiliary header x10rt_types.h that defines some common types for the whole system. More low level than x10rt_front.h there is the Logical Layer, found in x10rt_logical.h, which deals only with communication. The goal of the Logical Layer to join together the various backend networking libraries presenting a uniform set of places. It exports functions that allow communication to and from these places no matter where they are found in the topology.

Underneath the Logical Layer are the networking layers. There is a Core Networking Layer x10rt_net.h which provides the links between hosts. There are currently many implementations of the core networking layer: There is a standalone implementation that allows multiple places via inter-process communication on a single host. There is an MPI implementation that uses MPI for communication between hosts. There is also a proprietary implementation on top of the PGAS library, which internally supports many HPC libraries and also has a sockets implementation. All of these implement the symbols in x10rt_net.h so they cannot currently be used simultaneously. However one can link against whichever implementation is preferred for inter-host communication. Details on the available implementations of the Core Networking Layer can be found here.

In addition to the Core Networking Layer x10rt_net.h there is a layer for CUDA, which is intended to wrap the NVidia CUDA API in a way that provides an interface that is very similar to x10rt_net.h. This is found in x10rt_cuda.h. Similarly if there were completed Cell / OpenCL layers, these would be separate interfaces x10rt_cell.h and x10rt_opencl.h defining their own distinct symbols. It is the job of the Logical Layer to take the Core Networking Layer and allow crosstalk between this and the various accelerator places by connecting these distinct APIs together and dispatching messages accordingly.

X10RT structural diagram

Callbacks

The X10RT API makes heavy use of callbacks at all levels for handling of incoming messages and other events. These callbacks are fully re-entrant. This means they can call back into X10RT in arbitrary recursive ways. This is necessary because in X10 everything is an async and thus asyncs are capable of executing arbitrary code, including sending other asyncs and waiting for remote termination. These callbacks can execute for as long as they need to but if they become idle for whatever reason, they should not block or busy wait. They should call x10rt_probe() to trigger more callbacks.

Performance Notes

The x10rt_send_* functions may block until some internal operation has completed, such as writing to a network buffer. But blocking until the callback has completed at the other side of the transmission is not allowed as it may cause deadlocks to emerge within the X10 program.

The X10RT implementation is permitted to handle messages (i.e. behave as if x10rt_probe() has been called) when it is called via one of the x10rt_send_* calls. This relaxation may be useful for increasing throughput.

The x10rt_probe() function should not block waiting for network traffic to arrive, as to do so would impact performance. In the event of incomplete arrival of a message, x10rt_probe() should return, and the full receipt and associated action should be deferred to some future call of x10rt_probe() (which may be from another thread), i.e. to a time when sufficient data is available to allow the immediate handling of the message.