A much-needed solution to communication in embedded HPC applications
Abaco is leading an effort to create a new open standard point-to-point communication in order to address a significant problem in the embedded HPC (eHPC) market.
In the eHPC market (a small niche group in the software world), developers are focused on domain-specific application development (e.g. radar processing, signal intelligence, vehicle autonomy). These domain-specific problems require development of complex algorithms and the resulting applications are often very compute-intensive. In order to meet real-time requirements, the algorithms must be distributed across a heterogeneous system, which requires movement of large data sets around the system at high speeds. This is where point-to-point communication is very important but, unfortunately, very often a secondary focus which creates development and maintenance problems.
There are three fundamental issues with existing point to point communication APIs.
Modern heterogeneous eHPC applications typically require multiple communication APIs, one for each interconnect type. This diagram shows the complexity:
If there were a single unified API, this is what you would have:
This would result in much less source code to implement, and a software architecture that is much simpler to maintain.
2. Complexity or lack of functionality
Communication is typically thought of as sending and receiving data. Unfortunately, most communication APIs are designed to closely follow the underlying hardware or low-level kernel/protocol designs rather than keep the API simple and intuitive. This leads to complex APIs. There are simpler, higher level communication APIs out there such as MPI, but commercial and open source implementations may not be suitable for applications that require determinism, extreme low latency, and. Here’s a table of the common communication APIs that a developer might consider using, and their drawbacks.
As can be seen, there is no fully featured API with a low API function count or low lines-of-code count.
Long standing communication APIs may no longer fit well with modern interconnect hardware.
- Sockets, MCAPI, and MPI don’t fully utilize RDMA due to the high cost of deferring memory registration at the time of send/receive.
- MPI isn’t well suited for fault tolerance since MPI doesn’t use explicit timeouts, or have explicit disconnect reporting.
- Many of the APIs have multiple ways to move data to try an compensate for the various types of interconnects.
- Some interconnects like MPI require passing the data structure of the message with may result in mysterious performance issue for noncontiguous data or may even result in privacy concerns of the API knowing the message structure.
- Some APIs like MPI are either 100% polling (good for latency), or 100% event driven (good for SWaP), but not a mix.
Abaco has created an open source point-to-point communication specification and reference implementation, which is designed to solve the above issues.
Takyon can be found on GitHub here.
A more extensive white paper can be found here.
We believe we have created a specification that is well suited as an open standard, and have worked with Khronos (a standards group) to find that the eHPC industry has significant interest in formulating a new open point-to-point communication standard. In order to proceed with a
Khronos working group, new members are needed to become part of the group who are willing to actively participate in creating the new standard.
If you would like to participate, please contact David Tetley, Khronos Exploratory Group Chair and Principal Software Engineer at Abaco Systems.