Not All Open Architecture Middleware Is Created Equal: Life of MPI
We now live in a world where embedded systems are expected to be built around Open Architectures (OA), from both hardware and software perspectives. The end customer is looking to reduce costs over the lifetime of programs, and views mandating non-proprietary components as a key element in constraining those costs.
That revives an enduring discussion about open standards vs. proprietary solutions: Does the use of OA stifle innovation and performance—or does it facilitate it?
It may be instructive to look at one OA middleware component to obtain a better understanding.
When it comes to moving data around a multi-node processing system, a systems engineer is faced with a plethora of choices—sockets, VERBS, DDS, to name but a few. In some application areas—radar and ISR in particular—Message Passing Interface (MPI) is emerging as a common choice.
Do You Tuple?
MPI is certainly not new—it has been around for 20 years or so, initially mainly in the supercomputer world. Even then, it was not alone; in the same timeframe, we saw the emergence of many other models—PVM, BSP, Linda (who else remembers tuple space?) but MPI seems to have outlasted the others by a long way. The MPI reference sitting on my bookshelf dates from 2001, which probably frames my first involvement with MPI in the embedded space. At that time, it was a little bit of a square peg in a round hole. Most embedded multiprocessor systems of that era were built around Power PC CPUs and many used proprietary interconnects (SKYchannel, RACEway, Myrinet, StarFabric and so on), but we did start to see the application of InfiniBand in such systems.
In some ways, this could be considered the genesis of what we now term HPEC. The square peg analogy came from the fact that MPI was really designed for big applications running on big machines to solve big problems without regard to efficiency—at least in the way those of us in the embedded space understand it. There were some attempts to make it fit better: MPI-RT was all the rage for a while in an attempt to make it a better fit for the data-driven environment we typically see in embedded. Then, it seemed like there was a hiatus of several years where other (and often proprietary) APIs gathered momentum.
Skip forward to 2010 or so, and the baseline architectures for HPEC systems moved to Intel chipsets, standard fabrics such as Ethernet and InfiniBand, and operating systems such as Linux—and suddenly, the linkage from HPC to HPEC became much more obvious and easy (OK, easier). All of a sudden, we were all running OFED stacks that came along with several implementations of MPI rolled in, and its adoption in real, deployed applications was resumed with renewed vigor. It came with full operating system and stack support, supported a variety of interconnects and was accompanied by good performance (due to higher speed CPUs, faster buses, and wire-speed protocols like RDMA). Now, we started to see RFPs explicitly calling out support for MPI.
Performance: A Qualified “Maybe”
But (and there is always a "but," right?) did we really see the performance we were led to expect? The answer to that is a qualified “maybe.” As the title to this piece implies, there are many things to factor in when deciding if MPI is right for a given application—and if so, which MPI?
One key point is to establish that the MPI programming model is a fit for the application. As the name implies, it is oriented around passing messages between entities, or ‘communicators’. Data is moved from the address space of one process to that of another via cooperative operations on each end (in most cases—later versions of MPI allow for single ended operation). One or more processes send the data, one or more processes receive it. Operations range from simple point-to-point to complex collective operations such as scatters and gathers. This is often a good fit for signal- and image processing applications which need to be parallelized to meet real-time or data-driven constraints. There are other models that perhaps map better to other paradigms, such as the publish-subscribe methods of DDS and others. Over time, MPI evolved from a pure distributed memory method to one that supports shared and hybrid memory models.
It should be acknowledged that MPI is not a standard ratified by ANSI, ISO or any other of the standards bodies. It is a consensus-based API definition that is managed by the MPI Forum, a body of some 40+ contribution organizations. It has gone through several iterations, the major ones being MPI1 (1994), MPI2 (1996) and MPI3 (2012) as well as significant intermediate versions. MPI is not a library per se; it is a specification for an API that is then the subject of various implementations, some open source, some closed source. This is a key point: an application can be written to utilize the API of a specific release of the MPI specification, and can then be linked against various available libraries. It may be found, for instance, that some perform better than others, either in general, or in some cases when applied to specific system architectures.
It was this flexibility that drove the development of AXISmpi. AXISmpi allows developers to take advantage of the smarts of GE engineers to optimize under the covers, while maintaining application compatibility with other available libraries. OA API, OEM optimization—the best of both worlds. There are many MPI implementations out there, including but not limited to:
- MPICH, from Argonne National Laboratory (ANL) and Mississippi State University
- MVAPICH from Ohio State University
- Intel MPI
So what would one want to optimize in an MPI library intended for HPEC systems? The answers are many and varied. Maybe memory space is at a premium, so you optimize for that; HPEC systems, particularly fully rugged ones, tend to have less memory in fewer banks than their server-based big brothers. Maybe power consumption (or its cousin, heat dissipation) is the most important factor. Many MPIs are implemented with spin-loops to minimize latency when waking from waiting for data.
All Burn, No Crunch
If you profile an application running on such a library, you will often see long periods of 100% CPU utilization for this reason. All those spin-loops burn CPU cycles without contributing to crunching the data, which is highly inefficient in terms of power and thermal dissipation. Some (mostly commercial) implementations allow the developer to select an event-driven mode where cores can sleep or perform other tasks while waiting for data. This can be much more efficient power-wise (trading off against that wake up time, of course). Does hyper-threading help or hurt performance and/or power efficiency?
One frequent question is: “How do I tell how my MPI application is performing?” There are several approaches that can be taken here, depending on the degree of granularity desired for the analysis. At the fine-grained end of the spectrum are tools like AXIS EventView, a multiprocessor event analyzer that allows a developer to easily reference an event trace to a particular system task and processor while ensuring event traces are accurately time-aligned across the system. Several MPIs are available with pre-installed event-based profiling.
At the other extreme is Allinea’s MAP, an MPI system-wide application profiler that can help to identify application inefficiencies across the entire system. In between is AXIS RuntimeView, which shows a graphical view of MPI ranks and communication paths mapped to a system diagram, and shows statistics interactively.
Innovation Is Alive and Well
So, back to the original question—I would argue that in the case of AXISmpi, it does indeed facilitate desirable innovation and performance. MPI is certainly Open Architecture in that it is a published API that can claim to be a de facto standard. There are many implementations available, including Open Source versions. Because it is an API definition, it does not prevent optimization, tuning and instrumentation by the supplier of an implementation in order to improve the user experience. A systems integrator can take advantage of what a vendor may have to offer in terms of performance to reduce the size and cost of a system without feeling tied to that vendor in perpetuity. AXISmpi provides the application with the security of OA combined with the performance that comes from being implemented by engineers with many years of experience in the demands of embedded systems.
If you want to learn more:
See our White Paper on Tuning HPEC Linux Clusters for Real-Time Determinism