HPEC: 5 Ways the Boundaries Will Blur in 2015
I’ve been around the computer industry way too long. But I still get excited about embedded computing—and where I am at present, helping drive GE’s HPEC initiative, there’s plenty of reasons to get excited because we keep being able to do so much more than, once upon a time, we could ever dream of. 2015 will see the familiar landscape of HPEC stretch, flex and grab some serious attention. Here’s what I see happening:
- GPGPUs blur the CPU/GPU boundary
NVIDIA Tegra K1 started the trend by putting a quad-core ARM processor on the die with a GPGPU. Tegra X1 will continue the trend (dual quad core ARMs and a new GPU). Now, the benefits of GPGPU can be reaped without discrete host processors, paving the way to lower power system designs. Outright performance may not match that of discrete GPGPUs, but the performance per Watt metric is impressive, as is the small size of designs possible due to its System-on-Chip nature. Combine that with the ability to run Linux and program in CUDA or OpenCL, and it’s a winner for some applications.
- CPUs blur the CPU/GPU boundary
Both Intel and AMD integrated graphics continue to improve and offer decent GPGPU performance with tight integration with the CPU. Mobile fourth-gen i7 devices now sport GT2 graphics with 20 execution units versus the 16 of third-gen devices. Intel has not announced some of the relevant fifth-gen Broadwell devices, but a linear progression to 24 cores would represent an educated guess. Raw GFLOPS and GFLOPS/W are not at the same levels as that which discrete GPUs can offer, but are certainly headed in the right direction, and the tight coupling helps to offset some of the performance gap. Now if we can only unlock those GFLOPS under Linux…
- FPGAs blur the CPU/FPGA boundary
Altera’s Stratix10 incorporates a quad core ARM (seems to be a common thread here…), dramatically increases floating point performance (to a claimed 10 TFLOPS), and leverages Intel’s cutting edge 14nm process technology. Xilinx’s UltraScale FPGAs boast 5 TFLOPS and the ubiquitous ARM cores. It seems we have come a long way from the initial baby steps of low end PowerPCs hard wired onto FPGAs. Today’s ARM cores that are incorporated are decent performers capable of running well-featured embedded operating systems. Gone are the days when FPGAs were limited to fixed point operation. The gate count requirements of full floating point operation and the relatively low performance that was achievable generally prohibited the use of full floating point math. Not so now… 5–10 TFLOPS should be enough to grab the interest of anyone.
- CPUs blur the CPU/FPGA boundary
Is Intel’s announcement of combining a Xeon CPU with an FPGA in the same package with cache coherency a sign of things to come? It’s probably a little early to say, but the world has changed since we saw the nascent use of FPGAs as attached accelerators to general purpose processors. Heck, even Microsoft uses FPGAs to accelerate Bing searches these days. How relevant is that to our niche of the embedded world? The answer is probably somewhat, and is indirect. It’s OK to buddy up a CPU and FPGA. It’s even nicer to do that with cache coherency. What really makes the whole thing sing, though, is the toolchain. If Intel does a nice job on that (and when didn’t they?), things get interesting and programming becomes easier and more productive. This might be another example of the embedded world riding on the coattails of commercial, volume markets.
- OpenCL blurs the CPU/GPU/FPGA boundary
And on the subject of ease of programming and productivity… Who likes having to program their heterogeneous radar processor in a mix of VHDL, CUDA and C++? Pipe-stoved development teams might, but anyone interested in cost of development doesn’t. There are a number of approaches out there, but the one that seems to hold the best prospects is OpenCL. It is open architecture and independent of vendor and parallel architecture. It is available today for CPUs, GPUs, FPGAs, and DSP chips. Now of course things are not that simple. You can’t take a piece of OpenCL code written for a GPU and move it to an FPGA and expect great performance out of the box. There are a number of architectural dials to turn to tune to different types of processors, but it sure beats a complete rewrite—provided you can get adequate performance. Maybe 2015 will be the year when OpenCL turns that corner and truly lives up to expectation.