Improving glass-to-glass latency

8 January 2020
michale_both_post.jpg

Imagine looking in a mirror and your reflection was delayed by a few seconds. It would be very disorienting. It would be entertaining, watching someone trying to swat an annoying mosquito based on the reflections in the mirror.

This is a similar issue to using cameras to do things like object tracking in combination with manual or autonomous drive commands. Drive control reactions are more precise, the lower the latency in obtaining the image. Glass-to-glass latency is the step that needs to be minimized as much as possible. Minimizing latency will also allow for more time to do better image processing, such as object detection using a convolutional neural network (CNN).

Initially, it might be assumed that any modern low cost camera would have minimal latency - such as a USB camera. The fundamental issues are that these camera interfaces do multi-image buffering and place the ‘old’ image data into CPU memory. Since CPUs are not well suited for 2D image processing, typically the image is copied over to a GPU so it can be processed quickly, which further increases latency:

The above diagram shows how a camera coordinates with the CPU to deliver a buffered (i.e. ‘old’) image into CPU memory. Then, in a subsequent step, the CPU and GPU coordinate a transfer to GPU memory. The image processing and displaying can be achieved very quickly - but at the expense of significant latency.

Direct-to-GPU to the rescue

The ideal solution is to have the camera bypass the CPU entirely and move image data directly into GPU memory without any internal frame buffering. That way, as soon as the last pixel of the image has been captured, GPU processing can start immediately. There’s no need to wait for any buffering or copying.

This can be achieved via combinations of specialized technologies:

  • Camera: for example, HD-SDI, camera link, HDMI, CSI
  • Capture card: specifically designed for one or more of the above cameras. This is typically an FPGA that communicates via PCIe to the GPU.
  • GPU that allows DMAing into GPU memory, such as NVIDIA’s GPUDirect.

With this combination, the diagram of getting an image frame into the GPU changes dramatically:

This is very efficient, since the CPU and GPU can be busy doing other processing while the DMA is occurring in the background. This method minimizes the latency of the video stream. A great side effect is that it also provides the maximum amount of time to do high performance GPU-based image processing and displaying.

Abaco’s GR4 is a 3U VPX board containing both a capture card and GPU for high performance processing. It has 4 HD-SDI inputs and 4 HD-SDI outputs capable of full HD resolution at 60Hz.

To simplify any GPU processing or displaying, Abaco’s ImageFlex software can be used. It simplifies the interface to displaying with OpenGL, keyboard input, mouse input, and the interop to other GPU technologies like CUDA, OpenCL, CNNs etc. ImageFlex provides support for the GR4 video input and processing.

Michael Both

As a software architect, Michael’s career has focused on innovations in visualization and communication software. At Abaco, he’s been responsible for the development of DataView, EventView, AXISFlow, and AXISView. As a hobby, he started developing software for Atari back in the 80s. When the iPhone came out, he created the official Rubik’s Cube app, which five years later was featured in an Apple TV commercial (https://www.youtube.com/watch?v=ajj2-KYQ0R0). And yes, he can solve it in under 30 seconds. Bernie refers to him as Rubik’s Mike.