LiteReconfig at Eurosys 2022: Cost and Content-Aware Video Object Detection for Mobile GPUs

Object detection is arguably one of central problems in computer vision. Much progress has been made over the past few years in deep learning based object detectors. Despite their impressive accuracy results on standard benchmarks, these models come at a price of their complexity and computational cost. This imposes a major barrier to deploy these models under resource-constrained settings with strict latency requirements, such as real-time detection in streaming videos on mobile or embedded devices.

Prior Advances

Several recent works have addressed this challenge by designing light-weight models on mobiles [CVPR-18, ICCV-19, CVPR-19, CVPR-19-2, CVPR-20,], in particular, for video object detection [CVPR-18, MLSys-19, AAAI-19]. The assumption and the common belief are that the object detectors that are optimized for accuracy, such as Faster R-CNN with a ResNet-50 backbone, are too expensive for mobile vision. Indeed, detectors optimized for accuracy are rather complex, often trained with different input resolutions, and equipped with multiple stages (e.g. proposal generation).

It is perhaps not surprising that these detectors can adapt to different settings at inference time. Consider the example of Faster R-CNN [NeurIPS-15], one can reduce the input resolution or the number of proposals for a lower latency while still maintaining a reasonable accuracy. Such combinations of choices of tunable parameters would constitute a multi-branch execution kernel (MBEK). The Faster R-CNN detector using a specific input resolution and a particular number of proposals from our previous example could be considered as one execution branch.

The Gap

We have shown in earlier work (ApproxDet, Sensys-20) as have others [MobiCom-18, MLSys-19] that one can expose a large number of execution branches in an existing kernel and schedule the appropriate branch at runtime to meet a user-provided accuracy-latency operating point. But a key point that all work had overlooked is that the latency budget needs to be carefully apportioned between the adaptive (MBEK) vision system itself (the actual application), and the overhead of the scheduler.

To skirt this problem, previous work had kept the overhead of the scheduler to be low through some approaches that turn out to be misguided for some situations. For example, they use only computationally light video features (e.g., height, width, number of objects, or intermediate results of the execution kernel), to decide which branch to run. Such features might not be sufficiently informative. Other features or models, such as motion and appearance features of the video, can improve decision making, but are typically too heavy-weight on the mobile GPUs to extract and then to use in prediction models.

The second key fact that had been omitted in all prior work is that if the conditions change frequently (content characteristics or contention on the device due to other co-located applications), the scheduler incurs high switching overhead between execution branches. Thus, a cost-aware scheduler should tamp down the frequency of reconfigurations based on the cost, which itself can vary depending on the execution branch.

Our Solution: LiteReconfig

Our work, LiteReconfig, solves these problems to generate an adaptive object detection system for embedded boards with mobile GPUs. LiteReconfig provides a cost-benefit analysis that allows it to decide, at any point in a video stream, which execution branch to select. A schematic of LiteReconfig is shown in Figure 1. The cost-benefit analyzer factors in the latency cost and the benefit (in terms of accuracy) of using computationally-heavy content features. By wisely enabling content features and models, the system characterizes the accuracy of the MBEK in a content-aware manner so as to select a more accurate branch, tailored to the video content. Furthermore, LiteReconfig analyzes the cost-benefit overhead of switching execution branches when conditions change. Through careful design, we ensure that the overhead of using a content feature extractor and the corresponding model is minimal, so as not to erase the gains from the optimization.

An illustration of our proposed cost-aware adaptive framework for video object detection. Our scheduler uses its cost-benefit analysis to decide on which features to use for making a decision and then makes a decision on which execution branch to run for detection. The multi-branch execution kernel (MBEK) can be provided by any adaptive vision algorithm for mobiles and we build on top of several mainstream object detection and tracking algorithms.

The Proof is In the Pudding

We evaluate our approach on the ImageNet VID 2015 benchmark, and compare with SSD [ECCV-16] and YOLOv3 [ArXIV-18], which we enhance by incorporating tuning knobs to run at different points in the latency-accuracy spectrum (e.g., tuning knobs such as shape of video frame and size of GoF (Group-of-Frames)). We also compare to a recent adaptive model [ApproxDet, Sensys20] with the Faster R-CNN backbone. The evaluation uses the Jetson TX2 and Jetson AGX Xavier boards with mobile GPUs. LiteReconfig improves accuracy by 1.8% to 3.5% mean average precision (mAP) over state-of-the-art (SOTA) adaptive object detection systems, under identical latency objective. Under contention for the GPU resource, the SSD and YOLOv3 baselines completely fail to meet the latency objective. Compared to three recent accuracy-focused object detection systems, SELSA [ICCV-19], MEGA [CVPR-20], and REPP [IROS-20], LiteReconfig is 74.9×, 30.5×, and 20.3× faster on the Jetson TX2 board.

The full implementation of LiteReconfig has been evaluated and all results reproduced through the Eurosys Artifact Evaluation Committee. It is available from here: DOI.

LiteReconfig is able to satisfy even stringent latency objectives, 30 fps on the weaker TX2 board, and 50 fps on the higher performing AGX Xavier board.

Motivation of cost-benefit analysis. We plot the accuracy vs. latency curve for three different strategies. Without a careful design, a content-aware strategy can be either better (e.g., ResNet) or worse (e.g., MobileNet) than a content-agnostic one. Here, the ResNet50 features come from the object detector itself and thus has lower cost than using an external MobileNet, making it a winning option.

The Human Story

Like all good stories there is an essential human element to this story as well. The lead author, PhD student Ran Xu, came to me about a year before graduation and said he will be ready to graduate soon, say within 6 months. At that time, he had had the inkling of this idea and had sketched out the beginnings of its design. But as we know, there are many late nights or early mornings of work involved and many missteps. Ran said he will get it done. I thought to myself this is the sunny and misplaced optimism of one not yet weathered by multiple rejections.

Ran marshalled his talent. He marshalled his troops, a committed group of two other, and newer, PhD students, Jay and Pengcheng. He asked the right questions to me, Somali, and our CV collaborator from Wisconsin, Yin. And in about nine months, we had a submission, the kind that makes you feel strong inside and fills you with quiet confidence. And that led to the Eurosys paper, and my first international travel in eons.

Ran (right) who tamed Eurosys and CVPR in the space of a year. And then graduated to go work for NVIDIA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s