The area of continuous computer vision algorithms that can run on mobile or embedded or edge or take your pick of resource-constrained platform, has seen a great outpouring of work. This post is a look at how this field has been marching along, seen from the eyes of a computer systems person, as opposed to a computer vision person.
The tasks that are typically supported involve running algorithms like object classification, object detection, activity recognition, and object re-identification on resource-constrained platforms. The works in this realm are appearing in the top computer systems conferences as well as the top computer vision conferences and represents an exemplar of a very desirable outcome — two communities within Computer Science coming together to create progress at a rate that would not have been possible in isolation.
Lessons for other areas in CS?
I see two meta-lessons to derive from this line of work. Deriving lessons is in order seeing how it has been making rapid progress and its ability to draw a large and a rapidly increasing number of adherents.
- It is possible for two communities to learn from each other without any seismic shift in policy or reward structure.
- Incrementalism is not a dirty word. The works in this area build on top of each other — some may argue rather incrementally (I do not) — and the net result has been a rather impressive gain, measured in terms of efficiency of the algorithms, accuracy of the algorithms, maturity and robustness of the code base.
There is one compelling external factor for this desirable outcome: There are compelling application challenges to motivate the technical work and where we started 4 years back (one can argue with reason that the MCDNN work in Mobisys 2016 was the first credible effort in this space) there was a huge gap between current capability and the goal. The compelling application challenges come from three applications:
- Video cameras: These are becoming ubiquitous (a nightmare for privacy advocates but that is the topic for another day) and their bandwidth heavy streams cannot all be brought back to the cloud and processed. Hence, streaming video analytics at the edge.
- Autonomous vehicles: The automotive industry has been moving slowly and steadily up the rungs of levels of autonomy. Autonomy demands fast processing on sensors data, which includes camera data, and hence the technical challenge.
- Augmented reality/Virtual reality. These need very fast processing on the AR/VR equipment itself and hence the unmet need.
A view of the current landscape
Image vs Video: Most of the work is done on images, with a small but growing line of work that leverages video characteristics, such as continuity between successive frames.
And within this scope, what are the tasks that are being attempted?
Object classification. This is the task of classifying an object in a frame into one of a preconfigured set of classes. Early on this was done by multiple models, each catering to a particular point in the accuracy vs. latency tradeoff (typically, higher the accuracy, higher is the latency incurred). Nowadays, because storing and loading multiple models is expensive, people have been doing this by adapting within a single model.
Object detection. Here you want to detect where the objects are in the frame, usually in the form of a bounding box. An overwhelmingly popular one is called Faster R-CNN, which was released in a NeurIPS 2015 paper.
Object tracking. Here you want to track the objects as they move about in the frame. The trackers, like Median Flow, are typically lighter weight than the object detection algorithms.
Object re-identification. This involves re-identifying a given object in a later frame, or a frame from a different camera. You can think of a use case being a classic police investigation story where the car of a suspect must be detected in reams of video footage.
This is the battle between accuracy, latency, fitting within resource budget. This is an unfolding battle with quick gains being made. We want to achieve the accuracy of the state-of-the-art video analytics algorithms, but want them to run on embedded devices, which have GPUs that are far weaker than those on servers (think Pascals instead of the beefier V100s), and we want to keep up with the video streaming rate (30 frames per second).
The current state of the battle is that latency has been going down and the desired frame rate has been met for video object classification and is close for video object detection, but there are caveats here too. And for activity recognition, we are still miles away from reaching that goal. The caveat with respect to the first two is that if your device has any dynamic fluctuations to the amount of resources available to the task on hand — say another task starts running concurrently for an unrelated reason — then the latency guarantees get thrown out the window.
Here are three papers that capture the current state of the battle royale. With admission of self promotion here they are.
- “ApproxDet: Content and Contention-Aware Approximate Object Detection for Mobiles,” Ran Xu, Chen-lin Zhang (Nanjing University), Pengcheng Wang, Jayoung Lee, Subrata Mitra (Adobe Research), Somali Chaterji, Yin Li (U of Wisconsin at Madison), and Saurabh Bagchi. Sensys 2020.
- “MCUNet: Tiny Deep Learning on IoT Devices,” Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han (MIT). NeurIPS 2020.
- “MobileNetV2:Inverted Residuals and Linear Bottleneck,” Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen (Google). CVPR 2018.
Our Sensys paper is in object detection and its claim to fame (or for what passes as fame in our circles) is that it can tolerate bumps on the road, like unpredictable contention on the device or the video stream suddenly becoming too complicated like choc-full of objects. For a video demo of our video object detection, take a look here. The comparison point is the state of the art object detection algorithm called Faster R-CNN (23,000+ citations in 5 years cannot be wrong, can it?) coupled with an object flow tracker (called Median Flow).
This topic of video analytics on mobile or embedded devices will continue to see significant activity in both computer vision and systems conferences. This is one of those topics where three stars have lined up and this happens rarely: there are compelling commercial applications (autonomous transportation), there are technology inflections in hardware (embedded and mobile technology), and the technical challenges are fit for a large number of smart people to work away at them.