ThunderNet: Towards Real-time Generic Object Detection

Abstract
Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task.However,previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios.In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two stage detector named ThunderNet.In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design.To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head.Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms.Code will be released for paper reproduction.

移动平台上的通用对象实时检测是一项关键而又具有挑战性的计算机视觉任务。http://www.qianjia.com/zhike/201903/251131323930.html。然而，以前的基于cnn的检测器存在巨大的计算开销，这使得它们无法在计算受限的情况下进行实时推理。本文研究了两级检测器在实时通用检测中的有效性，提出了一种轻量级的两级检测器ThunderNet。在骨干网部分，分析了现有的轻量级骨干网的不足，提出了一种用于目标检测的轻量级骨干网。在检测部分，我们采用了一个非常有效的RPN和检测头设计。为了生成更有鉴别性的特征表示，我们设计了两个高效的架构块:上下文增强模块和空间注意模块。最后，我们研究了输入分辨率、主干和检测头之间的平衡。与轻量级的单级检测器相比，ThunderNet在PASCAL VOC和COCO基准上的计算成本仅为40%，却能获得优异的性能。在没有附加功能的情况下，我们的模型在基于arm的设备上以24.1 fps的速度运行。据我们所知，这是在ARM平台上报道的第一个实时探测器。代码将被发布用于研究。

Introduction

Real-time generic object detection on mobile devices is a crucial but challenging task in computer vision. Compared with server-class GPUs, mobile devices are computationconstrained and raise more strict restrictions on the computational cost of detectors. However, modern CNN-based detectors are resource-hungry and require massive computation to achieve ideal detection accuracy, which hinders them from real-time inference in mobile scenarios.

移动设备上的通用对象实时检测是计算机视觉领域的一项重要而又具有挑战性的任务。与服务器级gpu相比，移动设备的计算能力受到限制，对检测器的计算成本提出了更严格的限制。然而，现代的基于cnn的检测器需要耗费大量的资源，并且需要大量的计算才能达到理想的检测精度，这阻碍了它们在移动场景中的实时推理。

From the perspective of network structure, CNN-based detectors can be divided into the backbone part which extracts features for the image and the detection part which detects object instances in the image. In the backbone part,state-of-the-art detectors are inclined to exploit huge classification networks (e.g., ResNet-101 [10, 4, 16, 17]) and large input images (e.g., 800×1200 pixels), which requires massive computational cost. Recent progress in lightweight image classification networks [3, 33, 20, 11, 28] has facilitated real-time object detection [11, 28, 14, 20] on GPU.However, there are several differences between image classification and object detection, e.g., object detection needs large receptive field and low-level features to improve the localization ability, which is less crucial for image classification. The gap between the two tasks restricts the performance of these backbones on object detection and obstructs further compression without harming detection accuracy.

从网络结构上看，基于cnn的检测器可分为提取图像特征的主干部分和检测图像中对象实例的检测部分。在主干部分，最先进的检测器倾向于利用巨大的分类网络(如ResNet-101[10, 4, 16, 17])和大的输入图像(如800×1200像素)，这需要大量的计算成本。近年来，轻量级图像分类网络[3,33,20,11,28]的发展促进了基于GPU的实时目标检测[11,28,14,20]。但是，图像分类与目标检测之间存在一些差异，如目标检测需要较大的接受域和较低的特征来提高定位能力，而这对于图像分类来说并不是很重要。这两个任务之间的差距限制了这些骨干在目标检测方面的性能，并在不影响检测精度的情况下，阻碍了进一步的压缩。

In the detection part, CNN-based detectors can be categorized into two-stage detectors [27, 4, 16, 14] and one stage detectors [24, 19, 25, 17]. For two-stage detectors, the detection part usually consists of Region Proposal Network (RPN) [27] and the detection head (including RoI warping and R-CNN subnet). RPN first generates RoIs, and then the RoIs are further refined through the detection head. State of-the-art two-stage detectors tend to utilize a heavy detection part (e.g., over 10 GFLOPs [27, 10, 4, 16, 2]) for better accuracy, but it is too expensive for mobile devices. LightHead R-CNN [14] adopts a lightweight detection head and achieves real-time detection on GPU. However, when coupled with a small backbone, Light-Head R-CNN still spends more computation on the detection part than the backbone, which leads to a mismatch between a weak backbone and a strong detection part. This imbalance not only induces great redundancy but makes the network prone to overfitting.

在检测部分，基于cnn的检测器分为两级检测器[27,4,16,14]和一级检测器[24,19,25,17]。对于两级检测器，检测部分通常由区域建议网络(RPN)[27]和检测头(包括感兴趣区域偏移和R-CNN子网)组成。RPN首先生成roi，然后通过检测头进一步细化roi。最先进的两级探测器倾向于使用较大的检测部分(例如，超过10 GFLOPs[27, 10, 4, 16, 2])来获得更高的精度，但是对于移动设备来说太贵了。LightHead R-CNN[14]采用轻量级检测头，在GPU上实现实时检测。但是，当与一个小的主干连接时，Light-Head R-CNN在检测部分的计算量仍然大于主干，导致弱主干与强检测部分不匹配。这种不平衡不仅会导致大量的冗余，还会使网络容易出现过拟合。

On the other hand, one-stage detectors directly predict bounding boxes and class probabilities. The detection part of this category is composed of the additional layers to generate predictions, which usually involves little computation. For this reason, one-stage detectors are widely regarded as the key to real-time detection. However, as one-stage detectors do not conduct RoI-wise feature extraction and recognition, their results are coarser than two-stage detectors. The problem is aggravated for lightweight detectors. Prior lightweight one-stage detectors [11, 28, 31, 13] do not obtain an ideal accuracy/speed trade-off: there is a huge accuracy gap between them and the large detectors [19, 25], while they fail to achieve real-time detection on mobile devices. It inspires us to rethink: can two-stage detectors surpass one-stage detectors in real-time detection?

另一方面，单级检测器直接预测边界盒和类概率。这个类别的检测部分由额外的层组成，以生成预测，这通常需要很少的计算。因此，单级检波器被广泛认为是实现实时检测的关键。但是，由于单级检测器不进行RoI-wise特征提取和识别，因此其结果比两级检测器更粗糙。对于轻量级检测器，这个问题变得更加严重。之前的轻量级单级探测器[11,28,31,13]并没有获得理想的精度/速度权衡:它们与大型探测器[19,25]之间存在着巨大的精度差距，而无法在移动设备上实现实时检测。这启发我们重新思考:在实时检测中，两级探测器能否超越一级探测器?

In this paper, we propose a lightweight two-stage generic object detector named ThunderNet. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. In the backbone part, we investigate the drawbacks in previous lightweight backbones, and present a lightweight backbone named SNet designed for object detection. In the detection part, we follow the detection head design in Light-Head R CNN, and further compress RPN and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, we design two efficient architecture blocks, Context Enhancement Module (CEM) and Spatial Attention Module (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Fig. 2 illustrates the overall architecture of ThunderNet.

在这篇论文中，我们提出了一种轻量级的两级通用对象检测器，命名为ThunderNet。ThunderNet的设计目标是在最先进的两级探测器中计算昂贵的结构。在骨干网部分，我们研究了以前的轻量级骨干网的缺点，并提出了一个名为SNet的轻量级骨干网，它是为对象检测而设计的。在检测部分，我们按照Light-Head R CNN中检测头的设计，进一步压缩RPN和R-CNN子网。为了消除由小骨架和小特征图引起的性能下降，我们设计了两个有效的架构块:上下文增强模块(CEM)和空间注意模块(SAM)。CEM结合了来自多个尺度的特征映射来利用本地和全局上下文信息，而SAM使用在RPN中学习到的信息来细化RoI扭曲中的特征分布。最后，我们研究了输入分辨率、主干和检测头之间的平衡。下图展示了ThunderNet的总体架构。

ThunderNet的整体架构。ThunderNet的输入分辨率为320×320像素。SNet骨干网是基于ShuffleNetV2和专门为对象检测设计的。在检测部分，对RPN进行了压缩，R-CNN子网采用了1024-d的fc层，提高了效率。上下文增强模块利用来自多个范围的语义和上下文信息。空间注意模块通过引入RPN中的信息来细化特征分布。

ThunderNet surpasses prior lightweight one-stage detectors with significantly less computational cost on PASCAL VOC [5] and COCO [18] benchmarks. ThunderNet outperforms Tiny-DSOD [13] with only 42% of the computational cost and obtains gains of 6.5 mAP on VOC and 4.8 AP on COCO under similar complexity. Without bells and whistles, ThunderNet runs in real time on ARM (24.1 fps) and x86 (47.3 fps) with MobileNet-SSD level accuracy. To the best of our knowledge, this is the first real time detector and the fastest single-thread speed reported on ARM platforms. These results have demonstrated the effectiveness of two-stage detectors in real-time object detection.

在PASCAL VOC[5]和COCO[18]基准测试上，ThunderNet的计算成本大大低于之前的轻量级单级检测器。ThunderNet的性能优于tinly - dsod[13]，其计算成本仅为42%，在相似的复杂度下，其VOC的mAP和COCO的AP分别为6.5和4.8。在没有附加功能的情况下，ThunderNet可以在ARM (24.1 fps)和x86 (47.3 fps)上实时运行，具有MobileNet-SSD级别的精度。据我们所知，这是在ARM平台上报告的第一个实时检测器和最快的单线程速度。这些结果证明了两级检测器在实时目标检测中的有效性。

posted @ 2020-04-01 21:01 nannanZhang 阅读(257) 评论(0) 收藏举报

刷新页面返回顶部

才疏学浅，如有错误，请您指正，万分感谢！

ThunderNet: Towards Real-time Generic Object Detection