Abstract
Detection identififies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, ineffificient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point — the center point of its bounding box. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors.
简介
检测将对象识别为图像中的水平对齐框.许多成功的物体检测器列举所有相近的潜在的物体位置和类别,这是浪费且无效的,并且需要额外的后处理.在这篇文章中,我们采用不同的方法,我们将一个物体视为一个简单的点,这个点对应其检测框. 基于中心点的方法,是端到端有区别的,简单的,快速的,且更加准确的相比于基于检测框的检测器
Introduction
In this paper, we provide a much simpler and more effificient alternative. We represent objects by a single point at their bounding box center (see Figure 2). Other properties, such as object size, dimension, 3D extent, orientation, and pose are then regressed directly from image features at the center location. Object detection is then a standard keypoint estimation problem [3,39,60]. We simply feed the input image to a fully convolutional network [37, 40] that generates a heatmap. Peaks in this heatmap correspond to object centers. Image features at each peak predict the objects bounding box height and weight. The model trains using standard dense supervised learning [39,60]. Inference is a single network forward-pass, without non-maximal suppression for post-processing.
介绍
在这篇论文中, 我们提供了一种更加简单的和有效的方法。我们通过一个在框中心位置的简单点来表示一个物体,其他属性,类型框的大小,维度, 3d范围, 方向和位置可以通过在图像特征中中心点的位置来回归获得。物体检测是一种标准的点估计问题。我们简单的喂数据给一个全卷机的网络然后生成一个热力图, 在热力图的尖端对应着物体的中心。图片特征中每一个尖端预测物体边界框的长和宽,这个模型训练使用标准的有监督学习。推理是一个简单的前向网络,不需要nms作为后处理。 
                      

 

Related work 
Our approach is closely related to anchor-based one-stage approaches [33,36,43]. A center point can be seenas a single shape-agnostic anchor (see Figure 3). However, there are a few important differences. First, our CenterNet assigns the “anchor” based solely on location, not box over-lap [18]. We have no manual thresholds [18] for foreground and background classification. Second, we only have one positive “anchor” per object, and hence do not need Non- Maximum Suppression (NMS) [2]. We simply extract local peaks in the keypoint heatmap [4,39]. Third, CenterNet uses a larger output resolution (output stride of 4) compared to traditional object detectors [21,22] (output stride of 16). This eliminates the need for multiple anchors [47]. 
 
最近的工作
我们的方法更加接近于基于anchor的单步方法,一个中心点可以被看做一个简单的不确定的锚,然后这个有一些非常重要的不同,首先,我们CenterNet分配"锚"仅仅基于位置,不是框的堆叠。我们没有采取阈值对前景和背景进行分类。第二点,我们对每一个物体只有一个正的anchor, 因此不需要使用极大值抑制(NMS).我们简单提取在热力图关键点上的局部峰值。第三点, CenterNet使用一个大的输出分辨率(输出步长是4)相比于传统的物体检测器(输出步长是16)。这消除了对多种anchor的需求。
 
Object detection by keypoint estimation. We are not the first to use keypoint estimation for object detection. CornerNet [30] detects two bounding box corners as keypoints, whileExtremeNet[61]detects the top-, left-, bottom-, right-most, and center points of all objects. Both these methods build on the same robust keypoint estimation network as our CenterNet. However, they require a combinatorial group-ing stage after keypoint detection, which significantly slows down each algorithm. Our CenterNet, on the other hand,simply extracts a single center point per object without the need for grouping or post-processing.
 
通过关键点估计的目标检测 我们不是第一个使用关键点估计用于物体检测,CornerNet检测两个边界框的角作为关键点, 而且ExtremeNet 检测这上,左,下,右和中心点对所有的物体。这些方法和我们相同都是采用鲁棒的关键点估计网络。然而,他们都需要组合类别步骤在关键点检测后,会显著降低每一个算法的速度。我们的CenterNet网络,另一方面,简单提取一个简单的中心关键点对于每一个物体,不需要对于群组做后处理。
 
Preliminary

 Let I ∈ RW×H×3 be an input image of width W and height H. Our aim is to produce a keypoint heatmap Yˆ ∈ [0, 1] W R × H R ×C , where R is the output stride and C is the number of keypoint types. Keypoint types include C = 17 human joints in human pose estimation [4, 55], or C = 80 object categories in object detection [30,61]. We use the default output stride of R = 4 in literature [4,40,42]. The output stride downsamples the output prediction by a factor R. A prediction Yˆ x,y,c = 1 corresponds to a detected keypoint, while Yˆ x,y,c = 0 is background. We use several different fully-convolutional encoder-decoder networks to predict Yˆ from an image I: A stacked hourglass network [30,40], upconvolutional residual networks (ResNet) [22,55], and deep layer aggregation (DLA) [58].

初步

我们使用 W * H * 3作为我们图片的输入,我们的目标是为了生成一个关键点热力图 Y(0, 1) W / R * H / R * C, 这里的R是输出的步长以及C是关键点的种类数量,关键点类型包括C=17关联着人类姿态估计,或者C=80物体类别在物体检测中。我们使用默认的输出步长 R=4在结构中。通过一个R的因素下采样输出预测结果。一个预测标签等于1对应着一个检测的关键点,然后预测标签等于0对应着背景。在一张图片上,我们使用几个不同的全卷机的编码器和解码器去预测Y。一个堆叠的沙漏网络,上采样卷积的Resnet,和一个深层次的聚合结构。

We train the keypoint prediction network following Law and Deng [30]. For each ground truth keypoint p ∈ R2 of class c, we compute a low-resolution equivalent p˜ = b p R c. We then splat all ground truth keypoints onto a heatmap Y ∈ [0, 1] W R × H R ×C using a Gaussian kernel Yxyc = exp  − (x−p˜x) 2+(y−p˜y) 2 2σ2 p  , where σp is an object size-adaptive standard deviation [30]. If two Gaussians of the same class overlap, we take the element-wise maximum [4]. The training objective is a penalty-reduced pixelwise logistic regression with focal loss [33]:

我们训练这个关键点的预测网络遵循Law和Deng。对于每个类别实际样本的真实框,我们等价的计算一个低分辨的图片,我们将真实的点投射到热力图上通过一个高斯卷积核,σp 是一个物体自适应的标准差。如果两个高斯在相同的物体上堆叠,我们使用元素范围最大的,训练物体时一个惩罚减少像素逻辑回归和角点损失

 To recover the discretization error caused by the output stride, we additionally predict a local offset Oˆ ∈ R W R × H R ×2 for each center point. All classes c share the same offset prediction. The offset is trained with an L1 loss

为了消除由于输出步长导致的离散误差,我们额外预测局部偏移对于每一个中心点,所有的类别都享有相同的偏移预测,这个偏移预测使用L1 loss

The supervision acts only at keypoints locations p˜, all other locations are ignored. 

该监督仅在关键点起作用,其他位置忽略

Objects as Points

Let (x (k) 1 , y (k) 1 , x (k) 2 , y (k) 2 ) be the bounding box of object k with category ck. Its center point is lies at pk = ( x (k) 1 +x (k) 2 2 , y (k) 1 +y (k) 2 2 ). We use our keypoint estimator Yˆ to predict all center points. In addition, we regress to the object size sk = (x (k) 2 − x (k) 1 , y (k) 2 − y (k) 1 ) for each object k. To limit the computational burden, we use a single size prediction Sˆ ∈ R W R × H R ×2 for all object categories. We use an L1 loss at the center point similar to Objective 2:

 

 

 使用x1, y1, x2, y2 作为类别的物体边界框,我们使用我们的关键点估计其去预测所有中心点。另外的,对于每一个物体K,我们都回归物体的尺寸。为了去限制计算负担,我们使用一个简单的预测尺寸对于所有物体类别,我们在中心点使用类似于目标2的L1loss 

We do not normalize the scale and directly use the raw pixel coordinates. We instead scale the loss by a constant λsize. The overall training objective is

 

 

 我们没有使用归一化的尺寸和直接使用像素位置,我们控制损失函数的范围通过λ的值,这个全部的训练目标是

We set λsize = 0.1 and λof f = 1 in all our experiments unless specified otherwise. We use a single network to predict the keypoints Yˆ , offset Oˆ, and size Sˆ. The network predicts a total of C + 4 outputs at each location. All outputs share a common fully-convolutional backbone network. For each modality, the features of the backbone are then passed through a separate 3 × 3 convolution, ReLU and another 1 × 1 convolution. Figure 4 shows an overview of the network output. Section 5 and supplementary material contain additional architectural details.

我们设置λsize = 0.1和λof f = 1在我们的实验中,除非有特定的情况,我们使用一个简单的网络去预测关键点Y, 偏置项O和尺寸S,这个网络在每一个点预测一个C+4的输出。在所有的输出都共享一个全卷机的主干网络。对于每一个形态,主干的特征通过一个分离的3x3卷积,激活侧和另外一个1X1卷积。Section5和补充材料包含了额外的结构细节。

 

Training schedule By default, we train the keypoint estimation network for 140 epochs with a learning rate drop at 90 epochs. If we double the training epochs before dropping the learning rate, the performance further increases by 1.1 AP (Table 3d), at the cost of a much longer training schedule. To save computational resources (and polar bears), we use 140 epochs in ablation experiments, but stick with 230 epochs for DLA when comparing to other methods. Finally, we tried a multiple “anchor” version of CenterNet by regressing to more than one object size. The experiments did not yield any success. See supplement.

训练流程 默认的,我们训练关键点估计网络140epoch,在90epoch进行学习率衰减,如果我们延长训练次数在降低学习率之前,模型表现可以提高1.1AP, 花费更多的时间在训练流程上,为了节省计算资源,我们使用140epoch的消融时间,相比于其他方法,对于DLA我们使用230epoch,最后,我们尝试使用一个多anchor的CenterNet去回归一个物体的更多尺寸,这些实验没有获得任何的成功。

 

 

Conclusion

In summary, we present a new representation for objects: as points. Our CenterNet object detector builds on successful keypoint estimation networks, finds object centers, and regresses to their size. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS postprocessing. The idea is general and has broad applications beyond simple two-dimensional detection. CenterNet can estimate a range of additional object properties, such as pose, 3D orientation, depth and extent, in one single forward pass. Our initial experiments are encouraging and open up a new direction for real-time object recognition and related tasks.

结论

总而言之,我们使用一个新的对象表现方法: 点。我们的CenterNet检测器建立在非常成功的点估计网络,发现物体的中心,并且回归他们的尺寸。这种方法是简单,快速,精确的,不同于任何NMS的后处理,这是端到端的,这个想法是普通的并且有广泛的应用在2维检测器中。CenterNet可以评估额外的物体检测属性,如姿态,3D方向,深度和范围,通过一个简单的前向。我们初始的实验是振奋人心的并且开发了一个新的方向对于实时物体识别和相关任务。

 

 

 

 

 

 

 

 

 
 
 
 
posted on 2021-08-30 22:59  python我的最爱  阅读(493)  评论(0编辑  收藏  举报