R-CNN:region based convolutional networks (rich feature)
Summary
0. History:
ILSVRC 2012 workshop central issue :
To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?
1. Objective:
a. Bridging the gap between image classification and object detection
b. Show CNN leads to dramatically higher object detection performance on PASCAL VOC
c. Localizing objects with a deep network
d. Training a high-capacity model with only a small quantity of annotated detection data
Attention:
detection differs classification by requiring localizing objects within an image
2. Model adopted
System consists of 3 modules:
a.generates around 2000 category-independent region proposals(2000) for the input image (Selective search)
b. extracts a fixed-length feature vector(4096) from each proposal using a ImageNet-CNN (forward propagating a mean-substracted 227*227 RGB image through 5 conv and 2 fc*)
c. classifies each region with category-specific linear SVMs and applies a non-maximum suppression that rejects a region if it has an intersection-over-union(IoU) overlap with a higher scoring selected region larger than a learned threhold
Attention:
Different shapes of region proposals are transformed by affine warping to compute a fixed-size CNN input
The model is supervised pre-trained on a large auxiliary dataset(ILSVRC), followed by domain-specific fine-tuning on a small dataset(PASCAL)
CNN parameters are shared across all categories
Feature vectors are comparablely low-dimensional
3. Specialities of the system:
a. First to use a three-stage model to handle object detection task
b. First to use a CNN to extract image features
c. SVMs still play a essential role to the system
d. First to use supervised pre-training followed by domain specific fine-tuning instead of conventional unsupervised pre-training followed by supervised fine-tuning(because of the scarcity of labeled data, lacking ability to train a large CNN)
e. Focusing on regions of interests proposed by SS reducing the amount of computation
f. Domain specific fine-tuning: SGD training of the CNN parameters using only wraped region proposals
g. Hard negative mining is used to balance the quantity of positive and negative examples during training only on the val1 set, because the training set isn't exaustively annotated(ILSVRC2013), validation set is used instead, split val set using a technique to minimize the relative class imbalance
h. The choice of architecture has a large effect on R-CNN detection performance
i. Bounding-box regression to reduce the localization errors (linear)
j. Positive and negative examples are defined differently for fine-tuning the CNN versus training the object detection SVMs (hypothesis: fine-tuning data is limited, paradox: the definition of positive examples for fine-tuning is more 'loose' than that of the SVMs, which is caused by the limitation of annotated data, and this also brings obstacles to precisely localizing objects)
k. Dataset rebundancy
4. Disadvantages:
Feature computation is time-consuming, because the system applies repeatedly the deep convolutional networks to every extracted wrapped region
Region proposal method is time-consuming
5. Personal reviews:
Fundation of CNN object detection
浙公网安备 33010602011771号