Fast-RCNN
Summary
0. History
The beginning of love between Kaiming and Ross (Ross, you write I write...)
1. Objective
Use a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations, to reduce the multi-stage pipline and fine-tune the network precede the spatial pyramid pooling
2. Model adopted
Region proposals -> Conv layers+max pooling layers+RoI pooling layers to extract a fixed-length feature vector from the feature map -> fc layers -> two sibling output layers: softmax probability and another layer that outputs sets of 4 values encodes refined bounding-box positions for the K classes
3. Specialities of system
a. RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with fixed spatial extent of H*W (hypermeters; special case of SPPnet in which there is only one pyramid level)
b. training can update all network layers and is a single-stage using a multi-task loss
c. training process:
- initialization of the networks using a classification task
- the inefficiency of BP through SPPnet and RCNN is that each training sample comes from a different image, and the RoI may have a very large receptive field. So the SGD mini-batches method is proposed to solve this problem, which proposes N images and R/N RoI from each image, making the same image share computation and memory in the forward and backword passes, converge faster than R-CNN
- each training RoI is labeled with a ground-truth class u and a groud-truth bounding-box regression target v, multi-task loss L is used to jointly train for classification and bounding-box regression
- the loss for backgroud is zero
- the 25% of RoIs which have a IoU overlap with a ground-truth bounding box of over 0.5 is taken and is labeled with a foreground object class, the rest is taken from the [0.1,0.5) for the samples of backgroud examples and is labled with u=0
- backpropagation well designed (genius), with an activation input may be mapped to several different outputs (in the case where two region of interests share an area)
- 2 ways of acheving scale invariance : either fix the size of taining data and testing data to force the network learn scale-invariant object detection, either use the image pyramid to generate image of different scales(discussed below)
d. fast r-cnn detection, uses truncated SVD for faster detection, which decomposes the fc layers to a combination of single networks, reduce 30% complexity with only a small drop in mAP(0.3%)
e. training through the RoI pooling layer is important for very deep nets, but it seems reasonable to train only the layers from conv3_1 and up, which reduce the training time while preserving the accuracy
f. multi-task training improves accuracy, and multi-scale is proved to be hardly efficient and brings a large cost in compute time as well
g. softmax outperforms SVM, proved by experiments, and it introduces competition between classes when scoring a RoI
h. more proposals does not help and even slightly hurts accuracy
4. Personal review
Solve the multi-stage problem, make the region proposal method become the core problem to solve
浙公网安备 33010602011771号