Fast-RCNN

Summary

0. History

The beginning of love between Kaiming and Ross (Ross, you write I write...)

1. Objective

Use a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations, to reduce the multi-stage pipline and fine-tune the network precede the spatial pyramid pooling

2. Model adopted

Region proposals -> Conv layers+max pooling layers+RoI pooling layers to extract a fixed-length feature vector from the feature map -> fc layers -> two sibling output layers: softmax probability and another layer that outputs sets of 4 values encodes refined bounding-box positions for the K classes

3. Specialities of system

a. RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with fixed spatial extent of H*W (hypermeters; special case of SPPnet in which there is only one pyramid level)

b. training can update all network layers and is a single-stage using a multi-task loss

c. training process:

     - initialization of the networks using a classification task

     - the inefficiency of BP through SPPnet and RCNN is that each training sample comes from a different image, and the RoI may have a very large receptive field. So the SGD mini-batches method is proposed to solve this problem, which proposes N images and R/N RoI from each image, making the same image share computation and memory in the forward and backword passes, converge faster than R-CNN

     - each training RoI is labeled with a ground-truth class u and a groud-truth bounding-box regression target v, multi-task loss L is used to jointly train for classification and bounding-box regression

     - the loss for backgroud is zero

     - the 25% of RoIs which have a IoU overlap with a ground-truth bounding box of over 0.5 is taken and is labeled with a foreground object class, the rest is taken from the [0.1,0.5) for the samples of backgroud examples and is labled with u=0

     - backpropagation well designed (genius), with an activation input may be mapped to several different outputs (in the case where two region of interests share an area)

     - 2 ways of acheving scale invariance : either fix the size of taining data and testing data to force the network learn scale-invariant object detection, either use the image pyramid to generate image of different scales(discussed below)

d. fast r-cnn detection, uses truncated SVD for faster detection, which decomposes the fc layers to a combination of single networks, reduce 30% complexity with only a small drop in mAP(0.3%)

e. training through the RoI pooling layer is important for very deep nets, but it seems reasonable to train only the layers from conv3_1 and up, which reduce the training time while preserving the accuracy

f. multi-task training improves accuracy, and multi-scale is proved to be hardly efficient and brings a large cost in compute time as well

g. softmax outperforms SVM, proved by experiments, and it introduces competition between classes when scoring a RoI

h. more proposals does not help and even slightly hurts accuracy

4. Personal review

Solve the multi-stage problem, make the region proposal method become the core problem to solve

 

posted on 2017-11-23 00:12  lucasdu  阅读(143)  评论(0)    收藏  举报

导航