Faster R-CNN

BigFanTheory
6 min readMay 3, 2021

Faster RCNN is proposed to further improve the computation efficiency of the bounding boxes based on select search after fast RCNN. The difference between faster R-CNN and fast R-CNN is the use of Regional Proposal Network (RPN) for region proposals instead of some special method. RPN takes feature maps as the input and outputs the region proposals. The region proposals would then be fed into the Region of Interest Pooling (RoI). The difference between their structures can be seen as following:

The key steps for faster RCNN:

(1) Conv Layers: a basic set conv+relu+pooling layers (ZF model or VGG-16) are used to obtain the feature map, which would be used in both the region proposal network and the fully connected layer.

(2) Region Proposal Network: RPN is used to generate region proposals. It uses softmax to classify whether an anchor belongs to the foreground or the background. Bounding box regression to used to obtain the bounding box actual size the location from the anchor box.

(3) RoI pooling: this layer takes in the feature map and region proposals and outputs the proposal feature maps and send those to fully connected layer.

(4) Classification: takes in the proposal feature map and output the proposal label. It also performs bounding box regression for a second time to obtain the precise location of the object.

As shown above, the convolution layer has 13 conv + 13 relu + 4 pooling layers. The RPN network has a 3x3 convolution and generates the foregound anchors and also the bounding box shifts. Those combined together are the proposals. The RoI layers combines proposals and the feature maps and obtain the proposal features and forward those to fully connected and softmax for classification.

  1. Regional Proposal Network

Faster R-CNN further optimizes the process of region proposals based on fast R-CNN. The RPN network as shown above can be separated as two parts: one is to for the softmax classification of the anchors to obtain the foreground and background; the other is for the bounding box regression. The anchor box and the bounding box shifts will be combined in the proposal layer to provide the proposals.

For each cell in the feature map, there will be 9 anchor boxes set as the initial test proposals. The bounding boxes obtained in this way is not very accurate. There will be a second bounding box regression to make the location more precise. As shown in the figure above, there will be 2k scores and 4k coordinates. The softmax classification is used to classify foreground from background and thus a binary classification. If the original picture has the size of 600x800, the feature map has a final stride of 16 (VGG) and each cell in the feature map has 9 anchors. This would give us a total of around 17K anchors. During the training process, only a small portion of those anchors would be selected.

There is a 1x1 convolution for generating a matrix of (WxH)x(9x2), which is used to store the information of foreground and background. There are (WxHx9) anchor boxes for classification, each has 2 outputs. They are foreground and background probability respectively. The loss function is cross entropy loss. The labels are either 0 or 1 depends on whether or not this bounding box is ground truth. For ground truth bounding boxes, there are foregrounds and backgrounds. The boxes are labeled based on the following rules:

(1) The boxes has the largest IoU are labeled as positive sample

(2) The boxes has a IoU larger than 0.7 are positive

(3) The boxes has a IoU smaller than 0.3 are negative

  1. 1 Proposal Layer

The proposal layer has the following functions:

(1) generate the bounding boxes (~17k) and their corresponding softmax scores and shifted regression box results

(2) sort the bounding boxes based on the foreground softmax score and obtain the pre-NMS-topN (such as 12000) boxes

(3) Clip the bounding boxes with negative coordinates (out of the image)

(4) Remove the boxes with either height or width smaller than the threshold, i.e. the boxes that are too small

(5) Non max suppression (with threshold 0.7) so that only one box is remained for each object

(6) Sort the bounding boxes based on the foreground score after the NMS and return the post-NMS-topN boxes (such as 2000) as the proposals and output their [x1, y1, x2, y2]

To summarize, the RPN functions as the following flow:

Generate anchor boxes → filter anchors based on softmax foreground score → bounding box regression → proposal layer NMS and return proposals

The loss function used in the training of RPN is the combination of smoothed L1 loss plus the softmax crossentropy loss.

Some details that need to pay attention to:

(1) The anchor box size are based on the size of the original picture and the coordinates from the proposal layer is also based on the original image

(2) softmax is used to classify whether anchor contains foreground

(3) bounding box regression outputs the shift values from the original anchor

(4) the exact proposal location is obtained in the proposal layer after combining both the anchor box and shifts

(5) the reason why the bounding box regression does not predict the size directly is the final box size and the original anchor has a more complex relation than linear

2. RoI pooling Layer

RoI takes in the feature map and the proposals and generate the proposal feature maps. The RPN proposal boxes are of different sizes and their dimensions are based on the coordinates in the original image (MxN). The feature map are in the size of M/S x N/S. To take in those two inputs of different dimension systems and output the same dimension, RoI does the following:

(1) Scale the proposals to M/S x N/S and map onto the feature map

(2) Mesh each proposal feature map of different sizes to pooled_w x pooled_h grids

(3) Maxpooling over the each grid

In the end of RoI pooling, each proposal outputs proposed feature maps of a fixed size of pooled_w x pooled_h.

3. Classification

After obtaining the proposed feature map of the fixed size. Classification does two things: connect with fully connected layer and classify each object based on the softmax classification; bounding box regression for the detailed object locations.

4. Training of Faster R-CNN

The training of Faster R-CNN is a retraining of models such as VGG_CNN_M_1025 or ZF. It can be summarized into 4 steps:

(1) Train the RPN network for the first time and obtain the proposals and model

(2) Start from ImageNet model and use the proposal from step 1 and train Fast R-CNN for the first time

(3) Start from the model from step 2 and set the learning rate of shared weights of RPN and Fast R-CNN to be zero. Retrain RPN. Only the un-shared weights in RPN are updated

(4) Set all other weights to be fixed and train only the fully connected layer.

[Ref. 1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 2016

[Ref. 2] mAP: mean Average Precision for Objection Detection

[Ref. 3] Evaluation metrics for object detection and segmentation: mAP

[Ref. 4] 一文读懂Faster RCNN

--

--