Mask R-CNN

3 min readMay 3, 2021

Proposed in the best paper of ICCV 2017, Mask R-CNN is used to do an instance segmentation at pixel level. The implementation of mask R-CNN is an extension of the faster R-CNN which is also authorized by Kaiming He. Mask R-CNN has almost the same structure with Faster R-CNN, except an additional Mask Prediction Branch and the replacement of RoI pooling with RoI Align.

The basic structure of Faster R-CNN is shown as below. The loss function of the first training of RPN is a combination of softmax cross-entropy loss combined with the smooth L1 Loss of the bounding box regression.

In the RoI pooling step, in order to map the proposal on the feature map, the RoI feature map applies rounding at the starting and ending location. This is the issue one for pixel level segmentation. After that, RoI layer applies the same size mesh (6x6) for each proposal feature map and does max pooling over each grid. During this meshing, the starting and ending coordinate is also rounded, which is called quantization. This quantization does not impact the classification. It hurts the pixel level alignment, however.

RoI Align solves above issues by (1) does not round the proposal coordinates and (2) interpolate to obtain the category for each cell. The intuition is RoI Align does not use rounding to find the coordinates of the proposals.

The above figure demonstrates the process of Mask R-CNN. Different from Faster R-CNN, there is an additional “head” after RoI Align to enlarge the size of the RoI Align output so that a more precise mask can be obtained. In training Mask Branch, K mask prediction (for each category) is output to calculate average binary cross-entropy loss. Out of all K outputs, only the loss with the category that is the same with ground truth contributes to the Mask Loss.

Mask R-CNN has shown a mAP from 33 to 37 for instance segmentation task (based on different backbone models), which is higher than the state-of-art models (FCIS). The training requires the synchronization of 8 GPUs. It can be also applied on the posture of human.

To me, the design and the success of Mask-RCNN is almost like an art-design process. A lot of the tricks such as the employment of smooth L1 loss or the sophisticated training process are more like an art. However, the design of the output of the mask cross entropy loss and also the replacement of the rounding at the starting and ending points of the bounding box should be more intuitive based on the problem description.

The cost of pixel level detection is rather high and the ability to demonstrate pixel level segmentation is for sure very cool. However, in most practical cases, pixel level detection is still not needed. The accuracy of the detection itself is not high enough for the users to trust 100% percent, let alone the reliability of the results for each pixel. It is still, for sure, a very interesting research topic.

[Ref. 1] Mask R-CNN

[Ref. 2] 实例分割模型Mask R-CNN详解：从R-CNN，Fast R-CNN，Faster R-CNN再到Mask R-CNN

Mask R-CNN

Written by BigFanTheory