Anchor Free Approaches

7 min readApr 27, 2021

Here, I want to introduce two anchor free approaches: Fully Convolutional One-Stage Object Detection (FCOS, CVPR 2017 by MSRA) and CenterNet.

Focal Loss [2017 ICCV]

Focal loss aimed to help the one-stage detectors to achieve the same accuracy with the two-stage detectors such as Faster RCNN or RFCN, which requires region proposals. Two stage detectors have better accuracy with the compromise of their efficiency. In 2017, a research work proposed the usage of focal loss [Ref. 1] to improve the accuracy of the one stage detectors such as YOLO and SSD. A single picture usually has tens of thousands of candidate locations, among which only a few are positive samples. This fact results in a very unbalanced dataset with a very small portion of positive samples. The unbalanced data would cause (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal and (2) the easy negatives can overwhelm training and lead to degenerate models.

Other approaches in the past include OHEM (online hard example mining), which scores each example by its loss, and then applies non-maximum suppression (NMS) and constructs a mini-batch with highest loss samples. OHEM proposed the idea to add weights for the mis-classified samples, but it fails to take into account the samples which are easy to classify.

Focal loss a modified cross-entropy loss with weights added. It reduces the weight for the samples that can be easily classified (with a high score) and focus more on the samples with lower score. Mathematically, it has the formula as below. As shown in the figure (only the positive samples are shown) the loss is damped according to the score. For a very hard example (Pt ~ 0), the damping factor (1-Pt) is close to one; its loss is barely tuned. For a sample with Pt around 0.5, the loss is reduced by 4 times. Where as for a sample that has a Pt of 0.9 and 0.967, their loss reduces by more than 100 and 1000 times. The higher the score, the larger the damping factor.

The focal loss was first demonstrated with RetinaNet. Compared to the original Feature Pyramid Network (FPN), the convolution backbone model is ResNet, which generates feature map pyramid at different scales. Each of the feature map is connected to two subnets, which are used for softmax regression and bounding box regression respectively. The overall model is pretty neat and verify the usefulness of Focal Loss.

Figure Credit: https://zhuanlan.zhihu.com/p/68786098

2. Fully Convolutional One-Stage Object Detection (FCOS)

FCOS is proposed in 2019 by University of Adelaide. It is a proposal free method that avoids the usage of anchors. In this way, FCOS can save the computation resource for the redundant calculation of all the bounding boxes. Most importantly, FCOS does not engage excessively many hyper-parameters for anchors, which typically need to be carefully tuned for good results. It only needs non-max suppression as post-processing. FCOS is a much more simplified detector as compared to YOLO.

As shown in the figure below, FCOS engages a backbone model, a Feature Pyramid and a Classification+Center-ness+Regression structure. C3, C4 and C5 are the feature maps for the backbone network and P3 tp P7 are the feature levels used for final prediction. HxW are the height and width for each feature map respectively.

For each cell in the feature map, a 4D real vector (l*, t*, r*, b*) is defined as following as the regression target. Specifically, the formula is for a location (x, y) is associated with bounding box i. Intuitively, l*, t*, r* and b* are the distances from the location to the four sides of the bounding box. If a location falls into multiple bounding boxes, it will be considered as an ambiguous sample, the bounding box with the minimal area is chosen as the regression target.

2.1 Loss Function

Let Fi (H x W x C) be the feature map at layer i of the backbone CNN and s being the total stride until this layer. The ground truth bounding boxes {Bi}, where each Bi is defined as (x0(i), y0(i), x1(i), y1(i), c(i)). Those are the left, top, right and bottom sides of the bounding box respectively. c(i) is the class the object belongs to. For COCO data-set, the total number of classes is 80. For each location (x, y) on the feature map Fi, it can be backed back to the original image as (lower(s/2)+xs, lower(s/2)+ys), which is near the center of the receptive field of location (x, y).

The final layer of the network predicts an 80D vector of classification labels plus a 4D vector t = (l, t, r, b) bounding box coordinates. It is worthwhile to point out, instead of a c-class multi-classification label, FCOS uses C binary classifiers. FCOS has 9 x fewer network outputs compared with the popular anchor based detectors with 9 anchor boxes each location.

The training loss is defined as following. Lcs is the focal loss and Lreg is the IoU loss. Npos represents the number of positive samples. The summation is calculated over all locations on the feature maps Fi.

Given the input images, it is forward through the network and the classification score p and the network regression prediction t will then be obtained. The positive sample are the locations with p greater than 0.05 and the bounding box sides’ positions are inverted.

2.2 Center-ness for FCOS

The final results are still a bit off from the anchor-based detectors, and the main cause is the locations that are a bit far from the center predicts low quality bounding boxes. FCOS introduced the concept of center-ness, which is defined as following. Center-ness is added to the loss function by multiplying the center-ness with their classification score. Intuitively, the positions that are closer to the center would be multiplied by a higher weight and the ones that are far off the center would have a small contribution to the final loss.

The recall of FCOS is similar to the best anchor-based detectors and the mAP achieved is better by more than 1% than the popular one-stage detectors.

3. CenterNet

Consider an image of size 512 x 512 (W x H)and the output heads are of dimension 128 x 128 considering the stride length (R) equals to 4. The three types of outputs are demonstrated as following: (1)Dimension head (2) Heatmap head and (3) Offest head

image credit: https://medium.com/visionwizard/centernet-objects-as-points-a-comprehensive-guide-2ed9993c48bc

3.1 Dimension Head

This head of used to predict the dimension of the bounding boxes’ width and height. Given the box coordinates (x1, y1, x2, y2) of object k and class c. The regression targets for the dimension head is sk = (x2–x1, y2-y1). The dimension of of the dimension head is (W/R, H/R, 2).

3.2 Heatmap Head

The ground truth heatmaps are produced by spotting out the low-resolution equivalent images and then times Gaussian Kernels on the center spot. There will be C heatmaps and each heatmap represents the centers for a different class. If two gaussian of the same class overlaps, the element-wise maximum would be used for the target class.

Gaussian Kernel for an object located at (px, py)

3.3 Offset head

This set of heads are used to recover the discretization error due to the downsampling of the input. After the prediction of the center points, the coordinates have to be mapped back to an original higher resolution image. This will cause a value disturbance as original image pixel are integers and the numbers from prediction are floats. Offset head are designed to solve this problem by the prediction of the local offsets. The dimension of this head is (W/R, H/R, 2).

3.4 Loss Function

The loss functions for three different types of heads are as following.

(1) L1 Norm Dimension Size Loss