YOLO v3 and YOLO advancements

BigFanTheory
6 min readApr 25, 2021

YOLO means “You Only Look Once”. It is an object detector method that studies the features based on deep convolutional neural network to detect objects. The key steps YOLO engages can are as following:

  1. Convolutional Implementation of Sliding Window

YOLO is a fully convolutional network (FCN), which uses only convolutional neural network. The main implementation of YOLO v3 [2018] is based on a variant of DarkNet, which originally has 53 layers trained on ImageNet. For the task of detection, 53 more layers were stacked making a total of 106 layers, with skip connections and upsampling layers. A convolutional layer of stride 2 is used to downsample the feature map, which helps reducing the loss of the low-level features due to pooling.

Due to the fact the GPU speeds up by parallelizing the processing of batches, all the images are kept in a fixed width and height. The output feature map dimension of a convolution layer of size H x W can be obtained through (H/W+ 2*Padding - Filter)/Stride + 1. The feature map dimension can be approximated as downscaled by a factor of stride length. For example, an image of size 416x416 yields an output dimension of 13x13 with a stride length of 32.

2. Anchor Box

The idea of predicting both the width and height of a bounding box leads to unstable gradients during training. To solve this problem, the current populated solution is the employment of pre-defined bounding boxes called Anchors. The real bounding box is then predicted by log-space transforms and/or offsets to anchors.

YOLO v3 has 3 anchor boxes per cell, which predicts 3 bounding boxes at most for each cell. The anchor box that is responsible for producing the bounding box for the detected object would be the one having the highest IOU with the ground truth box.

Specifically, the bx, by, bw, bh stated in the formula below are the center coordinates, width and height of the predicted bounding box. tx, ty, tw, ty are the outputs obtained through the neural network. cx, cy, pw, ph are predefined parameters for the top left coordinates of the grid and the width, height of the anchor respectively.

The formula used to obtain the coordinates and dimensions of a bounding box from the anchor box.

Notice the sigmoid function regulates the offset term output to be between 0 to 1. This offsets are (1) offset shift from the top left corner (cx, cy); (2) normalized by the dimension of the feature map. For example, assuming the selected anchor box provides the top left corner to be (3,4) on a 13x13 feature map and the offsets are (0.7,0.3) from the neural network. The center of the bounding box is (3.7, 4.3). If the predicted width bw and height bh of the bounding box are (0.4, 0.8), the box containing the detected object would be of size 4.2 x 10.4.

2.0 YOLO v2

Before jumping into the advancement of v3, let’s get a bit more understanding of v2 and it’s invention. v2 is introduced in 2017 CVPR, it has used Darknet-19 as the backbone model and batch normalization (introduced in 2015). v2 also has introduced the anchor idea after v1 based on Faster-RCNN. Differently, it used K-means to decide the to 2 most popular height and width ratio. In the Faster-RCNN case, 9 anchors for each cell is used.

2.1 Prediction at different scales for YOLO v3

YOLO v3 makes prediction across 3 different scales. The detection algorithm is applied on feature maps of size 52x52, 26x26 and 13x13, which are the results of convolutional layers with stride length of 8, 16 and 32 respectively with input image size of 416x416.

https://zhuanlan.zhihu.com/p/97170924

The network downsamples the input image until the first detection layer, in which a detection is made through a stride length of 32. This layer is then upsampled by a factor of 2 and concatenated with a previous feature map of the same size. Another detector is made at the layer with stride 16. Another upsample procedure is repeated for a third and final detection to be made on a layer with stride 8.

At each scale, there are 3 anchor boxes and making it a total of 9. Those 9 anchors are of different sizes and they are 10x13, 16x30, 33x23, 30x61, 62x45, 59x119, 116x90, 156x198, 373x326 respectively. Those sizes are obtained through K-means clustering based on COCO dataset.

The author reported a better recall rate for smaller objects, which are often the drawbacks of the previous versions. Upsampling and detection at different scales help the detector lean finer grid objects, which is instrumental of small object detection.

3. Non-max Suppression (NMS) and threshold filtering

At each scale, there are B (in the case of YOLO v3, B=3) bounding boxes each cell can predict. And for each bounding box, there are 5+C outputs, with 5 being 4 coordinate parameters mentioned above plus 1 objectness score. C represents the number of classes.

There will be too many boxes and most of them are redundant. The first step is to get rid of all the boxes with a probability that is lower than a threshold of an object being detected, which is called threshold filtering. This can be done through constructing a boolean mask and keeping only the boxes that have a probability above the threshold value.

Threshold filtering gets rid of anomalous detections of objects. However, there remained multiple boxes for each object being detected. The Non-max suppression (NMS) method, which calculates the ratio of the intersection and the union of two boxes (intersection over union, IoU), works as following:

(1) Select the box with the highest probability of detection;

(2) Remove all the boxes with a high IoU with the selected box;

(3) Mark the selected box as “processed”.

NMS makes sure only one bounding box is remained for every object detected. The cases where more than one object are detected in one grid would need usually cause a low recall rate.

4. Interpret the outputs

For each cell on the feature map, there are (5+C) x B outputs. The ideal situation is each cell of the feature map predict the object(s) through one of its bounding boxes if the center falls in the receptive field of that cell. The receptive field is the region of the input image visible to that cell.

Image Credit: https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

The cell containing the center of the ground truth box of the object is responsible for prediction, which is the cell marked in red in the image.

YOLO v3 employs Darknet-53 as backbone model, which is more powerful than Darknet-19 (used in YOLO v2) and demonstrates a higher efficiency thatn ResNet-101 or ResNet-152. It improves the accuracy of the top 5 classes to 93.8 from 91.8 in YOLO v2. YOLO v3 and tiny-YOLO provides a choice between the tradeoff of accuracy and speed for the users. However, the efficiency is degraded, due to the multi-scale detection and the more complex CNN model on one hand. And on the other hand, there are a lot of redundant calculations due to a fixed number bounding boxes need to be predicted in each cell. FCOS and CenterNets are anchor free methods that detects objects without predefined anchors and improve the detection efficiency.

5. The difference between SSD and YOLO v3

(1) YOLO v3 has 3 anchors each level, making a total of 9 anchor boxes. Those anchor boxes has their size pre-determined through K-means method clustering of all the objects of COCO dataset. In the case of SSD, the size of anchors are manually set.

(2) In YOLO v3, we obtain all the anchor boxes with a IoU (with GT) higher than 0.3, which means there can be multiple anchors for one GT. If there are no such anchor boxes, then the anchor box with the highest IoU will be selected.

In the case of SSD, the anchor boxes with highest IoU is chosen first. After this, the anchors with an IoU larger than 0.5 are also considered.

(3) The loss function of YOLO v3 is gIoU and sigmoid binary cross-entropy. SSD uses smooth L1 loss and softmax cross-entropy loss.

6. Advances after YOLO v3

The YOLO v2 can process images at 40–90 FPS while YOLO v3 allows us to easily tradeoff between speed and accuracy, just by changing the model size without any retraining.

Ref[1]: YOLOv3: An Incremental Improvement

Ref. 2: yolov3总结/代码/SSD对比

Ref. 3: YOLO v4 or YOLO v5 or PP-YOLO?

--

--