YOLOv2 and YOLOv3 study notes basic idea model training YOLOv3

YOLOv2 and YOLOv3 study notes basic idea model training YOLOv3

The basic idea

YOLOv2 is the second version of YOLO. The item detection system still only needs "Look Once". Its overall structure is as follows:


It mainly consists of two parts:

  • Neural network: Calculate the picture as a 13\times 13/times 125 vector, which contains the predicted item location and category information
  • Detector: "decode" the vector output by the neural network, and output the classification and location information of the item.

Neural network part

The neural network part of YOLOv2 uses a neural network with a jump layer. The specific structure is as follows:


The design of the neural network has not changed drastically. Compared with the YOLOv1 neural network design, the main changes are as follows:

  • A batch normalization layer is added after each convolutional layer, which accelerates the convergence of the network.
  • At the 16th layer, it is divided into two paths, connecting the features of the lower layer to the upper layer directly, which can improve the performance of the model.
  • The fully connected layer is removed, and the original position information is saved in the final output vector.
  • The input size becomes 416\times 416/times 3, which can recognize higher resolution pictures.

The final input image size of the network is 416\times 416/times 3 and the output vector size is 13/times 13/times 125.

Detector part

YOLOv2 uses the Anchor Box method. The vector size of the neural network output is 13\times 13/times 125, where 13/times 13 divides the picture into 13 rows and 13 columns, a total of 169 cells, and each cell has 125 data. For the 125 data of each cell, it is decomposed into 125 = 5/times (5+20), that is, each cell includes 5 anchor boxes, and each anchor cell includes 25 data, which are the confidence of the existence of the item and the center of the item. Location (x, y), item size (w, h) and category information (20 items). As shown below:


Each cell includes 5 anchor box information, and each anchor box includes 25 data, respectively:

  • Is there an item (1)
  • Item location (4)
  • Item types (20 items)

It is easier to understand whether there is an item's mark conf_{ijk}, which indicates the confidence that there is an item in the kth anchor box of i,jcell. The 20 item category vectors are also well understood, which one has the largest data is the item corresponding to the category.

The four data for the item location are x_{ijk}, y_{ijk}, w_{ijk}, h_{ijk}, and the relationship with the center point and size of the item location is:

$$b_x = f(x_{ijk}) + c_x/b_y = f(y_{ijk}) + c_y/b_w = p_w e^{w_{ijk}}/b_h = p_h e^{h_{ijk }}$$

Among them, b_x, b_y are the actual coordinates of the center point of the item, and b_w, b_h are the size (length and width) of the item. c_x, c_y are the number of pixels from the upper left corner of the image in the cell (x row and y column), and the meaning of f is inferred to scale the input value in the range of 0 to 1 to the length of 0 to the cell. p_w and p_h are the preset sizes of the anchor box. As shown below:


Each cell includes 5 anchor boxes. The 5 anchor boxes have different preset sizes. The preset sizes can be manually specified or obtained through training on the training set. In YOLOv2, the preset size is obtained by clustering on the test set.

Model training

The neural network part is based on the model Darknet-19. The training part of the model is divided into two parts: pre-training and training part

  • Pre-training: Pre-training is to perform 160 rounds of pre-training on ImageNet in a classified manner, using the SGD optimization method, with an initial learning rate of 0.1, which is reduced by 4 times each time, and ends at 0.0005. In addition to training 224x224 size images, we still use 448x448 size images.
  • Training: Remove the last convolutional layer of Darknet, modify the network structure to a YOLOv2 network, and train on the VOC data set. The cost function used for training is the MSE cost function.

In addition, in the training process, multi-size training is also introduced. Since the network deletes the fully connected layer, the network does not care about the specific size of the picture. The training uses 320~608 size images {320,352,..., 608}.


YOLOv3 is the latest update of YOLO, and its main improvements are as follows:

  • The network structure changed: the network structure changed from Darknet-19 to Darknet-53, and the phenomenon of layer jumps became more and more common.
  • The tail activation function changed: the tail activation function (category prediction) was changed from softmax to sigmoid
  • The number of scales changed: the number of anchor boxes was changed from 5 to 3

The internet

The network structure is as follows:


The network structure obviously refers to the design of ResNet, which directly connects the low-level features to the high-level. At the same time, note that the network may not use the pool layer, but use the convolutional layer with stride=2 to achieve downsampling.

Reference: https://cloud.tencent.com/developer/article/1156245 YOLOv2 and YOLOv3 study notes basic idea model training YOLOv3-Cloud + Community-Tencent Cloud