YOLO (You Only Look Once) is a target detection system, which is characterized by fusing item recognition and item classification, and using a deep learning model to directly calculate the location and type of the object. The basic idea is as follows:

yolo_basic.JPG

1. the picture is divided into S/times S boxes, as shown in the leftmost picture. For each box, if the center of an object falls in the box, this box is responsible for the type and position prediction of the object. For each box, the following data needs to be calculated:

- Data of B Bounding boxes, B/times 5 in total. Each box corresponds to 5 data, which are:
- x: the position of the center x of the object
- y: the position of the center y of the object
- w: horizontal length of the object
- h: the vertical length of the item
- conf: The confidence of the item, that is, the probability that this box contains the object. The definition conf =P(object)/times IOU_{pred}^{truth}, that is, this indicator also considers the existence possibility of the item and the corresponding Bounding boxes and The area where real objects overlap.

- Category: A total of C, corresponding to the types of items, used to mark which object the box belongs to.

In the VOC data set, there are 20 categories of items, that is, C=20. Take S = 7, B = 2, so the final data has a total of 7/times 7/times (20 + 5/times 2) = 7/times 7/times 30, expressed as the dimension [7,7,30].

The design of the network structure is as follows:

yolo_network.JPG

The design of the network is based on GoogLeNet, in which the Inception structure is simply replaced by the concatenation of 1x1 convolution and 3x3 convolution. At the same time, it should be noted that the leaky relu function with a leaky constant of 0.1 used by the activation function.

In addition, the x, y, w, and h parameters mentioned above are all normalized-the center position x, y is normalized by the size of the box, and w, h is normalized by the size of the picture. After doing this, x, y, w, h are all normalized to 0~1.

The model was first pre-trained on ImageNet, and the Top-5 accuracy rate reached 88%. Before the training starts, the parameters of the first 20 layers of convolution are retained.

The training cost function is divided into five parts, as shown below (the formula here is not completely rendered):

$$ Loss = L oss_{xy} + Loss_{wh} + Loss_{obj} + Loss_{noobj} + Loss_c\\ Loss_{xy} =/lambda_{coord}/sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{l_{ij}^{obj}[( x_i-x'_i)^2 + (y_i-y'_i)^2]}}\\ Loss_{wh} =/lambda_{coord}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{l_{ij}^{obj}[(/sqrt w_i-\sqrt{ w'_i})^2 + (\sqrt h_i-\sqrt{h'_i})^2]}}\\ Loss_{obj} =/sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}l_{ij}^{obj}(C_i-C_i')^2 }\\ Loss_{noobj} =/lambda_{noobj}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}l_{ij}^{noobj}(C_i- C_i')^2}\\ Loss_c =/sum\limits^{S^2}_{i = 0}l^{obj}_i\sum\limits_{c/in classes}{(p_i(c)-p'_i(c))^2 } $$

Among them, l_{i}^{obj} is used to mark whether there is an object in box i, if there is 1, otherwise it is 0; l_{ij}^{obj} is used to mark whether there is an object in box i In the jth Bounding box, if there is 1, it is 1, otherwise it is 0. Since one box will generate multiple Bounding Boxes, the Bounding Box with the highest IOU in the real area is used to predict the object./lambda_{noobj} and/lambda_{coord} are two cases of lattice points including objects and excluding objects. There are/lambda_{noobj} = 0.5 and/lambda_{coord} = 5.

Loss_{xy} and Loss_{wh} are the cost items for the location of the item, and Loss_{obj} and Loss_{noobj} are the cost items for the confidence of the existence of the item. Loss_c is the cost item for the item category. Suppose a network outputs an output vector of [7x7x(20+2X5)]. Suppose there is only one object in the grid located at a and b. Then l_{i=a/times b}^{obj} = 1, l_{i/neq a/times b}^{obj} = 0, then Loss_c =/sum\limits_{c/in classes}(p_ {a/times b}(c)-p'_{a/times b}(c))^2. For this grid, there are B Bounding Boxes. If the Bounding Box labeled k and the object's IOU are the highest, then l^{obj}_{i = a/times b,j = k} = 1,l^{ obj}_{i/neq a/times b,j/neq k} = 0, that is, when calculating Loss, only the loss of the k number Bounding Box is considered, and the k number Bounding Box is designated to be responsible for the inspection of the item.

During the test, every time a picture is tested, S/times S/times B Bounding Boxes will be obtained. According to personal understanding, each grid point has a unique attribution, so the confidence level of B Bounding Boxes for each grid point can be obtained The highest one has S/times S Bounding Boxes. In these Bounding Boxes, a threshold can be set to filter out the grid points with low confidence. For the last grid point, perform non-maximum suppression: that is, compare the confidence levels of the Bounding Boxes whose IOU exceeds a certain threshold and belong to the same category, and select the Bounding Box with the highest confidence.

Reference: https://cloud.tencent.com/developer/article/1156247 YOLO1 study notes, basic ideas, network design training and prediction-Cloud + Community-Tencent Cloud