SSD target detection system system structure network training

SSD target detection system system structure network training

First published on personal blog

system structure

system.png

The SSD recognition system is also a single-step object recognition system, which integrates the extraction of the object position and the judgment of the object category. Its main feature is that the features used by the recognizer to determine the object come not only from the output of the neural network, but also from the output of the neural network. The intermediate result of the neural network. The system is divided into the following parts:

  • Neural network part: used as a feature extractor to extract image features
  • Recognizer: According to the features extracted by the neural network, generate a candidate frame containing item location and category information (implemented using convolution)
  • Post-processing: decode and filter (NMS) the candidate frame extracted by the recognizer, and output the final candidate frame

Neural Networks

network.PNG

The network structure of the system is shown in the figure above. The basic network is the VGG-16 network. The VGG-16 network is composed of a series of 3x3 convolution sequential connections. Before conv5_3 layer convolution, there are a total of 4 stride=2 maximum pooling. Therefore, the length and width of the output of this layer are 16 times smaller than the original input. The size of the input image in the SSD300 network is normalized to 300x300, so the output length and width of this layer are

, The channel is 512, that is, the output size of the basic network VGG-16 is 512x19x19.

After the basic network, there is the following network structure:

name

enter

kernel size

stride

padding

Output

Whether to output

conv6

512x19x19

1024x512x3x3

1

1

1024x19x19

N

conv7

1024x19x19

1024x1024x1x1

1

0

1024x19x19

Y

conv8_1

1024x10x10

256x1024x1x1

1

0

256x10x10

N

conv8_2

256x10x10

512x256x3x3

2

1

512x10x10

Y

conv9_1

512x10x10

128x512x1x1

1

0

128x10x10

N

conv9_2

128x10x10

256x128x3x3

2

1

256x5x5

Y

conv10_1

256x5x5

128x256x1x1

1

0

128x5x5

N

conv10_2

128x5x5

256x128x3x3

1

0

256x3x3

Y

conv11_1

256x3x3

128x256x1x1

1

0

128x3x3

N

conv11_2

128x3x3

256x128x3x3

1

0

256x1x1

Y

Among them, whether to output a column marked Y will be output to the recognizer, that is, the final recognizer accepts feature maps of different sizes, a total of (5+1)=6 (5 additional output layers and 1 basic network) Output), respectively 10x10, 5x5, 3x3, 1x1 and two 19x19,

Recognizer

The recognizer is composed of a convolutional layer, and its convolution size is

, Where box_num is the number of recognition boxes generated by a grid point on the feature map, and class_num is the number of categories (including background categories), as shown below:

default_box.PNG

The figure is a 4x4 feature map, a total of

Each grid point has 3 candidate boxes, namely box_num=3. There are a total of p data in the category information, that is, there are a total of p types of items for judgment (p contains background categories), class_num=p. The other 4 pieces of data are position fine-tuning information after loc. One

After the feature map is processed by the recognizer, it becomes

Tensor, which contains

Candidate boxes.

Post-processing

The first post-processing is to parse the data in the candidate box, each candidate box is composed of 4+class_num pieces of data: 4 pieces of position information x, y, w, h and class_num pieces of category information. The parsing method is almost the same as the anchor box, as shown below:

among them,

They are the width coordinate of the center point of the identified item, the height coordinate, and the height and width of the item.

Are the width and height of the input image,

Is the coordinate of the grid point where the candidate box is located, and the value range is 0~

And 0~

, As shown in the figure above

.

Is the width and height of the feature map where the candidate box is located, as shown in the figure above

.

They are the default normalized width and height of the corresponding default box . For category information, choose the largest among them:

The second post-processing step is to use NMS (Non-Maximum Suppression) to filter the candidate frames: when the IOU of the two candidate frames exceeds a threshold, the candidate frame with low confidence conf is discarded.

Network training

Network training is divided into two parts:

  • Establish label: The label of general object detection is the position information of the object. In order to achieve training, the label needs to be transferred to the default box.
  • Cost function: the starting point of backpropagation, marking the training task

Label creation

default box generation

On the grid point of each feature map, the area of ​​the default box is a fixed value, and the aspect ratio has several optional values, as shown below:

among them,

Is the normalized size parameter of the kth feature map (the ratio of the actual size to the image size),

, That is, when k=1 (

The largest feature map), the size parameter is 0.2 times the image size, when k=m (

The smallest feature map), the size parameter is 0.9 times the image size.

They are the default normalized width and height of the default box with different aspect ratios under the k-th feature map. In addition to the four default boxes mentioned above, the grid point default box of each feature map also has two boxes with an aspect ratio of 1, and their size factors are:

In summary, the grid points of each feature map correspond to 6 default boxes.

label matching

When performing label matching on a default box, traverse the object information label of the input image. If the IOU of the object and the default box exceeds a certain threshold, the default box is determined to be used to identify the object, and the label is created as follows:

  • For location information: According to the formula shown in the post-processing above, you can get the label of location information.
  • For category information: set the confidence level of the corresponding position of the item category to 1, and set the other to 0

Traverse all default boxes according to the above method, which generates a label for a piece of input data

Cost function

The cost function consists of two parts, respectively for positioning accuracy and classification accuracy:

Where x is the mark information,

, When the i-th default box is marked as the j-th object belonging to category p,

, Otherwise the mark is 0. The cost function is divided into two parts, the first part is the classification accuracy, using the softmax loss function, as shown below.

Refers to the default box that does not belong to the background in the label (p>0), and vice versa

. c is the vector related to the confidence level in the network output,

It is the confidence level of the category p in the i-th default box output by the SSD.

The second part is positioning accuracy, using the smooth function under L1 as the cost function:

Other training details

Pros and cons

Guarantee positive examples: Negative examples=1:3. Since there are generally far fewer positive examples than negative examples, all positive examples are retained, and negative examples are selected based on three times the number of positive examples. The selection criterion is confidence: that is, select

High counterexample.

Image preprocessing

The input picture is randomly selected for the following processing:

  • Input original image
  • Intercept the part of the item with IOU greater than 0.3, 0.5, 0.7 or 0.9
  • Randomly capture part of the picture

After the above random selection, randomly flip the processed image

Reference: https://cloud.tencent.com/developer/article/1368761 SSD target detection system system structure network training-cloud + community-Tencent Cloud