The model of the perceptron is shown in the figure below:

linear_classifier_structure.png

The formula is as follows: $$ f(x) = sign(w/cdot x + b)/sign(x) =/begin{cases} +1 & x/geq 0/-1 & x <0/end{ cases} $$ For this classifier, its hypothesis space is all linear classifiers in the feature space, which can be understood as all hyperplanes in the feature space from a geometric point of view. Then, as long as the sample is linearly separable in the feature space (it can be perfectly divided by a hyperplane), the hyperplane must be in the hypothesis space by the perceptron, so the perceptron can be used to distinguish all linear Dividable sample.

The goal of the perceptron is to find the hyperplane that can perfectly divide the linearly separable samples. In order to achieve this goal, we need to define a cost function. The cost function means that the function describes the performance of the model, which usually needs to meet the following conditions:

- Condition 1: The better the model performance, the smaller the cost function, the worse the model performance, the larger the cost function
- Condition 2: Continuous and leadable

The first condition is that the cost function can describe the performance of the model, and the second condition is that the model can be optimized using gradient descent methods.

For the perceptron, the selected cost function is: $$ L(w,b) =-\sum\limits_{x_i/in M}y_i(w/cdot x_i + b) $$ where M is the set of samples that are misclassified. For this cost function, it is obvious that it is continuous and differentiable, and when the classification is wrong, $y_i$ and $w/cdot x_i + b$ have different signs, and the cost is positive. When the model performance is good, there are fewer samples in M and the cost is small; when the model performance is poor, there are many samples in M and the cost is higher.

The cost function can also be explained from a geometric point of view. The distance from any point in space $x_0$ to the hyperplane is: $$/cfrac{1}{||w||}/cdot |w/cdot x_0 + b| $ $ From this, the misclassified samples $y_i$ and $w/cdot x_i + b$ have different signs, which results in the following: $$/cfrac{1}{||w||}/cdot |w/cdot x_i + b| =/cfrac{1}{||w||}/cdot (w/cdot x_i + b)/cdot y_i $$ Take M as the set of misclassified samples, and the distances from all misclassified samples to the hyperplane are as follows: $$

- \cfrac{1}{||w||}/sum\limits_{x_i/in M} y_i(w/cdot x_i + b) $$ does not consider the constant $\cfrac{1}{||w||}$ , You can get the perceptron cost function $L(w,b) =-\sum\limits_{x_i/in M}y_i(w/cdot x_i + b)$

The perceptron algorithm is driven by errors. From the above cost function, the perceptron learning algorithm becomes: $$ argmin_{w,b}(-\sum\limits_{x_i/in M}y_i/cdot (w/cdot x_i + b)) $$ In order to make the cost function fall fastest, optimize w and b in the direction of the negative gradient of the cost function. Take the gradient of w and b for the cost function: $$/nabla_wL(w,b) =-\sum\limits_{x_i/in M}y_i/cdot x_i/\nabla_bL(w,b) =-\sum\limits_{ x_i/in M}y_i $$ The update method can be obtained from this: $$ w^{n} = w^{n-1}-\eta/cdot/nabla_wL(w{n-1},b{n-1 })/b^{n} = b^{n-1}-\eta/cdot/nabla_bL(w{n-1},b{n-1}) $$ where, $\eta$ is the learning rate, Represents the step length of each update, the greater the learning rate, the more obvious the update. Therefore, each time a batch of misclassified points is selected, the above optimization is performed, and a perceptron model that can be correctly classified can be learned through multiple cycles.

Bring the gradient expression into the update formula: $$ w^{n} = w^{n-1} +\eta/cdot/sum\limits_{x_i/in M}y_i/cdot x_i/b^{n} = b^{n-1} +\eta/cdot/sum\limits_{x_i/in M}y_i $$ If the initial values of w and b are both 0 and $\eta = 1$, then w can be considered as an error sample The sum of $y_i/cdot x_i$, b is the sum of error sample labels, from which the following formula can be obtained: $$ w =/sum\limits_{x_i/in M} a_i/cdot y_i/cdot x_i/b =/sum\limits_{x_i/in M} a_i/cdot y_i $$ where $a_i$ is the number of times the sample was misclassified. It can be found that the more the number of misclassifications, the larger the proportion of the sample in the parameter. Then select a batch of data input each time, count the incorrectly classified samples into the parameters according to the update formula, and repeat it several times until there are no wrong samples.

The dual optimization algorithm of the perceptron already has some shadows of the idea of support vector machines-only a few "key samples" determine the position of the classifier hyperplane, other samples are not important, there are w and b formulas obtained by the dual algorithm: $$ w =/sum\limits_{x_i/in M} a_i/cdot y_i/cdot x_i/b =/sum\limits_{x_i/in M} a_i/cdot y_i $$

You can think of $a_i$ as the weight of the sample. The greater the number of classification errors, the greater the weight $a_i$, that is, the more important the sample. It is natural to think that samples that are closer to the final hyperplane are more likely to be classified incorrectly. The weight of such samples is higher, and the more important these samples are.

The perceptron is the basis of the neural network, and the perceptron is also called a single-layer neural network. A major drawback of the perceptron is that it cannot solve the linear inseparability problem. To solve this problem, it is necessary to map the original linearly inseparable sample to another feature space where the sample is linearly separable. There are two main mapping methods:

- Manually specify the mapping method: manually specify the mapping method, represented by the kernel function (core method)
- Automatically find the mapping method: use the machine learning method to automatically obtain the mapping method, which is represented by a neural network

The neural network can be decomposed as follows:

linear_classifier_nn.png

When the input is a linearly inseparable sample, it is mapped through a complex hidden layer (fully connected layer, convolutional layer or capsule layer), and finally the data is mapped to a linearly separable feature space before the output layer, and then the perception Machine performs linear classification. The mapping method is learned by the neural network. All neural networks whose output layer is the perceptron layer can be understood under this framework.

In addition, the perceptron also laid the basic theory of neural networks. For example, the basic learning framework of neural networks is also gradient descent: backpropagation is used to calculate the gradient, and the optimization algorithm iteratively obtains new parameters.

"Statistical Learning Methods" Li Hang

Reference: https://cloud.tencent.com/developer/article/1110945 Deep Understanding of Perceptron-Cloud + Community-Tencent Cloud