## The principle of gradient descent

Gradient descent is a commonly used first-order optimization method, and it is one of the simplest and most classic methods for solving unconstrained optimization problems.

The most typical example of gradient descent is to walk down from a mountain. Every time you look for the steepest direction of your current position and walk down in small steps, you will eventually reach the bottom of the mountain (not considering the situation of valleys for the time being).

1. explain what is a gradient? This is to talk about differentiation first. I believe everyone is familiar with differentiation, and you will be more familiar with a few examples.

Let's first look at the differential of a single variable:

Look at the differentiation of multiple variables:

Supplement: The difference between derivative and differential Derivative is the slope of the function at a certain point, which is the ratio of Δy and Δx; and differential refers to the increment of the ordinate obtained after the tangent of the function at a certain point takes the increment Δx on the abscissa , Generally expressed as dy.

The gradient is the vector composed of the differential result, let

Have

Then, the derivative of the function f(x,y,z) at (1,2,3) is

Therefore, the gradient of the function f(x,y,z) at (1,2,3) is (6,11,6).

The gradient is a vector. For a one-variable function, the gradient is the derivative at that point, representing the slope of the tangent. For a multivariate function, the direction of the gradient is the direction in which the function rises fastest at that point.

The gradient descent method is to find the opposite direction of the gradient every time, so that it can reach the local lowest point.

Then why can the local lowest point be reached in the opposite direction of the gradient? This problem is easy to see intuitively, but for the sake of strictly prohibiting it, we still give a mathematical proof.

For the continuous differentiable function f(x), starting from a random point, if you want to find the local lowest point, you can construct a sequence

, Can satisfy

Then we can continue to execute the process to converge to a local minimum, please refer to the figure below.

Then the question is how to find the next point

And guarantee

What? Let's take a univariate function as an example. For unary functions, x will have two directions: either the positive direction (

), either in the negative direction (

), how to choose the direction of each step, you need to use the famous Taylor formula, first look at the following Taylor expansion:

among them

Represents the derivative of f(x) at x.

If you want

, You need to guarantee

,make

Step size

Is a small positive number, so

Therefore, there is

We follow every step

Update x, this is the principle of gradient descent.

Here again

To explain, α is called the learning rate or step size in the gradient descent algorithm, which means that we can control the distance of each step through α. It is necessary to ensure that the steps are not too small, and that the sun will go down before reaching the bottom of the mountain; and also that the steps cannot be too large, which may cause the lowest point to be missed.

Adding a minus sign in front of the gradient means moving in the opposite direction of the gradient, because the gradient is the fastest rising direction, so the direction is the fastest falling direction.

## Examples of gradient descent

Gradient descent of a unary function

Let the unary function be

The derivative of the function is

Let the starting point be

, Step size

, According to the formula of gradient descent

, After 4 iterations:

Gradient descent of multivariate function

Let the binary function be

The gradient of the function is

Let the starting point be (2,3), and the step length

,According to the formula of gradient descent, after many iterations, there is

# loss function

The loss function is also called the cost function. It is a function used to measure the difference between the value h(θ) predicted by the model and the true value y. If there are multiple samples, you can take the value of all the cost functions Find the mean value and record it as J(θ). The cost function has the following properties:

1. For each algorithm, the cost function is not unique;
2. The cost function is a function of the parameter θ;
3. The total cost function J(θ) can be used to evaluate the quality of the model. The smaller the cost function, the better the model and parameters are in line with the training samples (x, y);
4. J(θ) is a scalar.

The most common cost function is the mean square error function, namely

among them,

• m is the number of training samples

Represents the estimated value, the expression is as follows

• y is the value in the original training sample

All we need to do is to find the value of θ such that J(θ) is the smallest. The graph of the cost function is very similar to the graph we drew above, as shown in the figure below.

Seeing this graph, I believe you will also know that we can use the gradient descent algorithm to find the value of θ that can minimize the cost function.

First find the gradient of the cost function

There are two variables here

with

, For the convenience of matrix representation, we add one dimension to x. The value of this dimension is all 1, and it will be multiplied to

on. Then the matrix form of the cost function is:

Looking at the formula in this way, many students may not quite understand it. We can express the specific content of each matrix so that it is easy for everyone to understand.

matrix

for:

The matrix X is:

The matrix y is:

It is easy to understand the above formula after writing it out.

Let's give an example of using the gradient descent algorithm to achieve linear regression. There is a set of data as shown in the figure below, we try to find the linear regression model of these points.

First generate matrix X and matrix y

# generate matrix X
X0 = np.ones((m, 1))
X1 = np.arange(1, m+1).reshape(m, 1)
X = np.hstack((X0, X1))

# matrix y
y = np.array([2,3,3,5,8,10,10,13,15,15,16,19,19,20,22,22,25,28]).reshape(m,1 )


Define the gradient function according to the above formula

def gradient_function(theta, X, y):
diff = np.dot(X, theta)-y
return (1./m) * np.dot(np.transpose(X), diff)


Next is the most important gradient descent algorithm, we take

with

The initial values ​​of are all 1, and then the gradient descent process is performed.

def gradient_descent(X, y, alpha):
theta = np.array([1, 1]).reshape(2, 1)
while not np.all(np.absolute(gradient) <= 1e-5):
theta = theta-alpha * gradient
return theta


Through this process, the final result

,

, The linear regression curve is as follows

# Appendix source code

## Gradient descent program of matlab unary function

clc;
close all;
clear all;
%%
delta = 1/100000;
x = -1.1:delta:1.1;
y = x.^2;
dot = [1, 0.2, 0.04, 0.008];
figure;plot(x,y);
axis([-1.2, 1.2, -0.2, 1.3]);
grid on
hold on
plot(dot, dot.^2,'r');
for i=1:length(dot)
text(dot(i),dot(i)^2,['\theta_{',num2str(i),'}']);
end
title('The gradient descent process of a univariate function');


## Gradient descent program of python unary function

import numpy as np
import matplotlib.pyplot as plt

delta = 1/100000
x = np.arange(-1.1, 1.1, delta)
y = x ** 2
dot = np.array([1, 0.2, 0.04, 0.008])
plt.figure(figsize=(7,5))
plt.plot(x,y)
plt.grid(True)
plt.xlim(-1.2, 1.2)
plt.ylim(-0.2, 1.3)
plt.plot(dot, dot**2,'r')
for i in range(len(dot)):
plt.text(dot[i],dot[i]**2,r'$\theta_%d$'% i)
plt.title('The gradient descent process of a unary function')
plt.show()


## Gradient descent program of julia unary function

using PyPlot
delta = 1/100000
x = -1.1:delta:1.1
y = x.^2
dot = [1, 0.2, 0.04, 0.008]
plot(x, y)
grid(true)
axis("tight")
plot(dot, dot.^2, color="r")
for i=1:length(dot)
text(dot[i], dot[i]^2, "\$\\theta_$i\$") end title("Single variable function gradient descent")  ## Matlab binary function gradient descent program pecision = 1/100; [x,y] = meshgrid(-3.1:pecision:3.1); z = x.^2 + y.^2; figure; mesh(x,y,z); dot = [[2,3];[1.6,2.4];[1.28,1.92];[5.09e-10, 7.64e-10]]; hold on scatter3(dot(:,1),dot(:,2),dot(:,1).^2+dot(:,2).^2,'r*'); for i=1:4 text(dot(i,1)+0.4,dot(i,2),dot(i,1).^2+0.2+dot(i,2).^2+0.2,['\theta_{',num2str (i),')']); end title('The gradient descent process of a binary function')  ## Python binary function gradient descent program import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D x = np.linspace(-3.1,3.1,300) y = np.linspace(-3.1,3.1,300) x,y = np.meshgrid(x, y) z = x**2 + y**2 dot = np.array([[2,3],[1.6,2.4],[1.28,1.92],[5.09e-10, 7.64e-10]]) fig = plt.figure(figsize = (10,6)) ax = fig.gca(projection = '3d') cm = plt.cm.get_cmap('YlGnBu') surf = ax.plot_surface(x, y, z, cmap=cm) fig.colorbar(surf,shrink=0.5, aspect=5) ax.scatter3D(dot[:,0], dot[:,1], dot[:,0]**2 + dot[:,1]**2, marker='H',c='r') for i in range(len(dot)-1): ax.text(dot[i,0]+0.4, dot[i,1], dot[i,0]**2 + dot[i,1]**2, r'$\Theta_%d\$'% i)
ax.text(dot[3,0]+0.4, dot[3,1]+0.4, dot[3,0]**2 + dot[3,1]**2-0.4, r'min')
plt.show()


## Gradient descent program of julia binary function

The text of this picture is not marked for life and death, friends who want to know can tell us. One more thing, although I have published a Julia tutorial before, it also contains 4 drawing tools (Plots, GR, Gadfly & PyPlot), but I haven’t drew a 3-dimensional graph. It’s really exhausting to draw this graph today. Turns around, there is basically no 3D drawing program on Julia's official website that can be used directly. For the specific drawing process and problems encountered in debugging, I will also organize an article to Zhihu and the official account. You can take a look.

using Plots
Plots.plotlyjs()
n = 50
x = range(-3, stop=3, length=n)
y = x
z = zeros(n,n)
for i in 1:n, k in 1:n
z[i,k] = x[i]^2 + y[k]^2
end

surface(x, y, z)
dot = [[2 3]; [1.6 2.4]; [1.28 1.92]; [5.09e-10 7.64e-10]]
scatter!(dot[:,1], dot[:,2], dot[:,1].^2 .+ dot[:,2].^2)


## Linear regression of matlab gradient descent

m = 18;
X0 = ones(m,1);
X1 = (1:m)';
X = [X0, X1];
y = [2,3,3,5,8,10,10,13,15,15,16,19,19,20,22,22,25,28]';
alpha = 0.01;
theta = gradient_descent(X, y, alpha, m);

function [grad_res] = gradient_function(theta, X, y, m)
diff = X * theta-y;
grad_res = X'* diff/m;
end

function [theta_res] = gradient_descent(X, y, alpha, m)
theta = [1;1];
theta = theta-alpha * gradient;
end
theta_res = theta;
end


## Python gradient descent linear regression

import numpy as np
import matplotlib.pyplot as plt

# y = np.array([2,3,3,5,8,10,10,13,15,15,16,19,19,20,22,22,25,28])
# x = np.arange(1,len(y)+1)
# plt.figure()
# plt.scatter(x,y)
# plt.grid(True)
# plt.show()

# sample length
m = 18

# generate matrix X
X0 = np.ones((m, 1))
X1 = np.arange(1, m+1).reshape(m, 1)
X = np.hstack((X0, X1))

# matrix y
y = np.array([2,3,3,5,8,10,10,13,15,15,16,19,19,20,22,22,25,28]).reshape(m,1 )

# alpha
alpha = 0.01

def cost_function(theta, X, y):
diff = np.dot(X, theta)-y
return (1./2*m) * np.dot(np.transpose(diff), diff)

def gradient_function(theta, X, y):
diff = np.dot(X, theta)-y
return (1./m) * np.dot(np.transpose(X), diff)

def gradient_descent(X, y, alpha):
theta = np.array([1, 1]).reshape(2, 1)
while not np.all(np.absolute(gradient) <= 1e-5):
theta = theta-alpha * gradient
return theta

[theta0, theta1] = gradient_descent(X, y, alpha)
plt.figure()
plt.scatter(X1,y)
plt.plot(X1, theta0 + theta1*X1, color='r')
plt.title('Linear regression fitting based on gradient descent algorithm')
plt.grid(True)
plt.show()


## Julia gradient descent linear regression

m = 18
X0 = ones(m,1)
X1 = Array(1:m)
X = [X0 X1];
y = [2,3,3,5,8,10,10,13,15,15,16,19,19,20,22,22,25,28];
alpha = 0.01;
theta = gradient_descent(X, y, alpha, m)

function gradient_function(theta, X, y, m)
diff = X * theta .- y;
grad_res = X'* diff/m;
end

function gradient_descent(X, y, alpha, m)
theta = [1,1]