# Theoretical study

## Capsule structure

The capsule can be seen as a vectorized neuron. For a single neuron, the data flowing in the current deep network is all scalar. For example, for a certain neuron of a multilayer perceptron, its input is a number of scalars, and its output is a scalar (not considering batch processing); for capsules, each neuron input is a number of vectors, and the output is a vector ( Do not consider batch processing). Forward propagation is as follows:

capsule_structure.png

Among them, $I_i$ is the i-th input (vector), $W_i$ is the i-th weight (matrix), and $U_i$ is the intermediate variable (vector), which is obtained by the cross product of the input and the weight. $c_i$ is the routing weight (scalar). It should be noted that the scalar is determined during the forward propagation process (using the dynamic routing algorithm), and is not a parameter optimized through back propagation. Squash is an activation function. The formula for forward propagation is as follows:

$$U_i = W_i^T/times I_i$$

$$S =/sum/limits_{i = 0}^n c_i/cdot U_i$$

$$Result = Squash(S) =/cfrac{||S||2}{1+||S||2}/cdot/cfrac{S}{||S||}$$

It can be seen from the above that the data type flowing in the capsule structure is a vector, and its activation function Squash inputs a vector and outputs a vector.

## Dynamic routing algorithm

The dynamic routing algorithm is suitable for determining the algorithm of $c_i$ in the capsule structure, and its pseudo code is as follows:

dynamic_route.jpg

1. its input is $U_{j|i}$ as the intermediate variable of this layer, where i is the number of capsules in this layer, j is the number of capsules in the next layer, and finally the output of the capsule is $v_j$. The steps are described as follows :

1. Initialization: Initialize a temporary variable b, which is a matrix of all 0s in $i/times j$
2. Get the connection weight c of this step: $c_i = softmax(b_i)$, pass the temporary variable b through softmax to ensure that the sum of the components of $c_i$ is 1.
3. Get the weighted sum result S of this step: $s_j =/sum_i c_{ij}u_{j|i}$, calculate the weighted sum by connecting the weights in this step
4. Non-linear activation: $v_j = squash(s_j)$, after the non-linear activation function, get the capsule output of this step
5. Iteration temporary variable: $b_{ij} = b_{ij} + u_{i|j}/cdot v_{j}$, so the output of this step is close to the direction of the intermediate variable, increase the temporary variable b, that is, increase the weight; If the output of this step is opposite to the intermediate variable, reduce the temporary variable b, that is, reduce the weight.
6. If it has been iterated to the specified number of times, output $v_j$, otherwise skip to step 2

At the same time, for the number of iterations j, the paper indicates that too many iterations will lead to overfitting. In practice, it is recommended to use 3 iterations.

## Output and cost function

The output of the output layer capsule is a vector, and the length of the vector is the probability. In other words, the result of forward propagation is the result represented by the output capsule that outputs the longest vector. When backpropagating, it is also necessary to consider that the output of the network is a vector instead of a scalar. Therefore, the following cost function is included in the original paper (the cost function of each output, the cost function is the sum of all output cost functions $L =/sum\limits_ {c=0}^n L_c$)

$$L_c = T_c max(0,m^+-||V_c||)^2 +/lambda (1-T_c)max(0,||v_c||-m^-) ^ 2$$

Among them, $T_c$ is a scalar. When the classification result is c, $T_c = 1$, otherwise $T_c = 0$; $\lambda$ is a fixed value (usually 0.5) to ensure numerical stability; $m+$ And $m-$ are also fixed values:

• For the output capsule of $T_c = 1$, when the output vector is greater than $m^+$, the cost function is 0, otherwise it is not 0
• For the output capsule with $T_c = 0$, when the output vector is less than $m^-$, the cost function is 0, otherwise it is not 0

## Overall structure

The original paper gave an example of recognizing the MNIST handwritten digit data set. The network architecture is shown in the following figure:

capsnet_mnist.jpg

• The first layer is an ordinary convolution layer, using 9*9 convolution, the number of output channels is 256, and the output data size is 20*20*256
• The second layer is a convolutional layer, which is composed of 32 parallel convolutional layers, and each convolutional layer corresponds to a vector in the vector data. Each convolutional layer is 9*9*256*8 (input channel is 256, output channel is 8). Therefore, the output is 6*6*32*8, that is, the window size is 6*6, the output channel is 32, and each data is a vector of 8 components.
• The third layer is the capsule layer, which behaves like a fully connected layer. The input is 6*6*32=1152 8-component input vectors, and the output is 10 16-component vectors, corresponding to 1152*10 weights, each weight is 8*16 matrix, and the final output is 10 16-component vector
• Finally, 10 16-component vectors are output, and the final classification result is the output with the largest vector length.

# Code reading (PyTorch)

This code reading does not care about the specific implementation method, mainly reading the CapsNet implementation ideas

## Front capsule layer (convolutional layer)

class PrimaryCaps(nn.Module):
def __init__(self, num_capsules=8, in_channels=256, out_channels=32, kernel_size=9):
super(PrimaryCaps, self).__init__()

self.capsules = nn.ModuleList([
nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=2, padding=0)
for _ in range(num_capsules)])

def forward(self, x):
u = [capsule(x) for capsule in self.capsules]
u = torch.stack(u, dim=1)
u = u.view(x.size(0), 32 * 6 * 6, -1)
return self.squash(u)

def squash(self, input_tensor):
squared_norm = (input_tensor ** 2).sum(-1, keepdim=True)
output_tensor = squared_norm * input_tensor/((1. + squared_norm) * torch.sqrt(squared_norm))
return output_tensor

Focus on the forward communication part:

def forward(self, x):
u = [capsule(x) for capsule in self.capsules]
u = torch.stack(u, dim=1)
u = u.view(x.size(0), 32 * 6 * 6, -1)
return self.squash(u)

self.capsulesIt is num_capsulesa [in_channels,out_channels,kernel_size,kernel_size]convolutional layer, corresponding to the operation of the second convolutional layer described above. Note that the output of this part is directly transformed into [batch size,1152,8]the form, and the output vector is squeezed by the squash activation function

## Capsule layer

class DigitCaps(nn.Module):
def __init__(self, num_capsules=10, num_routes=32 * 6 * 6, in_channels=8, out_channels=16):
super(DigitCaps, self).__init__()

self.in_channels = in_channels
self.num_routes = num_routes
self.num_capsules = num_capsules

self.W = nn.Parameter(torch.randn(1, num_routes, num_capsules, out_channels, in_channels))

def forward(self, x):
batch_size = x.size(0)
x = torch.stack([x] * self.num_capsules, dim=2).unsqueeze(4)

W = torch.cat([self.W] * batch_size, dim=0)
u_hat = torch.matmul(W, x)

b_ij = Variable(torch.zeros(1, self.num_routes, self.num_capsules, 1))
if USE_CUDA:
b_ij = b_ij.cuda()

num_iterations = 3
for iteration in range(num_iterations):
c_ij = F.softmax(b_ij)
c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4)

s_j = (c_ij * u_hat).sum(dim=1, keepdim=True)
v_j = self.squash(s_j)

if iteration <num_iterations-1:
a_ij = torch.matmul(u_hat.transpose(3, 4), torch.cat([v_j] * self.num_routes, dim=1))
b_ij = b_ij + a_ij.squeeze(4).mean(dim=0, keepdim=True)

return v_j.squeeze(1)

def squash(self, input_tensor):
squared_norm = (input_tensor ** 2).sum(-1, keepdim=True)
output_tensor = squared_norm * input_tensor/((1. + squared_norm) * torch.sqrt(squared_norm))
return output_tensor

### Get intermediate vector

batch_size = x.size(0)
x = torch.stack([x] * self.num_capsules, dim=2).unsqueeze(4)

W = torch.cat([self.W] * batch_size, dim=0)
u_hat = torch.matmul(W, x)

This part calculates the intermediate vector $U_i$

### Dynamic routing

for iteration in range(num_iterations):
c_ij = F.softmax(b_ij)
c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4)

s_j = (c_ij * u_hat).sum(dim=1, keepdim=True)
v_j = self.squash(s_j)

if iteration <num_iterations-1:
a_ij = torch.matmul(u_hat.transpose(3, 4), torch.cat([v_j] * self.num_routes, dim=1))
b_ij = b_ij + a_ij.squeeze(4).mean(dim=0, keepdim=True)

In the structure of dynamic routing:

• The first line calculates the result of the softmax function, using the temporary variable b
• Line 5 calculates the weighted sum
• Line 6 calculates the output of the current iteration number
• Lines 9 and 10 update the value of the temporary vector

### Cost function

def margin_loss(self, x, labels, size_average=True):
batch_size = x.size(0)
v_c = torch.sqrt((x**2).sum(dim=2, keepdim=True))
left = F.relu(0.9-v_c).view(batch_size, -1)
right = F.relu(v_c-0.1).view(batch_size, -1)
loss = labels * left + 0.5 * (1.0-labels) * right
loss = loss.sum(dim=1).mean()
return loss

This function is a cost function, which implements the cost function in two cases ($T_c = 0, T_c = 1$).

# Reference

The code comes from higgsfield's github

For text information, please refer to Max Pechyonkin 's blog translated by weakish :

Also refer to:

Reference: https://cloud.tencent.com/developer/article/1110783 CapsNet study notes theory study code reading (PyTorch) reference materials-Cloud + Community-Tencent Cloud