The capsule can be seen as a vectorized neuron. For a single neuron, the data flowing in the current deep network is all scalar. For example, for a certain neuron of a multilayer perceptron, its input is a number of scalars, and its output is a scalar (not considering batch processing); for capsules, each neuron input is a number of vectors, and the output is a vector ( Do not consider batch processing). Forward propagation is as follows:

capsule_structure.png

Among them, $I_i$ is the i-th input (vector), $W_i$ is the i-th weight (matrix), and $U_i$ is the intermediate variable (vector), which is obtained by the cross product of the input and the weight. $c_i$ is the routing weight (scalar). It should be noted that the scalar is determined during the forward propagation process (using the dynamic routing algorithm), and is not a parameter optimized through back propagation. Squash is an activation function. The formula for forward propagation is as follows:

$$U_i = W_i^T/times I_i$$

$$S =/sum/limits_{i = 0}^n c_i/cdot U_i$$

$$Result = Squash(S) =/cfrac{||S||2}{1+||S||2}/cdot/cfrac{S}{||S||}$$

It can be seen from the above that the data type flowing in the capsule structure is a vector, and its activation function Squash inputs a vector and outputs a vector.

The dynamic routing algorithm is suitable for determining the algorithm of $c_i$ in the capsule structure, and its pseudo code is as follows:

dynamic_route.jpg

1. its input is $U_{j|i}$ as the intermediate variable of this layer, where i is the number of capsules in this layer, j is the number of capsules in the next layer, and finally the output of the capsule is $v_j$. The steps are described as follows :

- Initialization: Initialize a temporary variable b, which is a matrix of all 0s in $i/times j$
- Get the connection weight c of this step: $c_i = softmax(b_i)$, pass the temporary variable b through softmax to ensure that the sum of the components of $c_i$ is 1.
- Get the weighted sum result S of this step: $s_j =/sum_i c_{ij}u_{j|i}$, calculate the weighted sum by connecting the weights in this step
- Non-linear activation: $v_j = squash(s_j)$, after the non-linear activation function, get the capsule output of this step
- Iteration temporary variable: $b_{ij} = b_{ij} + u_{i|j}/cdot v_{j}$, so the output of this step is close to the direction of the intermediate variable, increase the temporary variable b, that is, increase the weight; If the output of this step is opposite to the intermediate variable, reduce the temporary variable b, that is, reduce the weight.
- If it has been iterated to the specified number of times, output $v_j$, otherwise skip to step 2

At the same time, for the number of iterations j, the paper indicates that too many iterations will lead to overfitting. In practice, it is recommended to use 3 iterations.

The output of the output layer capsule is a vector, and the length of the vector is the probability. In other words, the result of forward propagation is the result represented by the output capsule that outputs the longest vector. When backpropagating, it is also necessary to consider that the output of the network is a vector instead of a scalar. Therefore, the following cost function is included in the original paper (the cost function of each output, the cost function is the sum of all output cost functions $L =/sum\limits_ {c=0}^n L_c$)

$$L_c = T_c max(0,m^+-||V_c||)^2 +/lambda (1-T_c)max(0,||v_c||-m^-) ^ 2$$

Among them, $T_c$ is a scalar. When the classification result is c, $T_c = 1$, otherwise $T_c = 0$; $\lambda$ is a fixed value (usually 0.5) to ensure numerical stability; $m+$ And $m-$ are also fixed values:

- For the output capsule of $T_c = 1$, when the output vector is greater than $m^+$, the cost function is 0, otherwise it is not 0
- For the output capsule with $T_c = 0$, when the output vector is less than $m^-$, the cost function is 0, otherwise it is not 0

The original paper gave an example of recognizing the MNIST handwritten digit data set. The network architecture is shown in the following figure:

capsnet_mnist.jpg

- The first layer is an ordinary convolution layer, using 9*9 convolution, the number of output channels is 256, and the output data size is 20*20*256
- The second layer is a convolutional layer, which is composed of 32 parallel convolutional layers, and each convolutional layer corresponds to a vector in the vector data. Each convolutional layer is 9*9*256*8 (input channel is 256, output channel is 8). Therefore, the output is 6*6*32*8, that is, the window size is 6*6, the output channel is 32, and each data is a vector of 8 components.
- The third layer is the capsule layer, which behaves like a fully connected layer. The input is 6*6*32=1152 8-component input vectors, and the output is 10 16-component vectors, corresponding to 1152*10 weights, each weight is 8*16 matrix, and the final output is 10 16-component vector
- Finally, 10 16-component vectors are output, and the final classification result is the output with the largest vector length.

This code reading does not care about the specific implementation method, mainly reading the CapsNet implementation ideas

class PrimaryCaps(nn.Module): def __init__(self, num_capsules=8, in_channels=256, out_channels=32, kernel_size=9): super(PrimaryCaps, self).__init__() self.capsules = nn.ModuleList([ nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=2, padding=0) for _ in range(num_capsules)]) def forward(self, x): u = [capsule(x) for capsule in self.capsules] u = torch.stack(u, dim=1) u = u.view(x.size(0), 32 * 6 * 6, -1) return self.squash(u) def squash(self, input_tensor): squared_norm = (input_tensor ** 2).sum(-1, keepdim=True) output_tensor = squared_norm * input_tensor/((1. + squared_norm) * torch.sqrt(squared_norm)) return output_tensor

Focus on the forward communication part:

def forward(self, x): u = [capsule(x) for capsule in self.capsules] u = torch.stack(u, dim=1) u = u.view(x.size(0), 32 * 6 * 6, -1) return self.squash(u)

`self.capsules`

It is `num_capsules`

a `[in_channels,out_channels,kernel_size,kernel_size]`

convolutional layer, corresponding to the operation of the second convolutional layer described above. Note that the output of this part is directly transformed into `[batch size,1152,8]`

the form, and the output vector is squeezed by the squash activation function

class DigitCaps(nn.Module): def __init__(self, num_capsules=10, num_routes=32 * 6 * 6, in_channels=8, out_channels=16): super(DigitCaps, self).__init__() self.in_channels = in_channels self.num_routes = num_routes self.num_capsules = num_capsules self.W = nn.Parameter(torch.randn(1, num_routes, num_capsules, out_channels, in_channels)) def forward(self, x): batch_size = x.size(0) x = torch.stack([x] * self.num_capsules, dim=2).unsqueeze(4) W = torch.cat([self.W] * batch_size, dim=0) u_hat = torch.matmul(W, x) b_ij = Variable(torch.zeros(1, self.num_routes, self.num_capsules, 1)) if USE_CUDA: b_ij = b_ij.cuda() num_iterations = 3 for iteration in range(num_iterations): c_ij = F.softmax(b_ij) c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4) s_j = (c_ij * u_hat).sum(dim=1, keepdim=True) v_j = self.squash(s_j) if iteration <num_iterations-1: a_ij = torch.matmul(u_hat.transpose(3, 4), torch.cat([v_j] * self.num_routes, dim=1)) b_ij = b_ij + a_ij.squeeze(4).mean(dim=0, keepdim=True) return v_j.squeeze(1) def squash(self, input_tensor): squared_norm = (input_tensor ** 2).sum(-1, keepdim=True) output_tensor = squared_norm * input_tensor/((1. + squared_norm) * torch.sqrt(squared_norm)) return output_tensor

batch_size = x.size(0) x = torch.stack([x] * self.num_capsules, dim=2).unsqueeze(4) W = torch.cat([self.W] * batch_size, dim=0) u_hat = torch.matmul(W, x)

This part calculates the intermediate vector $U_i$

for iteration in range(num_iterations): c_ij = F.softmax(b_ij) c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4) s_j = (c_ij * u_hat).sum(dim=1, keepdim=True) v_j = self.squash(s_j) if iteration <num_iterations-1: a_ij = torch.matmul(u_hat.transpose(3, 4), torch.cat([v_j] * self.num_routes, dim=1)) b_ij = b_ij + a_ij.squeeze(4).mean(dim=0, keepdim=True)

In the structure of dynamic routing:

- The first line calculates the result of the softmax function, using the temporary variable b
- Line 5 calculates the weighted sum
- Line 6 calculates the output of the current iteration number
- Lines 9 and 10 update the value of the temporary vector

def margin_loss(self, x, labels, size_average=True): batch_size = x.size(0) v_c = torch.sqrt((x**2).sum(dim=2, keepdim=True)) left = F.relu(0.9-v_c).view(batch_size, -1) right = F.relu(v_c-0.1).view(batch_size, -1) loss = labels * left + 0.5 * (1.0-labels) * right loss = loss.sum(dim=1).mean() return loss

This function is a cost function, which implements the cost function in two cases ($T_c = 0, T_c = 1$).

The code comes from higgsfield's github

For text information, **please** refer to **Max Pechyonkin** 's blog translated by **weakish** :

- One of the CapsNet introductory series: the intuition behind the capsule network
- CapsNet Introduction Series II: How Capsules Work
- CapsNet Introduction Series Three: Dynamic Routing Algorithm Between Capsules
- CapsNet Introduction Series IV: Capsule Network Architecture

Also refer to:

Reference: https://cloud.tencent.com/developer/article/1110783 CapsNet study notes theory study code reading (PyTorch) reference materials-Cloud + Community-Tencent Cloud