Due to the many formulas in this article, the short book does not support formula rendering, the full version has been published on the personal blog

The storage design concept of the DianNao series is split storage, which has several advantages:

- Increase bandwidth: Compared with multiple memories with the same size, multiple memories can provide greater bandwidth
- Matching bit width: Some data have different requirements for bit width. Putting data with different bit width requirements in different bit width memories can avoid bit width waste

The storage design of DianNao and DaDianNao is basically the same. The difference is that DaDianNao uses on-chip eDRAM to increase the area of on-chip storage. The following figure shows the storage part of DaDianNao. The storage part of DianNao is similar. You can refer to the architecture diagram of DianNao in the overall architecture:

DaDianNao_store.JPG

The storage is split into three parts:

- NBin: used to store input data, requires bit width $T_n$ (the number of inputs required for one processing x each input bit width)
- NBout: used to store partial sum and final operation result, bit width $T_n$ is required
- SB: used to store weights, need bit width $T_n/times T_n$

The reuse strategy of DianNao and DaDianNao is to reuse the input data, that is, the data in NBin. NBin will be overwritten only after all operations that require NBin to participate in are completed. Therefore, in DaDianNao, all computing units share the NBin and NBout implemented by eDRAM (eDRAM router part in the figure), but have their own SB cache (each node has 4 eDRAM)

ShiDianNao's storage is quite distinctive. Due to its particularity, DaDianNao's eDRAM is not used to form super-large on-chip storage. Only 288KB of SRAM is used, so its storage organization is worth studying. The following figure shows the design of NBin cache and its controller:

ShiDianNao_store.JPG

It can be found that each memory is split into $2/times P_y$ banks, and the bit width of each bank is $P_x/times 16bit$. Among them, $P_y$ is the number of rows of the arithmetic array, $P_x$ is the number of columns of the calculation array, and 16bit is the data bit width. There are 6 read methods supported by this memory:

- Read bank0~bank$P_y-1$, a total of $P_y/times P_x/times 16bit$ data, you can fill each node in the calculation array.
- Read bank$P_y$~bank$2/times P_y-1$, a total of $P_y/times P_x/times 16bit$ data, you can fill each node in the calculation array.
- Read a bank with a total of $P_x/times 16bit$ data, which can fill a row in the calculation array.
- Read a data in a Bank (16bit)
- Read the data of the specified interval in each Bank, a total of $2/times P_y/times 16bit$ data.
- Read the data at the specified position in each Bank in bank$P_y$~bank$2/times P_y-1$, a total of $P_y/times 16bit$ data, which can fill a column in the calculation array.

In terms of writing, the cache-storage method is adopted, that is, the data to be written is first exchanged and stored in the output register, and the data is uniformly written from the output register to the memory after all arithmetic units complete the operation.

PuDianNao abandoned the method of splitting the memory according to purpose, and instead split the memory according to the frequency of reuse. And its design method is closer to the general-purpose processor CPU to realize the general-purpose machine learning processor. PuDianNao believes that the seven machine learning algorithms it can implement are divided into two types in terms of storage:

![DianNao_map.png](https://upload-images.jianshu.io/upload_images/7241055-d462eafd790ab692.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

The first is similar to k-NN (k-neighboring algorithm), and the reuse interval of each data (the number of interval data between this use and the next use) is clearly grouped into several categories. The second is similar to NB (Naive Bayes). Except for the obvious clustering at position 1, the data reuse interval is distributed in one segment. Therefore, PuDianNao implements three on-chip storage, namely:

- ColdBuffer: 16KB, which stores data with longer reuse intervals and smaller bit width.
- HotBuffer: 8KB, which stores data with less reused data and has a larger bit width.
- OutputBuffer: 8KB, stores output data.

The mapping method refers to how existing hardware accelerators implement operations in neural networks, including convolution, pooling, and fully connected layers.

Since the papers of DianNao and DaDianNao did not clearly explain how these two accelerators are mapped, the following content is **personal speculation**

The computing units of DianNao and DaDianNao are both NFU. With reference to their design, their functions are described as follows: $$ mul: y_i =/sum\limits^{T_n} *{i=1} w_i/cdot x_i/*
*max:y_i = max{x_1 ,x_2,...,x* {T_n}} $$

Whether it is vector inner product or convolution, the final result is the multiplication and addition of the corresponding position elements. Both can be solved using the MUL function of the computing core, that is, NFU-2 is configured as an additive tree. In storage, the input data is arranged in the dimensions of [height, width, number of channels], that is, all channel data in the first data position are stored first, and then all channel data in the second data position are stored, and so on. The weight data is arranged by [height, width, number of output channels, number of input channels]. The implementation diagram is as follows:

DianNao_map.png

The figure above is an example of $T_n = 2$, in which the meaning of the data is as follows:

mark | source | Description |
---|---|---|

X000 | Input data | Data position (0,0), channel 0 data |

X001 | Input data | Data position (0,0), channel 1 data |

W0000 | parameter | Data position (0,0), channel 0 data corresponds to the parameters of output channel 0 |

W0001 | parameter | Data position (0,0), channel 1 data corresponds to the parameters of output channel 0 |

W0010 | parameter | Data position (0,0), channel 0 data corresponds to the parameters of output channel 1 |

W0011 | parameter | Data position (0,0), channel 1 data corresponds to the parameters of output channel 1 |

The operation implemented in convolution is as follows:

DianNao_conv_map.png

When the pooling layer is implemented, the input data is arranged according to [number of channels, height, width], and NFU-2 is configured to take the maximum value tree.

ShiDianNao implements operations such as convolution, pooling, vector inner product by the array, and the mapping is more complicated. The following descriptions all use $P_x=P_y=2$

The simplified graph of each node of ShiDianNao is shown below, and the following description will use this graph:

ShiDianNao_node_model.png

The first step to achieve convolution is to initialize, read the data into the arithmetic array, and use the cache reading method 1 or 2:

ShiDianNao_map0.png

Then read the first neuron of Bank2 and Bank3, fill it to the right side of the operation array, and shift the input data to the right. This is equivalent to marking the data frame participating in the operation to expand to the right:

ShiDianNao_map1.png

Then read the second neuron of Bank2 and Bank3, fill it to the right side of the operation array, and shift the input data to the right. This is equivalent to marking the data frame participating in the operation to extend to the right:

ShiDianNao_map2.png

Then read the two neurons in Bank1, fill them to the bottom, and move the data upwards, which is equivalent to marking the data frame participating in the operation to expand downward:

ShiDianNao_map3.png

The following table shows the weights and data used by each computing node:

coordinate | Parameter=K00 | Parameter=K10 | Parameter=K20 | Parameter=K01 |
---|---|---|---|---|

0,0 (upper left) | X00 | X10 | X20 | X01 |

0,1 (upper right) | X10 | X20 | X30 | X11 |

1,0 (lower left) | X01 | X11 | X21 | X02 |

1,1 (lower right) | X11 | X21 | X31 | X12 |

Note that the behavior of the arithmetic unit and SB mentioned above is indicated in the original text, and the memory behavior is **personal inference** . In addition, the inference in the original text ends here for the sake of keeping it concise. However, the next **steps cannot be completely inferred using the above steps** . The original text states that this multiplexing method can save 44.4% of bandwidth, with $4/times 9/times 44.4% = 16$, so a total of 20 readings, there are 16 data in the image, it is presumed that the center has been multiplexed the most times X11, X21, X12 and X22. The original picture described in this section is shown below:

ShiDianNao_map_source.JPG

The mapping method of pooling is similar to convolution, and since the Stride of pooling is generally not 1, it should be noted that the depth of FIFO-H and FIFO-V is no longer 1. Among them, $S_x$ and $S_y$ are the Stride in the X direction and Y direction, respectively.

In matrix multiplication, each calculation node represents an output neuron. Unless the calculation of one output neuron is completed, the operation of the next neuron will not be performed. Unlike convolution, the data being broadcast is input data rather than weights. Because in the matrix multiplication operation, the number of weights is more than the data and is not multiplexed. Each operation is divided into the following steps:

- An input data and $P_x/times P_y$ weights, each computing node receives one data and the data to be broadcast.
- The calculation node multiplies the input data and the weight value and accumulates the previous partial sum.
- When all calculations of an output neuron are completed, the accumulated results of each node are cached back to the on-chip storage.

The mapping method of PuDianNao is relatively simple, because more flexibility is considered, so the whole chip is controlled by a similar software method. The inference method is:

- The control module controls the DMA to transport the specified data from the off-chip storage to the on-chip buufer and transport it to the specified processing unit
- The processing unit processes the data under the control of the control module
- DMA transfers the results from the processing unit unit to the buffer

Reference: https://cloud.tencent.com/developer/article/1157328 DianNao series accelerator summary (2)-storage and mapping storage mapping method-cloud + community-Tencent Cloud