Summary of DianNao series accelerators (1)-Introduction to architecture and computing unit Overall architecture computing module

Summary of DianNao series accelerators (1)-Introduction to architecture and computing unit Overall architecture computing module

This article is the first summary of the DianNao series of accelerators. There are many formulas. The short book does not support formula rendering. The full version will be published on the personal blog after the summary is completed.


The DianNao series is a series of machine learning accelerators launched by the Institute of Computing Technology of the Chinese Academy of Sciences, including the following four members:

  • DianNao: Neural Network Accelerator, the pioneering work of the DianNao series.
  • DaDianNao: Neural network "supercomputer", a multi-core upgraded version of DianNao
  • ShiDianNao: Special accelerator for machine vision, integrated video processing part
  • PuDianNao: A machine learning accelerator, the DianNao series is a masterpiece, which can support 7 machine learning algorithms

Compared with other neural network accelerators, the DianNao series cares more about storage optimization in addition to the realization of calculations.

Overall structure

The overall structure of the DianNao series is similar, divided into the following three parts:

  • Computing core: complete the corresponding computing acceleration function
  • Cache: Cache input and output data and parameters to reduce memory access bandwidth requirements
  • Control: coordinate the work of the computing core and the cache

The overall architecture of the first three generations (DianNao, DaDianNao, ShiDianNao) is shown below:


among them:

  • NBin, NBout and SB: All memories are used to store input data, output data or temporary data and parameters respectively
  • NFU: computing core, used to complete neural network related operations

The following is the architecture diagram drawn in the original paper (the left picture is DianNao/DaDianNao, the right picture is ShiDianNao): [Picture uploading...(PuDianNao_structure.png-dc039b-1525185332874-0)]


In order to adapt to more machine learning algorithms (PuDianNao is not specifically designed for neural networks), the last generation of PuDianNao abandoned the method of separate caching by function and instead used caching by reuse frequency. Therefore, some changes have taken place in the architecture, as shown in the following figure. :


among them:

  • HotBuf, ColdBuf: input data buffer, respectively used to store frequently reused and reused input data with a long time interval
  • OutBuf: output data buffer, used to store output data
  • FU: functional module, complete machine learning related operations
  • Controller: Control the core, coordinate the work of memory and functional modules

The system structure diagram drawn in the original paper is as follows:


Calculation module

The arithmetic module is used to complete the arithmetic to be accelerated and is one of the core parts of the accelerator.

Operational analysis

Each of the DianNao series of papers will spend a lot of space on the calculation analysis part, which is very friendly to learners.

DianNao and DaDianNao

The types of neural network calculations supported by these two series are relatively basic. As summarized in the paper, if you want to implement a convolutional neural network, you need to implement the following operations:

  • Convolution operation: $out(x,y)^{fo} =/sum/limits_{f_i = 0}^{K_{if}}/sum/limits_{k_x = 0}^{K_x}/sum\limits_{ k_y = 0}^{K_y} w_{f_i,f_o}(k_x,k_y)/times in(x+k_x,y+k_y)^{f_i}$
  • Pooling operation: $out(x,y)^f = max_{0/leq k_x/leq K_x,0/leq k_y/leq K_y} in(x+k_x,y+k_y)^f$
  • LRN (Regional response standardization, batch standardization was not popular at the time): $out(x,y)^f =/cfrac{in(x,y)^f}{(c +/alpha/sum/limits_{g=max (0,fk/2)}{min(N_f,f+k/2)}(a(x,y)g)2){\beta}}$
  • Matrix multiplication: $out(j) = t(\sum\limits^{N_i}{i = 0} w{ij}/cdot in(i))$

Among them, DianNao does not implement the LRN function, which is only implemented in DaDianNao. In addition, DaDianNao supports the training of neural networks, and the operations required for the training process are basically the same as the test process.


In addition to the operations supported by DianNao, ShiDianNao also supports LCN (Local Contrast Normalization) for standardization: $$ O^{mi}{a,b} =/cfrac{I^{mi}{a,b} }{(k+\alpha/times/sum\limits{min(Mi-1,mi+M/2)}_{j=max(0,mi-M/2)}(Ij_{a,b})) ^\beta} $$


PuDianNao supports 7 kinds of machine learning algorithms: neural network, linear model, support vector machine, decision tree, naive Bayes, K proximity and K clustering. There are many operations that need to be supported. Therefore, the operation analysis of PuDianNao mainly focuses on storage On the one hand, the design of its computing core explains that the operations supported by PuDianNao mainly include: vector dot multiplication, distance calculation, counting, sorting and nonlinear functions. Other uncovered calculations are implemented using ALU.

Calculation module design


The calculation module of DianNao has laid the main tone of the DianNao series of calculation modules. The structure diagram is as follows:


The calculation module is divided into three stages of pipelines:

  • NFU-1: Multiplier array, 16bit fixed-point multiplier, 1 sign bit, 5 integer bits, 10 decimal places
  • NFU-2: Addition tree/maximum value tree, accumulate or take the maximum value of the result obtained by the multiplier, optionally with the previous/partial sum accumulation. There is a register structure at the end of this part, which can store the partial sum of this operation.
  • NFU-3: Non-linear activation function, this part realizes the non-linear function by piecewise linear approximation

When vector multiplication and convolution operations need to be implemented, NFU-1 is used to complete the corresponding position element multiplication, NFU-2 is used to complete the addition of the multiplication results, and finally NFU-3 is used to complete the activation function mapping. When the pooling operation is completed, use NFU-2 to complete the maximum value or average value operation of multiple elements. From this analysis, although the calculation module is very simple, it also covers most of the calculations required by the neural network (LRN is not implemented in DianNao)


The computing unit NFU of DaDianNao is basically the same as that of DianNao. The biggest difference is that a few more data paths are added to complete the training task, and the configuration is more flexible. The size of the NFU is 16x16, that is, 16 output neurons, each output neuron has 16 inputs (the input terminal needs to provide 256 data at a time). At the same time, NFU can optionally skip some steps to achieve flexible and configurable functions. The NFU structure of DaDianNao is as follows:



ShiDianNao is the only accelerator in the DianNao series that considers arithmetic unit-level data reuse, and is the only accelerator that uses a two-dimensional arithmetic array. The structure of the accelerator's arithmetic array is as follows:


The operation array of ShiDianNao is a 2D grid structure. For each operation unit (node), the parameters used in the operation come from the Kernel, and the data involved in the operation may come from:

  • Data cache NBin
  • Lower node
  • Node on the right

The following figure shows the structure (left) and abstract structure (right) of each arithmetic unit:


The functions of the computing node include forwarding data and performing calculations:

  • Forwarding data: Each data can come from the right node, the lower node and NBin. According to the control signal, select one of them and store it in the input register, and optionally store it in FIFO-H and FIFO-V according to the control signal. At the same time, the signals in FIFO-H and FIFO-V are selected according to the control signal and output from the FIFO output port.
  • Perform calculations: perform calculations based on control signals, including addition, accumulation, multiplication and addition, and comparison, etc., and store the results in the output register, and select the register according to the control signal or output the calculation result to the PE output port.

For the calculation function, according to the structure diagram above, it can be found that the operations supported by PE include: multiplying the kernel and input data and adding it to the output register data (multiply and add), and the input data and output register data to take the maximum or minimum (application In pooling), kernel and input data are added (vector addition), input data and output register data are added (accumulation), etc.


PuDianNao's computing unit is the only heterogeneous computer series. In addition to the MLU (machine learning unit), there is also an ALU for general operations and operations that cannot be processed by MLU. The computing unit (left) and MLU (right) ) The structure is shown in the figure below:


MLU is divided into 6 layers:

  • Counting layer/comparison layer: The processing of this layer is the bitwise AND or comparison of two numbers, and the result will be accumulated. This layer can be output separately and can be bypassed
  • Addition layer: This layer is the corresponding addition of two inputs, this layer can be output separately and can be bypassed
  • Multiplication layer: This layer is for two inputs or the previous layer (addition layer) to multiply the corresponding positions of the results, which can be output separately
  • Addition tree layer: accumulate the results of the multiplication layer
  • Accumulation layer: accumulate the results of the previous layer (addition tree layer), which can be output separately
  • Special processing layer: It is composed of a nonlinear function implemented by piecewise linear approximation and a k-sequencer (outputting the smallest output in the previous layer)

This arithmetic unit is the most versatile unit in the DianNao series, and its configuration is very flexible. For example, when implementing vector multiplication (accumulate after multiplying the corresponding positions), discard the counting layer and the addition layer, and flow the data through the multiplication layer, the addition tree layer, and the accumulation layer.

Reference: DianNao series accelerator summary (1)-Introduction to architecture and computing unit Overall architecture computing module-Cloud + Community-Tencent Cloud