How to Implement a Convolutional Neural Network Using High Level Synthesis

Introduction

Deep Learning has taken the world by storm and now has applications in almost every field, from image and speech recognition to medical software, from data analysis to the fine arts. Even though the idea of Deep Learning is not new and the basic principle is pretty straightforward, only recently has it grown in popularity thanks to the use of high performance CPUs and GPUs in the training process.

In this blog post I talk about how to implement a Deep Learning algorithm for handwritten digit recognition by taking advantage of the power of parallel computing inside an FPGA fabric and speeding up the development process using High Level Synthesis (HLS).

Workflow overview



workflow

Figure 1: Workflow overview


Step 1: Choosing a Neural Network architecture

First I explored a few different neural network architectures and chose one that was best suited to the task at hand by satisfying a set of performance criteria and design restrictions.

I recommend using Python for training the neural network because there are lots of resources available online: from APIs (TensorFlow, Caffe, Keras, etc.) to the wider community that shares insights via blogs, forums, and various tutorials. Training takes a lot of time and another advantage of using python is that you have access to a number of cloud computing services to speed things up like FloydHub, Amazon Web Services and the Intel AI Academy. Other languages could be used for training, e.g. Matlab or even C, but I think Python has more features that help speed up the training process.

The first thing to do is to define the metrics that are going to be used to evaluate different neural network architectures. In my exploration I used the following selection criteria:

  • Accuracy
  • Parameter count
  • Parallelization potential

I explored two main training techniques for this task:

  • Extreme Machine Learning (EML)
  • Backpropagation (backprop)

EML is a technique that has proven to have a high accuracy on the Modified National Institute of Standards and Technology data set widely known as MNIST (for all training I used the MNIST dataset because of its robustness), but its neural architecture is similar to a feedforward neural network which comes with a high parameter count and low parallelization potential. Feedforward neural network require all the values from the previous layer to be known in order to start computing the next layer. This leaves little room for improvement using HLS.

Backpropagation is widely used to train Feedforward Neural Networks and multiple variations of Convolutional Neural Networks (CNN). I trained multiple variations of NNs and even a Multi-Column CNN (MC-CNN). As expected the MC-CNN had the best accuracy on the MNIST dataset.

Each neural network has a set of parameters (weights), which are trained, and hyperparameters, which are more or less empirically chosen.

Some of the hyperparameters have a direct influence on the Neural Network structure:

  • Number of convolutional layers
  • Number of fully connected (FC) layers
  • Number of units (neurons) in each FC layer
  • Kernel size, stride, padding
  • Number of kernels
  • Number of columns (only for MC-CNN)
  • Activation function

Other hyperparameters only affect the training process:

  • Dropout
  • Mini-batch training, size of mini-batch
  • Learning rate
  • Minimization algorithm

I ended up choosing a regular CNN because of the low parameter count (Table 1 – highlighted in green)


Neural Network Parameter count
Extreme Machine learning algorithms >1,000,000
Fully Connected Networks 100,000-12,000,000
Convolutional Neural Networks 5,000-1,000,000
Multi-Column CNNs Number of columns * CNN parameters

Table 1: Parameter count for each type of neural network



To obtain the final architecture I trained 194 models covering a range of hyperparameters, as seen in Table 2. The training of the 194 models took a total of 29 hours and 17 minutes using an nVidia Tesla GPU.


Hyperparameter Range Observations
Number of convolutional layers [2:10]
Number of fully connected layers [1:5]
Number of units (neurons) in each FC layer [20:500]
Kernel size {2,3,5,7} Anything higher than 7 will decrease the size of the resulting map too fast.
Padding Never I wanted to decrease the number of parameters after each convolution.
Stride 1 for convolutions and 2 for pooling Used a constant small stride to slowly decrease the feature map size after each convolution or pooling.
Number of kernels [5:50] After 20 the performance was saturated, after 35 the performance started to decrease.
Activation function Sigmoid
tanh
ReLU
ReLU: was chosen because it speeds up the training and prevents training saturation.
Dropout {0%, 40-80%} Ended up not using dropout at all.
Mini-batch training, size of mini-batch {2,4,8,16, …1024} Best performance was obtain using a 64 size minibatch.
Learning rate [0.001: 0.1] Best performance was obtained using 0.01.
Minimization algorithm Gradient Descent
RMSprop
Adam
As expected Adam provided the best results.

Table 2: Hyperparameters


figure2

Figure 2: 10xCONV7-POOL2-20xCONV7-POOL2- 84-10


The accuracy of all the trained networks was above 90%; the minimum requirement was to have an accuracy of 98.5% with the goal being to hit 99%. In the end a trade-off was made, choosing an architecture with a low parameter count to be able to have more resources available on the FPGA for parallelization.


Arhitecture Number of parameters Accuracy
10xCONV7-POOL2-20xCONV7-POOL2-84-10 17430 98.93
6xCONV7-POOL2-16xCONV7-POOL2-92-10 14014 98.57
6xConv5-POOL2-24xCONV3-POOL2-12xCONV3-POOL2-84-10 5346 98.25

Table 3: Three promising architectures from which I chose the one highlighted in green.


Step 2: Implementation of the Neural Network in C

To be able to deploy the neural network algorithm on an FPGA, the algorithm needs to be written in a Hardware Description Language. For this reason I had to manually rewrite the entire inference step of the neural network in C/C++. To ensure that the newly implemented C code worked fine, it was tested against the Python inference using the same trained weights.

Step 3: Adding High Level Synthesis specific constructs

HLS comes with a number of #pragma directives that allow the user to guide the HLS compiler as to how to generate the desired design. It also contains a number of libraries containing predefined data types and functions that can boost the performance of your design.

HLS constructs help improve the design, but are not necessary to obtain a synthesizable code. The initial synthesis results were bad, but the design could be implemented on a big enough FPGA. In this project the synthesis is targeting a Zynq XC7Z020 System on a Chip.


BRAM DSP FF LUT Clock period (ns) Latency (cc)
Values 294 15 3,311 6,936 8.43 3,240,373
Percentage 105% 6% 3% 13%

Table 4: Initial synthesis results


Table 3 shows the synthesis results of the C code implemented during Step 2. In this case, the BRAM count is high and the latency is poor. Fortunately, after using specific HLS constructs, the results improved significantly, as we will see in the following sections.

3.1 Fixed point arithmetics

The initial C/C++ code uses the float data type for parameters, which comes at a great cost. Fixed point operations are less precise, use less hardware and won’t affect the global accuracy of the Neural Network.

#include "ap_fixed.h"
typedef ap_fixed<WIDTH, INT_WIDTH> fixed_p;

The current algorithm can use as few as 4 bits for the fixed point data type with a small decrease in accuracy, but a significant gain in available hardware.

Table 4 shows the hardware requirements for fixed and floating point operations. When synthesized, a single multiplier operation can use up to 858 FF using a 32 bit floating point data type (float). Fixed point arithmetics uses significantly fewer resources for basic operations, e.g. addition (32 LUT and 32 FF for a fixed point adder), while the addition operation in floating point arithmetics can use up to 430 LUT and 749 FF. The final synthesized CNN uses both dedicated DSP blocks and LUT/FF-based multipliers/adders.



table5

Table 5: source – A Survey of FPGA Based Neural Network Accelerator


3.2 Loop unrolling and loop pipelining

Both these actions are performed by a specific #pragma directive applied on a loop.

  • #pragma HLS unroll factor=N
  • #pragma HLS pipeline
  • Unrolling a loop basically copies the loop body N times. If a factor is equal to the number of iterations, the loop is fully unrolled. Each instance of the loop body is executed in parallel, which speeds up the design significantly.

    The pipeline directive adds a pipeline register for each loop input, which initially increases the propagation delay, but in time provides an increase in throughput.

    While both of these directives can improve overall design latency, they come at a cost, e.g. they increase the area of the design. Thus, you need to make a trade-off between hardware consumption and overall design latency.

    3.3 Dataflow pipelining

    The previous sections looked at how any C/C++ code can be optimized for FPGA deployment. In my workflow, I used three main functions, as previously stated: conv(), pool() and fc(). Even though each function is heavily parallelized at the loop level, the functions are called sequentially.

    One of the main reasons I chose to use CNN is that you do not need all the data from the previous layer to start computing the output response for the current layer. This trait of the CNN allows me to further improve the throughput, by parallelizing the execution of convolutional and pooling layers. The initial C/C++ code implementation did not support the dataflow pragma directive. To make the code compatible with the dataflow pragma, some modifications had to be made:

  • data streams were added between the functions
  • data buffers were added in each layer


  • figure3

    Figure 3: Dataflow pipelining


    The data buffers are dimensioned in such a way that they can hold enough data to process one output for the next layer. After the output is computed, one element exits the buffer while another one enters and is used to compute the next output.

    As a result, all the convolutional and pooling layers will finish at almost the same time. Unfortunately, the fully connected layers are not able to benefit from this optimization because they are executed sequentially: they need all the data from the previous layer to compute the output. This is not a reason for concern because less than 1% of the computation takes place in the fully connected layers.

    Step 4: Synthesis

    In HLS the term synthesis means converting the C code into Verilog or VHDL. At the end of a successful synthesis process you will end up with RTL folders containing Verilog and VHDL code and a synthesis log containing information about area, latency and clock frequency. The design needs to be tweaked until the desired hardware performance is achieved.

    After implementing all the upgrades mentioned in step 3, the report looked fantastic compared with the first iteration. The number of CC dropped from 3,240,373 down to 70,532 with a small area consumption.


    BRAM DSP FF LUT Clock period (ns) Latency (cc)
    Values 96 158 23.069 30.779 13.91 70.532
    Percentage 34% 71% 21% 57%

    Table 6: Final synthesis results


    Step 5: Validate synthesis results (C/RTL co-simulation)

    In this step the C code becomes a “golden model” against which the generated RTL is tested. A C testbench is used to generate and check the stimuli. If the test passes, your design is good to go!

    Conclusions

    The field of Deep Learning and FPGA programming is growing rapidly thanks to numerous research papers and community driven projects. New platforms are emerging while others are becoming extinct – e.g. CPUs were the main training processor ten years ago, but now they are rarely used.

    There are still improvements to be made. High Level Synthesis is still a relatively new technique in the field of FPGA programming. Also most implementations are for the forward propagation part of the neural network, even though backpropagation algorithms can also benefit from running on an FPGA-based platform.

    In the future I plan to explore new neural network topologies more suited to FPGA implementations, such as Reduced Precision Neural Networks or Quantized Neural Networks which use low-precision arithmetics in order to reduce the hardware consumption of basic operations and the size of weight storage.

    Acknowledgment

    The content of this article is part of my diploma thesis and it has been developed under the umbrella of AMIQ Education Program.


    Comments

    Leave a Comment:

    Your comment will be visible after approval.

    (will not be published)

    This site uses Akismet to reduce spam. Learn how your comment data is processed.