1. Introduction
In recent years, deep learning and artificial neural networks (ANNs) have achieved
many achievements in various fields. ANNs have become the cornerstone of artificial
intelligence research by showcasing performances in fields such as image recognition,
speech recognition, and natural language processing superior to those of conventional
algorithms. However, the computational complexity and memory requirements of neural
networks make it challenging to apply them in resource-constrained environments such
as in mobile and embedded systems. Aiming to address these issues, extensive research
has been conducted on developing various neural network architectures [1-5].
Convolutional neural networks (CNNs) have garnered the most attention for efficiently
performing computations in the field of image recognition. In contrast to conventional
deep neural networks (DNNs) with a fully connected layer structure, CNNs incorporate
convolutional and pooling layers, making them highly efficient in learning the structures
and patterns of images.
The success of CNNs has led to extensive research on various CNN architectures and
their variations for a wide range of visual problems. For instance, models like VGGnet,
ResNet, and Inception have demonstrated innovative performances for image classification
and object recognition tasks. Moreover, this success has influenced other areas of
research, leading to the development and exploration of CNN-based models in various
domains such as speech recognition, natural language processing, and reinforcement
learning.
Despite their success, managing larger models remains computationally intensive. Researchers
have explored alternatives like fixed-point representations or removing certain layers
to improve efficiency, alongside growing interest in binary neural networks (BNNs).
BNNs completely binarize the weights, thereby advancing the ability to provide efficient
neural network computations [6-10]. BNNs binarize weights and activations (−1 or +1), offering substantial reductions
in memory and power consumption [8,11,12]. This becomes particularly advantageous when applying neural networks in resource-constrained
environments such as edge devices.
However, this binarization often leads to accuracy losses compared to conventional
networks, prompting investigations into techniques like batch normalization, multi-bit
neural networks, and various activation functions to mitigate these drawbacks while
improving efficiency [13-15]. Recent advances include models like XNOR-Net, which use binary weights and activations
to reduce computational demands without sacrificing accuracy [8,11,16]. Additionally, active research is focusing on optimizing computations by approximating
accumulators.
This paper introduces BNNs while focusing on the optimization techniques for BNN computations.
Furthermore, it proposes a novel approach for accumulators and analyzes the impacts
on accuracy, power, and hardware utilization when this approach is applied to FPGA-based
BNNs. The remainder of this paper proceeds as follows. Section 2 provides a brief
overview of various neural networks. Section 3 explains the proposed BNN accelerator
with low power accumulator. Experimental results are explained in Section 4. Finally,
Section 5 concludes the paper.
2. Background of Neural Network
2.1. Deep Neural Network (DNN)
The DNN is a specific type of ANN comprising multiple hidden layers, as illustrated
in Fig. 1. It has garnered significant attention in tandem with the advancements in machine
learning models. DNNs excel in effectively learning complex nonlinear patterns, resulting
in high performance across diverse domains.
Fig. 1. General deep neural network.
The input layer receives data, hidden layers extract complex patterns, and the output
layer produces results. The input layer’s neuron count matches data characteristics.
Hidden layers vary in number and complexity. The output layer’s neuron count aligns
with network goals. Activation functions add nonlinearity, vital for learning intricate
patterns. Choices impact neuron outputs, each function having pros and cons. Common
types include sigmoid, hyperbolic tangent, rectified linear unit (ReLU), and leaky
ReLU.
Fig. 2. Neuron computational process.
Forward propagation in neural networks involves multiplying input data (e.g., x1)
by corresponding weights (e.g., wi1), summing the results, and passing them through
an activation function to generate predictions as shown in Fig. 2. Backpropagation is crucial for error reduction, adjusting weights based on output
layer errors using methods like gradient descent. This iterative process refines weights,
minimizing overall error and enhancing network performance. Gradient descent optimizes
weight updates based on loss function gradients but faces challenges like local minima
and efficiency in learning speed.
2.2. Convolutional Neural Network (CNN)
CNNs excel in image recognition and processing by learning regional patterns and analyzing
spatial structures through convolution and pooling operations. The convolution layer
applies filters (e.g., 3×3 or 5×5) to the input data to create a feature map. These
filters, smaller than the input data, are adjustable and are used to detect local
patterns in the image. In the convolution operation, the filter moves across the input
image, multiplying and summing overlapping data regions to produce results. This output
then feeds into the activation function, commonly ReLU in CNNs. ReLU outputs zero
for inputs less than zero and the input value itself for non-negative inputs.
The pooling layer processes data from convolutional operations and activation functions
to reduce the feature map size. This reduction decreases computational complexity,
speeds up model convergence, and enhances robustness against overfitting. Additionally,
the pooling layer maintains recognition of patterns despite shifts in their positions
within the feature map, making it robust to spatial variations in the image.
Two common types of pooling in CNNs are max-pooling and average pooling. Max-pooling
extracts the maximum value from a specified region (e.g., 2×2, 3×3) of the input feature
map, capturing significant features and reducing image size while preserving important
information. Average pooling, on the other hand, takes the average value from a region,
retaining smaller-scale features and being more robust to noise.
The pooling layer has key parameters: pooling size and stride. The pooling size determines
the region’s dimensions, while the stride indicates the step size of the pooling operation.
Proper tuning of these parameters helps control the output size, retain relevant information,
and manage model complexity. Pooling reduces computational load, increases learning
rates, and mitigates overfitting by reducing dimensionality. However, it may result
in some information loss and decreased accuracy, and its fixed size can limit applicability
to various image sizes.
2.3. Binary Neural Network (BNN)
CNNs demand substantial computation and memory, making them unsuitable for low-end
devices or energy constrained environments. To address this, researchers have focused
on binary neural networks (BNNs). BNNs reduce computational and memory requirements
by binarizing weights and activation values into +1 or -1 using a sign function. This
binarization allows efficient processing with bitwise operations like XOR, AND, and
bit shifts, replacing traditional multiplication and addition.
BNNs operate similarly to conventional CNNs but use binarized weights and activation
values, enabling faster hardware computations. They include convolutional, pooling,
and fully connected layers, with computations performed on binarized values. BNN learning
relies on backpropagation but requires alternative binarization functions and approximation
methods, such as the straight-through estimator, for weight updates. Standard optimization
techniques like SGD, Momentum, AdaGrad, and Adam can be used.
BNNs achieve computational efficiency by binarizing weights and activation values,
but this can reduce accuracy due to information loss and limited network expressiveness.
Applying standard learning algorithms directly to BNNs is challenging, requiring modifications
to handle binarized values. Therefore, finding suitable optimization methods is crucial
to address these issues and improve the learning process.
3. BNN Accelerator with Low Power Accumulator
The proposed accelerator for BNNs incorporates approximation bit operations and is
implemented using the Xilinx Vivado high-level synthesis (HLS) tool targeting the
Xilinx Zynq-7000 FPGA. The evaluation is based on the Modified National Institute
of Standards and Technology (MNIST) dataset, which consists of handwritten digits.
To assess the hardware resource utilization and accuracy, measurements are performed
using the HLS tool. The tool provides insights into the utilization of hardware resources
such as logic elements, memory, and digital signal processing (DSP) blocks. Additionally,
accuracy metrics can be obtained through evaluations based on the MNIST dataset. The
values for hardware resources, speed, and power consumption are compared using the
results obtained from FPGA synthesis.
The proposed BNN accelerator incorporates loop-flattening and pipelining techniques,
along with a novel approximation accumulator design. These optimizations aim to reduce
the latency and power consumption.
The loop-flattening and pipelining techniques are employed to minimize the latency
of the computations. Loop parallelization enables concurrent execution of independent
operations, reducing the overall execution time. Pipelining breaks down the computation
into a series of stages, allowing for overlapping of operations and reducing the overall
latency.
Furthermore, the new approximation accumulator design is implemented, which includes
pop-count and selective bit operations. This design efficiently performs the required
binary operations, reducing the power consumption.
By applying loop parallelization, pipelining and the new approximation accumulator
design, the proposed BNN accelerator reduces the latency and power consumption compared
to those of conventional designs.
Fig. 3. Overall architecture of the designed binary neural network (BNN).
The overall structure of the BNN is shown in Fig. 3. The resource utilization of each layer is measured. By leveraging the advantages
of the layer with the highest resource consumption, an efficient accelerator model
can be designed to enable bit operations.
The design and algorithms for the accelerator model are discussed in detail in their
respective sections. The focus is on optimizing the computations by capitalizing on
the layer consuming the most hardware resources. By efficiently utilizing the capabilities
of the bit operations in this layer, the accelerator model achieves enhanced computational
efficiency.
3.1. Pipelining and Loop-Flattening
Fig. 4 provides an overview of the Vivado HLS software. Vivado HLS allows for verification
through register transfer level (RTL) coding and simulations using high-level languages
such as C, C++, and SystemC. Once the verification is complete, the design is generated
in an RTL language (either Verilog hardware description language (HDL) or very high-speed
integrated circuit HDL (VHDL)) and is transformed into an intellectual property (IP)
core.
Fig. 4. Vivado high-level synthesis (HLS) overview: (a) block diagram, (b) design
flow.
Fig. 4(b) shows the flow of HLS. It is similar to the traditional RTL design flow but includes
an additional HLS coding flow. The advantages of HLS include a significant reduction
in design time and cost-effectiveness. Compared to the relatively high entry barrier
to using Verilog HDL, HLS using C-based optimization methods allows for a faster design
implementation. It also helps reduce the design time in critical areas such as synthesis,
place and routing (P&R), verification, and timing closure.
-
High-level languages like C enable faster coding and easier optimization compared
to low-level languages like Verilog HDL.
-
HLS provides automated optimization techniques, improving the efficiency and performance
of the design.
-
HLS reduces the effort and time required for verification by leveraging the abstractions
and simulation capabilities of the high-level language.
-
The generated RTL code can be seamlessly integrated with the rest of the design flow
(including the synthesis, P&R, and timing closure processes), further reducing the
design time.
Overall, Vivado HLS offers a more efficient and streamlined design flow, allowing
designers to achieve faster and more cost-effective design implementations than those
from traditional RTL design methods.
Insofar as the coding time reduction, a significant difference in the code content
can exist between designing in Verilog and in Vivado HLS. For example, if we consider
designing the pipelining, Vivado HLS allows for a simple implementation with just
one line by using macros. In contrast, in Verilog HDL, the entire design needs to
be manually coded, resulting in a substantial amount of code content. The reduced
coding effort in Vivado HLS not only reduces the design time but also decreases the
amount of code that needs to be debugged. Consequently, within the same time frame,
Vivado HLS enables the design of a wider range of architectures compared to Verilog
HDL.
Vivado HLS offers higher-level abstractions and optimization techniques that simplify
the design process and reduce the required amount of manual coding. By utilizing macros
and built-in libraries, designers can express their intentions in a more concise and
efficient manner. This leads to faster design exploration and parameter tuning, as
well as reduced debugging efforts.
Therefore, Vivado HLS provides a more productive and streamlined development experience,
allowing designers to achieve a broader range of architectural designs within the
same time constraints compared to traditional Verilog HDL coding.
Achieving timing closure, especially when working with high-frequency clocks, can
be challenging. However, Vivado HLS addresses this issue by incorporating library
information and target frequency details during the generation of the Verilog code
from the C description. This allows for the creation of RTL code able to operate at
the desired timing for high-frequency designs.
By providing library information, Vivado HLS ensures that the generated RTL code leverages
optimized and pre-characterized components capable of meeting the specified timing
requirements. Additionally, the target frequency information helps guide the HLS optimization
process to achieve the desired performance.
However, Vivado HLS has limitations. One of the main drawbacks is that the efficiency
of the conversion from C to Verilog may not be optimal. When designers manually write
Verilog code, they have more control over the hardware resources and can optimize
the design to make more efficient use of those resources. However, when using Vivado
HLS, the efficiency of the generated Verilog code depends on the designer’s proficiency
and quality of the HLS tool optimizations.
As Vivado HLS performs automated transformations and optimizations based on the C
description, it may generate code that utilizes more hardware resources than necessary.
This can lead to suboptimal resource utilization and potentially affect performance
or resource constraints. Designers may need to manually optimize the generated Verilog
code or fine-tune the directives of the HLS tool to achieve the desired efficiency.
Overall, the trade-off between efficiency and productivity is a consideration when
using Vivado HLS, and designers need to carefully evaluate their requirements, constraints,
and design goals to determine the most suitable approach for their specific project.
Loop-flattening and pipelining can be easily implemented using the Xilinx Vivado HLS
tool with the provided commands. Fig. 5 illustrates the concepts of loop-flattening and pipelining. Pipelining reduces the
latency by overlapping the execution of multiple operations, but at the cost of increased
hardware resource utilization. In contrast, loop parallelization can save hardware
resources but may result in increased latency compared to pipelining.
Fig. 5. Pipelining and loop-flattening optimization process. (a) Non-optimization.
(b) Pipelining. (c) Loop-flattening.
To optimize the hardware resource usage, an analysis of the hardware resource utilization
is conducted for each layer, as shown in Fig. 6 and Table 1. Based on this analysis, optimization techniques such as loop parallelization or
pipelining can be selectively applied to the layer that utilizes the most hardware
resources. This allows for a more efficient utilization of the available hardware
resources while considering the trade-off between latency and resource usage.
Fig. 6. Hardware usage for each layer based on BNN.
Table 1. Hardware usage for each layer based on binary neural network (BNN).
|
Metric
|
Block RAM (BRAM)_18K (unit)
|
Digital signal processing (DSP) 48E (unit)
|
Flip-Flop (unit)
|
Look-Up Table (LUT) (unit)
|
|
Main
|
0
|
0
|
239
|
160
|
|
Convolution_1
|
64
|
28
|
5053
|
9523
|
|
Convolution_2
|
32
|
49
|
5248
|
11087
|
|
Fully connected_1
|
0
|
25
|
3986
|
7375
|
|
Fully connected_2
|
0
|
0
|
2132
|
1709
|
|
Pooling
|
0
|
0
|
256
|
963
|
By carefully analyzing the resource requirements of each layer and applying the appropriate
optimization techniques, designers can achieve an optimal balance between hardware
resource utilization and performance in their design.
The first convolutional layer in the BNN design performs a preprocessing step of converting
the input image into binary values (0 and 1). As a result, it tends to have a higher
memory usage compared to the second convolutional layer. Additionally, convolutional
layers generally account for a significant portion of the overall computations in
a BNN design.
To optimize the computational and memory requirements, pipeline and loop parallelization
techniques are applied to the convolutional layer with the highest computational and
memory demands. In contrast, for the pooling layer (which typically involves fewer
computations), only loop parallelization is applied. By selectively applying these
optimization techniques, a balance can be made between the accuracy and hardware resource
utilization when considering the convolutional layer with higher computational demands
and pooling layer with relatively lower computational demands.
Table 2 shows the accuracy and hardware resource utilization when applying only pipeline
optimization to the computation-intensive parts of the existing BNN. The table shows
the utilization of hardware resources such as the Block RAM (BRAM) and flip-flops.
It is observed that these resources are heavily utilized. In contrast, there is a
slight reduction in the utilization of look-up tables and DSPs; the latency decreases
by 12%.
Table 2. Accuracy and hardware resource consumption of pipelined BNNs.
|
Metric
|
Conventional BNN
|
Pipelined BNN
|
Pipelined vs conventional
|
|
Accuracy (%)
|
94.51
|
94.51
|
100%
|
|
Flip-Flop (unit)
|
21088
|
24613
|
116%
|
|
Look-Up Table (unit)
|
39133
|
35455
|
90%
|
|
DSP (unit)
|
110
|
103
|
93%
|
|
BRAM_18K (unit)
|
100
|
184
|
184%
|
|
Power (W)
|
1.927
|
1.965
|
101%
|
|
Latency (Clock)
|
140476
|
123865
|
88%
|
This indicates that applying pipeline optimization better reduces the latency of the
computations while slightly reducing the utilization of certain hardware resources.
The trade-off between accuracy and resource usage is an important consideration in
BNN optimization. The results from Table 2 provide insights into the impact of pipeline optimization on the overall performance
and resource utilization of the BNN.
Table 3. Accuracy and hardware resource consumption of BNNs with pipelining and loop
parallelization.
|
Metric
|
Conventional BNN
|
Optimized BNN
|
Optimized vs conventional
|
|
Accuracy (%)
|
94.51
|
94.51
|
100%
|
|
Flip-Flop (unit)
|
21088
|
23806
|
112%
|
|
Look-Up Table (unit)
|
39133
|
38582
|
98%
|
|
DSP (unit)
|
110
|
103
|
93%
|
|
BRAM_18K (unit)
|
100
|
184
|
184%
|
|
Power (W)
|
1.927
|
1.969
|
102%
|
|
Latency (Clock)
|
140476
|
100228
|
71%
|
Table 3 illustrates the accuracy and hardware resource utilization when both loop parallelization
and pipeline optimization are applied to the computation-intensive convolutional layers,
whereas only loop parallelization is applied to the relatively less compute-intensive
pooling layers. When comparing this approach to that applying only pipeline optimization,
the hardware resource utilization remains similar, but an additional 16% reduction
in latency appears. Overall, this results in a total reduction of 29% in latency compared
to that of the original BNN design.
This indicates that by combining both loop parallelization and pipeline optimization
in the convolutional layers and applying loop parallelization in the pooling layers,
the latency can be further reduced while maintaining a hardware resource utilization
comparable to that of the pipeline optimization-only approach.
Fig. 7 shows the block diagram of the BNN design created using Vivado HLS and implemented
as a Verilog language IP in the Vivado tool. It is designed while targeting the Xilinx
Zynq-7000 FPGA. The bnn_0 IP is responsible for performing all of the operations of
the BNN. The BRAMs are used for storing values such as input images, weights, biases,
and other necessary data.
Fig. 7. Block diagram of BNN design implemented in FPGA.
Table 4 provides the power consumption results when performing the hardware implementation.
It shows that applying pipelining significantly increases the usage of hardware resources,
leading to increased power consumption. “Processing System 7” (PS7) consumes the highest
amount of power; it represents the power consumption related to the FPGA. In the SoC
implementation, the power consumption of the PS7 contributes significantly to the
overall power consumption, resulting in a 1%, 3% increase compared to the baseline.
However, the pipeline and loop-flattening optimizations result in significant decreases
in latency (by 12% and 29%, respectively), highlighting their advantages in terms
of the reduced delay.
Table 4. Comparison and breakdown of power consumption of various approaches.
|
Metric
|
Conventional
|
Pipelining
|
Loop-flattening
|
Pipelining vs conventional
|
Loop-flattening vs conventional
|
|
Clock (W)
|
0.026
|
0.025
|
0.026
|
96%
|
100%
|
|
Signals (W)
|
0.049
|
0.058
|
0.06
|
118%
|
122%
|
|
Logic (W)
|
0.038
|
0.036
|
0.041
|
94%
|
107%
|
|
BRAM (W)
|
0.058
|
0.094
|
0.088
|
162%
|
151%
|
|
DSP (W)
|
0.035
|
0.028
|
0.03
|
80%
|
85%
|
|
Processing System 7 (PS7) (W)
|
1.564
|
1.564
|
1.564
|
100%
|
100%
|
|
Total power (W)
|
1.771
|
1.804
|
1.808
|
101%
|
103%
|
3.2. Pop-Count Operation
The pop-count operation is an operation used in digital circuit design. Traditional
multiply-and-accumulate operations involve the use of adders and multipliers and consume
a significant amount of hardware resources. The pop-count operation counts the number
of 1s in a binary representation of a value. This operation can be implemented easily
and allows for faster computation with reduced hardware resources.
In BNNs, both the weights and activation values are binarized. The pop-count operation
is commonly used for performing inner product operations between two binary vectors.
For example, consider two vectors A = [1, 1, −1, −1] and B = [1, −1, −1, 1]. After
performing an XOR operation, A⊕B = [1, 0, 1, 0] is obtained. Subsequently, applying
the pop-count operation to this result yields popcount (1010) = 4 (length of the array)
−2 (number of 1s) = 2. In contrast, the dot product of the two vectors A · B = (1
× 1) + (1 × −1) + (−1 × −1) + (−1 × 1) = 0. The pop-count operation is used in BNNs
primarily to prioritize computation speed rather than accuracy, thereby leveraging
the characteristics of BNNs.
3.3. Proposed Low-Power Accumulator
BNNs are neural networks with weights and activation values limited to +1 or −1, allowing
for more efficient learning and inference compared to ANNs. Whereas ANNs mainly use
multiplication and addition operations, BNNs employ bitwise operations such as XNOR,
providing significant advantages in memory usage. However, as the amount of learning
increases, more hardware resources are required. Correspondingly, many studies have
addressed this issue. One of the proposed solutions involves using accumulators to
add the results of previous operations to those of the current operation in each layer;
however, this approach consumes a significant amount of hardware resources.
Fig. 8. Conventional accumulator structure.
Fig. 8 depicts a general accumulator for accumulating and storing calculation results. In
the case of two operands involved in a computation, one operand is stored in the accumulator
register, whereas the other is fetched from memory or another register for the operation.
In BNNs, the accumulator is responsible for accumulating the results of the binary
dot product operations. When calculating the neuron output values in each layer, the
accumulator accumulates the binary dot product of the previous layer’s output values
and weights, then outputs them as the current layer’s output values. These output
values are then fed as inputs to the next layer.
The accumulator operation method in BNNs reduces the computational complexity and
memory usage compared to those in other neural networks. As a result, BNNs can be
effectively utilized in low-power and resource-limited hardware environments. However,
the drawback is the reduced accuracy, as the representation and computation are limited
to 1 and −1 (unlike in other neural networks).
Fig. 9. Algorithm of the proposed BNN.
Fig. 9 shows the operation process of the BNN applied in this study. Initially, the input
data and weights undergo bitwise operations, followed by pop-count operations, before
entering the accumulator as inputs. The corresponding formula can be expressed as
follows:
The accumulator adds the previous values and, upon completing all bit operations,
ultimately outputs the final result. The pop-count operation used to identify the
number of 1s; it is frequently employed in BNNs for computational optimization, as
they predominantly involve bit operations.
BNNs primarily focus on the computational speed and hardware resource usage rather
than the accuracy. Therefore, it is not necessary to accurately compute the cumulative
operations at each layer. It is more efficient to maintain a level of accuracy sufficient
for recognition while reducing the power consumption and hardware resource usage.
Not all bits have an equal impact on recognition during cumulative operations, so
it is inefficient to perform cumulative operations for bits with a relatively lower
impact. Consequently, the accumulator does not perform cumulative operations for bits
with less impact on the accuracy, and instead directly outputs the previous value.
Fig. 10 shows the accuracy when cumulative operations are not performed for each bit. The
32nd bit exhibits a high accuracy of 94.7%. In contrast, the 6th bit shows a much
lower accuracy of 68.8%. This confirms that the 6th bit has a more significant influence
on the final classification than that of the 32nd bit. Based on these results, a low-power
accumulator is proposed herein.
Fig. 10. Accuracy when not performing operation for each bit.
Fig. 11 illustrates the proposed low-power accumulator. Conventional accumulators sequentially
compute each bit, thereby consuming significant amounts of computational time and
hardware resources. In contrast, the proposed accumulator performs parallel operations
while adding an enable signal to each adder. If the enable signal is ‘1,’ the operation
proceeds as usual; however, if the enable signal is ‘0,’ the operation does not take
place. By conducting parallel operations with the same number of adders as before,
the computational speed increases, and the addition of the enable signal to each adder
significantly reduces the hardware resource usage.
Fig. 11. Proposed accumulator structure.
3.4. Proposed Accumulator-Based BNN
Fig. 3 shows the overall architecture of the BNN with the proposed low-power accumulator,
as designed using the MNIST dataset. The MNIST dataset consists of handwritten ten
digit images, each with a size of 28 × 28 pixels. When the input data is received,
the pixel values of the image are converted to 0 or 1. The preprocessed image data
is then fed into the input layer. The data output from the input layer serves as the
input for the convolution layer, which extracts image features using 32 filters. After
the convolution operation is completed, max-pooling is performed.
Then, 64 filters are applied again, and pooling is performed once more before the
data is converted into a one-dimensional format. The one-dimensional data passes through
a fully connected layer to ultimately perform the classification.
The image data is converted to 0 or 1 through the preprocessing process, and other
data such as weights and biases are also binarized. Therefore, most operations are
performed using XOR or bitwise operators.