Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIE Transactions on Smart Processing and Computing

IEIESPC Vol. 14, No. 6, p.825-836

ISSN (online) :

2287-5255

Received : 13 March 2024Revised : 19 June 2024Accepted : 27 August 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.6.825

Regular Paper

Optimizing Hardware Resources for Low-Power Binary Neural Networks Using Approximate Bitwise Operation

LeeDongchan¹ KimYoungmin¹

Email :

youngmin@hongik.ac.kr

(School of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea chanlee94@naver.com, youngmin@hongik.ac.kr)

^* Corresponding Author: Youngmin Kim, youngmin@hongik.ac.kr

Abstract

Artificial neural networks have recently been widely used in image classification, object detection, and character recognition. However, the amounts of learning and computation in the model required to achieve high accuracy have increased rapidly. As a result, a bottleneck phenomenon has intensified. Research on approaches such as reducing the weights models and optimizing calculations are being conducted to solve this problem. Binary neural networks are receiving significant attention for field-programmable gate array (FPGA)-based designs owing to their high computational efficiency and low-power designs. In this paper, we propose a binary neural network with an FPGA-based low-power accumulator. Based on the hardware resource consumption of each layer, operations are optimized by targeting more hardware-intensive layers. In addition, we propose a new method for operating the accumulator for adding the existing operation results during the learning process in the neural networks. As a result, a binary neural network using the optimized accumulator reduces power by up to 55% compared to the previous network; the other hardware usage was also reduced by 27%. Nevertheless, the delay time remains constant, and the accuracy remains at 90%.

Keywords

Neural network, Binary neural network, FPGA, Approximate, Accumulator

1. Introduction

In recent years, deep learning and artificial neural networks (ANNs) have achieved many achievements in various fields. ANNs have become the cornerstone of artificial intelligence research by showcasing performances in fields such as image recognition, speech recognition, and natural language processing superior to those of conventional algorithms. However, the computational complexity and memory requirements of neural networks make it challenging to apply them in resource-constrained environments such as in mobile and embedded systems. Aiming to address these issues, extensive research has been conducted on developing various neural network architectures ^[1-^5].

Convolutional neural networks (CNNs) have garnered the most attention for efficiently performing computations in the field of image recognition. In contrast to conventional deep neural networks (DNNs) with a fully connected layer structure, CNNs incorporate convolutional and pooling layers, making them highly efficient in learning the structures and patterns of images.

The success of CNNs has led to extensive research on various CNN architectures and their variations for a wide range of visual problems. For instance, models like VGGnet, ResNet, and Inception have demonstrated innovative performances for image classification and object recognition tasks. Moreover, this success has influenced other areas of research, leading to the development and exploration of CNN-based models in various domains such as speech recognition, natural language processing, and reinforcement learning.

Despite their success, managing larger models remains computationally intensive. Researchers have explored alternatives like fixed-point representations or removing certain layers to improve efficiency, alongside growing interest in binary neural networks (BNNs). BNNs completely binarize the weights, thereby advancing the ability to provide efficient neural network computations ^[6-^10]. BNNs binarize weights and activations (−1 or +1), offering substantial reductions in memory and power consumption ^[8,^11,^12]. This becomes particularly advantageous when applying neural networks in resource-constrained environments such as edge devices.

However, this binarization often leads to accuracy losses compared to conventional networks, prompting investigations into techniques like batch normalization, multi-bit neural networks, and various activation functions to mitigate these drawbacks while improving efficiency ^[13-^15]. Recent advances include models like XNOR-Net, which use binary weights and activations to reduce computational demands without sacrificing accuracy ^[8,^11,^16]. Additionally, active research is focusing on optimizing computations by approximating accumulators.

This paper introduces BNNs while focusing on the optimization techniques for BNN computations. Furthermore, it proposes a novel approach for accumulators and analyzes the impacts on accuracy, power, and hardware utilization when this approach is applied to FPGA-based BNNs. The remainder of this paper proceeds as follows. Section 2 provides a brief overview of various neural networks. Section 3 explains the proposed BNN accelerator with low power accumulator. Experimental results are explained in Section 4. Finally, Section 5 concludes the paper.

2. Background of Neural Network

2.1. Deep Neural Network (DNN)

The DNN is a specific type of ANN comprising multiple hidden layers, as illustrated in Fig. 1. It has garnered significant attention in tandem with the advancements in machine learning models. DNNs excel in effectively learning complex nonlinear patterns, resulting in high performance across diverse domains.

Fig. 1. General deep neural network.

The input layer receives data, hidden layers extract complex patterns, and the output layer produces results. The input layer’s neuron count matches data characteristics. Hidden layers vary in number and complexity. The output layer’s neuron count aligns with network goals. Activation functions add nonlinearity, vital for learning intricate patterns. Choices impact neuron outputs, each function having pros and cons. Common types include sigmoid, hyperbolic tangent, rectified linear unit (ReLU), and leaky ReLU.

Fig. 2. Neuron computational process.

Forward propagation in neural networks involves multiplying input data (e.g., x1) by corresponding weights (e.g., wi1), summing the results, and passing them through an activation function to generate predictions as shown in Fig. 2. Backpropagation is crucial for error reduction, adjusting weights based on output layer errors using methods like gradient descent. This iterative process refines weights, minimizing overall error and enhancing network performance. Gradient descent optimizes weight updates based on loss function gradients but faces challenges like local minima and efficiency in learning speed.

2.2. Convolutional Neural Network (CNN)

CNNs excel in image recognition and processing by learning regional patterns and analyzing spatial structures through convolution and pooling operations. The convolution layer applies filters (e.g., 3×3 or 5×5) to the input data to create a feature map. These filters, smaller than the input data, are adjustable and are used to detect local patterns in the image. In the convolution operation, the filter moves across the input image, multiplying and summing overlapping data regions to produce results. This output then feeds into the activation function, commonly ReLU in CNNs. ReLU outputs zero for inputs less than zero and the input value itself for non-negative inputs.

The pooling layer processes data from convolutional operations and activation functions to reduce the feature map size. This reduction decreases computational complexity, speeds up model convergence, and enhances robustness against overfitting. Additionally, the pooling layer maintains recognition of patterns despite shifts in their positions within the feature map, making it robust to spatial variations in the image.

Two common types of pooling in CNNs are max-pooling and average pooling. Max-pooling extracts the maximum value from a specified region (e.g., 2×2, 3×3) of the input feature map, capturing significant features and reducing image size while preserving important information. Average pooling, on the other hand, takes the average value from a region, retaining smaller-scale features and being more robust to noise.

The pooling layer has key parameters: pooling size and stride. The pooling size determines the region’s dimensions, while the stride indicates the step size of the pooling operation. Proper tuning of these parameters helps control the output size, retain relevant information, and manage model complexity. Pooling reduces computational load, increases learning rates, and mitigates overfitting by reducing dimensionality. However, it may result in some information loss and decreased accuracy, and its fixed size can limit applicability to various image sizes.

2.3. Binary Neural Network (BNN)

CNNs demand substantial computation and memory, making them unsuitable for low-end devices or energy constrained environments. To address this, researchers have focused on binary neural networks (BNNs). BNNs reduce computational and memory requirements by binarizing weights and activation values into +1 or -1 using a sign function. This binarization allows efficient processing with bitwise operations like XOR, AND, and bit shifts, replacing traditional multiplication and addition.

BNNs operate similarly to conventional CNNs but use binarized weights and activation values, enabling faster hardware computations. They include convolutional, pooling, and fully connected layers, with computations performed on binarized values. BNN learning relies on backpropagation but requires alternative binarization functions and approximation methods, such as the straight-through estimator, for weight updates. Standard optimization techniques like SGD, Momentum, AdaGrad, and Adam can be used.

BNNs achieve computational efficiency by binarizing weights and activation values, but this can reduce accuracy due to information loss and limited network expressiveness. Applying standard learning algorithms directly to BNNs is challenging, requiring modifications to handle binarized values. Therefore, finding suitable optimization methods is crucial to address these issues and improve the learning process.

3. BNN Accelerator with Low Power Accumulator

The proposed accelerator for BNNs incorporates approximation bit operations and is implemented using the Xilinx Vivado high-level synthesis (HLS) tool targeting the Xilinx Zynq-7000 FPGA. The evaluation is based on the Modified National Institute of Standards and Technology (MNIST) dataset, which consists of handwritten digits.

To assess the hardware resource utilization and accuracy, measurements are performed using the HLS tool. The tool provides insights into the utilization of hardware resources such as logic elements, memory, and digital signal processing (DSP) blocks. Additionally, accuracy metrics can be obtained through evaluations based on the MNIST dataset. The values for hardware resources, speed, and power consumption are compared using the results obtained from FPGA synthesis.

The proposed BNN accelerator incorporates loop-flattening and pipelining techniques, along with a novel approximation accumulator design. These optimizations aim to reduce the latency and power consumption.

The loop-flattening and pipelining techniques are employed to minimize the latency of the computations. Loop parallelization enables concurrent execution of independent operations, reducing the overall execution time. Pipelining breaks down the computation into a series of stages, allowing for overlapping of operations and reducing the overall latency.

Furthermore, the new approximation accumulator design is implemented, which includes pop-count and selective bit operations. This design efficiently performs the required binary operations, reducing the power consumption.

By applying loop parallelization, pipelining and the new approximation accumulator design, the proposed BNN accelerator reduces the latency and power consumption compared to those of conventional designs.

Fig. 3. Overall architecture of the designed binary neural network (BNN).

The overall structure of the BNN is shown in Fig. 3. The resource utilization of each layer is measured. By leveraging the advantages of the layer with the highest resource consumption, an efficient accelerator model can be designed to enable bit operations.

The design and algorithms for the accelerator model are discussed in detail in their respective sections. The focus is on optimizing the computations by capitalizing on the layer consuming the most hardware resources. By efficiently utilizing the capabilities of the bit operations in this layer, the accelerator model achieves enhanced computational efficiency.

3.1. Pipelining and Loop-Flattening

Fig. 4 provides an overview of the Vivado HLS software. Vivado HLS allows for verification through register transfer level (RTL) coding and simulations using high-level languages such as C, C++, and SystemC. Once the verification is complete, the design is generated in an RTL language (either Verilog hardware description language (HDL) or very high-speed integrated circuit HDL (VHDL)) and is transformed into an intellectual property (IP) core.

Fig. 4. Vivado high-level synthesis (HLS) overview: (a) block diagram, (b) design flow.

Fig. 4(b) shows the flow of HLS. It is similar to the traditional RTL design flow but includes an additional HLS coding flow. The advantages of HLS include a significant reduction in design time and cost-effectiveness. Compared to the relatively high entry barrier to using Verilog HDL, HLS using C-based optimization methods allows for a faster design implementation. It also helps reduce the design time in critical areas such as synthesis, place and routing (P&R), verification, and timing closure.

High-level languages like C enable faster coding and easier optimization compared to low-level languages like Verilog HDL.
HLS provides automated optimization techniques, improving the efficiency and performance of the design.
HLS reduces the effort and time required for verification by leveraging the abstractions and simulation capabilities of the high-level language.
The generated RTL code can be seamlessly integrated with the rest of the design flow (including the synthesis, P&R, and timing closure processes), further reducing the design time.

Overall, Vivado HLS offers a more efficient and streamlined design flow, allowing designers to achieve faster and more cost-effective design implementations than those from traditional RTL design methods.

Insofar as the coding time reduction, a significant difference in the code content can exist between designing in Verilog and in Vivado HLS. For example, if we consider designing the pipelining, Vivado HLS allows for a simple implementation with just one line by using macros. In contrast, in Verilog HDL, the entire design needs to be manually coded, resulting in a substantial amount of code content. The reduced coding effort in Vivado HLS not only reduces the design time but also decreases the amount of code that needs to be debugged. Consequently, within the same time frame, Vivado HLS enables the design of a wider range of architectures compared to Verilog HDL.

Vivado HLS offers higher-level abstractions and optimization techniques that simplify the design process and reduce the required amount of manual coding. By utilizing macros and built-in libraries, designers can express their intentions in a more concise and efficient manner. This leads to faster design exploration and parameter tuning, as well as reduced debugging efforts.

Therefore, Vivado HLS provides a more productive and streamlined development experience, allowing designers to achieve a broader range of architectural designs within the same time constraints compared to traditional Verilog HDL coding.

Achieving timing closure, especially when working with high-frequency clocks, can be challenging. However, Vivado HLS addresses this issue by incorporating library information and target frequency details during the generation of the Verilog code from the C description. This allows for the creation of RTL code able to operate at the desired timing for high-frequency designs.

By providing library information, Vivado HLS ensures that the generated RTL code leverages optimized and pre-characterized components capable of meeting the specified timing requirements. Additionally, the target frequency information helps guide the HLS optimization process to achieve the desired performance.

However, Vivado HLS has limitations. One of the main drawbacks is that the efficiency of the conversion from C to Verilog may not be optimal. When designers manually write Verilog code, they have more control over the hardware resources and can optimize the design to make more efficient use of those resources. However, when using Vivado HLS, the efficiency of the generated Verilog code depends on the designer’s proficiency and quality of the HLS tool optimizations.

As Vivado HLS performs automated transformations and optimizations based on the C description, it may generate code that utilizes more hardware resources than necessary. This can lead to suboptimal resource utilization and potentially affect performance or resource constraints. Designers may need to manually optimize the generated Verilog code or fine-tune the directives of the HLS tool to achieve the desired efficiency.

Overall, the trade-off between efficiency and productivity is a consideration when using Vivado HLS, and designers need to carefully evaluate their requirements, constraints, and design goals to determine the most suitable approach for their specific project.

Loop-flattening and pipelining can be easily implemented using the Xilinx Vivado HLS tool with the provided commands. Fig. 5 illustrates the concepts of loop-flattening and pipelining. Pipelining reduces the latency by overlapping the execution of multiple operations, but at the cost of increased hardware resource utilization. In contrast, loop parallelization can save hardware resources but may result in increased latency compared to pipelining.

Fig. 5. Pipelining and loop-flattening optimization process. (a) Non-optimization. (b) Pipelining. (c) Loop-flattening.

To optimize the hardware resource usage, an analysis of the hardware resource utilization is conducted for each layer, as shown in Fig. 6 and Table 1. Based on this analysis, optimization techniques such as loop parallelization or pipelining can be selectively applied to the layer that utilizes the most hardware resources. This allows for a more efficient utilization of the available hardware resources while considering the trade-off between latency and resource usage.

Fig. 6. Hardware usage for each layer based on BNN.

Table 1. Hardware usage for each layer based on binary neural network (BNN).

Metric	Block RAM (BRAM)_18K (unit)	Digital signal processing (DSP) 48E (unit)	Flip-Flop (unit)	Look-Up Table (LUT) (unit)
Main	0	0	239	160
Convolution_1	64	28	5053	9523
Convolution_2	32	49	5248	11087
Fully connected_1	0	25	3986	7375
Fully connected_2	0	0	2132	1709
Pooling	0	0	256	963

By carefully analyzing the resource requirements of each layer and applying the appropriate optimization techniques, designers can achieve an optimal balance between hardware resource utilization and performance in their design.

The first convolutional layer in the BNN design performs a preprocessing step of converting the input image into binary values (0 and 1). As a result, it tends to have a higher memory usage compared to the second convolutional layer. Additionally, convolutional layers generally account for a significant portion of the overall computations in a BNN design.

To optimize the computational and memory requirements, pipeline and loop parallelization techniques are applied to the convolutional layer with the highest computational and memory demands. In contrast, for the pooling layer (which typically involves fewer computations), only loop parallelization is applied. By selectively applying these optimization techniques, a balance can be made between the accuracy and hardware resource utilization when considering the convolutional layer with higher computational demands and pooling layer with relatively lower computational demands.

Table 2 shows the accuracy and hardware resource utilization when applying only pipeline optimization to the computation-intensive parts of the existing BNN. The table shows the utilization of hardware resources such as the Block RAM (BRAM) and flip-flops. It is observed that these resources are heavily utilized. In contrast, there is a slight reduction in the utilization of look-up tables and DSPs; the latency decreases by 12%.

Table 2. Accuracy and hardware resource consumption of pipelined BNNs.

Metric	Conventional BNN	Pipelined BNN	Pipelined vs conventional
Accuracy (%)	94.51	94.51	100%
Flip-Flop (unit)	21088	24613	116%
Look-Up Table (unit)	39133	35455	90%
DSP (unit)	110	103	93%
BRAM_18K (unit)	100	184	184%
Power (W)	1.927	1.965	101%
Latency (Clock)	140476	123865	88%

This indicates that applying pipeline optimization better reduces the latency of the computations while slightly reducing the utilization of certain hardware resources. The trade-off between accuracy and resource usage is an important consideration in BNN optimization. The results from Table 2 provide insights into the impact of pipeline optimization on the overall performance and resource utilization of the BNN.

Table 3. Accuracy and hardware resource consumption of BNNs with pipelining and loop parallelization.

Metric	Conventional BNN	Optimized BNN	Optimized vs conventional
Accuracy (%)	94.51	94.51	100%
Flip-Flop (unit)	21088	23806	112%
Look-Up Table (unit)	39133	38582	98%
DSP (unit)	110	103	93%
BRAM_18K (unit)	100	184	184%
Power (W)	1.927	1.969	102%
Latency (Clock)	140476	100228	71%

Table 3 illustrates the accuracy and hardware resource utilization when both loop parallelization and pipeline optimization are applied to the computation-intensive convolutional layers, whereas only loop parallelization is applied to the relatively less compute-intensive pooling layers. When comparing this approach to that applying only pipeline optimization, the hardware resource utilization remains similar, but an additional 16% reduction in latency appears. Overall, this results in a total reduction of 29% in latency compared to that of the original BNN design.

This indicates that by combining both loop parallelization and pipeline optimization in the convolutional layers and applying loop parallelization in the pooling layers, the latency can be further reduced while maintaining a hardware resource utilization comparable to that of the pipeline optimization-only approach.

Fig. 7 shows the block diagram of the BNN design created using Vivado HLS and implemented as a Verilog language IP in the Vivado tool. It is designed while targeting the Xilinx Zynq-7000 FPGA. The bnn_0 IP is responsible for performing all of the operations of the BNN. The BRAMs are used for storing values such as input images, weights, biases, and other necessary data.

Fig. 7. Block diagram of BNN design implemented in FPGA.

Table 4 provides the power consumption results when performing the hardware implementation. It shows that applying pipelining significantly increases the usage of hardware resources, leading to increased power consumption. “Processing System 7” (PS7) consumes the highest amount of power; it represents the power consumption related to the FPGA. In the SoC implementation, the power consumption of the PS7 contributes significantly to the overall power consumption, resulting in a 1%, 3% increase compared to the baseline. However, the pipeline and loop-flattening optimizations result in significant decreases in latency (by 12% and 29%, respectively), highlighting their advantages in terms of the reduced delay.

Table 4. Comparison and breakdown of power consumption of various approaches.

Metric	Conventional	Pipelining	Loop-flattening	Pipelining vs conventional	Loop-flattening vs conventional
Clock (W)	0.026	0.025	0.026	96%	100%
Signals (W)	0.049	0.058	0.06	118%	122%
Logic (W)	0.038	0.036	0.041	94%	107%
BRAM (W)	0.058	0.094	0.088	162%	151%
DSP (W)	0.035	0.028	0.03	80%	85%
Processing System 7 (PS7) (W)	1.564	1.564	1.564	100%	100%
Total power (W)	1.771	1.804	1.808	101%	103%

3.2. Pop-Count Operation

The pop-count operation is an operation used in digital circuit design. Traditional multiply-and-accumulate operations involve the use of adders and multipliers and consume a significant amount of hardware resources. The pop-count operation counts the number of 1s in a binary representation of a value. This operation can be implemented easily and allows for faster computation with reduced hardware resources.

In BNNs, both the weights and activation values are binarized. The pop-count operation is commonly used for performing inner product operations between two binary vectors. For example, consider two vectors A = [1, 1, −1, −1] and B = [1, −1, −1, 1]. After performing an XOR operation, A⊕B = [1, 0, 1, 0] is obtained. Subsequently, applying the pop-count operation to this result yields popcount (1010) = 4 (length of the array) −2 (number of 1s) = 2. In contrast, the dot product of the two vectors A · B = (1 × 1) + (1 × −1) + (−1 × −1) + (−1 × 1) = 0. The pop-count operation is used in BNNs primarily to prioritize computation speed rather than accuracy, thereby leveraging the characteristics of BNNs.

3.3. Proposed Low-Power Accumulator

BNNs are neural networks with weights and activation values limited to +1 or −1, allowing for more efficient learning and inference compared to ANNs. Whereas ANNs mainly use multiplication and addition operations, BNNs employ bitwise operations such as XNOR, providing significant advantages in memory usage. However, as the amount of learning increases, more hardware resources are required. Correspondingly, many studies have addressed this issue. One of the proposed solutions involves using accumulators to add the results of previous operations to those of the current operation in each layer; however, this approach consumes a significant amount of hardware resources.

Fig. 8. Conventional accumulator structure.

Fig. 8 depicts a general accumulator for accumulating and storing calculation results. In the case of two operands involved in a computation, one operand is stored in the accumulator register, whereas the other is fetched from memory or another register for the operation. In BNNs, the accumulator is responsible for accumulating the results of the binary dot product operations. When calculating the neuron output values in each layer, the accumulator accumulates the binary dot product of the previous layer’s output values and weights, then outputs them as the current layer’s output values. These output values are then fed as inputs to the next layer.

The accumulator operation method in BNNs reduces the computational complexity and memory usage compared to those in other neural networks. As a result, BNNs can be effectively utilized in low-power and resource-limited hardware environments. However, the drawback is the reduced accuracy, as the representation and computation are limited to 1 and −1 (unlike in other neural networks).

Fig. 9. Algorithm of the proposed BNN.

Fig. 9 shows the operation process of the BNN applied in this study. Initially, the input data and weights undergo bitwise operations, followed by pop-count operations, before entering the accumulator as inputs. The corresponding formula can be expressed as follows:

(1)

P n = ∑ i = 0 n p o p ( X N O R ( w i , x i ) )

(2)

y n = P n − 1 + P n

The accumulator adds the previous values and, upon completing all bit operations, ultimately outputs the final result. The pop-count operation used to identify the number of 1s; it is frequently employed in BNNs for computational optimization, as they predominantly involve bit operations.

BNNs primarily focus on the computational speed and hardware resource usage rather than the accuracy. Therefore, it is not necessary to accurately compute the cumulative operations at each layer. It is more efficient to maintain a level of accuracy sufficient for recognition while reducing the power consumption and hardware resource usage.

Not all bits have an equal impact on recognition during cumulative operations, so it is inefficient to perform cumulative operations for bits with a relatively lower impact. Consequently, the accumulator does not perform cumulative operations for bits with less impact on the accuracy, and instead directly outputs the previous value. Fig. 10 shows the accuracy when cumulative operations are not performed for each bit. The 32nd bit exhibits a high accuracy of 94.7%. In contrast, the 6th bit shows a much lower accuracy of 68.8%. This confirms that the 6th bit has a more significant influence on the final classification than that of the 32nd bit. Based on these results, a low-power accumulator is proposed herein.

Fig. 10. Accuracy when not performing operation for each bit.

Fig. 11 illustrates the proposed low-power accumulator. Conventional accumulators sequentially compute each bit, thereby consuming significant amounts of computational time and hardware resources. In contrast, the proposed accumulator performs parallel operations while adding an enable signal to each adder. If the enable signal is ‘1,’ the operation proceeds as usual; however, if the enable signal is ‘0,’ the operation does not take place. By conducting parallel operations with the same number of adders as before, the computational speed increases, and the addition of the enable signal to each adder significantly reduces the hardware resource usage.

Fig. 11. Proposed accumulator structure.

3.4. Proposed Accumulator-Based BNN

Fig. 3 shows the overall architecture of the BNN with the proposed low-power accumulator, as designed using the MNIST dataset. The MNIST dataset consists of handwritten ten digit images, each with a size of 28 × 28 pixels. When the input data is received, the pixel values of the image are converted to 0 or 1. The preprocessed image data is then fed into the input layer. The data output from the input layer serves as the input for the convolution layer, which extracts image features using 32 filters. After the convolution operation is completed, max-pooling is performed.

Then, 64 filters are applied again, and pooling is performed once more before the data is converted into a one-dimensional format. The one-dimensional data passes through a fully connected layer to ultimately perform the classification.

The image data is converted to 0 or 1 through the preprocessing process, and other data such as weights and biases are also binarized. Therefore, most operations are performed using XOR or bitwise operators.

4. Experiment Results

The design was conducted using the Xilinx Vivado HLS tool. We compared the accuracy and hardware usage of the BNN with the applied power-optimized accumulator based on using 10,000 MNIST dataset samples. The power results were verified; the accuracy and hardware resource usage according to the number of removed bits were ranked in the order of accuracy based on a 45-nm NanGate CMOS process ^[17].

Table 5 shows the hardware resource usage and accuracy when the number of bits not performing the cumulative operation is increased by one, based on the results in Fig. 10. The results are sorted in descending order of accuracy when the cumulative operation is not performed. For flipflops, a 23% reduction compared to the original is observed when one bit is removed. As the number of removed bits increases, the reduction reaches up to 27%. In the case of the BRAM, a reduction of approximately 16% is observed when one bit is removed, and the reduction reaches up to 21%. Lastly, the key point of this study, i.e., the power consumption, decreases by approximately 45% when one bit is removed, with a maximum reduction of 55%, which is 38% better than the state-of-the-art ^[18] reported result. The accuracy generally does not significantly decrease, but when the maximum of nine bits is removed, it decreases to 86% (8.6% lower than the original). However, when the top five bits are removed, it decreases by 4.3% compared to the original (to 90.4%). Therefore, the optimal number of bits to remove varies depending on the usage purpose, but it is generally most efficient to remove five bits for maintaining 90% accuracy.

Table 5. Accuracy and hardware resource usage by the number of removed bits in the order of accuracy.

Metric	Conv.	The number of removed bits in Approximate BNN
Metric	Conv.	Top_1	Top_2	Top_3	Top_4	Top_5	Top_6	Top_7	Top_8
Accuracy (%)	94.5	94.8	94.3	92.8	91.4	90.4	89	88.8	88.1
BRAM_18K (unit)	184	155	154	153	152	151	150	149	147
DSP (unit)	103	101	100	99	98	97	96	95	95
FF (unit)	23,806	18,142	18,004	17,928	17,852	17,714	17,638	17,500	17,344
LUT (unit)	38,582	39,292	39,162	39,063	38,965	38,835	38,737	38,607	38,397
Power (unit)	1.965	1.0807	1.0593	1.0173	1.0053	0.9674	0.9489	0.9195	0.917
Latency (unit)	163,084	163,084	163,084	163,084	163,084	163,084	163,084	163,084	163,084

5. Conclusion

In this paper, we proposed a power-optimized accumulator for improving the low-power and hardware efficiency in BNNs. The proposed accumulator divides the importance of each bit based on its accuracy and controls cumulative operations according to their significance, thereby reducing the power consumption and hardware resource usage. This approach improves the hardware efficiency of BNNs while maintaining the consistency of relative accuracy and latency.

An overall architecture was designed for the BNN with the proposed low-power accumulator and experiments were conducted using the Xilinx Vivado HLS tool. As a result, the proposed method achieved a maximum power consumption reduction of 55%, and the usage of flip-flops and BRAMs decreased power consumption by up to 27% and 21%, respectively. Meanwhile, the latency remained constant across all experiments. In terms of the accuracy, when removing up to nine bits, accuracy decreased by 8.6% compared to the original, but when removing only the top five bits, the decrease was only 4.5%, thereby maintaining a relatively high value of 90.4%.

The research results from this study are suitable for using BNNs in a limited hardware resource and low-power environments. Depending on the purpose of use, the desired number of bits can be applied to select the optimal accuracy and hardware resource consumption. In addition, although the experiment was based on the MNIST dataset, it is expected to be applicable to other datasets and neural networks. In future studies, the versatility of the proposed approach can be evaluated through experiments on more diverse datasets and neural network structures.

Acknowledgement

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-RS-2022-00156225) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). This work was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0012451, The Competency Development Program for Industry Specialist). The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

References

Krizhevsky A., Sutskever I., Hinton G. E., 2012, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, Vol. 25, pp. 1097-1105

Lin S., Ji R., Yan C., Zhang B., Cao L., Huang F., Doermann D., 2019, Towards optimal structured CNN pruning via generative adversarial learning, Proc. of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2790-2799

Simonyan K., Zisserman A., 2015, Very deep convolutional networks for large-scale image recognition, Proceedings of the International Conference on Learning Representations

Umuroglu Y., 2017, FINN: A framework for fast, scalable binarized neural network inference, Proc. of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 65-74

Zhang , 2015, Optimizing FPGA-based accelerator design for deep convolutional neural networks, Proc. of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161-170

Banner R., Courbariaux M., Soudry D., El-Yaniv R., Bengio Y., 2016, Binarized neural networks, Advances in Neural Information Processing Systems, pp. 4107-4115

Courbariaux M., Bengio Y., David J.-P., 2015, BinaryConnect: Training deep neural networks with binary weights during propagation, Advances in Neural Information Processing Systems, pp. 3123-3131

Martínez B., Yang J., A. Bulat , G. Tzimiropoulos , 2020, Training binary neural networks with real-to-binary convolutions, Proc. of the Conference on NeurIPS, pp. 1-11

Rastegari M., Ordonez V., Redmon J., Farhadi A., 2021, XNOR-Net: ImageNet classification using binary convolutional neural networks, Proceedings of the European Conference on Computer Vision, pp. 1-16

Nuyen X. T., Nguyen T. N., Lee H.-J., Kim H., 2020, An accurate weight binarization scheme for CNN object detectors with two scaling factors, IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 6, pp. 497-503

Bethge J., 2021, Rethinking the value of binary neural networks, Proceedings of the International Conference on Learning Representations

Qiu J., 2021, AdderNet: Do we really need multiplications in deep learning?, Proc. of the 2021 IEEE/CVF WACV, pp. 2060-2069

Umuroglu Y., Fraser N.J., Gambardella G., Blott M., Leong P., Jahre M., Vissers K., 2017, FINN: A framework for fast, scalable binarized neural network inference, Proc. of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Array, pp. 65-74

Ding , 2020, Towards accurate and high-performance binary neural networks on FPGA, Proceedings of the 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 331-340

Kim H., Kim K., Kim J., Kim J.-J., 2020, BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations, Proc. of the 2020 Conference on NeurIPS, pp. 1-12

Lin S., Ji R., Yan C., Zhang B., Cao L., Ye Q., Huang F., Doermann D., 2018, Towards optimal structured CNN pruning via generative adversarial learning, Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2790-2799

45 nm Open Cell Library, Available at http://www.nangate.com/.

Lee J., Kim H., Kim B.-S., Jeon S., Lee J. C., Kim D. S., 2022, Implementing binarized neural network processor on FPGA-based platform, Proc. of IEEE 4th International Conference on AICAS, pp. 469-471

Author

Dongchan Lee

Dongchan Lee received his B.S. degree in electronic and electrical engineering from the Hongik University, South Korea, in 2021. He received an M.S. degree from the School of Electronic and Electrical Engineering, Hongik University, South Korea in 2023. His research interests include FPGA SoC design, FPGA reverse engineering for hardware security and digital artificial intelligence circuit design.

Youngmin Kim

Youngmin Kim received his B.S. degree in electrical engineering from the Yonsei University, Seoul, Korea, in 1999, and his M.S. and Ph.D. degrees in electrical engineering from the university of Michigan, Ann Arbor, in 2003 and 2007, respectively. He has held a senior engineer position in the Qualcomm, San Diego, CA. He is currently a Professor at Hongik University, Seoul, South Korea. Prior to joining Hongik University, he was with the School of Computer and Information Engineering at Kwangwoon University, Seoul, South Korea, and the School of Electrical and Computer Engineering at the Ulsan National Institute of Science and Technology (UNIST), Ulsan, South Korea. His research interests include AI circuits, embedded systems, variability-aware design methodologies, design for manufacturability, design and technology co-optimization methodologies, and low-power and 3D IC designs.