Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea {nematullo.9006, neloyou}@knu.ac.kr )



LSR, Loss function, Gradient descent, Backpropagation

1. Introduction

Artificial neural networks (ANNs) were first proposed by Doctor Frank Rosenblatt in 1960, and are used widely for the prediction and classification of complex data [1]. ANN is a compound of biological neurons networked in layer forms to learn the parameters, weights, and biases. A typical neuron in the input and output layer performs the following activities.

· Neurons receive inputs from the previous layer.

· The output of a neuron is multiplied by the weights and added up together. This is called a weighted sum.

· Bias is added to adjust the weighted sum.

· The result is then passed into the activation function to bring down the values between a range.

· The output of the activation function, which is the outcome of a neuron, is passed on to the next layer.

Neurons in the output layer produce the result of a network.

The loss is calculated based on the predicted value (regression or classification) for a given input. After performing this for a set of input data, the optimal value for each parameter is obtained by applying gradient descent (GD) on the cost function [2] on backpropagation. Understanding loss function, GD, and backpropagation are essential in this process. This process will help ML researchers to train his/her model again and again until the loss is equal to 0. The trained model will be accurate. The GD will help the researcher find the minimum cost function [3]. Until now, we were describing only a single or double input, and single or double outputs were described using perceptron for a feedforward neural network.

The backpropagation algorithm is a broad view of perceptron for training multiple input datasets. In other words, it sometimes is called multilayer perceptron. Backpropagation makes it achievable to handle gradient methods for training multilayer networks, updating weights to reduce error; for example, GD, or other alternative forms, such as stochastic gradient descent (SGD), are generally utilized [4]. As ANNs have beneficial applications in many fields [1,5] including pattern recognition, face identification, signal recognition, and machine translation, it is essential to shed light continuous insight on DL algorithms, such as CNN, RNN, GNN, and GAN. Hence, this paper explains ANNs for novice DL practitioners. The remainder of this paper is organized as follows. Section 2 gives a detailed survey on least square regression, loss function, GD, and backpropagation applications. Section 3 provides a step-by-step straightforward guide on least square regression, loss function, and GD with backpropagation. Finally, the conclusion section summarizes the paper.

2. Related Work

2.1 Least-square-regression

Least square regression (LSR) is vital for prediction applications, including microarray missing value estimation, disease spread estimation, and weather condition estimation. In [6,7] proposed a novel algorithm called least square multi-splitting (LSMS) to solve large-scale regression problems in parallel. The authors designed a technique of partition for regression design based on cluster analysis. The LSMS approach can compete as a common function algorithm with the LSR. Similarly, the result shows that LSMS is a comprehensive and well performer technique for the partitioning of large-scale regression problems. For comparison, at every truncation, it is estimated that the partitions must utilize the cluster analysis based on three links: single, complete, and average. This work is the optimal choice for multicategory classification, and it is based on the newly proposed technique of group-wise retargeted least squares regression (GReLSR). This work is an extension of prior retargeted least square regression proposed by XuYao Lingfeng et al. The authors proposed a new reformulation of ReLSR for utilizing a groupwise regularization to restrict the translation values of ReLSR and produce the novel GReLSR technique. The performance of GReLSR was compared with the seven prior multicategory classification techniques. The result showed that the recently proposed GReLSR technique could be a state-of-the-art method and a better performer, unlike prior findings [8]. Drug availability has recently been an important part of human life. This study has been proposed to forecast drug availability for the next month through the LSR algorithm. For the sake of forecasting stock for medicine requirements, the authors collected data from January to November 2017 from Puskemas Community health center, East Kalimantan, Indonesia, to predict the future requirements of drug quantity. Nevertheless, the shortage of this work is that the authors did not compare LSR with other prediction algorithms to show the accuracy of their results [9]. Zhao Shuping et al. [10] proposed a novel discriminant and sparsity-based LSR. Compared to prior LSR techniques, the authors demonstrated the relationship among the training samples with L1 regularization to combinedly learn the discriminative projection matrix and the orthogonal relaxing term. Compared with other studies in the image classification area, a comprehensive experiment was done on the Yale B, LFW, and 15-Scene databases to demonstrate outperforming results. The work, together with eight prior algorithms, was compared. The results showed the efficiency of the proposed approach for classification. The COVID-19 outbreak is a global concern. O. Roseline et al. reported the predictive modelling of COVID-19 based on a linear regression model. The authors of this study have analyzed the impact of the traveling history and contacts regarding COVID-19 confirmed cases in Nigeria. The ordinary LSR was used as a tool to fit the data along with diagnostic checks datasets which were derived from the Nigeria Centre for Disease Control (NCDC) website starting from April 05 to April 13, 2020. In the end, they compared both results. The comparative results show that traveling history and contacts increase the likelihood of being infected with COVID-19 by 85% to 88%, respectively [11]. Generally, COVID-19 is increasing rapidly in countries where the population is dense. India is also one such country in the world. Work was done based on multiple linear regression to predict COVID-19 upcoming active cases in the last two weeks of August 2020 in Odisha and India. The authors suggested that the containment facilities in the India and Odisha should be reinforced by responsible authorities to keep people healthy and decrease the number of patients [12]. Qin Lei et al. proposed another prediction about COVID-19 cases of 2019. The data regarding patients’ categories like dry cough, fever, chest distress, coronavirus, and pneumonia was derived from the social media search index from 31 December 2019 until 17 February 2020. The authors proposed a method for COVID-19 prediction based on five regularized regression methods, including subset selection, forward selection, lasso regression, ridge regression, and elastic net to prevent overfitting while prediction. The comparative result regarding patients showed that by January 22 2020 the number of Coronavirus, Pneumonia and new suspected cases would increase sprightly and decrease slightly [13,14]. Table 1 lists the different regression analysis techniques.

Table 1. Summary of Different Versions of Regression Analysis Techniques.

References

Approaches

Regression

Classification

Multi class classification

Parallelism

[7]

LSMS

Yes

[8]

GreLSR

Yes

[9]

LSR

Yes

[10]

Discriminant LSR

Yes

[11]

OLS

Yes

[12]

LSR+MLSR

Yes

[13]

Lasso+Ridge

Yes

[14]

PLSR

Yes

2.2 Loss Function

A loss is the difference between the original outcome and the predicted outcome for a given instance. It is also called the error or residual. When the loss of a set of instances occurs, it is called the cost of the model. The sum of squared error (SSE) or cross-entropy (CE) is typically used as a cost function. The parameter values for different purposes in [15], such as sequential sampling, classification, and optimal control, were estimated using the cost function. Sequential sampling analysis is one of the important computation techniques in various fields, including economics, engineering, medical science, and statistics. K Jampachaisri et al. proposed an Empirical Bayes (EB) prediction method [15] for parameter estimation in a sequential sampling plan (SSP) based on squared error loss (SEL) and precautionary loss function (PL). The proposed EB approach was compared with the prior single-sampling statistical approach. The result shows that the EB in SSP via computation of SEL and PL affords the highest probability of acceptance and the smallest average sample number. The advantage of the proposed method is that the mean square error is always convex when it uses its parameter. Nevertheless, the authors did not compare the result of their proposed method with more than one sequential sampling technique.

DL algorithms will be an essential tool for everyday life. In particular, developing new techniques for training complex datasets quickly and accurately is on-demand. In [16] proposed a novel activation function called Reward cum Penalty loss function based Extreme Learning Machine (RPELM) to prevent those data points which do not fall on the targeted location and draft reward method for the data points which fall on the desired location. The authors compared the RP-ELM result with prior activation functions to demonstrate precise classification outputs. The result shows that the proposed RP-ELM activation function performs better than hinge and quadratic losses regarding fast and correct classification. The main objective of any activation function is to apply linear data to nonlinear data. S Ma et al. proposed a nonlinear optimization method called variable-step beetle antennae search algorithm (GBAS) to update the performance of the Huber loss function [17]. Currently, DL algorithms are de facto important tools in many areas [18-20]. Proposed novel ideas to contrivance multiple loss functions based on kernel density estimation (KDE) in [21] to predict the probability density function (PDF) regarding spoofing detection. The authors used a recent automatic speaker verification (ASV) spoof 2019 dataset to execute the actual scenario. The experimental results show that the proposed KDE-based loss functions are superior to the conventional loss functions, which were exploited for estimating anti-spoofing detection until now. The superiority of this finding suggests a new idea of KDE based on several loss functions in the field of DL. The weakness of the work was that the authors did not compare their proposed KDE-based loss functions with conventional loss functions to be clear to the reader. Traffic classification is also a long-term challenge in the research community. In [22] proposed a new loss function method called UniLoss. The authors of the work tried to classify the problem of an imbalanced dataset of VoIP traffic while occurring in a minority category with poor misclassification performance and high accuracy. In the experimental part of this finding, the newly proposed UniLoss function was run in four different types of deep neural networks (DNN), including convolutional neural network (CNN), recurrent neural network (RNN), ResNet, and FusionNet. Similarly, two conventional loss functions were run in the above-mentioned DNN algorithms for comparison. The research outcome showed that the proposed UniLoss function has a higher F1 score of single categories for better performance than the other two conventional loss functions. The supremacy of this finding is outperforming the novel UniLoss approach compared to other conventional loss functions. Unlike the drawbacks of the finding, authors did not experiment with other types of data for classification but only VoIP data. Despite the remarkable development in the field of biomedical technology DL algorithms is also becoming an important tool of prognosis in biomedical science. H Seo et al. proposed a generalized loss function (GLF) technique with functional parameters in [23] for optimal decision making in small target segmentation. The proposed method displayed more precise discovery and segmentation of lung and liver cancer tumors. An outcome of the work showed that the proposed GLF performs an accurate diagnosis compared to prior techniques in terms of detection and segmentation of lung and liver and tumors. Similarly, several studies have been done in the past using different types of loss functions in [24] and [25]. Table 2 lists different papers which were used the loss function for different applications.

Table 2. Summary of Different Versions of Loss Function Techniques.

References

Loss Function Approach

For Sequential

Sampling

Classification

Prediction

Object Detection

[14]

EB

Yes

[16]

RP-ELM

Yes

[17]

GBAS

Yes

[18]

Cross Entropy & ED

Yes

[19]

WC-Entropy Loss

Yes

[20]

Asymmetric Loss

Yes

[21]

KDE

Yes

[22]

UniLoss

[23]

GLF

Yes

[24]

RTC-L1

Yes

Yes

[25]

Gaussian Loss Function

Yes

2.3 Gradient Descent

Gradient descent (GD) is an optimization algorithm for finding the absolute minimum to minimize the error of the function while training fixed datasets. It has diverse applications, including image classification, entity clustering, weather condition, and disease spread prediction. D Zou et al. proposed a technique [26] for binary classification while training a deep, fully connected neural network in combination with Rectified Linear Unit (ReLU) activation function and cross-entropy loss function using GD. The result showed that with proper random weight initialization, GD could find absolute minima while training loss for an over-parametrized deep ReLU activation function under supposition on the training fixed amount of data. They compared the training performance of their findings with two prior proposed techniques. Their technique could discover global minima faster than the findings, which are exploited in the experimental part of the paper. An advantage of the finding is that the authors could prove their idea regarding Gaussian random initialization pursued through GD, which produces a sequence of iterations that remind inside a small disorder region placed at the initial weights. The disadvantage of the proposed method is that the authors did not compare their proposed method performance with more than two previous techniques. J Flynn et al. proposed a novel method for deducing multiplane image (MPI) scene representation in combination with Learned Gradient Descent (LGD) [27]. They compared the MPI results with a well-known method called Soft3D and some multiple methods in deep learning. The results showed that the proposed MPI scene representation, combined with the learned GD, showed better results than other traditional techniques, particularly in solving complicated, nonlinear inverse problems. The dominance of the proposed MPI technique was that authors could implement their novel idea based on LGD. The weak point of this work was the complication of implementing RAM requirements and the speed of LGD while training, which takes multiple days by utilizing more than one GPU. J Lee et al. proposed additional learning dynamics of GD in the theoretical approach for parameter space of deep nonlinear network and classification [28]. They compared their proposed technique with traditional SGD and CEL for classification purposes by exploiting MNIST and CIFAR datasets. The result showed that the proposed technique is more sufficient than the SGD and CEL techniques in training a wide DNN while gaining minimum error. The superiority of the work is that theoretical results are solely precise in the finding is authors did not experiment for comparison of their proposed classification technique with more than two classification datasets to prove the robustness of their technique. One continuous progress in the DL research community is reproducing tremendous works about updating convex, non-convex and concave functions. Two-time scale gradient descent ascent (GDA) for solving nonconvex-concave minimax problems has been proposed in [29]. The main aim of this work is to solve the nonconvex-concave minimax problem and compare the performance of GDA with two prior techniques Wasserstein robustness model (WRM) and Gradient Descent max (GDmA). The results showed that the proposed GDA outperforms both prior techniques, which were compared during the experiment by exploiting three types of classification datasets. An advantage of the finding is that the authors could prove the solution for the nonconvex-concave minimax problem while training combined classification datasets. On the other hand, the authors did not prove why they chose only three categories of classification datasets for a result comparison and showed the efficiency of GDA while there are many classification datasets. M M Amiri et al. in [30] proposed a novel analog distributed stochastic gradient descent (A-DSGD) to reduce noise from the channel bandwidth with a combination of parameter servers (PS). In addition, one of the prior techniques, called digital distributed stochastic gradient descent (DDSGD) was exploited for a performance comparison with A-DSGD. The results showed that A-DSGD has faster performance than D-DSGD because of its available channel bandwidth. In the last decade, many studies have been conducted using different GD types for different applications, such as classification [31], sparse linear problems [32], new task learning [33], data overfitting prevention [34], and inverse filtering [35].

2.4 Backpropagation

A combined Empirical Mode Decomposition-Variational Mode Decomposition-Genetic Algorithm-Backpropagation (EMD-VMD-GA-BP) model was introduced in [36] for carbon price prediction. The proposed model used several techniques for accurate carbon price estimation in the Hubei market, particularly the backpropagation (BP) neural network model combined with a genetic algorithm (GA) for accurate prediction purposes. The proposed model outperformed other prediction models. The advantage of this work is that the proposed model can meaningfully reduce the exertion of carbon price time series forecasting. In [37] Multi-input Multi-Output (MIMO) technique model for Wireless Sensor Network (WSN) was proposed, which addresses the Cluster Head (CH) recognition challenge for the MIMO sensor network. The proposed technique is based on a neural network BP algorithm. The main aim of the proposed model was to address the location identification problem of CH for the MIMO sensor network, which is utilizable in Intelligent Transportation System (ITS). The model minimizes the total estimation error compared to other proposed techniques. ML is used in several areas, such as agriculture. L Wang et al. proposed maize growth monitoring on the North China plain utilizing a hybrid genetic algorithm-based BP neural network [38]. An experimental part of the study was conducted through the remotely sensed leaf area index (LAI) and vegetation temperature condition index (VTCI). The data derived from the Global LAnd Surface Satellite (GLASS) and Moderate-resolution Imaging Spectroradiometer (MODIS) data were chosen as the key indicators of maize growth and a hybrid GA based on the BP neural network (BPNN).

The (GA-BPNN) model was designed to provide enough information on maize growth at the core growing stage. The proposed BPNN-based GA-BPNN performed satisfactorily compared to other techniques. A new method of ML algorithms combination was proposed called Genetic Algorithm-Back propagation called (GA-BP) [39,40]. The proposal aimed to predict clothing pressure. The proposed GA-BP algorithm does not require complex modeling compared to prior girdle pressure predicting models, such as General Regression Neural Network (GRNN) and grey BP. W Yang et al. proposed an analysis method for skeletal anthropology fields based on an improved BP algorithm. The main aim of the study was to determine if the selected skull was male or female from the compound of skeleton [40]. The proposed improved BP algorithm showed better classification accuracy with a 97.232 % training stage and 0.01 mean square error. The results for comparing the performance were trained with other prior techniques, including the Cranical Sagittal chord and Apical sagittal chord. Consequently, improved BP showed outstanding performance compared to two prior techniques.

3. Mathematical Calculations

3.1 Least Square Regression

The least square regression model aimed to determine the relationship between the dependent and independent variables. The large amount of missed data was assessed using the least square model. All data points of the dependent variable depend upon the alteration of the independent variable [41,42]. Fig. 1 shows how linear regression has fit the datapoints (X-input, Y-output), given in Table 4, using line function f(x) = mx + b.

Fig. 1. Fitting line on the data points.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig1.png

The line equation on the data points was determined by estimating (approximating) the value for m and b. The initial step of the LSR was calculating the slope of the function, computing the y-intercept, and finally substituting the input values in the model explored. Before calculating the slope of the function, it is essential to know the mean of the input and output values, as shown in Table 5. As the first step of the goal is to determine b (slope) of the function, each observation from X and Y axes must be subtracted from the mean of each column and demonstrated in a separate column. Table 5 lists this process in detail. With the values calculated in Table 5, the slope (m) [42] can be found using the following formula

(1)
$ m=\frac{\sum \left(x-\overline{x}\right)\left(Y-\overline{Y}\right)}{\sum \left(X-~ \overline{X}\right)^{2}}=\frac{80.82}{28}=2.89, $

The next step after finding the slope is finding the y-intercept (b). Y-intercept is a point at which the trend line crosses on the y-axis and then moves upward,

(2)

$9.86=b_{0}+2.89\left(3\right)$

$9.86=b_{0}+8.67$

$b_{0}=9.86-8.67$

$b_{0}=1.19.$

The values of m and b were calculated manually. This can be computed using Excel and plot the line of best fit (trend line), as shown in Fig. 2.

Fig. 2. Line of best fit.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig2.png
Table 3. Summary of the Different Versions of Gradient Descent Methods.

References

Approaches

Inverse Filtering

Classification

Convex setting

Image representation

Catastrophic forgetting

Minimum error analysis

[26]

GD

Yes

[27]

LGD

Yes

[28]

GD

Yes

[29]

GDA

Yes

[31]

GD

Yes

[32]

OGD

Yes

[34]

GD

Yes

[35]

PGDA

Yes

Table 4

X

Y

0

3

1

6

2

3

3

12

4

6

5

18

6

21

Table 5.

X

Y

X - X̂

Y - Ŷ

(X -)2

(X-X̂)*(Y-Ŷ)

0

33

-3

-6.86

9

20.58

1

6

-2

-3.86

4

7.72

2

3

-1

-6.86

1

6.86

3

12

0

2.14

0

0

4

6

1

-3.86

1

-3.86

5

18

2

8.14

4

16.28

6

21

3

11.14

9

33.42

Mean

X̂=3

Ŷ=9.86

28

80.82

3.2 Loss Function

3.2 Loss Function

The main goal of the loss function is to minimize the data loss occurrence while training complex datasets [2,43]. Table 6 provides some examples of working hour training datasets for predicting the gain of a worker. An example is given that when a worker works approximately one hour, his salary will be 13 dollars. On the other hand, what happens when more than one hour is worked?

Table 6.

Working time (X)

Salary of Worker (Y)

1

13$

2

26$

3

39$

4

?

5

?

6

?

What happens when more than 10 hours are worked? What about if there is very larger amount of work data for prediction? For this type of training, multiple ML algorithms can be used to deal with existing challenges while researching. It is important to follow some simple equations or ML hypotheses. Table 7 provides an example of a prediction of a worker’s salary. As the working time is X integer and the salary of a worker is the Y variable, then the ML hypothesis must be followed for a simple prediction as below

(3)
$\hat{Y}=X\star W$,

where $\hat{Y}$ is an undefined prediction variable. The variable X is the input data that should be multiplied with variable W to update the whole input data for predicting an accurate output [44,45]. The next stage of computing loss is determined by exploiting the obtained loss formula as follows:

(4)
$ Loss=\left(\hat{Y}-Y\right)=\left(X\times W-y\right)\,\,. $

Tables 7-9 list the prediction of a worker's salary for various W values.

Tables 10(a) and (b) list the mean square error or total loss of each training. The loss decreases through every iteration. This example did not show the condition in which the loss equals 0. On the other hand, only an example of loss is shown, decreasing the total loss in every training. In addition, if L = 0, more training is needed to achieve the actual goal. Fig. 3 shows different $\hat{Y}$ on the data points during the training process. The graph shows the loss minimization using simple mathematical computations, including graphs and tables. The first random value W$^{1}$, which is always equal to W$^{1}$=1, should be pointed out before every phase. For example, if there are 1000 or 10000 random guesses, the computer can automatically provide a variable value and compute other random guesses of W$^{n}$ for updating the output [6]. Therefore, variable W has an important role in the neural networking field. There are three random guesses of W, each with a simple value of 1, 2, and 3, for easy understanding of the loss function [45]. Table 11 shows three values of the X and Y axes in each column. Similarly, three more columns for describing the output of the predictions are needed to minimize the loss while training. A straightforward mathematical computation was performed at the bottom using the same values in the prediction part. The hypothesis $\hat{y}=x\star w$ was exploited to update each training data of input x [8]. $\mathrm{Cost}=\sum (\hat{y}-y)^{2}$ was used for minimizing the loss function. For example,

(5)

Computing for W$^{1}$

$^{\mathrm{}}$

$\hat{Y}=1\cdot 1=1$

$\hat{Y}=3\cdot 1=3$

$\hat{Y}=5\cdot 1=5$

Subtracting $Y$ from $\hat{Y}$ to find the loss

$ \begin{array}{l} \left(1-3\right)2+\left(3-9\right)2+\left(5-15\right)2\\ =22+62+102=140 \end{array} $

$^{\mathrm{}}$

Computing for W$^{2}$

$^{\mathrm{}}$

$\hat{Y}=1\cdot 2=2$

$ \begin{align*} \hat{Y}&=3\cdot 2=6 \\ \hat{Y}&=5\cdot 2=10 \end{align*} $

Subtracting $Y$ from $\hat{Y}$ to find the loss

$ \begin{array}{l} \left(2-3\right)2+\left(6-9\right)2+\left(10-15\right)2\\ =12+32+52=35 \end{array} $

$^{\mathrm{}}$

Computing for W$^{3}$

$^{\mathrm{}}$

$\hat{Y}=1\cdot 3=3$

$ \begin{align*} \hat{Y}&=3\cdot 3=9 \\ \hat{Y}&=5\cdot 3=15 \end{align*} $

Subtracting $Y$ from $\hat{Y}$ to find the loss

$ \begin{array}{l} \left(3-3\right)2+\left(9-9\right)2+\left(15-15\right)2\\ =02+02+02=0 \end{array} $

As shown in the above examples, the loss in the first iteration was 140 but 35 in the second iteration.

Eventually, the loss approaches 0. Fig. 4 shows the loss for different possible lines. The red line indicates the gap between the true and error lines. The particular point where L = 0 in the graph denotes the local minima of a function [45].

Table 7.

Working

time (X)

Salary of

Worker (Y)

Prediction,

Ȳ (W=1)

Loss (W=1)

1

13$

1$

144

2

26$

2$

576

3

39$

3$

1296

4

52$

4$

2304

MSE =

4320 / 4

= 1080

Table 8

Working

time (X)

Salary of

Worker (Y)

Prediction,

Ȳ (W=21)

Loss (W=21)

1

13$

13$

0

2

26$

26$

0

3

39$

39$

0

4

52$

52$

0

MSE = 0

Table 9

Working

time (X)

Salary of

Worker (Y)

Prediction,

Ȳ (W=21)

Loss (W=21)

1

13$

21$

64

2

26$

42$

256

3

39$

63$

576

4

52$

84$

1024

MSE =

1920 / 4

= 480

Fig. 3. Several random guesses of w${\cap}$y to approach the true line.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig3.png
Fig. 4. Loss for each line.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig4.png

3.3 Gradient Descent and Backpropagation

Gradient descent is an iterative optimization algorithm for finding the minimum of the cost function [46] as shown in Fig. 5. This section reviews the concept of gradient descent with simple expressive calculations working on the cost function, the sum of squared error (SSE). The equation of SSE is given below

(6)
$ SSE\left(w\right)=\frac{1}{N}\sum _{n=1}^{n}\left(\hat{y}_{n}-y\right)^{2} $

Table 12 explains the cost calculation on the given data.

For the first training example, the prediction variables $\hat{Y}$ are needed to calculate the loss function [46].

(7)

$\hat{Y}=X\star W$

$\hat{Y}=11\star 1$

$\hat{Y}=31\star 3$

$\hat{Y}=51\star 5$

Fig. 5. Gradient Descent.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig5.png
Table 10

(a)

(b)

Working time (X)

Loss

(W=1)

Loss

(W=5)

Loss

(W=10)

1

144

64

9

2

576

256

36

3

1296

576

81

4

2304

1024

144

MSE=4320/4

= 1080

MSE=1920/4

= 480

MSE=270/4

= 67.5

Loss

(W=13)

Loss

(W=16)

Loss

(W=21)

Loss

(W=25)

0

9

64

144

0

36

256

576

0

81

576

1296

0

144

1024

2304

MSE=0

Table 11

X

Y

Ŷ1

Ŷ2

Ŷ3

1

3

1

2

3

3

9

3

6

9

5

15

5

10

15

Table 12

X

Y

1

3

2

6

3

9

Fig. 6. Backpropagation.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig6.png

The gradient of the cost function was calculated with respect to all the parameters (weights and biases) to find the global loss minimum [47]. The equation is given below.

(8)
$ \omega =\omega -\alpha *\frac{\partial loss}{\partial w}, $

where ${\omega}$ is the derivative for loss with respect to the weight parameters. ${\alpha}$ is the learning rate that controls the speed of convergence. The weight values were updated by subtracting the derivative of the loss function [48]. If there are three weight values in the network, their values are three weight values in the network, as shown in Fig. 6, their values are calculated as below,

(9)

For weight w$^{1}$,

$W_{1}=W_{1}-\alpha \star \sum 2x\star \left(\hat{y}-y\right)$

$ \begin{align} \begin{array}{l} X_{1}\Rightarrow 2\star X_{1}\star \left(X_{1}\star W_{1}-y_{1}\right)\\ =2\star 1\star \left(1\star 1-3\right)\\ =2\star \left(1-3\right)=2\star \left(-2\right)=-4 \end{array} \\ \begin{array}{l} X_{2}\Rightarrow 2\star X_{2}\star \left(X_{2}\star W_{2}-y_{2}\right)\\ =2\star 2\star \left(2\star 1-6\right)\\ =4\star \left(2-6\right)=4\star \left(-4\right)=-16 \end{array} \\ \begin{array}{l} X_{3}\Rightarrow 2\star X_{3}\star \left(X_{3}\star W_{3}-y_{3}\right)\\ =2\star 3\star \left(3\star 1-9\right)\\ =6\star \left(3-9\right)=6\star \left(-6\right)=-36 \end{array} \end{align} $

$^{\mathrm{}}$

For weight w$^{2}$,

$W_{2}=W_{2}-\alpha \star \sum 2x\star \left(\hat{y}-y\right)$

$\begin{align} \begin{array}{l} X_{1}\Rightarrow 2\star X_{1}\star \left(X_{1}\star W_{2}-y_{1}\right)\\ =2\star 1\star \left(1\star 1.56-3\right)\\ =2\star \left(1.56-3\right)=2\star \left(-1.44\right)=-2.88 \end{array}\\ \end{align} $

$ \begin{align} \begin{array}{l} X_{2}\Rightarrow 2\star X_{2}\star \left(X_{2}\star W_{2}-y_{2}\right)\\ =2\star 2\star \left(2\star 1.56-6\right)\\ =4\star \left(3.12-6\right)=4\star \left(-2.88\right)=-11.52 \end{array}\\ \end{align} $

$\begin{align} \begin{array}{l} X_{3}\Rightarrow 2\star X_{3}\star \left(X_{3}\star W_{2}-y_{3}\right)\\ =2\star 3\star \left(3\star 1.56-9\right)\\ =6\star \left(4.68-3\right)=6\star \left(-4.32\right)=-26 \end{array} \end{align} $

$^{\mathrm{}}$

For weight w$^{3}$,

$W_{3}=W_{3}-\alpha \star \sum 2x\star \left(\hat{y}-y\right)$

$ \begin{array}{l} X_{1}\Rightarrow 2\star X_{1}\star \left(X_{1}\star W_{3}-y_{1}\right)\\ =2\star 1\star \left(1\star 2-3\right)\\ =2\star \left(2-3\right)=2\star \left(-1\right)=-2 \end{array} $

$ \begin{align*} \begin{array}{l} X_{2}\Rightarrow 2\star X_{2}\star \left(X_{2}\star W_{3}-y_{2}\right)\\ =2\star 2\star \left(2\star 2-6\right)\\ =4\star \left(4-6\right)=4\star \left(-2\right)=-8 \end{array} \\ \begin{array}{l} X_{3}\Rightarrow 2\star X_{3}\star \left(X_{3}\star W_{3}-y_{3}\right)\\ =2\star 3\star \left(3\star 2-9\right)\\ =6\star \left(6-3\right)=6\star \left(-3\right)=-18 \end{array} \end{align*} $

We used 10000 images for test and 50000 images for training along through 10 epochs. In the execution, we regulate learning rate to 0.01. Total iteration is 1000 with approximately 95% of accuracy as shown in Fig. 7.

Fig. 7. Backpropagation and loss minimization.
../../Resources/ieie/IEIESPC.2023.12.3.223/fig7.png

Conclusion

The objective of this article was to evaluate the ANN functionalities, such as cost function, gradient calculation, and backpropagation. An extensive explanation of ANN was made for novices in DL. The paper aimed to provide a straightforward clarification of ANNs. In the future study, we will adopt the same approach in explaining other deep learning algorithms, such as CNN, RNN, DNN, and their latest versions.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-00166267).

REFERENCES

1 
M. Majumder, “Artificial Neural Network,” pp. 49-54, 2015.DOI
2 
“Loss function - Wikipedia.” (accessed Mar. 29, 2023).URL
3 
“Gradient descent - Wikipedia.” (accessed Mar. 30, 2023).URL
4 
“Backpropagation - Wikipedia.” (accessed Mar. 30, 2023).URL
5 
S. A. Kalogirou, “Applications of artificial neural-networks for energy systems,” Appl. Energy, vol. 67, no. 1-2, pp. 17-35, 2000.DOI
6 
L. da Fontoura Costa and G. Travieso, “Fundamentals of neural networks,” Neurocomputing, vol. 10, no. 2, pp. 205-207, 1996.DOI
7 
G. Inghelbrecht, R. Pintelon, and K. Barbe, “Large-Scale Regression: A Partition Analysis of the Least Squares Multisplitting,” IEEE Trans. Instrum. Meas., vol. 69, no. 6, pp. 2635-2647, 2020.DOI
8 
L. Wang and C. Pan, “Groupwise Retargeted Least-Squares Regression,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 4, pp. 1352-1358, 2018.DOI
9 
N. Dengen, Haviluddin, L. Andriyani, M. Wati, E. Budiman, and F. Alameka, “Medicine Stock Forecasting Using Least Square Method,” Proc. - 2nd East Indones. Conf. Comput. Inf. Technol. Internet Things Ind. EIConCIT 2018, no. Ci, pp. 100-103, 2018.DOI
10 
S. Zhao, B. Zhang, and S. Li, “Discriminant and Sparsity Based Least Squares Regression with l1 Regularization for Feature Representation,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2020-May, pp. 1504-1508, 2020.DOI
11 
R. O. Ogundokun, A. F. Lukman, G. B. M. Kibria, J. B. Awotunde, and B. B. Aladeitan, “Predictive modelling of COVID-19 confirmed cases in Nigeria,” Infect. Dis. Model., vol. 5, pp. 543-548, 2020.DOI
12 
S. Rath, A. Tripathy, and A. Ranjan, “Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company’s public news and information,” no. January, 2020.URL
13 
L. Qin et al., “Prediction of number of cases of 2019 novel coronavirus (COVID-19) using social media search index,” Int. J. Environ. Res. Public Health, vol. 17, no. 7, 2020.DOI
14 
R. Gan, J. Tan, L. Mo, Y. Li, and D. Huang, “Using Partial Least Squares Regression to Fit Small Data of H7N9 Incidence Based on the Baidu Index,” IEEE Access, vol. 8, pp. 60392-60400, 2020.DOI
15 
K. Jampachaisri, K. Tinochai, S. Sukparungsee, and Y. Areepong, “Empirical bayes based on squared error loss and precautionary loss functions in sequential sampling plan,” IEEE Access, vol. 8, pp. 51460-51469, 2020.DOI
16 
P. Anand and A. Bharti, “A combined reward-penalty loss function based extreme learning machine for binary classification,” 2019 2nd Int. Conf. Adv. Comput. Commun. Paradig. ICACCP 2019, 2019.DOI
17 
S. Ma, D. Li, T. Hu, Y. Xing, Z. Yang, and W. Nai, “Huber Loss Function Based on Variable Step Beetle Antennae Search Algorithm with Gaussian Direction,” Proc. - 2020 12th Int. Conf. Intell. Human-Machine Syst. Cybern. IHMSC 2020, vol. 1, pp. 248-251, 2020.DOI
18 
B. Sung Lee, R. Phattharaphon, S. Yean, J. Liu, and M. Shakya, “Euclidean Distance based Loss Function for Eye-Gaze Estimation,” 2020 IEEE Sensors Appl. Symp. SAS 2020 - Proc., 2020.DOI
19 
T. H. Phan and K. Yamamoto, “Resolving Class Imbalance in Object Detection with Weighted Cross Entropy Losses,” arXiv, 2020.DOI
20 
Di. Rengasamy, B. Rothwell, and G. P. Figueredo, “Asymmetric Loss Functions for Deep Learning Early Predictions of Remaining Useful Life in Aerospace Gas Turbine Engines,” Proc. Int. Jt. Conf. Neural Networks, 2020.DOI
21 
A. Gomez-Alanis, J. A. Gonzalez-Lopez, and A. M. Peinado, “A Kernel Density Estimation Based Loss Function and its Application to ASV-Spoofing Detection,” IEEE Access, vol. 8, no. i, pp. 108530-108543, 2020.DOI
22 
L. Xu, X. Zhou, X. Lin, Y. Ren, Y. Qin, and J. Liu, “A New Loss Function for Traffic Classification Task on Dramatic Imbalanced Datasets,” IEEE Int. Conf. Commun., vol. 2020-June, 2020.DOI
23 
H. Seo, M. Bassenne, and L. Xing, “Closing the Gap between Deep Neural Network Modeling and Biomedical Decision-Making Metrics in Segmentation via Adaptive Loss Functions,” IEEE Trans. Med. Imaging, vol. 40, no. 2, pp. 585-593, 2021.DOI
24 
N. Zhang et al., “Robust T-S Fuzzy Model Identification Approach Based on FCRM Algorithm and L1-Norm Loss Function,” IEEE Access, vol. 8, pp. 33792-33805, 2020.DOI
25 
Z. Li, J. F. Cai, and K. Wei, “Towards the optimal construction of a loss function without spurious local minima for solving quadratic equations,” arXiv, vol. 66, no. 5, pp. 3242-3260, 2018.DOI
26 
D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Gradient descent optimizes over-parameterized deep ReLU networks,” Mach. Learn., vol. 109, no. 3, pp. 467-492, 2020.DOI
27 
J. Flynn et al., “Deepview: View synthesis with learned gradient descent,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 2362-2371, 2019.DOI
28 
J. Lee et al., “Wide neural networks of any depth evolve as linear models under gradient descent,” J. Stat. Mech. Theory Exp., vol. 2020, no. 12, 2020.DOI
29 
T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” arXiv, 2019.URL
30 
M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” arXiv, vol. 68, pp. 2155-2169, 2019.DOI
31 
S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans, “Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent,” arXiv, 2020.URL
32 
E. Amid, M. K. Warmuth, J. Abernethy, and S. Agarwal, “Winnowing with Gradient Descent,” Proc. Mach. Learn. Res., vol. 125, pp. 1-20, 2020.URL
33 
M. Farajtabar, N. Azizan, A. Mott, and A. Li, “Orthogonal gradient descent for continual learning,” arXiv, vol. 108, 2019.URL
34 
M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” arXiv, 2019.URL
35 
C. Cheng, N. Emirov, and Q. Sun, “Preconditioned gradient descent algorithm for inverse filtering on spatially distributed networks,” arXiv, vol. 27, pp. 1834-1838, 2020..DOI
36 
W. Sun and C. Huang, “A carbon price prediction model based on secondary decomposition algorithm and optimized back propagation neural network,” J. Clean. Prod., vol. 243, p. 118671, 2020..DOI
37 
A. Mukherjee, D. K. Jain, P. Goswami, Q. Xin, L. Yang, and J. J. P. C. Rodrigues, “Back Propagation Neural Network Based Cluster Head Identification in MIMO Sensor Networks for Intelligent Transportation Systems,” IEEE Access, vol. 8, pp. 28524-28532, 2020..DOI
38 
L. Wang, P. Wang, S. Liang, Y. Zhu, J. Khan, and S. Fang, “Monitoring maize growth on the North China Plain using a hybrid genetic algorithm-based back-propagation neural network model,” Comput. Electron. Agric., vol. 170, no. 46, p. 105238, 2020.DOI
39 
Z. Jie and M. Qiurui, “Establishing a Genetic Algorithm-Back Propagation model to predict the pressure of girdles and to determine the model function,” Text. Res. J., vol. 90, no. 21-22, pp. 2564-2578, 2020.DOI
40 
W. Yang, X. Liu, K. Wang, J. Hu, G. Geng, and J. Feng, “Sex determination of three-dimensional skull based on improved backpropagation neural network,” Comput. Math. Methods Med., vol. 2019, 2019.DOI
41 
L. P. Huelsman, for Engineers, no. November. McGraw-Hill Science/Engineering/Math, 1990.URL
42 
“Analysis of the vulnerability estimation and neighbor value prediction in autonomous systems | Scientific Reports.” (accessed Mar. 30, 2023).URL
43 
J. Brownlee, “Loss and Loss Functions for Training Deep Learning Neural Networks,” Mach. Learn. Mastery, pp. 1-19, 2019,URL
44 
H. D. Learning et al., “Perceptron,” pp. 1-9, 2020,URL
45 
G. C. Mqef, “Mathematics for,” Quant. Lit. Why Numer. matters Sch., no. c, pp. 533-540, 2009,URL
46 
S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms, vol. 9781107057. 2013.DOI
47 
G. Lawson, “Maxima and minima,” Edinburgh Math. Notes, vol. 32, pp. xxii-xxiii, 1940,DOI
48 
L. Multipliers, “Paul ’ s Online Notes Section 3-5: Lagrange Multipliers,” pp. 1-15, 2020.URL

Author

Rahmatov Nematullo
../../Resources/ieie/IEIESPC.2023.12.3.223/au1.png

Rahmatov Nematullo is currently a Post-doctoral researcher in Kyungpook National University. He received his Ph.D. degree in the Department of Computer Science and Engineering at Kyungpook National University Daegu, South Korea in 2022. His research interests include Artificial Intelligence, Machine and Deep Learning, Natural Language Processing and Neural Machine translation. His future research plan is sentence modeling and machine translation based on novel algorithms of Deep Learning. He was a recipient of Computer Science and Engineering award in 2019 at the Department of Computer Science and Engineering, Kyungpook National University.

Hoki Baek
../../Resources/ieie/IEIESPC.2023.12.3.223/au2.png

Hoki Baek received his B.S, M.S., and Ph.D. from the Department of Computer Science at Ajou University in Suwon, South Korea, in 2006, 2008, and 2014, respectively. From March 2014 to February 2015, he served as a full-time researcher at Ajou Univer-sity's Jangwee Defense Research Institute, and from March 2015 to February 2021, he was a Lecture Professor in the Department of Military and Digital Convergence at Ajou University. Currently, he is an Assistant Professor for the School of Computer Science and Engineering at Kyungpook National University. He is a life member and a director of the Korean Institute of Communications and Information Sciences (KICS), and is an editorial board member for the Journal of the Korean Institute of Communications and Information Sciences (J-KICS). He is a member of the Defense Information Technology Standards (DITA) Standard Working Group (SWG) for the Ministry of National Defense, serving from June 2020 to May 2024. His research interests include 5G/6G communications and networks, UAV networks, Wi-Fi, IoT, military communications and networks, and positioning and time synchronization.