Mobile QR Code QR CODE

  1. (College of Architecture and Information Engineering, Shandong Vocational College of Industry, Zibo, 256414, China qiwang_qwqw@outlook.com )



Machine-learning, Bayesian, Automation, CASH, Artificial intelligence

1. Introduction

Automated machine-learning technology can reduce or avoid manual participation in model selection and parameter tuning. Therefore, the efficiency and performance of machine-learning models are improved. Machine-learning pipeline design is integral to automated machine-learning technology and has received extensive attention [1,2]. In practical applications, however, the existing machine-learning pipeline design algorithms are used more to solve the automatic modeling of static data sets, which cannot capture the drift of data concepts accurately, resulting in the model trained in a particular stage cannot adapt to the next step. The data from each stage will reduce the accuracy of the model. In addition, for CASH (Combined Algorithm Selection and Hyperparameter optimization) problems, the solution effect and efficiency of existing machine-learning pipeline design algorithms are not ideal [3,4]. To this end, the research divides the problem of machine-learning pipeline design into two sub-problems. They are based on reinforcement learning to realize machine-learning pipeline structure search and the Bayesian network model to realize the optimal configuration of machine-learning pipeline hyperparameters. Thus, a Bayesian-based Model machine-learning pipeline automation design (AutoML for PipeLine Design, Auto-PLD) algorithm framework is proposed to improve the Bayesian proxy model and Bayesian acquisition function in the hyperparameters optimization configuration process, improving the performance of Auto-PLD. There are two main innovations in the research. The first point is to divide the machine-learning pipeline design problem into two sub-problems to realize the simultaneous optimization of the machine-learning pipeline structure and hyperparameters. The second point is to improve the Bayesian proxy model and the Bayesian acquisition function, thereby improving the performance of Auto-PLD. The research provides theoretical guidance and ideas for applying automated machine-learning technology practically. In addition, it has specific reference significance for developing automated machine-learning technology in China.

1. Related Works

The application data in various fields has increased considerably when information technology is highly developed and popularized. Artificial intelligence technology based on machine-learning models has received increasing attention to make more efficient use of these data. Automated machine-learning technology can realize the automation, efficiency, and intelligence of machine learning. Moreover, the application threshold of artificial intelligence technology has decreased, which has attracted the academic community. Tan HB et al. proposed a new crown prediction method based on automated machine-learning technology to analyze the chest CT of patients with new coronary pneumonia. Hence, clinical prediction of new coronary pneumonia can be realized. The results show that the AUC value of the method was greater than 0.95, proving the effectiveness of the method [5]. Alsharef et al. used an automated machine-learning technology framework to realize time series forecasting, improving the efficiency and performance of data modeling. This study provides additional help and reference for related researchers and industries [6]. Wever et al. applied it to multi-label classification work based on the characteristics of automated machine-learning technology that can support the construction of pipelined algorithm models. The results show that automated machine-learning technology has a good application effect in multi-label classification [7]. Baudart et al. proposed an orthogonal combinator for the defect in which progressive automated machine-learning techniques must change large-scale non-combined codes. Applying this combinator to progressive automated machine-learning techniques can improve its operational efficiency [8]. Li et al. introduced the VOLCANOML framework in end-to-end automated machine learning. This approach effectively improved the decomposition level of the search space in automated machine-learning techniques [9]. Automatic machine-learning technology could not achieve the best prediction performance of the model in a limited time(Ed note: Contractions, such as ``couldn’t'' should not be used in report writing.). Zogaj et al. proposed a method to solve this problem by reducing the number of rows in the input table data set and improving the efficiency of automatic machine learning. Experimental data confirm the effectiveness of the method [10]. Li et al. combined Internet of Things technology, blockchain technology, and automated machine-learning technology to build an open and intelligent customer service platform. The platform could help users realize data transactions on the premise of ensuring user safety [11]. Yakovlev A et al. reported that machine-learning technology could not quickly deploy models due to massive data growth and introduced automated machine-learning technology to achieve fast and accurate modeling. The experimental results verified the effectiveness of the method [12].

The Bayesian classification algorithm is an algorithm that achieves classification based on probability and statistics. It has the advantages of a simple classification method, high classification accuracy, and fast classification speed. In addition, it has a good application in large databases. Alade IO and others used the Bayesian algorithm to optimize the support vector machine (SVM) to construct a prediction model to predict accurately the specific heat capacity of alumina/ethylene glycol nanofluids. The experimental results showed that the model accuracy reached 99.95% [13]. Scanagatta et al. introduced the network structure of the Bayesian algorithm and proposed an alternative to deal with the processing of incomplete data and continuous variables by the Bayesian algorithm. In addition, the study also tested the current software tools [14]. Yao et al. explored the influence of silk processing parameters on the physical properties of silk fibers based on the fast Bayesian algorithm and improved silk processing. The experimental results showed that the mechanical properties of silk had been improved significantly after the fast Bayesian algorithm was introduced [15]. Maheswari et al. combined a decision tree and a naive Bayesian algorithm to perform data mining on healthcare data to predict heart disease. The experimental results validate the prediction accuracy of the method [16].

Joseph G et al. used sparse Bayesian to solve the dictionary learning problem and verified its global convergence and stability. In addition, this method had a good application effect in image denoising [17]. Mat SRT et al. proposed a model based on Bayesian algorithm classification to prevent malicious attacks from Android malware. Experiments were performed on samples from the AndroZoo and Drebin databases; the accuracy of the model exceeded 90% [18]. Based on a Bayesian algorithm, Salvato et al. cross-matched the counterparts of sky X-ray measurements. The experimental results showed that the results of this method are faster and more accurate [19]. Liu Y et al. proposed a hybrid Bayesian algorithm and applied it to evaluate the collaborative ability of related equipment in retrieving ice cloud microphysics. The experimental results verified the effectiveness of the algorithm [20].

The current automatic machine-learning technology and Bayesian algorithm are used widely. On the other hand, in the existing research, automatic machine learning was used more to solve the automatic modeling of static data sets, but the effect in actual application scenarios was poor. In response to this problem, the study proposed a machine-learning pipeline automation design method that combined Bayesian algorithms and reinforcement learning, so that it could also play a good role in practical applications. The research provided new ideas for the practical applications of automated machine-learning technology and had a particular role in promoting the development of artificial intelligence technology.

2. Construction of Auto-PLD Algorithm Framework for Classic Scenes

2.1 Basic Structure Design of the Machine-learning Pipeline

Automated machine-learning technology can automatically select an algorithm on a given data set and perform hyperparameter tuning on a given data set through a particular control strategy. Hence, manual intervention is reduced. The performance of machine-learning algorithms and the accuracy of the data set are improved. The main problem faced by automated machine-learning techniques is the CASH problem, which combines algorithm selection and hyperparameter optimization. CASH can be described as follows. Suppose there is a set of machine algorithms, $A=\left\{A_{1},A_{2},\ldots ,A_{n}\right\}$, which divide a data set into two disjoint subsets called the training set $D_{1}$and test set $D_{2}$. The main goal of the CASH problem is to find an algorithm $A_{i},A_{i}\in A$. After $D_{1}$training on the network and tuning the hyperparameters, $D_{2}$performed best. The above process can be expressed using formula (1).

(1)
$ A_{i}\in \arg \min _{A_{i}\in A}L\left(A_{i},D_{1},D_{2}\right) $

where $L\left(A_{i},D_{1},D_{2}\right)$ is the loss function. In the application scenarios of machine learning, it is often necessary to consider the design of data preprocessing algorithms and feature preprocessing algorithms. Taking the classic classification task as an example, the machine-learning pipeline has multiple algorithms participating in data preprocessing, feature preprocessing, and final classification. A complete automated machine-learning pipeline design structure can be expressed using formula (2).

(2)
$ m=\left(m_{1},m_{2},\ldots ,m_{l}\right) $

where $l$ represents the $m_{1},m_{2},\ldots ,m_{l}$algorithms that form the pipeline in turn. In a machine-learning pipeline, the input data is $<F,y>$, where $F$is the input feature and $y$is the corresponding data label. $F$ can be represented by two sets, namely discrete features $f_{1}$ and continuous features $f_{2}$. According to the above content, the design of the machine-learning pipeline can be realized, as shown in Fig. 1.

In Fig. 1, $M_{d1},M_{d2},M_{d3},M_{f},M_{c}$ represent the algorithm set for preprocessing discrete data in the machine-learning pipeline, the algorithm set capable of simultaneously preprocessing discrete data and continuous data, the algorithm set for continuous preprocessing data, a feature collection of preprocessing algorithms, and a collection of classification algorithms, respectively.

Fig. 1. Machine-learning pipeline feature transformation.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig1.png

2.2 Machine-learning Pipeline Structure Search based on Reinforcement Learning

To realize machine-learning automation, it is necessary to ensure the machine-learning pipeline structure. (Ed note: Short coordinating conjunctions like ``and'' and ``but'' should not be used at the beginning of sentences.) In addition, the hyperparameters corresponding to the machine-learning pipeline structure are optimized simultaneously in the machine-learning pipeline. This study proposes an Auto-PLD algorithm framework consisting of two parts. The machine-learning pipeline design problem is divided into two sub-problems, as shown in Fig. 2.

In Fig. 2, the two stages, A and B, are optimized alternately to realize the simultaneous optimization of the machine-learning pipeline structure and hyperparameters. The training process of reinforcement learning is the process of continuous interaction between the agent (Agent) and the environment. In this process, the decision-making strategy is updated through the interaction information and continues to act. The essence of reinforcement learning is to solve the problem that the agent maximizes the reward through decision-making strategies, as shown in Fig. 3.

For stage A, its workflow is basically the same as that of reinforcement learning. The Markov property of stage A can be modeled as a reinforcement learning problem, and reinforcement learning is used to determine the pipeline structure. The goal of phase A is to find a sequence, as expressed in Eq. (2). Therefore, the state space of reinforcement learning can be determined using formula (3).

(3)
$ M=M_{d1}\times M_{d2}\times M_{d3}\times M_{f}\times M_{c} $

The study proposes a 01 sequence to represent the state space. That is, the coding table to combine all the algorithms into a unique sequence is used. Each bit in the sequence represents an algorithm. In this sequence, 0 indicates that the algorithm represented by the sequence position is not selected; 1 is the opposite, indicating that it is selected. The set of states in reinforcement learning is denoted by $S$. One bit needs to be added at the end of the sequence to express the terminal state more intuitively.

In summary, the length of the 01 sequence is $\left| M_{d1}\cup M_{d2}\cup M_{d3}\cup M_{f}\times \cup M_{c}\right| +1.$ The machine-learning pipeline structure corresponds to the sequence structure proposed in the research. It has the advantages of low dimensionality, constant length, and simple implementation. In the problem corresponding to stage A, the action space of reinforcement learning has two actions: selecting an algorithm and evaluating the entire pipeline. $X$ represents a collection of actions. By executing an action, the state transition of the agent can be realized. In order to avoid the unreasonable pipeline structure of the proposed machine-learning pipeline, different candidate action sets need to be designed in different states. According to the definition of the state space, the algorithm corresponding to the last 1 in the 01 sequence can be known. It is defined as $m_{s}^{last}$; $s\in S$ is the last algorithm in the pipeline corresponding to the state. $s_{0}$ is the start state, and the sequence is all 0 at this time. $s_{e}$ is the termination state, indicating that the last bit is a sequence of 1. Other definitions are shown in formula (4).

(4)
$ \left\{\begin{array}{l} S_{d1}=\left\{s\left| m_{s}^{last}\in M_{d1},\forall s\in S\right.\right\}\\ S_{d2}=\left\{s\left| m_{s}^{last}\in M_{d2},\forall s\in S\right.\right\}\\ S_{d3}=\left\{s\left| m_{s}^{last}\in M_{f},\forall s\in S\right.\right\}\\ S_{f}=\left\{s\left| m_{s}^{last}\in M_{d1},\forall s\in S\right.\right\}\\ S_{c}=\left\{s\left| m_{s}^{last}\in M_{c},\forall s\in S\right.\right\} \end{array}\right. $

set $X_{s}$ of possible actions of the Agent in the $a_{e}$state, the evaluation action of the machine-learning pipeline is $s$, then $X_{s}$there is formula (5) for

(5)
$ X_{s}=\left\{\begin{array}{c} M_{d1}\cup M_{d2}\cup M_{d3}\cup M_{f}\cup M_{c},if\,s=s_{0}\\ M_{d2}\cup M_{d3}\cup M_{f}\cup M_{c},if\,s\in s_{d1}\\ M_{d3}\cup M_{f}\cup M_{c},if\,s\in s_{d2}\\ M_{f}\cup M_{c},if\,s\in s_{d3}\\ M_{c},if\,s\in s_{f}\\ a_{e},if\,s\in s_{c}\\ 0,if\,s\in s_{e} \end{array}\right. $

In reinforcement learning, the reward function can describe the agent’s actions in the environment. The training process of reinforcement learning is the process of maximizing the cumulative reward. In the problem of machine-learning pipeline structure search, the research regards performance evaluation as the reference index of reward value. The performance of the machine-learning pipeline is closely related to the selection of hyperparameters. The reward value in stage A is defined as the optimal performance evaluated so far for the machine pipeline structure to minimize the impact of noise produced by different hyperparameters. $s$Thus, the reward function in stage A is defined as formula (6).

(6)
$ r_{s}=\left\{\begin{array}{ll} \max \left(r_{s},r_{s}^{now}\right), & s_{-}=s'\\ 0, & s_{-}=s'' \end{array}\right. $

In formula (6), the initial value of $r_{s}$ is 0. $s\_ $ represents the next state. $s'$ and $s''$ are the terminal state and non-terminal state, respectively. $r_{s}^{now}$ is the performance of the machine-learning pipeline with structure. Based on the above content, the machine-learning pipeline structure search $s$ is completed.

Fig. 2. Auto PLD Framework.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig2.png
Fig. 3. Basic process of reinforcement learning.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig3.png

2.3 Bayesian-based Machine-learning Pipeline Hyperparameters Optimization

Assuming that the machine-learning pipeline structure is $m=\left(m_{1},m_{2},\ldots ,m_{l}\right)$, a set of optimal hyperparameters needs to be configured in its hyperparameter space $\theta _{1},\theta _{2},\ldots ,\theta _{l}$. The research proposes a public Bayesian model following the SMBO (sequential model-based global optimization) algorithm framework to optimize the hyperparameters under different structures of the machine-learning pipeline. The Bayesian model is a very common and effective global optimization algorithm, which can obtain the optimal global solution by calculating the extreme value of the objective function, as shown in formula (7).

(7)
$ x=\arg \max _{x\in \chi }f\left(x\right) $

where $\chi $ is the search space; $f\left(\right)$ is the objective function; $x$ is the given query point. In the automatic design of machine pipelines, the hyperparameter optimization problem is the problem of optimizing the loss function in the hyperparameter space. Hence, it is necessary to define the hyperparameter space for Bayesian optimization. First, the hyperparameter space should meet four basic requirements: support for integer and floating-point parameters, support for class parameters, support for conditional sincerity, and support for prohibition clauses. The research proposes a 01 sequence to represent the state space of reinforcement learning, and each bit in the sequence represents an algorithm. Therefore, it should treat each bit in the sequence as a class parameter, with optional values of 0 and 1. When the parameter value is 1, the hyperparameter space of the algorithm represented by this parameter is one of the components of the machine pipeline hyperparameter space. As shown in Fig. 4, the machine-learning pipeline structure search is completed through reinforcement learning. The 01 sequence and machine-learning pipeline structure are determined. When a certain bit in the sequence takes a value of 0, the hyperparameter space corresponding to its child node is defined as None. When a certain bit in the sequence takes a value of 1, the hyperparameters space corresponding to its child node is defined as AdaBoost. This position represents the hyperparameter space of the algorithm (AdaBoost). The hyperparameters are determined as the learning rate, estimators, and maximum depth.

A complete map of the machine-learning pipeline structure can be obtained through the above content in public hyperparameter space. In the SMBO framework, the core content of the agent model. Compared with other models, the Gaussian process is more flexible in representing the distribution of functions. Therefore, the Gaussian process is usually selected as the Bayesian proxy model. On the other hand, the proxy model constructed by this method is too dependent on the parameterized kernel function, which is only suitable for continuous hyperparameters. The application effect in the automatic design of the machine-learning pipeline is not ideal. Therefore, this study proposes the weighted Hamming distance kernel function method to optimize it and build a proxy model, to have a better processing effect on the category parameters. This method uses the Gaussian process to construct the proxy model, defining the category function as a similar kernel function. The weighted Hamming distance to measure the distance is used. Finally, a combined function as the kernel function in the proxy model is obtained, such as formula (8).

(8)
$ k_{\textit{mixed}}\left(\theta _{i},\theta _{j}\right)=\exp \left[\sum _{l\in P_{cont}}\left(-\lambda _{l}\left(\theta _{i,l}-\theta _{j,l}\right)^{2}\right)+\sum _{l\in P_{cat}}\left(-\lambda _{l}\left[1-\delta \left(\theta _{i,l},\theta _{j,l}\right)\right]\right)\right]$

where $k_{\textit{mixed}}\left(\right)$ is the combination function. $P_{cont},P_{cat}$ is the continuous numerical parameter set and the categorical parameter set, respectively. $\delta \left(\right)$ is the Kronecker delta function, and $\lambda _{l}$ represents the first parameter of the kernel function $l$. When using a Gaussian process as a proxy model, the complexity is high, and the time-consuming is also high. This study uses the random forest algorithm as the proxy model. The advantage of this method is that the calculation load is small, and the processing time is short. Therefore, it is more suitable for machine-learning pipeline design. After determining the proxy model, finally Expected Improvement (EI) is used as the Bayesian optimization function to obtain the function. Finally, the hyperparameters of the machine-learning pipeline are determined. Based on the above content, the Auto-PLD algorithm framework is constructed to realize the automatic design of the machine-learning pipeline.

Fig. 4. Machine-learning pipelined hyperparametric space.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig4.png

3. Performance Evaluation of the Auto-PLD Algorithm

Research, design, and comparative experiments were conducted to evaluate the performance of the Auto-PLD algorithm framework based on the Bayesian model. Table 1 lists the experimental environment.

The 10 datasets used in the experiment were all classification task datasets in OpenML-CC18. Approximately 70% of the data samples in each dataset were used as the training data set; the remaining 30% were used as the testing data set. Each experiment was run 10 times, and the average value was taken as the final result. In the construction process of the Auto-PLD algorithm framework, the reinforcement learning method adopted Q-learning. In addition to the method in this paper, two methods, Auto-sklearn and Auto-PLD-random, were also constructed for better comparison. Among them, the meta-learning of the Auto-sklearn method was pre-trained on a large-scale public data set. These data sets contained the part used in the experiment. Therefore, the meta-learning function was turned off to reduce the experimental error. In Auto-PLD-random, reinforcement learning and Bayesian optimization used random methods, i.e., the structure and hyperparameter configuration of the machine-learning pipeline are entirely random. The performance of the three methods was compared using the balanced accuracy as the evaluation index when the time budget is 1h, 4h, and 8h on each test set.

Auto-sklearn showed the best performance when the time budget was 1h, as shown in Table 2. Its average balanced accuracy value was 0.840, which was 0.005 higher than the balanced accuracy of Auto-PLD-random and 0.009 higher than the balanced accuracy of Auto-PLD. Auto-PLD showed the best performance when the time budget was 4h. Its average balanced accuracy value was 0.842, which was 0.003 higher than the balanced accuracy of Auto-PLD-random and 0.001 higher than the balanced accuracy of Auto-sklearn. When the time budget was 8h, Auto-PLD had the best performance, and its average balanced accuracy value was 0.845, which was 0.007 higher than the balanced accuracy of Auto-PLD-random and 0.003 higher than the balanced accuracy of Auto-sklearn. Auto-sklearn had more advantages when the time budget was small. This is because Auto-PLD needs to thoroughly search and determine the machine-learning pipeline structure so a sufficient number of training samples can be improved for reinforcement learning. After the time budget increased, the performance of Auto-PLD was also significantly better than Auto-sklearn and Auto-PLD-random. The above results verified the performance of Auto-PLD.

The machine-learning pipeline evaluation success rate of the three methods on each data set was compared, as shown in Table 3. Different time budgets have little impact on the success rate of machine-learning pipeline evaluation. Among them, the success rate of Auto-PLD was the highest, exceeding 92%. The success rate of Auto-sklearn was slightly lower than that of Auto-PLD, exceeding 91%. Auto-PLD-random had the lowest success rate, approximately 81%. This showed that adopting the search strategy proposed by the study during the search process could improve the performance of the machine-learning pipeline.

The average number of machine-learning pipelines per hour attempted by the three methods under different time budgets was compared, as shown in Fig. 5. Auto-PLD-random had the largest machine-learning pipeline attempts per hour, exceeding 140,000 times. The average number of machine-learning pipeline attempts per hour for Auto-PLD and Auto-sklearn was comparable, between 30,000 and 50,000. When the time budget was 8h, the average number of machine-learning pipeline attempts per hour of Auto-PLD was 5034 times smaller than that of Auto-sklearn.

The number of algorithm occurrences in the optimal machine-learning pipeline searched by the three methods was compared, as shown in Table 4. One algorithm could obtain the optimal situation for all problems. Different problems require different algorithms to obtain the optimal solution. Therefore, it was necessary to ensure the diversity of the AutoML algorithm library.

The dataset with id 14 showed the best machine-learning pipeline performance over time, as shown in Fig. 6. When the time budget was 1h, 4h, and 8h, the balanced accuracy values of Auto-PLD were 0.849, 0.858, and 0.863, respectively, which were higher than the other two methods.

The best machine-learning pipeline performance was tested over time to avoid errors in the experimental results caused by chance on the dataset with id 307, as shown in Fig. 7. When the time budget was 1h, 4h, and 8h, the balanced accuracy values of Auto-PLD were 0.982, 0.985, and 0.987, respectively, which were higher than the other two methods. The above results indicated that the performance of Auto-PLD was better. In summary, the Auto-PLD based on reinforcement learning and the Bayesian model had a good performance in the automatic design of the machine-learning pipeline.

Fig. 5. Average number of machine-learning pipeline attempts per hour under different time budgets.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig5.png
Fig. 6. Time-varying performance of the best machine-learning pipeline on Dataset id-14.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig6.png
Fig. 7. Time-varying performance of the best machine-learning pipeline on Dataset id-307.}
../../Resources/ieie/IEIESPC.2023.12.3.252/fig7.png
Table 1. Experimental environment.

Project

Configuration information

Operating system

CentOS 7.0

CPU

2x Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz (6C 12T 3.19GHz, 3.2GHz IMC, 6x 256kB L2,15MB L3)

Memory

64GB(8x8GB) 1866MHz

Hard disk

2TBx2 3.5-inch with RAID-1

Programing language

Python 3.6.10

Machine-learning library

Scikit-learn 0.21.3

Ray

Ray 0.8.2

Internet

1Gb/s Ethernet

Table 2. Balanced accuracy value of the three methods.

Time Budget

Method

Dataset id

Average

11

14

18

31

50

54

307

1053

1461

1480

1 hour

Auto-PLD

0.974

0.847

0.745

0.719

1.000

0.826

0.980

0.676

0.853

0.689

0.831

Auto-PLD-random

1.000

0.855

0.746

0.719

1.000

0.833

0.985

0.675

0.852

0.685

0.835

Auto-sklearn

1.000

0.849

0.753

0.736

0.997

0.840

0.981

0.683

0.854

0.703

0.840

4 hours

Auto-PLD

0.987

0.857

0.748

0.736

1.000

0.857

0.979

0.681

0.861

0.716

0.842

Auto-PLD-random

1.000

0.859

0.747

0.728

1.000

0.841

0.982

0.675

0.860

0.693

0.839

Auto-sklearn

1.000

0.865

0.755

0.728

1.000

0.844

0.983

0.678

0.856

0.699

0.841

8 hours

Auto-PLD

1.000

0.858

0.756

0.733

1.000

0.856

0.986

0.682

0.863

0.712

0.845

Auto-PLD-random

1.000

0.855

0.750

0.730

1.000

0.833

0.984

0.678

0.862

0.705

0.838

Auto-sklearn

1.000

0.876

0.755

0.730

0.999

0.836

0.979

0.678

0.858

0.709

0.842

Table 3. Machine-learning pipeline evaluation success rate of three methods (%).

Time Budget

Method

Dataset id

Average

11

14

18

31

50

54

307

1053

1461

1480

1h

Auto-PLD

92.42

91.05

94.34

95.13

93.02

91.77

92.58

90.26

89.73

92.06

92.24

Auto-PLD-random

83.07

80.16

79.23

85.14

80.03

78.17

82.14

81.33

80.25

83.27

81.28

Auto-sklearn

90.05

92.34

93.46

91.08

88.42

89.73

92.05

91.44

93.02

90.08

91.17

4h

Auto-PLD

93.26

90.58

93.49

96.22

91.05

92.38

91.46

91.04

89.42

93.05

92.20

Auto-PLD-random

81.85

82.34

80.96

86.33

81.02

77.93

83.18

82.71

81.04

82.98

82.03

Auto-sklearn

91.04

91.58

94.07

90.96

89.14

90.13

90.25

91.46

91.05

90.17

90.99

8 hours

Auto-PLD

92.84

91.03

92.45

95.66

90.72

93.11

92.45

90.01

90.42

92.17

92.07

Auto-PLD-random

80.72

81.08

79.34

87.25

82.33

78.96

82.15

83.44

80.42

84.03

81.97

Auto-sklearn

90.05

92.64

95.41

89.46

88.74

92.08

91.42

90.05

90.44

89.53

90.98

Table 4. Number of algorithm occurrences in the optimal machine-learning pipeline.

Time Budget

Method

Algorithms

Adaboost

Bernouli NB

ExtraTrees

GBDT

Gaussian NB

KNeightbors

LinearSVC

SGD

SVC

RF

1h

Auto-PLD

14

0

21

16

0

4

9

1

26

26

Auto-PLD-random

10

1

17

32

0

4

11

1

25

20

Auto-sklearn

13

0

20

7

2

10

14

2

14

27

4h

Auto-PLD

9

0

17

26

0

4

9

1

27

21

Auto-PLD-random

5

0

16

32

0

6

10

0

33

17

Auto-sklearn

20

0

24

8

0

9

8

2

11

23

8 hours

Auto-PLD

4

0

24

33

0

3

8

1

30

20

Auto-PLD-random

8

0

18

33

2

6

10

0

30

18

Auto-sklearn

17

1

24

7

0

6

14

2

15

25

4. Conclusion

Automated machine learning was a technology that used machines to replace manual model selection and parameter optimization. It could automate model design and improve model modeling speed and performance. Machine-learning pipeline automation design was integral to machine-learning automation technology. This study took the classic classification problem as an example and completed the machine-learning pipeline structure search based on reinforcement learning. The optimal configuration of hyperparameters based on the Bayesian network was realized, and an Auto-PLD algorithm framework was proposed. Auto-PLD was tested, and the experimental results showed that when the time budget was four hours, the average balanced accuracy value of Auto-PLD was 0.842. The accuracy was 0.003 higher than Auto-PLD-random and 0.001 higher than Auto-sklearn. When the time budget was eight hours, the average balanced accuracy value of Auto-PLD was 0.845, which was 0.007 higher than Auto-PLD-random and 0.003 higher than Auto-sklearn. Under different time budgets, Auto-PLD had the highest success rate of machine-learning pipeline evaluation on various datasets, exceeding 92%. When the budget was eight hours, the average number of machine-learning pipeline attempts per hour of Auto-PLD was 5034 times lower than that of Auto-sklearn. On the dataset with id 14, when the time budget was one hour, four hours, and eight hours, the balanced accuracy values of Auto-PLD were 0.849, 0.858, and 0.863, respectively, which are higher than the other two methods. On the dataset with id 307, when the time budget was one hour, four hours, and eight hours, the balanced accuracy values of Auto-PLD were 0.982, 0.985, and 0.987, respectively, which were higher than the other two methods. In summary, the Auto-PLD proposed in the study had excellent performance and essential applications in the mechanical design of machine-learning pipelines. The scale of the data used in the experiment is insufficient, which may cause certain experimental errors. Therefore, it is necessary to expand the scale and quantity of the data included in the subsequent experiments and conduct more experimental tests to reduce the error impact caused by accidental factors.

5. Fundings

The research is supported by Zibo Key Research and Development Program (city school-city integration) project “Building an integrated platform for industry-academia-research based on digital twin technology to empower Zibo’s digital economy” (No. 2021SNPT0055).

REFERENCES

1 
Tsiakmaki. M, Kostopoulos. G, Kotsiantis. S, Ragos. O. ``Fuzzy-based active learning for predicting student academic performance using autoML: a step-wise approach,'' Journal of Computing in Higher Education, vol. 33, no. 3, pp. 635-667, 2021.DOI
2 
Gupta. G, Katarya. R. ``EnPSO: An AutoML technique for generating ensemble recommender system,'' Arabian Journal for Science and Engineering, vol. 46, no. 9, pp. 8677-8695, 2021.DOI
3 
Roman. D, Saxena. S, Robu. V, Pecht. M, Flynn. D. ``Machine learning pipeline for battery state-of-health estimation,'' Nature Machine Intelligence, vol. 3, no. 5, pp. 447-456, 2021.DOI
4 
Ajirlou. AF, Partin-Vaisband. I. ``A machine learning pipeline stage for adaptive frequency adjustment,'' IEEE Transactions on Computers, vol. 71, no. 3, pp. 587-598, 2021.DOI
5 
Tan. HB, Xiong. F, Jiang. YL, Huang. WC, Wang. Y, Li. HH, You. T, Fu. TT, Peng. B W. ``The study of automatic machine learning base on radiomics of non-focus area in the first chest CT of different clinical types of COVID-19 pneumonia,'' Scientific reports, vol. 10, no. 1, pp. 1-10, 2020.DOI
6 
Alshareref. A, Aggarwal. K, Kumar. M, Mishra. ``A Review of ML and AutoML solutions to forecast time-series data,'' Archives of Computational Methods in Engineering, vol. 29, pp. 5297-5311, 2022.DOI
7 
Wever. M, Tornede. A, Mohr. F, Hüllermeier. E. ``AutoML for multi-label classification: Overview and empirical evaluation,'' IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 9, pp. 3037-3054, 2021.DOI
8 
Baudart. G, Hirzel. M, Kate. K, Ram. R, Shinnar. A, Tsay. J. ``Pipeline combinators for gradual automl,'' Advances in Neural Information Processing Systems, vol. 34, pp. 19705-19718, 2021.URL
9 
Li. Y, Shen. Y, Zhang. W, Zhang. C, Cui. B. ``VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition,'' The VLDB Journal, vol. 2022, pp. 218-218, 2022.DOI
10 
Zogaj. F, Cambronero. JP, Rinard. MC, Cito. J. ``Doing more with less: characterizing dataset downsampling for AutoML,'' Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2059-2072, 2021.DOI
11 
Li. Z, Guo. H, Wang. WM, Guan. YJ, Barenji. AV, Huang. GQ, McFall. KS, Chen. X. ``A blockchain and automl approach for open and automated customer service,'' IEEE Transactions on Industrial Informatics, vol. 15, no. 6, pp. 3642-3651, 2019.DOI
12 
Yakovlev. A, Moghadam. HF, Moharrer. A, Cai. JX, Chavoshi. N, Varadarajan. V, Agrawal. SR, Idicula. S, Karnagel. T, Jinturkar. S, Agarwal. N. ``Oracle automl: a fast and predictive automl pipeline,'' Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3166-3180, 2020.URL
13 
Alade. IO, Abd Rahman. MA, Saleh. T A. ``Predicting the specific heat capacity of alumina/ethylene glycol nanofluids using support vector regression model optimized with Bayesian algorithm,'' Solar Energy, vol. 183: 74-82, 2019.DOI
14 
Scanagatta. M, Salmerón. A, Stella. F. ``A survey on Bayesian network structure learning from data,'' Progress in Artificial Intelligence, vol. 8, no. 4, pp. 425-439, 2019.DOI
15 
Yao. Y, Allardyce. BJ, Rajkhowa. R, Hegh. D, Sutti. A, Subianto. S, Subianto. S, Rana. S, Greenhill. S, Greenhill. S, Greenhill. XG, Greenhill. J M. ``Improving the tensile properties of wet spun silk fibers using rapid Bayesian algorithm,'' ACS Biomaterials Science & Engineering, vol. 6, no. 5, pp. 3197-3207, 2020.DOI
16 
Maheswari. S, Pitchai. R. ``Heart disease prediction system using decision tree and naive bayes algorithm,'' Current Medical Imaging, vol. 15, no. 8, pp. 712-717, 2019.DOI
17 
Joseph. G, Murthy. C R. ``On the convergence of a Bayesian algorithm for joint dictionary learning and sparse recovery,'' IEEE Transactions on Signal Processing, vol. 68, pp. 343-358, 2019.DOI
18 
Mat. SRT, Ab Razak. MF, Kahar. MNM, Arif. JM, Firdaus. A. ``A Bayesian probability model for Android malware detection,'' ICT Express, vol. 8, no. 3, pp. 424-431, 2022.DOI
19 
Salvato. M, Buchner. J, Budavári. T, Dwelly. T, Merloni. A, Brusa. M, Rau. A, Fotopoulou S, Nandra K. ``Finding counterparts for all-sky X-ray surveys with NWAY: a Bayesian algorithm for cross-matching multiple catalogs,'' Monthly Notices of the Royal Astronomical Society, vol. 473, no. 4, pp. 4937-4955, 2018.DOI
20 
Liu. Y, Mace. G G. ``Assessing synergistic radar and radiometer capability in retrieving ice cloud microphysics based on hybrid Bayesian algorithms,'' Atmospheric Measurement Techniques, vol. 15, no. 4, pp. 927-944, 2022.DOI

Author

Qi Wang
../../Resources/ieie/IEIESPC.2023.12.3.252/au1.png

Qi Wang, lecturer at Shandong Vocational College of Industry, member of Shandong Electronics Society, received a bachelor's degree in computer science and technology from Shandong University of Technology in 2005, a master's degree in software engineering from the University of Electronic Science and Technology in 2008, the title of Zibo Technical Expert, Huawei HCIP, H3C certified lecturer, and Microsoft MCSE. He guides students to win the first prize in the virtual reality competition of the National Vocational College Skills Competition, mainly in the field of network communication protocol Graphics, edge computing and Artificial Intelligence.