Smart tourism is of great significance in society, and its core is how to obtain and utilize tourism-related information efficiently to provide a better tourism experience. This paper proposes a data mining method based on the Apriori association rule algorithm to solve the difficult search problem for complex and diverse tourism information. During the process, operator data are used as the data source for data mining, and the Apriori association rule algorithm is used as the foundation to construct a smart tourism information search method. The method is constrained by the tourists’ travel order data at different tourist locations, and multithreaded parallel computing of the data is achieved through a parallel computing framework. The experimental results show that the initial accuracy of the proposed method in mining data types can reach up to 97.8%. When testing the number of association rules, the proposed method only had 2317 association rules with a support level of 0.032. The proposed method had a runtime of only 13.6Ks when involving 50M data pieces in large-scale datasets, which was lower than other methods. Hence, the proposed method can effectively search for smart tourism information and has high search efficiency and data accuracy.

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

The tourism industry is undergoing unprecedented changes with the rise of smart tourism and the popularization of the Internet. The large amount of data tourists generate during their travels, such as location information and search history, provides enormous opportunities for better travel experiences and personalized services (Pop et al., 2022; Alyahya and McLean, 2022). On the other hand, a large amount of information has flooded the tourism field, making it difficult for tourists and tourism industry practitioners to process and filter this information, leading to information overload. Moreover, tourism information is scattered across various channels and platforms, including social media, tourism websites, and applications. Integrating this scattered information into a comprehensive travel guide is also a challenge. As a part of tourism information, operator data includes user location data and mobile device usage. These data can be used to analyze user behavior and needs. The Apriori algorithm is a commonly used method for mining association rules and can be used to discover correlations in data, such as tourists’ preferences and behavior habits. Through the Apriori algorithm, the system can quickly identify frequently occurring itemsets from massive data and generate recommendation rules. Compared to traditional recommendation algorithms, this method can improve the speed and accuracy of information searches for tourist attractions. Through conditional constraints, further screening and filtering of mining results can be carried out to improve the accuracy of information. Adopting a parallel computing framework can accelerate the speed of data processing and analysis and improve the real-time performance. In this context, research proposes a smart tourism information search method based on data mining (DM) to provide feasible technical references for the tourism industry.

The research consists mainly of four parts. The first part discusses the relevant research results of current tourism information and DM technologies. The second part is the design of information search methods based on DM. The third part is an analysis of the effectiveness of the proposed method. The fourth part discusses and summarizes the entire text.

## 2. Related Works

As the tourism industry expands, processing technology and content analysis methods related to tourism information have become crucial auxiliary decision-making methods for the tourism industry. Some researchers have evaluated tourism information. Doborjeh et al. (2022) proposed an artificial intelligence-based method for collecting data and information on tourism and hotels, targeting demand forecasting and behavioral pattern-related data, and developing novel personalized prediction frameworks. The experimental results showed that the proposed method had good collection efficiency. Stylos et al. (2021) proposed a method based on big data analysis to address the issue of information processing in the tourism industry. During the process, retrospective methods were used to develop common themes and dimensions, and predictive models with certain agility were set up. The experimental results showed that the proposed method has good processing performance. Hu and Yang (2021) proposed a multi-factor-based analysis strategy for analyzing online comment information in the tourism industry. During the process, a fundamental element analysis was conducted based on the final utility, and the corrected average effect of important factors was calculated. The experimental results indicated that the proposed method could provide effective analysis results.

Lin et al. (2021) proposed a method based on collection and statistical techniques for processing tourism information. Regression and relevant data analysis were conducted using structural equation software to evaluate the development connections between relevant enterprises and society. The experimental results showed that the proposed method had good processing performance.

Chen et al. (2022) proposed a method based on sequential mixing for analyzing honeymoon tourism information. During the process, the conceptual framework was based on the process from stimulus to response, analyzing the emotional driving force of destination attributes in the population. The proposed method could provide effective analysis results.

Some researchers conducted relevant research on DM. Sunhare et al. (2022) proposed an information conversion technology based on DM to optimize the device interaction process in the Internet of Things. Based on cloud technology, the process collects data from heterogeneous environments, transforms the data into knowledge information for resource and system performance optimization, and analyzes the contribution of cloud-assisted technology to big DM. The proposed method produced good optimization performance.

Schorn et al. (2021) proposed a DM-based method for analyzing specific metabolic diversity. A paired omics data platform was established to achieve quick access to data, and connections were established between different data for unsupervised mining strategy integration. The experimental results indicated that the proposed method had strong operability.

Savaglio and Fortino (2021) proposed a DM-based approach for task objectives, such as accuracy in IoT scenarios. Inspired by software engineering principles, a simulation-driven method was constructed to predict dynamic and constrained scenarios. Intelligent monitoring applications were used for case studies. The proposed method had multiple advantages. Yates and Islam (2022) proposed a DM-based method for local execution model training tasks in smartphone devices. The method analyzed the intersection of data in smartphones and conducted performance research based on the usage patterns of smartphones. The proposed method achieved local task execution. Jayasri and Aruna (2022) proposed a DM-based technology for data analysis in healthcare. During the process, it embedded association rules into the MapReduce framework and developed algorithm rules based on health-related data. The proposed method performed big data analysis effectively.

In summary, although DM technology has been studied or applied in various fields, research on tourism information is still relatively scarce. Hence, this paper proposes a tourism information search technology based on DM to provide more technical references for developing the tourism industry.

## 3. Intelligent Tourism Information Search Method

Smart tourism information can be used to improve the tourism experience, provide personalized suggestions, and improve the operational efficiency of tourism practitioners. This section focuses on the technical means used in the smart tourism information search method designed by the research institute.

### 3.1 Design of Mining Method based on Operator Data

With the rapid development of the global tourism industry, smart tourism has become a highly concerned field. Operator data include user communication records, location information, and other travel-related data, providing important data foundations for building smart tourism systems (Nimrah and Saifullah, 2022; Huang et al., 2022). On the other hand, powerful DM and information retrieval technologies are required to utilize operator data and achieve smart tourism (Maihulla et al., 2022). The Apriori algorithm is capable of mining frequent itemsets in the dataset and generating association rules and has relatively loose requirements for data preprocessing. Therefore, tourism DM methods have been designed based on Apriori. Fig. 1 presents the basic process of Apriori research and design.

The ultimate goal of researching and designing algorithms is to generate a frequent itemset that cannot be further changed (Fig. 1). It sets the number of times a candidate itemset is generated three times, first scanning the database and counting the number of occurrences of the current level candidate itemset. After clearing the candidate set that does not meet the requirements, the next level candidate set is generated on the current level frequent itemset. After counting the number of occurrences of the third-level candidate itemset, the method determines whether a new frequent itemset can be generated. If it can be generated, it continues to generate a new candidate set and scans the database to count the number of occurrences of the candidate itemset. If it cannot be generated, the operation is complete, and the algorithm is ended. The frequent itemset is determined based on support, and a minimum support value is set in advance. If the support of the generated itemset is greater than the minimum support value, it is determined to be a frequent itemset. The support calculation is expressed as Eq. (1).

where $S$ represents support; $X$ and $Y$ represent the itemsets that do not intersect with each other; $N\left(X,Y\right)$ is the number of records containing $X$ and$Y$; $N$ is the total number of records. A candidate hash tree is constructed for selecting frequent items based on the hash function. The entire dataset is scanned to obtain all possible frequent itemsets and extract the frequent itemsets of the candidates. The method then compares the frequent itemsets of candidates with the data in the hash tree to calculate the confidence level of the frequent itemsets. The confidence level calculation is shown in Eq. (2).

where $c$ represents the confidence level. When constructing a dataset, each item must be sorted according to specific rules derived from the association relationships between the elements. When there is a subset of the $X$-term set, the subset representation is expressed as Eq. (3).

where $X'$ represents a subset of the $X$ term set, and $K$ is an alternate item. Eq. (4) expresses the constraint with minimum confidence.

##### (4)

$ \frac{S\left(X-X',Y+X'\right)}{S\left(X-X'\right)}=\frac{S\left(X,Y\right)}{S\left(X\right)+K}<\frac{S\left(X,Y\right)}{S\left(X\right)}<\alpha $where $\alpha $ represents the minimum confidence level. In a set containing multiple items, the total number of rules is calculated using Eq. (5).

where $R$is the total number of rules, and $d$ is the number of items in the set. When conducting tourism DM based on operator data, it is necessary to mine the tourists’ travel routes and trends and predict future time periods accordingly. When conducting data analysis, it is necessary to construct a binary attribute transaction set, as shown in Fig. 2.

Fig. 2 shows that when constructing a binary attribute transaction set, it is necessary to preset the transaction category and collect information records from the system. It is represented by 1 and 0 pairs of transactions being accessed and not being accessed, constructing a binary attribute transaction set matrix. The Apriori algorithm for DM based on operator data undergoes multiple iterations and itemset filtering to mine frequent itemsets containing binary attributes, obtain valuable association rules in the data, and obtain effective information content.

### 3.2 Optimized Design of Improved Apriori Information Mining Method

The number of tourists in the tourism industry constantly changes, and the data volume will increase significantly on some special dates. The Apriori algorithm, which has been studied and constructed, has the potential risk of a high load when processing large-scale data, which could decrease the operational efficiency. Hence, optimization is needed (Taher et al., 2021). Although some association rules mined meet strong association conditions, they are unsuitable for practical application scenarios and may mislead subsequent data analysis. Therefore, it is necessary to exclude inappropriate association rules (Eskandari et al., 2022). To make the excavated data more suitable for the actual scene, the tourists’ travel orders at different locations are constrained as conditions, and the data from the first travel location best reflects the priority of different travel locations. The matching conditions for statistics are expressed as Eq. (6).

where $V$, $S$, $lac$, and $ci$ are the tourist, tourist destination, location code, and sector, respectively. Eq. (7) expresses the first set of tourist destinations and frequent $k$-item sets obtained.

##### (7)

$ \left\{\begin{array}{l} first\_ v\_ \sec =\left\{S_{1},S_{2},\ldots ,S_{n}\right\}\\ k_{item}=\left\{X_{1},X_{2},\ldots ,X_{k}\right\} \end{array}\right. $where $first\_ v\_ \sec $ is the first gathering of tourist destinations, and $k_{item}$is a frequent set of$k$ terms. The constraint that the $k$-term set will no longer be calculated backward is expressed as Eq. (8).

where $X_{i}$ represents the set of items with fewer than $k$ terms. The conditions for two tourist destinations not to be strongly correlated are expressed as Eq. (9).

where $A$ represents the collection of tourist destinations, and $A_{i}$ represents any element in the collection of tourist destinations. If each item in the frequent$k$ item set does not meet the conditions for the first tourist location, the conditions for continuing to excavate the $k+1$ item are expressed as Eq. (10).

where $T$ represents the label attribute. In different scenarios, there are differences in the requirements of the original dataset and association rules. Hence, the description of the degree of correlation is also relatively variable, which affects the computational efficiency of the algorithm. This study introduces the correlation to reduce the number of candidate sets and improve the computational efficiency of data. The association rules need to satisfy Eq. (11).

##### (11)

$ \begin{array}{l} \exists A_{i}\left(1\leq i\leq n\right)\in first\_ v\_ \sec \vee T\left(A_{1}\right)\\ =\ldots =T\left(A_{n}\right)=T\left(B_{1}\right)=\ldots =T\left(B_{n}\right) \end{array} $where $B$ represents any collection of tourist destinations outside of set $A$. Eq. (12) expresses whether the visit behavior of two tourist destinations is independent.

##### (12)

$ \left\{\begin{array}{l} P\left(AB\right)=P\left(A\right)P\left(B\right),\text{Mutually independent}\\ P\left(AB\right)\neq P\left(A\right)P\left(B\right),\text{Have relevance} \end{array}\right. $where $P$ is the probability of tourists visiting a certain tourist destination. The representation of association rules based on the probability correlation is expressed as Eq. (13).

where $P\left(A\rightarrow B\right)$ is the probability correlation between two tourist destinations. Eq. (14) expresses the correlation analysis between itemsets.

##### (14)

$ \left\{\begin{array}{l} P\left(A\rightarrow B\right)>1,\text{Positive correlation}\\ P\left(A\rightarrow B\right)<1,\text{Negative correlation}\\ P\left(A\rightarrow B\right)=1,\text{Mutually independent} \end{array}\right. $The probability of one event occurring increases when two itemsets are positively correlated, and the conditions for constructing strong association rules are expressed as Eq. (15).

##### (15)

$ \frac{P\left(AB\right)}{P\left(A\right)P\left(B\right)}>1\wedge \left(A_{i}\in first\_ v\_ \sec \vee T\left(A_{1}\right)=\ldots =T\left(A_{n}\right)=T\left(B_{1}\right)=\ldots =T\left(B_{n}\right)\right) $The information will be filtered if the condition does not meet Eq. (15). A parallel computing framework is introduced to optimize the Apriori algorithm and further improve the computational speed of the algorithm in tourism information mining. Fig. 3 presents the parallel operation method.

Parallel operations divide the dataset that needs to be processed into small blocks based on the number of CPU threads and memory capacity (Fig. 3). Subsequently, the dataset is grouped, and different small blocks in each group are input into the corresponding allocated CPU threads and memory for parallel computation. Fig. 4 shows the actual running process of parallel computing based on memory.

During actual parallel computing, the data is first loaded from memory, and the loaded data is used as the data source (Fig. 4). It then groups the data sources and allocates the grouped data to the preset nodes. The data set is transformed on the node to generate new memory variables. It then broadcasts or shares variables to reduce the time consumption of data transmission and adjusts the parallel computing parameters until the data skew is eliminated. The model then performs column action operations on the data set and generates calculation results. The calculation result set and outputs of the tourism information mining results are then summarized. Combined with the distributed programming architecture, the algorithm process is optimized into a two-stage form. The first stage completes the generation of frequent 1-itemsets, as shown in Fig. 5.

During the first calculation stage, the initial memory dataset is first read from the transaction using the flatMap function (Fig. 5). It uses the map function to transform the transaction items into a combination of transaction items and values. The support of candidate 1-itemsets is generated and counted using reduceByKey. The pruning operation is completed with the preset minimum support, and the retained itemsets form a frequent 1-itemset. The second stage completes the generation of frequent itemsets, as shown in Fig. 6.

During the second calculation stage, the frequent itemsets are loaded and transformed into a form that combines things and counting (Fig. 6). A candidate itemset is then generated and broadcasted to send data to each working node. It then performs grouping calculations and evaluates the correlation, calculating the support for itemsets that meet the correlation. The model prunes the data according to the preset minimum support threshold to obtain the final frequent itemset. Therefore, the tourism information search behavior constructed by the research institute mines the historical data of tourists from the operators, achieving the collection and supplementation of smart tourism information.

## 4. Effectiveness Analysis of Smart Tourism Information Search Methods

A more efficient and accurate information search method can meet the growing demand of users for tourism information and improve the quality of the tourism experience. This section tests the performance of the smart tourism information search methods from the perspective of mining data type accuracy, number of association rules, and ROC rectangular curve. The application effect of the research design method is analyzed from the perspective of running time, number of redundant data, and CPU load through the actual scene. The effectiveness of the research design method was analyzed by combining the content of the performance test and application analysis.

### 4.1 Performance Testing of Intelligent Tourism Information Search Methods

The effectiveness of the tourism information search method designed by the research institute in smart tourism DM was tested by collecting historical data collected by a tourist attraction operator during one year was selected as the source data, and after selection and purification, it was used as the test dataset. It divides the test dataset into quarterly datasets for the first, second, third, and fourth quarters. The proposed method was compared with the Extreme Gradient Boosting (XGBoost) algorithm and the Double Clustering Algorithm (DCA) algorithm. First, the accuracy of the mining data type of the proposed method was tested, as shown in Fig. 7.

When testing the accuracy of the mining data types using different methods, the accuracy on both the first and third-quarter datasets decreased with the running time of the method (Fig. 7). In the Season 1 data set, the initial accuracy of XGBoost was 97.2% (Fig. 7(a)); The initial accuracy of DCA was 92.4%. The initial accuracy of the proposed method was 97.8%, which decreased to 93.5% after 200 seconds. In the Season 3 data set, the initial accuracy of XGBoost was 94.6% (Fig. 7(b)). The initial accuracy of DCA was 92.3%, and the initial accuracy of the proposed method was 97.7%, which decreased to 93.6% after 200 seconds. This suggests that research can provide a more accurate classification of mining data types, and the accuracy decreases slowly with time. The model tests the number of association rules of the proposed method under different confidence and support levels, as shown in Fig. 8.

The number of association rules for XGBoost with a confidence level of 0.002 was 15611 (Fig. 8(a)). The number of association rules decreased to 4437 at a confidence level of 0.028. The number of association rules in DCA decreased to 5978 at a confidence level of 0.028. The number of association rules for the proposed method at a confidence level of 0.002 was 9723, and the number of association rules decreased to 2137 at a confidence level of 0.028. The number of association rules in XGBoost decreased to 8954 with a support level of 0.032 (Fig. 8(b)). The number of association rules in the proposed method with a support level of 0.006 was 14919. The number of association rules decreased to 5387 when the support level was 0.032. The number of association rules in the proposed method with a support level of 0.006 was 7423. When the support level was 0.032, the number of association rules decreased to 2317. Hence, the proposed method has better simplicity in association rules. It plots the ROC rectangular curve obtained by the proposed method to test the performance of the proposed method comprehensively, as shown in Fig. 9.

The ROC curve of XGBoost was farthest from the top left corner of the rectangle and closer to the diagonal line between the bottom left corner and the top right corner, resulting in the smallest area enclosed (Fig. 9). Although the ROC curve of DCA intersects with the ROC curve of XGBoost, it is closer to the upper left corner of the rectangle and encloses a relatively larger area than XGBoost. The ROC curve of the proposed method is closest to the upper left corner of the rectangle and does not intersect with the ROC curves of DCA and XGBoost, forming the largest area. Hence, the proposed method can maintain a higher level of functionality when searching for intelligent tourism information.

### 4.2 Application Analysis of Intelligent Tourism Information Search Methods

The effectiveness of the proposed methods in practical environmental operations was tested by selecting data from a scenic area operator for application analysis. When conducting analysis, information is searched based on the data collected on the same day every day, and the data is not preprocessed before analysis. Based on the actual situation, Wednesday data was selected as a small-scale dataset, and Sunday data was selected as a large-scale dataset. The running time of the proposed method was tested on two datasets of different scales, as shown in Fig. 10.

In datasets of different scales, the running time of different methods increases as the number of data included in the data increases (Fig. 10). In small data, XGBoost increased its runtime to 865 seconds when the number of data items increased to 400K (Fig. 10(a)). The running time of DCA increased to 902s when the number of data involved increased to 400K. The proposed method increased the running time to 221 seconds when the number of data items was increased to 400K. In big data, XGBoost had a runtime of 23.1Ks when involving 50M data pieces, as shown in Fig. 10(b). The running time increased to 60.8Ks when the number of data items was increased to 400M. The running time of DCA when involving 50M data pieces was 37.9Ks. The running time increased to 63.7Ks when data items were increased to 400M. The running time of the proposed method when involving 50M data pieces was 13.6Ks. The running time increased to 33.0Ks when the number of data items was increased to 400M. Hence, the proposed method has higher operational efficiency in datasets of different scales. The method tests the number of redundant data during runtime, as shown in Fig. 11.

In datasets of different scales, the number of redundant data pieces for all methods increased as the processing data size increased (Fig. 11). In small data, XGBoost generated 401 redundant data when the number of data items was increased to 250K, as shown in Fig. 11(a). When the number of data involved in DCA increased to 250K, 586 redundant data were generated. The proposed method generated 28 redundant data when the number of data involved was 50K. One hundred and two redundant data items were generated when the number of data items was increased to 250K. In big data, XGBoost generated 614291 redundant data pieces when the number of data pieces was increased to 250M (Fig. 11(b)). When the number of data involved in DCA was increased to 250M, 662153 redundant data were generated. The proposed method generated 21367 redundant data when the number of data involved was 50M; When the number of data items was increased to 250M, 152372 redundant data items were generated. Hence, the proposed method can perform more accurate and effective DM when searching for smart tourism information. The method analyzed the CPU load during an information search, as shown in Fig. 12.

During the intelligent tourism information search process, the maximum CPU performance utilization ratio of the XGBoost method reached 91.8%, and the average CPU performance utilization ratio during the period was 61.3%, with significant fluctuations in the ratio (Fig. 12(a)). The maximum CPU performance utilization ratio of the DCA method was 73.3%, and the average CPU performance utilization ratio during the period was 47.6%, with significant fluctuations in the ratio (Fig. 12(b)). The maximum CPU performance utilization ratio of the proposed method was 43.2%, and the average CPU performance utilization ratio during the period was 39.7%, with slight fluctuations in the ratio (Fig. 12(c)). Hence, the proposed method can effectively reduce the processor performance burden and maintain stable operation after multithreading allocation.

## 5. Conclusion

Searching for smart tourism information can provide richer data references for developing the tourism industry. A smart tourism information search technology based on DM was proposed to address the diversity of smart tourism information and data analysis issues. During the process, the essential information in the data was mined through frequent itemset calculations, and binary attribute transaction sets were used for data analysis. The concept of correlation was introduced to streamline the number of candidate sets. The proposed method maintained an accuracy rate of over 93.5% and could reach a maximum of 93.6% after 200 seconds of data mining-type accuracy testing. The number of association rules of the proposed method decreased to 2137 at a confidence level of 0.028 and remained at 2317 at a support level of 0.032. The proposed method only ran for 221 seconds when the number of data involved was increased to 400K; The area enclosed by the ROC curve generated by the proposed method was larger than that of other methods. When testing the number of redundant data pieces, the proposed method only generated 152372 redundant data pieces when the number of data pieces was increased to 250M in large-scale datasets. The CPU load during the operation of the proposed method remained below 43.2%. The results indicated that the smart tourism information search technology designed in this study has better operational performance, lower hardware requirements during runtime, and can complete information search tasks faster and more accurately than other methods. Nevertheless, the proposed method was only tested in a short time span and a relatively simple data environment. In future studies, the experimental scope will be expanded. Variable parameters will be added to enrich the experimental results and optimize the method.

### REFERENCES

## Author

Deng Liu is an associate professor at Wuhan Technical College of Communications (430065, Hubei, China). His research interests include Tourism Management and travel research.