Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Northern Beijing Vocational Education Institute, Beijing 101400, China)



One-way frequent pattern tree, Spark platform, Design and Implementation, Preschool Education Book, Recommendation system

1. Introduction

In the era of big data, the information environment of the book platform has changed from insufficient information overload to information overload, and its corresponding service mode should also change from “people looking for information” to “information looking for people” [1,2]. This also puts forward new requirements and challenges for the online book platform. The book platform should not be limited to users’ passive search to provide information, but should adapt to user needs and improve content delivery, ease of use and service responsiveness [3]. Find the content they need in the complex information, which requires users to spend a lot of time and energy to screen and evaluate the search results, and cannot significantly solve the problem of information overload [4]. At the same time, when the user needs are not clear and the keyword description content is not accurate, often get the user is not interested in the results, users need to search for many times to find their own satisfactory results. Since its performance depends on the hidden association rules to provide important reference for decision making. With the deepening of many scholars’ research on this technology, the correlation rule mining has been applied in different fields and achieved good results [5,6]. At present, related rules have been widely used in the retail industry, medical care, securities market and other fields. For example, the related rules are applied in the retail industry, according to the consumption behavior, dig out the purchasing habits of customers, help merchants to design reasonable commodity layout, and formulate appropriate sales strategies to maximize profits; find the common characteristics of patients with certain disease from a large number of case records, which has important guiding significance for the existing stock trend, it has important reference value for investors to decide [7].

Books, as easily transportable and affordable goods, have become a significant component of e-commerce. The vast amount of data generated in daily transactions contains valuable insights, posing a challenge for businesses to effectively extract useful information. Data mining technology is essential for merchants to analyze transaction records, providing a crucial basis for decision-making. However, traditional frequent item set mining algorithms are limited by computing power and memory when applied to large-scale data [8,9]. To address this, parallelization techniques, particularly using the Spark framework, can enhance the efficiency of frequent item set mining by processing data in memory and reducing disk reads [11,12]. This paper explores improvements to existing mining algorithms, leveraging Spark’s distributed computing capabilities, and applies these advancements to a parallelized book recommendation system [13,14]. By mining transactional data, the improved algorithm enhances recommendation accuracy, demonstrating both practical and broad applicability in e-commerce settings.

2. Design and Implementation of the Frequent Term Set Mining Algorithm Based on UFP-tree

2.1. Creation of the Constrained Subtree

Data cleaning is not simply a process of selecting high-quality data, but of adding, deleting, grouping or reorganizing the original data, so as to improve the quality of data and reduce the impact of this gap on the mining results. As shown in Eqs. (1), (2), data from multiple files or multiple databases running environments are integrated together and stored in the same data warehouse.

(1)
$\varepsilon_i^k = \frac{1}{n_i} \sum_{j=1}^{n_i} p_{ij}, p_{ij} \in r_i$
(2)
$\sigma_i^k = \sqrt{\frac{1}{n_i-1} \sum_{j=1}^{n_i} (p_{ij} - \varepsilon_i^k)^2}, p_{ij} \in r_i$

Data integration can effectively reduce data redundancy, data value conflict and other problems, and effectively improve the efficiency of data processing. Data selection, in order to better serve users’ decision-making and improve the efficiency of data mining, needs to narrow the scope of data, such as shown in Eqs. (3) and (4), and only extract part of the data that users are interested in. Data transformation, data transformation is the data into an analysis model for a specific data mining algorithm.

(3)
$var(X) = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n\bar{X}^2 \right)$
(4)
$var(Y) = \frac{1}{n-1} \left( \sum_{i=1}^n Y_i^2 - n\bar{Y}^2 \right)$

Different mining tasks process different data types, so we should decide to choose a specific data transformation method according to the following mining tasks. The methods of data transformation include normalization, reduction, switching, rotation and projection. As shown in Eqs. (5) and (6), according to the characteristics and uses of the data, the appropriate data mining algorithm is selected, so as to more accurately search out the mode of interest in the data. The parameters may need to be adjusted multiple times during the algorithm execution until satisfactory results are obtained.

(5)
$g(x, y) = \left( \frac{1}{2\pi\sigma_x\sigma_y} \right) \exp \left[ -\frac{1}{2} \left( \frac{x^2}{\sigma_x^2} + \frac{y^2}{\sigma_y^2} \right) + 2\pi j\omega x \right]$
(6)
$G(\mu, \upsilon) = \exp \left\{ -\frac{1}{2} \left[ \frac{(\mu - W)^2}{\sigma_\mu^2} + \frac{\upsilon^2}{\sigma_\nu^2} \right] \right\}$

The patterns obtained by data mining may have no practical value or contradict the actual scenario, as shown in Eqs. (7) and (8), which requires pattern evaluation. The pattern obtained by data mining is evaluated in practical application to determine whether the model can achieve the predetermined goal. At this stage, the final decision to use the mining results is made.

(7)
$w_{mn}(x, y) = I(x, y)g_m^*(x-x_1, y-y_1)dx_1dy_1$
(8)
$E_{snaks} = \int [E_{in}(\nu(s)) + E_{out}(\nu(s))]ds$

This algorithm tries to use samples to reduce the number of scans of the database. This algorithm can effectively reduce the number of I / O, so as to achieve a performance improvement. As shown in Eqs. (9), (10), but due to the nonuniformity of transaction data distribution, this method does not guarantee the accuracy of mining, that is, data distortion. Therefore, this algorithm is suitable for mining scenarios with high efficiency requirements and not too high accuracy requirements.

(9)
$g(|\nabla u_0(x, y)|) = \frac{1}{1 + |\nabla G_\sigma(x, y) \times u_0(x, y)|^2}$
(10)
$J_m(U, \nu) = \sum_{k=1}^n \sum_{i=1}^c (u_{ik})^m \|x_k - \nu_i\|^2$

2.2. Mining of Frequent Item Sets

Association refers to the special law of the existence of values between two or more variables. This association includes simple association, quantitative association, temporal association and so on. Association rule mining is the most commonly used method in association analysis. As shown in Eqs. (11) and (12), we can find the correlation relationship between data and find the hidden and important knowledge through association rule mining, where X and Y are disjoint rule precursor and rule backs.

(11)
$cut(G_1, G_2) = \sum_{i \in V_1(G_1), j \in V_2(G_2)} d_{ij}$
(12)
$cut(A, B) = \sum_{u \in A, v \in B} w(u, v)$

Due to the lack of exact correlation functions for the associations between the data, as shown in Eqs. (13) and (14), the support and confidence measures are needed to describe the strength of the mined correlation rules.

(13)
$w_{ij} = \exp \left( -\frac{|I_i - I_j|^2}{\sigma_I^2} \right)$
(14)
$\bar{c}(A, B) = \frac{c(A, B, w(u, v))}{c(A, B, 1)}$

The number of entries that appear must be equal to or greater than the given minimum support. As shown in Eqs. (15), (16), in the above two steps, the second step is carried out on the basis of the first step, which is relatively simple, so the key and difficulty of the association rule mining algorithm lies in the frequent item set mining.

(15)
$Rcut(A, B) \triangleq \frac{c_1(A, B)}{c_2(A, B)}$
(16)
$c_1(A, B) \triangleq \sum_{u \in A, v \in B, (u, v) \in E} w_1(u, v)$

Divide the original transaction database into m different subblocks, as shown in Eqs. (17), (18), the principle of division is that each subblock data can be loaded into memory; each subblock independently obtains the local frequency item set, where the minimum support count of each subblock frequent item set is the product of min_sup and the total number of transactions of the subblock. Local frequencies set of each subblock can be obtained by scanning the original transaction database once.

(17)
$\Gamma(u) = \frac{\sum_{i>j} w_{ij}(u_i - u_j)^2}{\sum_{i>j} w_{ij}u_iu_j} = \frac{u^T L u}{\frac{1}{2} u^T W u}$
(18)
$Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)}$

Using the property “the frequent item set is frequent in at least one subblock”, as shown in Eqs. (19), (20). Partition The algorithm is highly parallelized, requiring only scanning the complete database twice, quite large and the redundant calculation cost is high.

(19)
$Ncut(A, B) = \frac{\sum_{(x_i>0, x_j<0)} -w_{ij}x_ix_j}{\sum_{x_i>0} d_i} + \frac{\sum_{(x_i<0, x_j>0)} -w_{ij}x_ix_j}{\sum_{x_i<0} d_i}$
(20)
$Ncut(A, B) = \frac{y^T(D-W)y}{y^T D y}$

3. A Parallelization Implementation of the UFIM Algorithm on the Spark

3.1. Parallel-based Analysis of the UFIM Algorithm

When using such algorithms, the accuracy of the mining results depends on the size of the sampling capacity: if the sample data is too little, the mining results may be greatly biased and cannot meet the mining requirements; if the sample data is too large, the significance of using the sampling algorithm is lost [15]. Based on the hash method, the DHP algorithm proposed by Park et al utilizes the hash technique to reduce the number of candidates sets included in Ck. For example, when going through the transaction database to generate frequent 1-item sets, Fig. 1 is a flowchart of the collaborative filtering algorithm. All 2-item sets can be generated according to the transaction, mapping them to different buckets of the scatter list, and adding the corresponding bucket count value [16,17]. If the bucket count is less than the support threshold, none of the 2-item sets in this bucket are frequent 2-item sets. The number of Ck can be effectively reduced by hash-based techniques [18,19]. The efficiency of the DHP algorithm depends on the selected hash function. Transaction compression method that improves the efficiency of frequent item set mining by reducing the size of the future scanning transaction database. By marking or deleting transactions that do not contain any item in Ck, you calculate the Ck support. With the increase of k value, the actual number of transactions will be significantly reduced [20,21].

Fig. 1. Flow chart of the collaborative filtering algorithm.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig1.png

The Apriori algorithm is the first of the frequent term set mining algorithm, it can mine all the frequent term sets, but its inherent performance bottleneck causes the algorithm cannot well adapt to the needs of the era of big data [22,23]. Table 1 shows the table of running time changes with support, and the defect FP-Growth algorithm for the Apriori algorithm emerged. This algorithm mines out all the frequent items by squeezing the transaction database into the frequent mode tree (FP-tree) and using the pattern growth. Compared with Apriori algorithm, FP-Growth algorithm has greatly improved the time efficiency [24,25]. It compresses the large database into a tree structure and avoids the cost of scanning the database repeatedly. The idea of separation and treatment is adopted in frequent item set mining, and the problem of finding long frequent term set is transformed into discovering short frequent term set mining through iterative operation [26,27]. However, FP-tree occupies a lot of memory space, and when the data is large, it may not be loaded into memory. In the process of frequent item set mining, a large number of conditional pattern base and frequent pattern trees should be formed, which consumes a lot of time and space. The whole process adopts the recursive calling strategy of pattern growth, which also occupies a lot of CPU time and reduces the mining efficiency [28,29].

Table 1. Changes of running time with support.

Minimum support level FP-growth running time Algorithm running time UFIM running time
10 25.736 1.248 1.063
20 13.149 0.421 0.354
30 5.631 0.188 0.172
40 2.950 0.172 0.156
50 1.520 0.148 0.125

Create a two-dimensional array when creating the conditional FP-tree, and the elements in the array are the count of any two frequent term pairs [30]. Fig. 2 is the flowchart of the content base recommendation algorithm. Through this two-dimensional array, it can determine which are frequent items when generating the new condition FP-tree, and reduce the recurrence of the original condition FP-tree, thus improving the mining efficiency of frequent item set; using H-struct data structure. This data structure is more space-saving than using candidate discovery frequency item sets and pattern growth patterns. Good expansion in various types of data; adopt an additional structure hash structure to store the items and positions in the item header table. When you need to find an item i from the item header table, first find the value corresponding to the key in the scatter list, and then directly read the structure according to the order of the item header table. An algorithm that uses the original FP-tree without generating conditional FP-tree. The basic idea of this algorithm is to first scan the transaction database twice, put the data in a highly compressed unidirectional tree, and then use this unidirectional tree to construct a subtree constrained by single terms, thus recursively mining all the sets of frequent items in a pattern growth way. The input data and output directory specified in this job are correct and send the job jar file, job division, configuration file and other resources required for the job operation to Job Tracker, these resources are stored in the folder that Job Tracker has created specifically for the job, the file name is the job ID, Job Id; Job Tracker After receiving the task, Would put it in a queue, Wait for the job scheduler to dispatch it. When the job is executed by the job scheduler, Job Tracker creates Map tasks one by one based on the job division and assigns them to Task Tracker for execution.

Fig. 2. Flow chart of the content base recommendation algorithm.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig2.png

3.2. The UFIM Algorithm Is Designed Based on Spark

In Spark, Map tasks prioritize “data localization,” where the task runs on the same node as the data, minimizing data transmission time. Table 2 illustrates how run time changes with the number of transactions. Unlike Map tasks, reduce tasks don’t consider data localization. Task Trackers report regularly to Job Tracker to confirm they’re either executing tasks or idle, ready for new tasks. If a node fails, the task is quickly reassigned to another slave node. Each task runs in its own Java virtual machine, and outputs are saved in temporary HDFS files before being merged into a final result file. Users submit applications through a client, which initializes Spark Context on the driver node. During system setup, two scheduling modules are created: DAG Scheduler and Task Scheduler. DAG Scheduler groups tasks into Stages based on dependencies, while Task Scheduler submits these Stages (as Task Sets) for execution. Task monitoring and reporting occur during execution. Cluster resource management is handled by Spark’s own manager or other systems like YARN or Mesos. Upon completion, the Executor process stores the results, and the resource manager reclaims resources, completing the Spark application execution.

Table 2. Running time changes with the number of transactions.

Transaction number Number of frequent item sets Algorithm running time UFIM running time
1000 389279 0.734 0.610
2000 563663 1.094 0.860
3000 530501 1.047 0.958
4000 275087 1.172 0.957
5000 597519 1.156 0.984

The Spark computing framework features several operational modes: Spark Local, Spark Standalone, Spark Mesos, and Spark YARN. Spark Local Mode simulates distributed operations using multiple single threads. This straightforward mode requires only unzipping the Spark package and modifying common configurations. However, it lacks a distribution mechanism, making it primarily suitable for verifying the logical correctness of applications rather than assessing time efficiency. Spark Standalone Mode operates independently of other resource management systems and can be deployed in a single cluster with complete services. It only supports FIFO scheduling for multiple simultaneous tasks. Compared to MapReduce, this mode utilizes a master/slave architecture and employs Zookeeper to address single points of failure. In contrast to MapReduce’s distinction between Map and Reduce tasks, Spark improves resource utilization by allowing a single slot to handle various task types. Spark Mesos Mode provides two scheduling options: coarse-grained and fine-grained. In coarse-grained mode, Mesos allocates all resources of the running environment, retaining them even when not in use, and recovers these resources after application execution. This mode offers a low startup cost for tasks but lacks flexibility in resource allocation, leading to underutilization of cluster resources and subsequent waste, as shown in Fig. 4, which depicts the accuracy evaluation. The fine-grained mode allocates resources before task execution.

Fig. 3. Flow chart of the hybrid recommendation algorithm.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig3.png

Fig. 4. Accuracy assessment.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig4.png

The UFP-tree is a compact data structure similar to the FP-tree but features key differences. The UFP-tree is unidirectional, containing pointers only from the current node to its parent, while the FP-tree has bidirectional pointers. FP-tree nodes are identified by data items, whereas UFP-tree nodes are identified by frequent 1-item sets. Additionally, children in FP-trees are unordered, while those in UFP-trees are arranged in ascending order. Each node in the FP-tree has six domains, while the UFP-tree has only four, resulting in lower memory usage. After constructing the UFP-tree, a constrained subtree can be generated to mine all frequent item sets. The constrained subtree is analyzed in two scenarios: pointing to the same or different endpoints. The proposed improvement reduces recursion and storage requirements without altering the frequent item sets mined. The UFIM algorithm primarily comprises the UFP mining process, which generates frequent 1-item sets and constrained subtrees, and the subsequent mining process for longer item sets.

4. Design of Recommendation System for Preschool Education Based on UFIM Algorithm

In a single-machine environment, frequent item set mining algorithms can efficiently process small data volumes, as illustrated in Fig. 5, which shows the recall rate evaluation. However, the computing power and memory limitations in this environment hinder algorithm performance, often leading to challenges such as the difficulty of storing the constructed frequent mode tree in memory. Hadoop addresses big data processing effectively but is limited by its method of writing Map stage results to local disk and subsequently reading from it after the Reduce phase, rendering iterative frequent item set mining inefficient. In contrast, Spark, a memory-based big data processing framework, retains intermediate results in memory, minimizing frequent read/write operations and enhancing performance during iterative processes. This chapter will focus on parallelizing the design and implementation of the UFIM algorithm on the Spark platform, ensuring that processing tasks do not interfere with one another. In real-world applications, complex tasks are divided into subtasks, which can introduce time and space overhead due to data synchronization. Nevertheless, parallel computing, whether through new algorithm design or analyzing existing serial algorithms for parallelizable components, significantly improves execution efficiency compared to traditional serial methods.

Fig. 5. Recall assessment.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig5.png

This paper uses the second way to parallelize the serial UFIM algorithm proposed. Fig. 6 shows the coverage evaluation graph. Construct the subtree constrained by a single item according to the item header table, and the frequent item set mined from the constrained subtree contains m. Other item mining procedures in the item header table. Obviously, the process of constrained subtrees of different terms and mining the set of frequent items is mutually interference. UFIM algorithm mining process parallelization is simply divided subtask to construct a single constraint subtree required data as the basic unit will be distributed to the working node, each work node can independently mine contains frequent item set, eventually the node get the frequent set summary constitute the frequent set of the whole transaction database. Since the operation of Spark is based on RDD, when UFIM algorithm is parallelized by Spark platform, you need to convert the transaction database to RDD and then mine all the sets of frequent items based on RDD. The parallelization of the UFIM algorithm on Spark is mainly divided into the following five steps: to convert the transaction database to the initial RDD.

Fig. 6. Coverage assessment.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig6.png

Spark parallel computing relies on RDDs (Resilient Distributed Datasets) to process transaction databases. The transaction database is first converted into an RDD, referred to as T. As illustrated in Fig. 7, support counts for all items are calculated in parallel to yield frequent 1-item sets. Initially, the transactions in T are sorted according to the F_List, generating a new dataset, T’. The items in T’ are then grouped, producing G_List, which includes group numbers and corresponding segments of the transaction database. The UFIM algorithm is executed in parallel to mine the frequent item sets. Unlike the traditional UFIM algorithm, this approach employs parallelization to prevent redundant mining across multiple groups. Each group generates a UFP-tree item header table containing only serial numbers for frequent items specific to that group. The transaction database is transformed into an RDD using the map and reduceByKey operations. The map operation converts each transaction into a <key, value> pair, with the key as the transaction and the value initialized to 1. The reduceByKey operation aggregates these values to form the initial RDD, which is used for subsequent frequent item set mining, enabling efficient processing without re-scanning the original database.

Fig. 7. Diversity assessment plots.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig7.png

The purpose of data grouping in a Spark cluster is to facilitate parallel computing without the need for inter-node communication, as shown in Fig. 8, which illustrates user satisfaction evaluation. First, the F_List set is numbered in descending order of support count, generating a F_Ranked List in the format <key: frequent item, value: frequent item ranking>. Next, the T transaction database is numbered, and infrequent items are removed using the F_Ranked List to obtain T’. Data items are then grouped by determining the maximum number of frequent items in each group based on F_List. The number of frequent item sets per group (Q) is calculated and increased by one. Using a flatMap operation, items in T’ are divided into appropriate groups. The division method involves iterating through each transaction, assigning a group number (gid) based on the formula (t[i]−1)/g_size. If a group number does not exist, the corresponding part of the transaction is added to that group. Subsequently, a groupByKey operation aggregates data by group number to form G_List. Frequent item set mining occurs in parallel across nodes, utilizing the grouped data in G_List. Each node independently applies a serial frequent item set mining algorithm, generating frequent item sets and their support values.

Fig. 8. User satisfaction assessment chart.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig8.png

The completion of the previous step entailed mining all the frequent item sets and numbering them for evaluation, as illustrated in Fig. 9 for click assessment. To convert these serial numbers into distinct items, a mapping operation is executed, translating the serial numbers into corresponding items from the F_Ranked List. The resultant set, denoted as Result, can then be saved using “save As Text File()” for preservation in a specified location. Given the vast amount of books stored, featuring potentially thousands of similar or identically titled books, enhancing user experiences becomes crucial. Major book websites, such as Amazon, encompass around 3.1 million books, necessitating efficient retrieval methods.

Fig. 9. Click rate evaluation.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig9.png

Extensive searches based on classifications, titles, and authors, though viable, often consume time without guaranteeing satisfactory outcomes, an issue both users and businesses seek to mitigate. Reading through book content for selection further compounds time expenditure. Novices in certain professions might inadvertently purchase low-quality books, dampening their interest and, subsequently, impeding societal progress. A personalized book recommendation system serves practical importance, with this paper deploying the Spark-based parallel UFIM algorithm to enhance book recommendations, reinforcing its applicability. An integral facet determining software acceptance in the market pertains to the standardization of the development process.

Software development commences with requirement analysis, aimed at comprehending user demands and defining the software system’s functionalities within the business context. Subsequent phases span design, implementation, and testing. Design intricacies cover software architecture and data structure specifications post-requirement assessments, further segmented into summary and detailed designs. Implementation focuses on materializing module functions per design, emphasizing the optimal selection of programming languages and tools for efficient software creation. Testing stages scrutinize software alignment with design expectations, differentiating between white box and black box testing methodologies based on internal software comprehension emphasis.

5. Experimental Analyses

Based on the user’s collaborative filtering, suitable scenario is less users, more items of scenario, Fig. 10 for conversion rate evaluation, and most of the scenario, is the number of users more, if the collaborative filtering based on user will lead to calculation time is too long, although offline recommendation is calculated in advance does not need to consider the problem of time. In addition, based on user collaborative filtering, because changes in users’ interests can also lead to inaccurate recommendation results. Comprehensive consideration, in the scoring matrix is relatively dense scenario, this article or choose based on collaborative filtering, it is more suitable for recommended system late items less users, and can meet the needs of the personalized users, and because of the similarity between items more stable, not frequent changes, is more suitable for offline recommendation, ensure the accuracy of the recommendation system, and also can save calculation time.

Fig. 10. Conversion rate assessment.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig10.png

Based on the collaborative filtering of items, we need to find groups of users who are interested in the same item. Fig. 11 is a retention assessment chart, and in order to find the interest of users, we can use the scoring data. If a person has rated an item, it can indicate that he is interested in the item. Therefore, the scored data can be mined to calculate the similarity between items. With the behavioral data, the behavioral data in the existing data is the reader’s rating of the book. Therefore, after the connection from the system to MongoDB is configured and Spark is defined, Spark reads are stored offline in the scoring data set Rating of the database, calling the map () method to extract the required fields: “User Id”, “Product Id”, “Rating”, defined as table rating DF, and the data structure is converted to Data Frame for subsequent calculation and processing. Also, call the cache () method cache for performance purposes.

Fig. 11. Retention rate assessment chart.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig11.png

The loaded scoring data is preprocessed and entered into the core algorithm section. According to the formula, Fig. 12 is the turnover rate evaluation chart. There are two values to be calculated to calculate the current similarity, one is the number of the numerator, and the other is the denominator to score the number of each two books at the same time. The scoring table rating DF calls group by () method, which configure parameter Product Id, groups the users according to the dimension of the product, and counts the scores of each book. The table structure generated here is also Data Frame, the custom table name is product Rating Count DF, and the table contains field Product Id and the number of scores of each book count.

Fig. 12. Rain assessment.

../../Resources/ieie/IEIESPC.2025.14.6.803/fig12.png

6. Conclusion

This paper begins by outlining the research background of book recommendation systems, including their significance, current status both domestically and internationally, and the various stages of research. The objective is to develop a book recommendation system capable of handling large-scale data processing while improving accuracy. To implement the recommendation system, two core technologies are utilized. The first is a computing framework for large-scale data processing. This paper selects the Spark architecture, considering the need for both offline computing and online real-time processing. It explains Spark’s principles, including Resilient Distributed Datasets (RDDs), and discusses the evolution of the Spark community to address diverse scenarios within the Spark ecosystem. Compared to Hadoop, Spark is more suitable for streaming computing, allowing for rapid calculations of large datasets. The second core technology is the recommendation algorithm. This study examines mainstream recommendation algorithms, explaining their central concepts and suitability for different scenarios. The validation of these algorithms contributes to the development of a prototype system capable of addressing various multi-scenario challenges.

Based on the research and analysis of the core technology, this paper introduces the design and implementation of the system. First, the system architecture of the project is described, including the offline recommendation module, real-time recommendation module and data transmission storage module. For, the offline recommendation module and real-time recommendation module mainly introduce the recommendation algorithm used by the module, the core code and implementation steps. For the data transmission and storage module, it mainly introduces the tools used in offline data and online data processing, and sorts out all the links of data from acquisition and filtering to computing and storage, so that the tools and processing involved in the data life cycle are clearer. According to the evaluation index curve in the figure, the algorithm training achieves convergence around 450 rounds, and the fluctuation range of SSIM is 42.6-42.3, and that of SSIM is 0.92∼0.96. According to the evaluation index analysis, PRI increased by about 2.5%∼3.5%, GCE decreased by about 15%∼20%, and VOI decreased by about 13%∼16%. The number of pixels in groups A and B was 408000 and 745608. Image similarity reached 97.84% in group A, 96.21% in group B, and 99.33% in group C.

Finding

This work was supported by Northern Beijing Vocational Education Institute. Research and Application of Teachers’ Guidance Ability Development for Family Education for Preschool Education Majors in Higher Vocational Colleges Based on Home-School Cooperation (YD202103).

References

1 
Xu D., Zhang Y., Jin F., 2021, The role of AKR1 family in tamoxifen resistant invasive lobular breast cancer based on data mining, BMC Cancer, Vol. 21DOI
2 
Fiaschetti M., Graziano M., Heumann B. W., 2021, A data-based approach to identifying regional typologies and exemplars across the urban-rural gradient in Europe using affinity propagation, Regional Studies, Vol. 55, No. 12, pp. 1939-1954DOI
3 
Vemprala P., Bhatt P., Valecha R., Rao H. R., 2021, Emotions during the COVID-19 crisis: A health versus economy analysis of public responses, American Behavioral Scientist, Vol. 65, No. 14, pp. 1972-1989DOI
4 
Bao Y., Yang Q., 2022, A data-mining compensation approach for yaw misalignment on wind turbine, IEEE Transactions on Industrial Informatics, Vol. 17, No. 12, pp. 8154-8164DOI
5 
Kaspersky A., 2021, Selected morphotic parameters differentiating ulcerative colitis from Crohn's disease, Acta Mechanica et Automatica, Vol. 15, No. 4, pp. 249-253DOI
6 
Mandorino M., Figueiredo A. J., Cima G., Tessitore A., 2021, A data mining approach to predict non-contact injuries in young soccer players, International Journal of Computer Science in Sport, Vol. 20, No. 2, pp. 147-163DOI
7 
Baruah A. J., Baruah S., 2021, Data augmentation and deep neuro-fuzzy network for student performance prediction with MapReduce framework, International Journal of Automation and Computing, Vol. 18, pp. 981-992DOI
8 
Sharifi M., Khatibi T., Emamian M. H., Sadat S., Hashemi H., Fotouhi A., 2021, Development of glaucoma predictive model and risk factors assessment based on supervised models, Bio Data Mining, Vol. 14DOI
9 
Singh T., Panda S. S., Mohanty S. R., Dwibedy A., 2023, Opposition learning based Harris hawks optimizer for data clustering, Journal of Ambient Intelligence and Humanized Computing, Vol. 14, pp. 8347-8362DOI
10 
Torres-Ramos S., Fajardo-Robledo N. S., Pérez-Carrillo L. A., Castillo-Cruz C., Retamoza-Vega P. del R., Rodríguez-Betancourtt V. M., Neri-Cortés C., 2021, Mentors as female role models in STEM disciplines and their benefits, Sustainability, Vol. 13, No. 23DOI
11 
Guimarães N., Figueira Á., Torgo L., 2021, Can fake news detection models maintain the performance through time? A longitudinal evaluation of twitter publications, Mathematics, Vol. 9, No. 22, pp. 2988DOI
12 
Torres-Berru Y., Batista V. F. L., 2021, Data mining to identify anomalies in public procurement rating parameters, Electronics, Vol. 10, No. 22DOI
13 
Maciejewska K., Froelich W., 2021, Hierarchical classification of event-related potentials for the recognition of gender differences in the attention task, Entropy, Vol. 23, No. 11DOI
14 
Wei S., Shen X., Shao M., Sun L., 2021, Applying data mining approaches for analyzing hazardous materials transportation accidents on different types of roads, Sustainability, Vol. 13, No. 22DOI
15 
Bertón M. J., 2021, Text and data mining exception in South America: A way to foster AI development in the region, GRUR International, Vol. 70, No. 12, pp. 1145-1157DOI
16 
Meng J., Zhang H., Wang X., Zhao Y., 2021, Data mining to atmospheric corrosion process based on evidence fusion, Materials, Vol. 14, No. 22, pp. 6954DOI
17 
Wei J., Dai J., Zhao Y., Han P., Zhu Y., Huang W., 2021, Application of association rules analysis in mining adverse drug reaction signals, Applied Sciences, Vol. 11, No. 22DOI
18 
Xu J., Liu Z., Lin Y., Liu Y., Tian J., Gu Y., Liu S., 2021, Grey correlation analysis of haze impact factor PM2.5, Atmosphere, Vol. 12, No. 11DOI
19 
Neissi L., Golabi M., Albaji M., Naseri A. A., 2022, Evaluating evapotranspiration using data mining instead of physical-based model in remote sensing, Theoretical and Applied Climatology, Vol. 147, pp. 701-716DOI
20 
Xu F., Qu S., 2021, Data mining of students' consumption behaviour pattern based on self-attention graph neural network, Applied Sciences, Vol. 11, No. 22DOI
21 
Ferencsik D. K., Varga E. B., 2021, Cycling activity dataset creation and application for feedback giving, Acta Marisiensis Seria Technologica, Vol. 18, No. 2, pp. 29-35DOI
22 
Ekerete I., Garcia-Constantino M., Diaz-Skeete Y., Nugent C., McLaughlin J., 2021, Fusion of unobtrusive sensing solutions for sprained ankle rehabilitation exercises monitoring in home environments, Sensors, Vol. 21, No. 22DOI
23 
Gao Q., Molloy J., Axhausen K. W., 2021, Trip purpose imputation using GPS trajectories with machine learning, ISPRS International Journal of Geo-Information, Vol. 10, No. 11DOI
24 
Kaewyotha J., Songpan W., 2021, Multi-objective design of profit volumes and closeness ratings using MBHS optimizing based on the PrefixSpan mining approach (PSMA) for product layout in supermarkets, Applied Sciences, Vol. 11, No. 22DOI
25 
Mariana-Ioana M., Czibula G., Oneț-Marian Z.-E., 2021, Towards using unsupervised learning for comparing traditional and synchronous online learning in assessing students' academic performance, Mathematics, Vol. 9, No. 22DOI
26 
Konieczny J., Stojanowski J., Rydzyńska K., Kusztal M., Krajewska M., 2021, Artificial intelligence—A tool for risk assessment of delayed-graft function in kidney transplant, Journal of Clinical Medicine, Vol. 10, No. 22, pp. 5244DOI
27 
Yang J., Wang L., 2021, Applying MMD data mining to match network traffic for stepping-stone intrusion detection, Sensors, Vol. 21, No. 22DOI
28 
Wang D., Zou Y., Li H., Yu S., Xia L., Cheng X., Xu T., 2021, Data mining: Traditional spring festival associated with hypercholesterolemia, Cardiovascular Disorders, Vol. 21DOI
29 
Rahman M. A., Duradoni M., Guazzini A., 2022, Identification and prediction of phubbing behavior: A data-driven approach, Neural Computing and Applications, Vol. 34, pp. 3885-3894DOI
30 
Sheikhhosseini Z., Mirzaei N., Heidari R., Monkaresi H., 2021, Delineation of potential seismic sources using weighted K-means cluster analysis and particle swarm optimization (PSO), Acta Geophysica, Vol. 69, pp. 2161-2172DOI

Author

Zongli Xin
../../Resources/ieie/IEIESPC.2025.14.6.803/au1.png

Zongli Xin is a master of science in administration. She graduated from Southwest Jiaotong University in 2006. She worked in Beijing Jingbei Vocational College. Her research interests include preschool education and home schooling.