ShiYan1
               
                  - 
                           
                        (School of Physical Education, University of Sanya, Sanya, 572000, China)
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Action recognition,  Convolutional neural network,  Visual relation,  Self-attention mechanism
             
            
          
         
            
                  1. Introduction
               The number of videos onthe Internet has exploded, and the level of computer hardware
                  and information science has significantly improved. However, there are still problems
                  that need to be solved to find out how to accurately obtain valuable information from
                  action images from online physical education videos and perform intelligent identification
                  and classification. A convolutional neural network (CNN) is mainly used in deep learning
                  in computer vision tasks. The convolution operation is used to extract the features
                  of an input image, and the underlying features of the data are combined to form discriminative
                  high-level features [1]. In addition, CNNs can also learn and classify features from massive data, showing
                  very good model generalization ability [2].
               
               <New paragraph> Skeleton point modality has attracted the attention of many researchers
                  [3]. Action image recognition using skeleton point modality can greatly reduce the adverse
                  effects caused by a lens and background noise, so this method is more suitable for
                  recognition of action images in medium <note: ambiguous> [4]. In order to optimize the action recognition method using a bone point pattern, a
                  post fusion method of spatio-temporal depth features based on a CNNwasstudied, and
                  the bone point pattern was improved by using a spatio-temporal graph attention network
                  (STGAT). The proposed CSTGAT model can identify human action images in video clips,
                  provide a new way of teaching physical education, assist in distance learning, and
                  promote the harmonious and stable development of society <note: ambiguous/awkward>.
               
               We used a CNN to take joints as nodes of graph network and used a fixed adjacency
                  matrix to describe the relationship between nodes, which can quickly update and obtain
                  the characteristics of other nodes. STGAT was used to capture the cross spatio-temporal
                  information in the spatio-temporal neighborhood, expand the spatial receptive field
                  of nodes, and introduce a separation learning strategy to accurately aggregate the
                  features of each order of the spatio-temporal neighborhood. A dynamic time weighting
                  strategy can dynamically weigh the information of each frame in the local space-time
                  neighborhood. A display motion capture strategy reduces the redundancy of local spatiotemporal
                  features and enhances the accuracy of recognition.
               
             
            
                  2. Related Work
               As a tool to solve the problems existing in the field of understanding video and computer
                  vision in recent years, action recognition technology has attracted widespread public
                  attention. Action image recognition requires judging and classifying the actions existing
                  in multiple frames of images in a video clip, and attaching corresponding labels [5]. Many scholars have conducted in-depth discussions on this issue. Anithaet al. set
                  up a robust human action recognition system based on image processing and used it
                  to detect human behavior representations [6]. Nwoye et al. designed a novel spatial attention to capture single action triples
                  in the scene (i.e., a class activation guided attention mechanism) and analyzed surgical
                  actions in endoscopic videos to achieve accurate action recognition [7]. Jiang et al. established an artificial deep learning framework based on the SMO
                  algorithm optimization model and artificial intelligence-based motion combination
                  training action recognition model. They studied methods to improve the accuracy of
                  motion combination action recognition [8].
               
               <New paragraph> Based on the trampoline motion decomposition method for deep learning
                  image recognition, Liu et al. explored the key steps of an athlete's trampoline somersault
                  [9]. Silva et al. developed a skeleton-driven action recognition approach based on spatiotemporal
                  representations of images and CNNs to explore stereotyped movements in children with
                  autism spectrum disorder [10]. Ali et al. explored the visible spectrum of video media for action recognition and
                  used Beta-Liouville Hidden Markov Models for a multimodal action recognition [11]. Kim et al. studied multi-view action recognition and classification using skeleton-based
                  features for viewpoint-aware action recognition [12].
               
               With the development of deep learning technology, the use of deep CNNs to classify
                  action images has attracted the attention of many researchers. The structure of a
                  CNNis becoming increasingly simpler, and the performance and generalization ability
                  of the model are stronger than in other classification methods, so it has been widely
                  used in various fields [13]. Zhou et al. designed a short text classification algorithm based on semantic expansion
                  and a CNN to extract effective information from a large number of original texts and
                  improve the classification performance for short texts. A test with four datasets
                  showed that the proposed model hada better effect than the most advanced models, and
                  the computational difficulty was lower [14].
               
               <New paragraph> Satyanarayanaet al. built a CNN model to detect and classify vehicles
                  on a road in the construction of intelligent transportation. This model does not require
                  real-time implementation, which is more convenient, and its detection accuracy is
                  as high as 98.5% [15]. Eldho et al. effectively removed the Gaussian impulse noise in digital images using
                  a new type of pseudo-CNN without adjustable parameters for image preprocessing and
                  then used a CNN optimization model to process images. The results showed that this
                  method had better qualitative and quantitative results than the best technology at
                  present and can also remove noise efficiently [16].
               
               <New paragraph> Hu et al. built a network integration framework based on a CNN to
                  enhance local and regional motion history images in order to solve the problem of
                  facial expression in video sequences [17]. Jagannathan et al. used a CNN prediction model to make timely predictions of land
                  and natural resources information for mitigating the urban heat island phenomenon
                  [18]. Focusing on slow retrieval speed and easy loss of information in video retrieval,
                  Chen et al. used 3D-CNN technology to extract spatiotemporal image features and conducted
                  experiments based on a large number of datasets. It has the advantage of high efficiency
                  and can effectively improve the retrieval speed of video images [19].
               
               Comprehensively, it can be observed that relevant domestic and foreign researchin
                  the field of action image recognition and CNNs has achieved good evaluations in practice.
                  Therefore, we used aCNN to optimize the action recognition method, design a spatiotemporal
                  attention network based on the self-attention mechanism, improve skeleton point modality,
                  construct an action recognition model for video clips, and realize a new teaching
                  mode to meet needs for physical exercise and learning.
               
             
            
                  3. Construction of CSTGAT Model based on CNN
               
                     3.1 Spatiotemporal Depth Feature Fusion based on CNN
                  Sports actions in action image recognition are special and complex and cannot be accurately
                     identified. More attention should be paid to the processing of actions of various
                     parts of the human body. The actions of the human body are discriminative in not only
                     the spatial dimension, but also in the time dimension [20]. When performing a recognition task with human action images, it is necessary to
                     deeply mine the features in the spatiotemporal dimension from online videos. In the
                     aspect of extracting spatial depth features, a method of combining a depth map and
                     RGB map is adopted. While extracting relevant features, it can also accurately distinguish
                     the scene level and human body in the image.
                  
                  <New paragraph> A CNN can perform deep learning from a large number of samples, thereby
                     obtaining corresponding features and optimizing the long-term and complex feature
                     extraction process. Moreover, it can directly process the collected two-dimensional
                     action images, which has strong applicability. Its structure is used in depth maps
                     and RGB maps. The difference between the two types of graphs is due to the difference
                     in input signals, so distinctive features are mined. The underlying features of the
                     CNN focus on mining common features, and the high-level features are biased towards
                     extracting unique features of the image.
                  
                  <New paragraph> Let the graph structure be $Q=\left(R,L\right)$, $R=\left(r_{1},r_{2},\cdots
                     ,r_{S}\right)$ be the $S$ graph nodes of joints, $L$be the graph edge of bones between
                     joints, and the $S\times S$ adjacency matrix $O$ be the connection between joints.
                     If $r_{i}$ is connected with $r_{j}\,,$ $O_{ij}=1;$ otherwise, $O_{ij}=0$. In general,
                     $Q$ is an undirected graph, so $O$ is a symmetric matrix. Given the input vector $U$and
                     graph structure, the graph convolution operation of each time step can be calculated,
                     as shown in Eq. (1).
                  
                  
                  where $U^{in}$ is the input feature, $U^{out}$ is the output vector, $Y$ is a feature
                     transformation matrix that can be trained, $\chi $ is an angle matrix normalized to
                     $O$, and $I$ is $O$ to increase self-loop connection to maintain its own characteristics.
                     <note: ambiguous> We used the Softmax loss function. $Z$represents the output of the
                     last layer of the neural network, which is basically a vector of dimension <note:
                     ambiguous>. The definition of the Softmax loss function is expressed as Eq. (2).
                  
                  
                  In order to reduce the error generated by the loss function, the parameters of the
                     CNN are optimized by using the stochastic gradient descent algorithm, and the iterative
                     process is stopped when the network converges to a stable trend. We input the depth
                     map and the independent RGB image into the deep learning CNN model for feature extraction
                     and then fuse the extracted results into new features. The new features can have the
                     spatial information of both the RGB image and the individual depth image. The obtained
                     new feature is also the spatial depth feature (SDF), which is calculated with Eq.
                     (3).
                  
                  
                  where $A_{1}$ represents the accuracy of RGB map calculation, $A_{2}$ represents the
                     accuracy of depth map calculation, $SDF_{1}$ is the feature of the RGB map, and $SDF_{2}$
                     is the feature of the depth map. 
                  
                  Human movements in teaching videos contain not only spatial characteristics, but also
                     temporal characteristics, so it is also necessary to extract the temporal depth characteristics
                     of action images. A commonly used deep learning method for processing temporal feature
                     information is based on the two-layer structure of a recurrent neural network (RNN),
                     in which the calculation of the output layer is shown in Eq. (4).
                  
                  
                  where $f$represents the model function that the RNN needs to train, and $h_{t}$represents
                     the output layer. The RNN is iteratively processed through the time scale of the sequence.
                     Therefore, the RNN has excellent application effects in modeling and feature extraction
                     of sequence data. 
                  
                  The dimension of temporal depth feature extraction is calculated using the cross-entropy
                     loss function, which is defined in Eq. (5).
                  
                  
                  where $v_{t}$ represents the correct label at the time point $t$, and $v'_{t}$ represents
                     the predicted label calculated by the network. In order to control the calculation
                     result of the loss function in a lower region, the gradient optimization parameters
                     of the loss function $B$ are calculated, and the calculation of the total gradient
                     is shown in Eq. (6).
                  
                  
                  According to the calculated loss and gradient results, the weights can be automatically
                     adjusted, and finally, the optimized network model can be obtained by learning and
                     training. A traditional RNN has a problem of gradient dispersion. When the information
                     flow of the teaching video is too long, the number of iterations is so large that
                     the gradient explosion makes it difficult to carry out the training task. In order
                     to solve the problem of gradient dispersion, the LSTM-RNN method was used to learn
                     temporal depth features. The unit structure of LSTM is shown in Fig. 1.
                  
                  There is a state C in the internal unit structure of LSTM, which can be iteratively
                     updated as the time point of the input sequence increases, which solves the problem
                     of gradient dispersion. The late fusion method was adopted for the feature fusion
                     of spatial depth features and temporal depth features. First, the probabilities of
                     spatial and temporal depth features need to be superimposed using linear weighting
                     before being output to the subsequent process, and then the predicted value is obtained.
                     The calculation of late fusion is shown in Eq. (7).
                  
                  
                  where $\varepsilon $ represents the weighting parameter, $P$ represents the final
                     prediction probability, $N$ is the number of sample features after multiple calculations
                     and analysis of the video, $P_{1}$ represents the output probability of spatial depth
                     features, and $P_{2}$ represents the output probability of temporal depth features.
                  
                  
                        Fig. 1. Unit structure of LSTM.
 
                
               
                     3.2 CSTGAT Model based on Skeleton Point Action Recognition
                  Due to the rapid development of wearable motion capture devices and human motion estimation
                     algorithms, a method of motion recognition through skeleton points is more and more
                     widely used. Due to the collected skeleton point data, the influence of lens movement,
                     light transformation, and image noise can be largely avoided, and the method of using
                     skeleton point data for action recognition pays more attention to the movements of
                     the human body. The method of using high-order adjacency matrix decomposition has
                     disadvantages of high computation cost and inability to distinguish the importance
                     of neighbors. Therefore, STGAT based on a self-attention mechanism is introduced to
                     solve the problem. STGAT can perform adaptive computational tasks on the connections
                     between the physical structure of human actions in a local spatiotemporal neighborhood.
                     The self-attention operator in each time step is defined as Eq. (8).
                  
                  
                  where $D_{e}$ represents the weight of the connection between the node $e$ and other
                     nodes, $v_{e}$ represents the index of the output layer, and $i$ represents the index
                     of all possible node positions. The function $C$ normalizes the obtained results,
                     the function $f$ represents the connection weight between two nodes $v_{e}$ and $v_{i}$,
                     and $g$ is used to carry out the operation of transforming the dimension of the features
                     ($g=1$). 
                  
                  According to the adjacency matrix $D,$ the output features $U^{out}$ can be calculated
                     with Eq. (9).
                  
                  
                  where $\vartheta $ is the activation function, and $E$ represents a feature transformation
                     matrix that can be learned. The study uses an embedded Gaussian function to measure
                     the similarity of a set of vectors, and its definition is expressed as Eq. (10).
                  
                  
                  Eq. (10) is a function that maps the $\xi $ feature $u_{e}$ to high-dimensional space, which
                     is the function that maps the $\tau $ feature $u_{i}$ to the high-dimensional space.
                     The embedded Gaussian function can be highly adapted to the Softmax function. Through
                     the determined position $e$, the normalization factor $C$ can be used to implement
                     the subsequent operations of $\frac{1}{C\left(u\right)}e^{\xi {\left(u_{e}\right)^{T}}\tau
                     \left(u_{i}\right)}$ in the form of Softmax along the dimension $i$. Through this
                     equation, the self-attention module can be planned. Weinstantiate $\xi $ and $\tau
                     $ in a 1${\times}$1 convolution, and the output channel can be set to $C_{e}<C$ to
                     reduce the computational consumption. When calculating the result of the output channel,
                     $C_{out}/d$ is used to regulate the amount of calculation of the output channel. The
                     process of obtaining the self-attention module is shown in Fig. 2.
                  
                  The setting of the multi-head attention module can be used to learn different types
                     of adjacency matrices, which represent the different connection relationships between
                     nodes. By parallelizing the independent self-attention modules $K$, learning types
                     of adjacencies with inconsistent matrix structure. <note: ambiguous (incomplete sentence>
                     The calculation of the output channel is expressed as Eq. (11).
                  
                  
                  where $D_{k}$ represents the adjacency matrix calculated by the $k$th self-attention
                     module, and $E_{k}$ represents the feature transformation matrix calculated by the
                     $k$th self-attention module. 
                  
                  <New paragraph> The parallel processing of the self-attention mechanism provides a
                     more flexible and stable solution for establishing different kinds of connections
                     between skeletal joints. In order to make the information of each convolution module
                     reach the target node through a shorter path and remove the background noise more
                     effectively, the scope of the spatial graph attention network is expanded to the time
                     domain, and then the effective information in the spatiotemporal neighborhood can
                     be captured. The sampling operation is performed by using a sliding window with range
                     $\gamma $ and expansion coefficient $m$. The time steps of the input sequence are
                     sampled to generate the corresponding local action sequence expressed as Eq. (11).
                  
                  
                  where $\gamma $ is used to control the time range of the sampling sequence, and $m$
                     represents the selection of a frame from a video segment. The spatiotemporal attention
                     network calculates each frame of images selected to obtain the corresponding spatiotemporal
                     adjacency matrix, which is defined in Eq. (13).
                  
                  
                  The spatiotemporal adjacency matrix can be obtained by calculating all the neighbors
                     of the local spatiotemporal neighborhood and the similarity of the point $D_{\gamma
                     }^{t}$. The spatiotemporal network calculates the output vector of each frame image
                     according to Eq. (14).
                  
                  
                  In order to achieve the research goals, it is necessary to expand the scope of STGAT
                     through a method of separation learning and divide the joints in the local spatiotemporal
                     neighborhood. By grouping, STGAT only needs to calculate the connection weights of
                     each edge in each group, and the extracted features are connected to obtain all multi-scale
                     features. Then, two methods are introduced to dynamically weigh STGAT. An optimization
                     parameter that can be updated with the network is added:$F_{DTW}$. The adaptive dynamic
                     time weighting process is shown in Fig. 3.
                  
                  The adaptive dynamic weighting method can only dynamically weigh the action images
                     in the local spatiotemporal neighborhood, so an explicit motion capture method is
                     needed to remove the excessively extracted features in the local spatiotemporal neighborhood
                     and increase the time perception for each frame of the action image. The explicit
                     motion capture strategy not only highlights the changes of human motion, but also
                     cooperates with the adaptive dynamic temporal weighting method to effectively reduce
                     the redundant features extracted. Through fusion of spatiotemporal depth features
                     based on a CNN and the use of skeleton points for human action recognition, a skeleton
                     point self-attention mechanism action recognition model based on a CNN called theCSTGAT
                     model was constructed. The specific flow of the CSTGAT model is shown in Fig. 4.
                  
                  Three evaluation indicators were used to evaluate the quality of the prediction model:
                     accuracy, recall, and the F1 value. First, the definition of accuracy is expressed
                     as Eq. (15):
                  
                  
                  where $TP$represents the number of positive data with prediction results that are
                     consistent with the actual situation, and $NP$represents the number of positive data
                     withprediction results that are inconsistent with the actual situation. The recall
                     rate is calculated with Eq. (16):
                  
                  
                  where $P$ represents the total number of positive samples. The F1 value can be obtained
                     by calculating the harmonic mean of precision and recall. The larger the value is,
                     the better the prediction effect of the model will be. The F1 value is calculated
                     with Eq. (17):
                  
                  
                  
                        Fig. 2. The flow of the self-attention module.
 
                  
                        Fig. 3. Adaptive dynamic time weighting process.
 
                  
                        Fig. 4. The flow of the CSTGAT model.
 
                
             
            
                  4. Performance Analysis of CSTGAT Model based on CNN
               In order to verify the relevant performance of the CSTGAT model based on a CNN, three
                  action recognition databases were selected for an experiment: MSR 3D Online Action,
                  NTU RGB+D 60, and NTU RGB+D 120. There are 49,286 video actions in the MSR 3D Online
                  Action dataset, which are divided into 60 categories. There are about 350 images in
                  each video. There are 5776 video actions in the NTU RGB+D 60 dataset, which are also
                  divided into 60 categories. There are 91,854 video actions in the NTU RGB+D 120 dataset,
                  which are divided into 120 categories. 
               
               <New paragraph> At present, the most mainstream video action recognition models mainly
                  include the TSN model and SOTA model. The TSN model can sample a series of short clips
                  from a bottle <note: ambiguous> and obtain video prediction results based on the consensus
                  of these clips, which can be very useful for managing the classification of long videos.
                  The SOTA model has a fast reasoning speed and high accuracy [21]. Therefore, the TSN model, SOTA model, and CSTGAT model were selected for comparison.
                  The data samples were divided into a training set and verification set according to
                  different shooting angles. In the process of CNN training, the punitive loss function
                  model and the cross-entropy loss function model were calculated, and training results
                  using two different loss function models were obtained, as shown in Fig. 5.
               
               It can be seen in Fig. 5(a) that the loss function uses a model with a penalty term with an average accuracy
                  of 40.7%. Moreover, the value of the cost function fluctuates greatly, especially
                  during the training process of the CNN. For the loss function model of the penalty
                  term, it is difficult to achieve excellent convergence. In Fig. 5(b), the loss function using the model with cross entropy has an average accuracy of
                  91.6%, and the value of the cost function is low and stable, so the model can have
                  excellent convergence. The experimental results show that the loss function model
                  using cross entropy can make the model reach a stable target convergence state more
                  quickly and effectively, so it is beneficial to study the loss function using cross
                  entropy. After the training of the CNN, the convergence effect of different models
                  can be obtained, as shown in Fig. 6.
               
               As shown in Fig. 6, the convergence state of the CSTGAT model is better compared to the other two action
                  recognition models. To achieve stable convergence, the CSTGAT model needs only 217
                  iterations, the SOTA model needs 262 iterations, and the TSN model needs 285 iterations.
                  The experimental results show that the convergence performance of the CSTGAT model
                  is better, and the network training is completed well. In the experiment, different
                  feature methods were used for training and verification in two datasets, and the accuracy
                  results obtained are shown in Table 1.
               
               It can be observed from Table 1 that in the NTU RGB+D 120 dataset, the training accuracy of the CSTGAT model is 97.2%,
                  and the verification accuracy is 97.5%. The training accuracy of the TSN model is
                  88.6%, and the validation accuracy is 88.2%. The training accuracy of the SOTA model
                  is 90.5%, and the verification accuracy is 90.2%. The validation accuracy of CSTGAT
                  model is higher thanthose of the TSN model and SOTA model by 9.3% and 7.3%,respectively.
               
               <New paragraph> In the MSR 3D Online Action dataset, the training accuracy of the
                  CSTGAT model is 96.1%, and the verification accuracy is 96.9%. The training accuracy
                  of the SOTA model is 89.6%, and the verification accuracy is 90.1%. The training accuracy
                  of the TSN modelis 86.5%, and the validation accuracy is 87.8%. Compared with the
                  CSTGAT model, TSN model’svalidation accuracy is 8.8% lower, and SOTA model’s validation
                  accuracy is 6.8% lower.
               
               <New paragraph> In the NTU RGB+D60 dataset, the training accuracy of the CSTGAT model
                  is 97.2%, and the verification accuracy is 96.8%. The training accuracy of the TSN
                  model is 86.5%, and the validation accuracy is 87.8%. The training accuracy of the
                  SOTA model is 90.9%, and the verification accuracy is 91.5%. The validation accuracy
                  of CSTGAT model is higher than the TSN model and SOTA model’s accuracy by 9.0% and
                  5.3%, respectively. In the experiment, different models were used for calculation
                  in the verification set, and the comparison results between the predicted value and
                  the actual value of different action recognition models were obtained, as shown in
                  Fig. 7.
               
               Fig. 7 shows that the accuracy of the SOTA model is 91.50%, the accuracy of the CSTGAT model
                  is 98.47% , and the accuracy of the TSN model is 69.15%. Compared with the accuracy
                  of the SOTA model, the accuracy of the CSTGAT model is 6.97%higher. Compared with
                  the TSN model, the accuracy of the CSTGAT model is 29.32% higher. The results show
                  that the CSTGAT model can handle a large amount of calculation while having high accuracy.
                  In the experiment, different models were used for 100 calculations in the validation
                  set, and the obtained precision and recall results are shown in Fig. 8.
               
               As shown in Fig. 8, the precision and recall curves of the CSTGAT model are stable, and with an increase
                  of the number of experiments, the average precision of the CSTGAT model is 97.43%.
                  The rate curve fluctuates greatly, and the average accuracy of the TSN model is 86.59%,
                  while the average accuracy of the SOTA model is 90.71%. The precision and recall rates
                  of the CSTGAT model are higher than those of the other two action recognition models,
                  indicating that the CSTGAT model has higher accuracy.
               
               <New paragraph> The average recall rate of the CSTGAT model is 71.65%, while that
                  of the SOTA model is 61.86%. The average recall rate of the TSN model is 49.53%, which
                  is 22.03% lower than that of the CSTGAT model. The results show that the CSTGAT model
                  has higher accuracy and a more comprehensive query rate. The three action recognition
                  models were tested on the validation set, and the performance of the models was evaluated
                  using the F1 value. The variation of the F1 value of the three action recognition
                  models is shown in Fig. 9.
               
               It can be observed from Fig. 9 that after 100 tests, the CSTGAT model has a more stable F1 curve. The experimental
                  results show that the average F1 value of the CSTGAT model is 96.83%, while that of
                  the SOTA model is 85.94%.The average F1 value of the TSN model is 69.11%, which is
                  lower than that of the CSTGAT modelby 27.72%. With the increase of the number of iterations,
                  the CSTGAT model is very stable with little fluctuation. The SOTA model and TSN model
                  have greater fluctuationrange and frequency. The SOTA model has the largest fluctuation
                  range and the worst model expressiveness. Based on the results, the CSTGAT action
                  recognition model can achieve extremely high accuracy and precision and can accurately
                  identify human movements in videos, which is conducive to the development of online
                  teaching methods.
               
               
                     Fig. 5. Changes in the training process of CNNs for models using different loss functions.
 
               
                     Fig. 6. Convergence process of different models in CNN training.
 
               
                     Fig. 7. Error analysis of different models.
 
               
                     Fig. 8. Precision and recall for different models.
 
               
                     Fig. 9. Variation of the F1 value for different models.
 
               
                     Table 1. Comparison of the accuracy of CSTGAT model and other latest models on three datasets.
                  
                        
                           
                              | Dataset | Model | Accuracy/% | 
                        
                              | Training set | Validation set | 
                        
                              | NTU RGB+D 120 | TSN | 88.6 | 88.2 | 
                        
                              | SOTA | 90.5 | 90.2 | 
                        
                              | CSTGAT | 97.2 | 97.5 | 
                        
                              | MSR 3D Online Action | TSN | 87.9 | 88.1 | 
                        
                              | SOTA | 89.6 | 90.1 | 
                        
                              | CSTGAT | 96.1 | 96.9 | 
                        
                              | NTU RGB+D 60 | TSN | 86.5 | 87.8 | 
                        
                              | SOTA | 90.9 | 91.5 | 
                        
                              | CSTGAT | 97.2 | 96.8 | 
                     
                  
                
             
            
                  5. Conclusion
               This study provided a solution for late fusion of spatiotemporal depth features based
                  on CNNs and skeleton point actions based on a self-attention mechanism. Combining
                  the recognition methods, a skeleton point action recognition model based on a CNNwas
                  constructed. The results showed that after the training of the CNN, the CSTGAT model
                  achieved stable convergence within only 217 iterations. In contrast, the SOTA model
                  needed45 more iterations than the CSTGAT model, andthe TSN model needed 68 more iterations.
                  The accuracy of the CSTGAT model was 98.47%, which is 10.84% higher than that of the
                  SD-Net model and 29.32% higher than that of the TSN model. 
               
               <New paragraph> The accuracy of the CSTGAT model was 97.43%, whichwas 10.84% higher
                  than that of the TSN model and 6.72% higher than that of the SOTA model. The recall
                  rate of the CSTGAT model was 71.65%, which was9.79% lower than that of the SOTA model
                  and 22.03% lower than that of the TSN model. After 100 tests, the F1 value of the
                  CSTGAT model was 96.83%, whichwas 10.89% higher than that of the SOTA model and 27.72%
                  higher than that of the TSN model.
               
               <New paragraph> In summary, the CSTGAT model can realize action recognition more efficiently
                  and accurately and has better model expressiveness. However, there are still shortcomings
                  in this research. The parameters of the model are too large, and the structure of
                  the model needs to be simplified in future research.
               
             
          
         
            
                  
                     REFERENCES
                  
                     
                        
                        Guo. M, Yu. Z, Xu. Y, Li. C. “ME-Net: A deep convolutional neural network for extracting
                           mangrove using sentinel-2A data,” Remote Sensing, vol. 13, no. 7, pp. 1-24, 2021.

 
                     
                        
                        Wu. H, Zhou. B, Zhu. K, Shang. C, Tam. HY, Lu. C. “Pattern recognition in distributed
                           fiber-optic acoustic sensor using intensity and phase stacked convolutional neural
                           network with data augmentation,” Optics Express, vol. 29, no. 3, pp. 3269-3283, 2021.

 
                     
                        
                        Rastgoo. R, Kiani. K, Escalera. S. “Hand sign language recognition using multi-view
                           hand skeleton,” Expert Systems with Applications, vol. 150, no. 8, pp. 113336, 2020.

 
                     
                        
                        Tsai. MF, Chen. CH. “Spatial temporal variation graph convolutional networks (STV-GCN)
                           for skeleton-based emotional action recognition,” IEEE Access, no. 9, pp. 13870-13877,
                           2021.

 
                     
                        
                        Gao. P, Zhao. D, Chen. X. “Multi-dimensional data modelling of video image action
                           recognition and motion capture in deep learning framework,” IET Image Processing,
                           vol. 14, no. 7, pp. 1257-1264, 2020.

 
                     
                        
                        Anitha. U, Narmadha. R, Sumanth. DR, Kumar. DN. “Robust human action recognition system
                           via image processing,” Procedia Computer Science, no. 167, pp. 870-877, 2020.

 
                     
                        
                        Nwoye. CI, Yu. T, Gonzalez. C, Seeliger. B, Mascagni. P, Mutter. D, Padoy. N. “Rendezvous:
                           Attention mechanisms for the recognition of surgical action triplets in endoscopic
                           videos,” Medical Image Analysis, no. 78, pp. 102433, 2022.

 
                     
                        
                        Jiang. H, Tsai. SB. “An empirical study on sports combination training action recognition
                           based on SMO algorithm optimization model and artificial intelligence,” Mathematical
                           Problems in Engineering, no. 2021, pp. 1-11, 2021.

 
                     
                        
                        Liu. Y, Dong. H, Wang. L. “Trampoline motion decomposition method based on deep learning
                           image recognition,” Scientific Programming, vol. 2021, no. 9, pp. 1-8, 2021.

 
                     
                        
                        Silva. V, Soares. F, Leo. CP, Esteves. JS, Vercelli. G. “Skeleton driven action recognition
                           using an image-based spatial-temporal representation and convolution neural network,”
                           Sensors, vol. 21, no. 13, pp. 4342.

 
                     
                        
                        Ali. S, Bouguila. N. “Multimodal action recognition using variational-based Beta-Liouville
                           hidden Markov models,” IET Image Processing, vol. 14, no. 17, pp. 4785-4794.

 
                     
                        
                        Kim. SH, Cho. D. “Viewpoint-aware action recognition using skeleton-based features
                           from still images,” Electronics, vol. 10, no. 9, pp. 1118, 2021.

 
                     
                        
                        Xuan. P, Gong. Z, Cui. H, Li. B, Zhang. T. “Fully connected autoencoder and convolutional
                           neural network with attention-based method for inferring disease-related lncRNAs,”
                           Briefings in Bioinformatics, no. 3, pp. 89-91, 2022.

 
                     
                        
                        Wang. H, He. J, Zhang. X, Liu. S. "A Short Text Classification Method Based on N-Gram
                           and CNN," Chinese Journal of Electronics, vol.29, no. 2, pp. 248-254, 248-254, March.
                           2020.

 
                     
                        
                        Abraham. L, Sasikumar. M. “Vehicle Detection and Classification from High Resolution
                           Satellite Images,” Journal of Bacteriology, Vol. 2, no. 1, pp. 1-8, November, 2014.

 
                     
                        
                        Mafi. M, Izquierdo. W, Martin. H, Cabrerizo. M, Adjouadi. M. “Deep convolutional neural
                           network for mixed random impulse and Gaussian noise reduction in digital images,”
                           IET Image Processing, Vol. 14, no. 3, pp. 3791-3801, 2020.

 
                     
                        
                        GS Hayes, SN Mclennan, JD Henry, LH Phillips, I Labuschagne. “Task characteristics
                           influence facial emotion recognition age-effects: A meta-analytic review,” Psychology
                           and Aging, no. 2, pp. 295-315, January 2020.

 
                     
                        
                        Jagannathan. J, Divya. C. “Deep learning for the prediction and classification of
                           land use and land cover changes using deep convolutional neural network,” Ecological
                           Informatics, vol. 65, no. 15, pp. 101412, 2021.

 
                     
                        
                        Chen. H, Hu. C, Lee. F, Lin. C, Yao. W, Chen. L, Chen. Q. “A supervised video hashing
                           method based on a deep 3d convolutional neural network for large-scale video retrieval,”
                           Sensors, vol. 21, no. 9, pp. 3094, 2021.

 
                     
                        
                        Ji. R. “Research on basketball shooting action based on image feature extraction and
                           machine learning,” IEEE Access, no. 8, pp. 138743-138751, 2020.

 
                     
                        
                        Chen. C, Song. J, Peng. C, Wang. G, Fang. Y. “A Novel Video Salient Object Detection
                           Method via Semisupervised Motion Quality Perception,” IEEE, vol. 32, no. 5, pp. 2732-2745,
                           2019.

 
                   
                
             
            Author
            
            
               			Yan Shi, August 20, 1986, female, associate professor, master. She graduated from
               Xi’an Institute of physical education in July 2008, majoring in human movement science.
               She graduated from Xi’an Institute of physical education in July 2011, majoring in
               human movement science. Now she works in Sanya University, school of physica education.
               She has published 10 academic articles and participated in 4 scientific research projects.