Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Jilin Animation Institute, Changchun 130000, China)



Scene planning, Automatic generation of animations, Active learning, Deep learning models

1. Introduction

The computer animation automatic generation technology based on artificial intelligence, starting from the design and production process of animation, studies the script written in natural language to the final animation implementation process, aiming to improve the automation and intelligence of animation production, thereby reducing animation production costs, shortening animation production cycles, and improving the efficiency of the animation industry [1,2]. The automatic generation of animation represents a novel area of research. In contrast to traditional animation production, which involves lengthy cycles and substantial investment in manpower and financial resources, this new approach leverages advancements in computer hardware technology, graphics algorithms, and artificial intelligence [3,4]. Consequently, automatic animation generation technology has emerged, relying on computer technologies like artificial intelligence to explore how computers can comprehend storylines, generate corresponding scenes, and ultimately produce animations. The primary aim is to transform conventional animation production methods and enhance the intelligence and efficiency of the production process. This technology serves as a cutting-edge topic, garnering extensive attention from scholars and researchers who have achieved promising results in various aspects. One significant focus area is automatic text-to-scene conversion, which investigates how to automatically generate corresponding 3D scenes from text described in natural language. In a text scene automatic conversion system, each natural language statement is analyzed and translated; however, unlike general natural language processing (NLP) technologies, the NLP techniques employed for automatic text-to-scene conversion must tackle additional challenges. These include understanding spatial relationships and other relevant constraints between objects as conveyed through words, particularly in prepositional phrases. Notable systems in this domain include the Words Eye system, the Car Sim system, and the Swan system [5,6]. Research on the generation of computer animation from natural language is mainly driven by stories described in natural language, which are translated into formal descriptions by computers to generate virtual character Agent animations. It generally includes the following steps: story understanding, generating Agent characters, constructing animation scenes, and simulating Agent actions to form animations [7,8]. Representative systems are CONFUCIUS. Story understanding is an important module in animation automatic generation systems. Story understanding is not only about understanding each isolated sentence, but also about inferring and analyzing the implicit states, constraints, conflicts, etc. in the story [9]. Therefore, it must have the necessary knowledge base and the ability to use knowledge for reasoning. The task of environmental sound classification has not received widespread attention and research [10].

Effectively harnessing or circumventing these acoustic signals has emerged as a primary research focus in the 21st century, making the study of environmental sound a crucial aspect of acoustics. In an age characterized by information overload, the introduction of new public datasets has significantly advanced related fields, including the classification of environmental sounds [11,12]. To streamline the design and implementation of parallel methods for deep learning models and enhance the adaptability of parallel policy designs, researchers have started investigating an alternative automatic parallel method for deep learning models, which is grounded in graph algorithms or machine learning techniques. This approach offers a distributed training framework for the entire deep learning process and facilitates automated distributed parallel policy search [13]. The automatic parallel training method that employs graph algorithms primarily relies on the principles of graph partitioning scheduling from parallel computing to achieve model parallelism. While these methods are known for their rapid solving capabilities, they are often constrained by specific types of networks, making them applicable solely to deep learning. Furthermore, the model training process and various execution details still necessitate human intervention, highlighting a significant reliance on domain knowledge [14,15]. Different initialization conditions need to be set for different models. Therefore, some scholars have proposed machine learning based methods to achieve end-to-end automatic parallel policy output. However, in dynamic distributed environments, sample data is prone to failure, resulting in large fluctuations in the performance of the output distributed parallel policy. Reinforcement learning, due to its ability to interact with dynamic environments in real-time and achieve label free autonomous learning, has become the mainstream of machine learning algorithms for automatic parallelism. Although existing reinforcement learning model-based automatic parallel methods have solved some of the problems of automatic parallel training of models, there are still many shortcomings, such as inadequate adaptation to the dynamic environment of the policy [16,17]. Currently, the performance evaluation of the policy only focuses on the resource requirements of the model itself, without considering the availability of storage and communication bandwidth resources in the environment, resulting in a decline in performance of parallel policies guided by static resources in actual environments [18,19].

2. Distributed Training Performance Evaluation Model

2.1. Deep Learning Models

Deep learning mainly relies on expert experience to manually design parallel strategies for parallel training of large-scale deep learning models. Deep learning models are typical hierarchical structures, so manual parallel strategy design mainly involves coarse-grained layer partitioning to divide the model into different sub models, as shown in Eqs. (1) and (2), and then scheduling them to different devices for execution, balancing computing and communication loads without affecting the model’s computational structure.

(1)
$ P(\vec{x}|\lambda) = \sum_{i=1}^{M} p_i b_i(\vec{x}) $
(2)
$ \lambda = (p_i, \vec{\mu}_i, \Sigma_i) $

By studying the computational and memory characteristics of different layers in the natural language domain model, the LSTM layer and Attention layer are allocated to devices that match their computational and storage performance for execution. As shown in Eq. (3), nodes with large-scale network parameters and high computational time complexity are often distributed in different layers, and coarse-grained hierarchical division makes it impossible for single working nodes in the cluster to complete the training of such complex network layers.

(3)
$ b_i(\vec{x}) = \frac{1}{(2\pi)^{D/2}|\Sigma_i|^{1/2}} \times \exp\left( -\frac{1}{2}(\vec{x}-\vec{\mu}_i)' \Sigma_i^{-1} (\vec{x}-\vec{\mu}_i) \right) $

A more fine-grained segmentation method was proposed, and the skeleton model of neural networks was studied. By retaining the key computing nodes of its performance. Using the matrix partitioning method in the field of high performance computing, partition the parameter tensor matrix of the model, as shown in Eq. (4), to achieve finer grained tensor level parallelism. Design adaptive communication methods for different model layers by analyzing underlying communication technologies such as RPC and MPI.

(4)
$ p(x_1, x_2, \dots, x_N|\lambda) = \prod_{n=1}^{N} p(x_n|\lambda) $

With the in-depth study of parallel granularity and underlying technologies, the number of combinations of parallel strategies is increasing exponentially, making it difficult to find the optimal combination of parallel strategies based on manual parallelization methods. As shown in Eqs. (5) and (6), customized parallel strategies for different deep learning models are also not universal.

(5)
$ \log p(x_1, x_2, \dots, x_N|\lambda) = \sum_{n=1}^{N} \log p(\vec{x}|\lambda) = \sum_{n=1}^{N} \log \left( \sum_{i=1}^{M} p_i b_i(\vec{x}) \right) $
(6)
$ W^* = \text{argmax}_W \frac{P(W|Y)P(W)}{P(Y)} $

Due to the shortcomings of manual parallel methods, many scholars have begun to study automatic parallel methods. At present, the mainstream automatic parallel methods include two types, as shown in Eqs. (7) and (8). One type is based on graph algorithms for model automatic parallel methods, and the other type is based on machine learning algorithms for automatic parallel methods.

(7)
$ W^* = \text{argmax}_W P(W|Y)P(W) $
(8)
$ P(Y|W) = \text{argmax}_W \sum_Q P(Y,Q|W) $

The initial model automatic parallel method based on graph algorithm adopted an adaptive graph partitioning scheduling method similar to the high-performance computing field, as shown in Eqs. (9) and (10). The proposed FM graph segmentation algorithm guides the static computation of graph segmentation by balancing node computation cost and dependent edge data volume, balancing multiple loads while minimizing communication costs; By analyzing the structural characteristics of neural networks.

(9)
$ P(Q|W) = \prod_{l=1}^{L} P(q^{W_l}|W_l) $
(10)
$ P(\Theta, Y|W) = a_{\Theta_0 \Theta_1} \prod_{t=1}^{T} b_{\Theta_t} y_t a_{\Theta_t \Theta_{t+1}} $

Opt deep learning was proposed, and a method based on tensor segmentation was proposed for parallel scheduling. The automatic optimal policy search was implemented based on dynamic programming ideas, as shown in Eqs. (11) and (12). However, it still focuses on coarse-grained tensor segmentation for layers, resulting in limited performance improvement of the policy in the end.

(11)
$ f_t = \sigma(W_f \cdot [x_t, h_{t-1}] + b_f) $
(12)
$ i_t = \sigma(W_i \cdot [x_t, h_{t-1}] + b_i) $

2.2. Model Parallel Methods

A scheduling algorithm FastT based on deep learning model DAG was proposed, which achieves operator scheduling by setting operator priority and graph critical path, as shown in Eq. (13). However, its parallel strategy has limited improvement on RNN models. Building Beachi implements strategy search using three graph algorithms: topological sorting, earliest start time based, and minimum communication volume based. The model parallel strategy can be searched in tens of seconds at the fastest.

(13)
$ c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_c \cdot [x_t, h_{t-1}] + b_c) $

Their different methods rely on specific constraints, resulting in inconsistent policy effects across different deep learning models. Therefore, although the above graph algorithm method has a fast search speed, as shown in Eq. (14), it does not consider the structured features of the deep learning model, resulting in little improvement in the performance of the obtained parallel strategy, and is only applicable to some networks and training scenarios, with poor portability.

(14)
$ o_t = \sigma(W_o \cdot [x_t, h_{t-1}] + b_o) $

These methods still require manual analysis of the details of the model training process and still heavily rely on domain knowledge. In order to solve the problems of limited scenarios and excessive reliance on knowledge in fields such as parallel computing and architecture, as shown in Eqs. (15) and (16), automatic model parallelism based on machine learning algorithms has become a hot research topic.

(15)
$ h_t = o_t \odot \tanh(c_t) $
(16)
$ a'_i = LSTM(a_i), \forall i \in [1, 2, \dots, T] $

Establish an evaluation model for the parameter server architecture framework, predict policy performance through machine learning methods, and automatically search for job training strategies that minimize costs. As shown in Eqs. (17) and (18), a multi-dimensional strategy execution time prediction method based on message partitioning, communication topology, and other dimensions was implemented in data parallel mode to guide strategy search for synchronous data parallelism.

(17)
$ b'_j = LSTM(b_j), \forall j \in [1, 2, \dots, T] $
(18)
$ \tilde{X} = MCFF(X_L, X'_L, X_R, X'_R) \in R^{4 \times D \times T} $

Parallex, developed by R&D, has constructed a linear prediction model for computation and communication based on different partitioning methods of sparse model parameters, to guide the adaptive partitioning of sparse model parameters. Using Bayesian optimization to determine the trustworthy size of scheduling to guide policy output; However, the above methods require obtaining a large amount of label data in the actual environment. When in a dynamic cluster environment, as shown in Eqs. (19) and (20), the quality of label data decreases, resulting in the actual strategy effect not meeting expectations, making it difficult to apply in practice.

(19)
$ \vec{F} = CNN(\tilde{X}) \in R^m $
(20)
$ y = \text{argmax}[H(\vec{F})] $

Propose the Hierarchical method, which extracts deep learning models and device topology features, and uses reinforcement learning algorithms to guide the automatic parallel policy output of the model. Due to the need for frequent sampling and large search space, as shown in Eq. (21), the strategy search cost of the above two methods is high, so the performance improvement is limited compared to the model parallel method based on expert experience.

(21)
$ \vec{y}'_i = \text{softmax}(\vec{X}'_i \cdot W + b) $

Spotlight proposed to abstract the parallel training of deep learning models into operator scheduling problems, and for the first time constructed it as a Markov decision process. Subsequently, POST further optimized it by introducing cross entropy and proximal policy gradient methods in the sampling process to improve search efficiency. As shown in Eq. (22), the Flex Flow framework is improved and proposed based on Opt deep learning.

(22)
$ l(\vec{y}_i, \vec{y}'_i) = -\sum_{j=0}^{C} y_{i,j} \log y'_{i,j} $

3. Research on Aggregated Time Frequency Domain Deep Learning Models for Automatic Animation Generation

3.1. Hybrid neural network model based on time-frequency domain aggregation features

Deep learning introduces the idea of SOAP to establish a multi-dimensional parallel policy search space. However, its implementation is complex and cannot be applied to recurrent neural network models. It can be seen that the above methods are only effective for some network models, and when facing different network models, it is necessary to retrain the strategy search model, which is not portable [20,21]. The cost of designing and implementing parallel strategies remains high for different network models. To address this issue, Placeto is proposed, which introduces a graph embedding encoding method to endow parallel strategies with portability and reduce training time under similar networks [22,23]. A simulation executor is also proposed, which predicts the single step execution time under a specified strategy through static data simulation. Although it can accelerate the search speed of the strategy, the execution performance error of parallel strategies is relatively large; Propose BRKGA, introduce graph neural networks, and construct a deep learning model execution cost model to search for parallel strategies with better execution performance; The proposed Auto Map achieves automatic parallel policy search for fine-grained models based on XLA-IR graphs; A reinforcement learning based adaptive distributed parallel training method called Trinity is proposed, which utilizes a proximal policy optimization method to expand the learning ability of policy networks [24,25]. A multi-dimensional distributed training evaluation model is proposed to evaluate policy performance at a fine-grained level, but only considers the static resource allocation of parallel policies. In practical environments, insufficient policy resources may lead to a decrease in policy performance; HeterPS is proposed to combine pipeline technology with reinforcement learning methods to achieve coarse-grained hierarchical scheduling [26,27]. The accuracy of evaluation models in dynamic environments is low. At present, the performance evaluation of the strategy only focuses on the calculation, memory, communication and other requirements of the model itself, without considering the load balancing of the environment memory and the availability of communication bandwidth resources during the strategy output process, resulting in low accuracy in evaluating the model in dynamic environments [28,29]. Causing the performance of strategies guided by traditional evaluation models to fall short of expectations. Fig. 1 shows the flowchart of reinforcement learning in animation interaction design, with a time-consuming strategy search process. The current reinforcement learning automatic parallel methods all use full process sampling for value estimation. Only after completing a full process interaction with the environment will a policy iteration be performed, resulting in a decrease in the effective sampling data rate as the policy space increases, greatly increasing the learning time of the overall policy iteration [30].

Fig. 1. The flowchart of reinforcement learning in animation interaction design.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig1.png

As the state space and action space increase, the performance of this method will continue to decline; Easy to converge to the local optimal strategy, the sampling learning process is entirely based on the policy experience collected in the current sampling, and after policy iteration based on this experience, this information will be discarded. Therefore, when a suboptimal strategy appears, the model will continuously lean towards the local optimal policy point, losing the ability to explore different strategies, making the overall model only able to converge to the local optimal. Speech animation is a vital aspect of automatic speech recognition research and plays a significant role in the field. Initially, the recognition of speech animation was not treated as an independent research topic; rather, it emerged as a necessity stemming from the requirements of automatic speech recognition. The pronunciation of a word typically involves about three animations, meaning that recognizing text through speech sequences essentially involves the identification of these speech animations. The evolution of automatic speech recognition has spanned nearly sixty years, indicating that research into this technology has existed since the advent of computers. As a medium for information exchange, speech has led to numerous practical applications and advancements based on automatic speech recognition technology. Although some systems focus exclusively on Arabic numerals in English pronunciation from the same speaker, the underlying models and research methods have been adapted and refined by subsequent researchers. Following this, interest in voiceprint recognition began to flourish within the speech domain, with models based on Gaussian mixtures and their enhancements being utilized for voiceprint recognition. Given the close relationship between voiceprint recognition and speech animation recognition, techniques and models from voiceprint recognition have also found application in speech animation recognition, with the Gaussian mixture model emerging as a prominent representation during this phase of animation recognition. Moreover, our understanding of speech animation extends beyond merely its constituent units. Fig. 2 shows the structure diagram of autoencoders used for texture compression in animation. With the advent of deep learning, models based on temporal neural networks and convolutional neural networks have also been used in speech animation recognition, and many practical applications have been developed based on this. A typical example is to further improve the accuracy of speech animation recognition by improving its accuracy. In addition, in the field of combining speech and animation, by learning the correspondence between speech animation and mouth shape, the speech of animated characters can be more anthropomorphic. Therefore, there will be many applications in the fields of animation and power supply, especially in the field of robot anthropomorphism.

Fig. 2. Structure diagram of autoencoder for animation texture compression.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig2.png

3.2. Analysis of Deep Learning Models

This algorithm enables single-layer perceptrons to transform into multi-layer perceptrons and still learn to update weight parameters. In addition, the non-linear activation function behind each layer of neurons can effectively solve the XOR problem. This magical algorithm has led artificial neural networks towards non-linear computing and towards deeper and larger artificial neural network structures. Robert established the universal approximation theorem for multi-layer perceptrons through theoretical means, indicating that artificial neural networks with hidden layers can, in principle, approximate any continuous function within a specified domain. This proof serves as a comforting validation, reassuring future researchers that artificial neural networks are worthy of extensive study and application. In the realm of acoustic signal classification, automatic animation generation has emerged as a significant research endeavor proposed by scholars. Numerous deep learning models, particularly those based on convolutional neural networks and their enhancements, have been introduced to extract high-dimensional hidden features and more discriminative characteristics from low-level animation signal features. Additionally, researchers emphasizing temporal information have developed various deep learning models rooted in time series analysis to capture the temporal dynamics between successive frames of animation signals. As previously noted, one major challenge encountered by acoustic signals in the field of acoustics when applying deep learning models is the focus on examining a hybrid neural network model that integrates both time-domain and frequency-domain characteristics. Notable examples of such models include active learning models and CLC models. For instance, CLDNN utilizes logarithmic Mel animation features as inputs to the convolutional neural network and subsequently feeds the features learned by this deep learning model into a temporal prediction neural network. Fig. 3 shows the evaluation of animation style transfer effect. CLC uses LSTM to learn temporal characteristics and deep learning models to learn animation characteristics in parallel, and then trains a classifier network by simply cascading the temporal characteristics encoded by LSTM and the animated features learned by deep learning. However, the temporal and frequency-domain information should be studied as a whole to learn highly integrated fusion features, which will ultimately be used for DNN classification.

Fig. 3. Evaluation of animation style transfer effect.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig3.png

There are two main improvements made to convolutional neural networks. Firstly, by changing the step size of convolutional kernels and pooling operations on the timeline to maintain the same feature map size on the timeline dimension, it can process temporal inputs and obtain consistent length temporal outputs. Utilizing the powerful learning ability of convolutional kernels and the focus on local information, the recognition accuracy is effectively improved; The next step involves enhancing the final fully connected layer in convolutional neural networks by introducing a fully connected mechanism with shared weights, which links the fully connected layer to each temporal node. Subsequently, the model is trained by computing the cross-entropy loss between the predicted outcomes of all temporal nodes and the actual temporal labels, effectively mitigating issues related to the pre-segmentation and post-processing of animation signals. In earlier research on voiceprints, Gaussian mixture models and general background Gaussian mixture models were predominantly utilized, with support vector machines (SVM) integrated into the GMM-UBM framework. By extracting the mean vector from each Gaussian mixture component individually, a Gaussian super vector was created to serve as a sample for the SVM. This approach leveraged the powerful nonlinear classification capabilities of SVM kernel functions, significantly enhancing the performance of the model compared to the previous GMM-UBM approach. Fig. 4 shows the automatic generation of evaluation graphs for character action sequences. A theoretical analysis framework for joint factor analysis has been proposed, further enhancing and compensating for the FA model. The red line represents the performance of a specific model, usually the benchmark model or the optimal model; The green line represents the result of another model or algorithm, used for comparison with the red line model; The blue line displays the results of another strategy or optimization algorithm; Gray lines generally represent random benchmarks, baseline models, or control groups.

Fig. 4. Automatic generation and evaluation of character action sequences.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig4.png

A theoretical model of full factor space was proposed, which maps speech signals onto a fingerprint like identity vector in this space. The features of animation signals are greatly reduced in dimension, and key features are extracted in representation content. The corresponding traditional model prediction methods are also relatively mature. Improved the recognition rate by one stage and maintained a dominant position over the problem for a long period of time. Later, convolutional neural networks were extended to multiple fields and made significant achievements in areas such as facial recognition, video surveillance, behavioral motion analysis, natural language processing, and speech recognition related to this research topic. Convolutional neural networks include a feature extractor consisting of convolutional layers and subsampling layers. In the convolutional layer of a convolutional neural network, a neural unit is only connected to some adjacent layers of neural units, usually containing several feature planes composed of neurons arranged in a matrix. The same set of feature planes share neuron weight parameters, which are called convolutional kernels. Generally, convolutional kernels are initialized in the form of a random decimal matrix that satisfies a certain probability distribution. During the training and learning process of deep convolutional neural networks, the convolutional kernels are iteratively learned and updated to obtain reasonable weight parameters. Fig. 5 shows the evaluation of scene layout deep learning optimization. The direct benefit of convolutional kernels, which share parameters, is to reduce the complexity of the network, that is, the connections between layers in the network, and also reduce the risk of overfitting. Another core operation of convolutional neural networks is pooling, which is subsampling. There are generally two forms of subsampling: maximum subsampling and mean subsampling. Its function is similar to that of convolutional kernels, and it can be said to be a special type of convolutional kernel.

Fig. 5. Scene layout deep learning optimization evaluation diagram.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig5.png

4. Research on Deep Learning Models for Automatic Animation Generation and Active Learning

Although traditional Gaussian mixture models or machine learning methods can solve classification problems to a certain extent, in the era of big data, deep learning methods or a combination of deep learning and machine learning methods can more effectively fit data and obtain more efficient performance. The use of convolutional neural network models to train environmental sound datasets has achieved excellent performance compared to traditional algorithms, providing good reference for later researchers in both data and deep network models. Piczak extracts animation features from sound signals, performs Mel cepstral analysis on animation features, then performs cosine transformation to obtain Mel animation cepstral coefficient features, and performs first-order differencing on MFCC features to obtain two channel data samples. By combining convolutional neural networks with fully connected networks, Dropout mechanism is used to prevent overfitting. Fig. 6 is Performance evaluation of lighting rendering algorithm, A sample-level deep learning network was developed and enhanced by directly applying one-dimensional convolution to learn animation features from the original wave data, ultimately concatenating feature data from the previous pooling layer to serve as classification features. To extract animation features, MFCC features, and CRP features, two prominent image classification networks—AlexNet and GoogleNet—were employed to integrate these into a three-channel image. Additionally, a preprocessing method based on Constant Q Transform (CQT) was designed to extract more effective animation features before they are fed into deep learning models. In the accompanying graph, the yellow line indicates a newly proposed method or algorithm, typically used to showcase its performance on specific tasks, while the blue line represents an alternative comparison model.

Fig. 6. Performance evaluation of lighting rendering algorithm.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig6.png

Drawing inspiration from the effective application of deep learning, LSTM, and DNN in active learning models, an enhanced CLDNN model based on a bidirectional long short-term memory network has been proposed and is utilized for automatic animation generation tasks. Furthermore, leveraging the CLDNN network architecture, the research team introduces an active network that incorporates attention-weighted calculations following each LSTM layer. In the context of automatic animation generation, another approach combines deep learning with LSTM. A parallel joint LSTM and deep learning network structure has been proposed, designed to independently learn the relationships between animation frame sequences and high-dimensional local representations of animation features. Within the LSTM module, Mel-frequency cepstral coefficient features serve as input for temporal sequences, facilitating the extraction and learning of temporal relationships. The deep learning module directly uses the animated image after the short-time Fourier transform as input. Specifically, the sequence length of the deep learning module and the LSTM module in the time dimension is consistent. After two independent modules learn features separately, they are cascaded to synthesize and input as new feature representations into a fully connected network for classification. Fig. 7 shows the accuracy evaluation of speech driven lip synchronization. The model demonstrated in the TUT2016 animation automatic generation experiment that the parallel combination of LSTM and deep learning outperforms traditional independent DNN, deep learning, and LSTM models in classification accuracy. Many improvements have been made to the parallel model mentioned earlier, using adversarial networks to construct feature data instead of directly constructing environmental sound data. The reason for this is that it is very difficult to generate reasonable animation signals. On the contrary, constructing animation features is relatively easy and effective.

Fig. 7. Speech driven mouth shape synchronization accuracy evaluation diagram.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig7.png

Next, LSTM and deep learning networks were used to extract features in parallel, and finally, a mechanism combining fully connected neural networks and support vector machine models was fused to classify environmental sounds. This paper’s method begins by training a classification model for the appropriate image, incorporating LSTM and deep learning at the base of the model, which connects the feature vectors generated from both paths as ASC features. At the top of the model, a three-layer fully connected neural network is employed, concluding with a Softmax function for classification. The input features consist of animation features or Mel-frequency animation features, with the Gaussian mixture model being replaced by a more robust convolutional neural network. The improvement in performance arises from the ability of deep neural networks to more effectively learn the intricate relationships within speech signal features, providing greater robustness compared to Gaussian mixture models. This approach utilizes delayed deep neural network models for speech animation recognition. Table 1 shows the DCASE2016 dataset, which further proposes a weight sharing method that can better learn speech features. Experiments on speech animation recognition and speech search tasks show that the convolutional neural network-based model reduces error rates by 6% -10% compared to traditional models.

Table 1. DCASE2016 dataset.

Cafe/restaurant Small cafe/restaurant
Indoor Grocery store Medium size grocery store
Home Home environment
Library Library environment
Metro station Metro environment
Office Multiple persons, typical workday

Whether it is text or speech animation, they all correspond to a segment of speech signal. Speech signals of different lengths will generate frame sequences of different sizes, and these temporal neural networks need to pre train continuous speech signals into segments, and then obtain label sequences through post-processing and other methods. Their applicability to this task is limited. A method that does not require pre segmentation and post-processing was proposed to solve these two problems, and experiments were conducted on the TIMIT speech dataset, proving that this method has more advantages than HMM and hybrid HMM-RNN methods. A BLSTM-CTC model was proposed for speech animation recognition based on the former method. However, existing hybrid models also have some shortcomings. Epoch: 64 indicates that the model has completed 64 rounds of training. This means that the model traverses the entire dataset multiple times during the training process, with the aim of gradually improving the performance and accuracy of the model by continuously updating weights and adjusting parameters. Although there are two ways to reduce multi-channel feature maps to single channel feature maps, they may lose global contextual content information and even destroy structural information; By using LSTM to learn temporal characteristics and deep learning models to learn animation characteristics in a parallel manner, Table 2 shows the impact of sliding window length and Mel feature dimension on the Multi-L deep learning model. Then, a classifier network is trained by simply cascading the temporal characteristics encoded by LSTM and the animation graph characteristics learned by deep learning. However, the temporal and frequency domain information should be learned as a whole to learn highly integrated fusion features, which will ultimately be used for DNN classification.

5. Experimental Analysis

The model structure of serially connected convolutional neural networks and long short-term memory networks first inputs animation features into deep learning network modules, and then uses the output of deep learning network modules as input to LSTM network modules. This sub-network structure has two obvious problems. The convolutional kernels in convolutional neural networks can effectively learn local feature information of receptive fields, but as the convolutional kernels move, the temporal information of consecutive frames of animation signals gradually becomes confused, ultimately leading to loss of global context content in animation, further affecting the effective training of subsequent LSTM temporal network modules. Fig. 8 shows the evaluation of the enhancement effect of 3D model details. In the convolutional neural network module, multiple convolutional kernels are usually used to enrich the diversity of feature maps. However, in the CLDNN architecture, the final multi-channel feature map of deep learning must be reduced to the single channel feature map required by LSTM.

Fig. 8. Evaluation of 3D model detail enhancement effect.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig8.png

This leads to a loss of temporal information and may even compromise the integrity of the temporal structure. The sampling mechanism of animation signals facilitates the independent analysis and examination of both the time and frequency domains associated with animation. The current research methods based on temporal neural networks to analyze animation time domain features and convolutional neural networks to analyze animation frequency domain features have achieved certain performance improvements. However, more importantly, the time and frequency domains of animation should be processed and analyzed as a whole to learn more highly aggregated time-frequency domain fusion features. Fig. 9 shows the evaluation of transition smoothness between animation frames, while existing convolutional neural networks, temporal neural networks, or their hybrid networks independently analyze and study time-domain and frequency-domain information.

Fig. 9. Evaluation of transition smoothness between animation frames.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig9.png

The network architecture of the CLC model first combines the animation features learned by deep learning and the temporal features learned by LSTM, and then classifies and trains the joint features. However, further learning of the joint features is not carried out, resulting in inadequate processing and analysis of animation signals, which cannot fully leverage the best performance advantages of the hybrid model. Fig. 10 shows the diversity evaluation of character expression synthesis, showing the single channel feature map Mf generated by the stretching transformation dimensionality reduction operation and its correlation matrix.

Table 2. The influence of sliding window length and Mel feature dimension on multi-L deep learning models.

Frames 10 20 30 40 50 60 70 80 90 100
Epoch:64 Flatten 512 1024 2048 2560 3072 3584 4608 5120 5632 6144
Segment 73.16 75.77 75.86 76.91 77.42 77.61 77.86 78.15 77.72 78.06
Vote 81.79 83.58 82.30 83.07 83.07 82.82 83.33 83.58 83.07 82.30
Mean 82.30 83.84 82.30 84.61 83.58 83.33 83.58 83.84 83.07 83.07

The structural integrity of the current animation has been significantly compromised, as all the original multi-channel data has been retained, leading to excessive preservation of redundant information. These disrupted data structures adversely affect the global contextual information of the animation, rendering them ineffective as efficient embedding encodings for training LSTM temporal networks. To validate our model, we performed a series of ablation experiments for thorough analysis. Fig. 11 shows the automatic conversion evaluation graph from drawing scripts to scripts. To verify the effectiveness of the Multi-L deep learning model, an LSTM network was designed to learn temporal features, and the original time-frequency feature map was fed into the LSTM network as an input sequence.

Fig. 10. Diversity evaluation of character expression synthesis.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig10.png

Fig. 11. Evaluation diagram of automatic conversion from animation script to script.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig11.png

The output of the LSTM network is input into an efficient deep learning model to learn temporal encoding features; Deep learning models directly learn frequency features from the original animation; Using the final Multi-L deep learning model to learn highly integrated features from MFFM fusion features. Fig. 12 shows the feedback evaluation of interactive animation creation, where animation features have a more effective impact on classification performance than time-domain features. The results of Multi-L deep learning are superior to LSTM networks and deep learning models, which verifies the effectiveness of MFFM.

Fig. 12. Interactive animation creation feedback evaluation chart.

../../Resources/ieie/IEIESPC.2025.14.6.776/fig12.png

6. Conclusion

A fully convolutional neural network can learn all the content information of a sample, and can reshape the network structure through deconvolution. Drawing on this approach, we can also improve our model to learn global content information. We hope to verify our ideas through experiments in the future, further improve the model, and enhance its performance. This represents the initial and most basic deep neural network, achieving a speech animation recognition accuracy of 77%, which is a significant improvement over traditional models. Building upon deep recursive recurrent neural networks, the accuracy of speech animation recognition has been notably enhanced, surpassing 80% for the first time, with a final accuracy of 82.3%. The model is trained for 50 iterations, with an initial learning rate set to 0.01. Specifically, the learning rate for the LSTM module is configured to 0.001, which is adjusted at the 15th, 25th, 35th, and 40th iterations. A year later, convolutional neural networks were proposed as a solution for animation recognition challenges, followed by the introduction of an improved hierarchical convolutional neural network to further enhance performance, achieving a recognition accuracy of 83.5%. Ultimately, this method reached an accuracy of 87.43%, marking it as the best result among end-to-end trained models to date.

In the scene planning section, this article outlines the design architecture of the planning layer and explores the formal representation of scenes based on the narrative. The utilization of attention signals is justified by their ability to significantly enhance problem-solving effectiveness. The attention mechanism selectively concentrates on the most pertinent aspects of the input data, allowing the model to extract essential features more efficiently. This method not only boosts the model’s performance but also improves its capability to comprehend complex scenes and dynamic changes. Furthermore, attention signals facilitate the model’s ability to more accurately capture keyframe information during animation generation, resulting in smoother and more natural animations. The Text2Scene system has basically achieved the goal of converting simple natural language texts into three-dimensional scenes. In the task of speech animation recognition, although our proposed convolutional neural network is capable of recognizing multiple consecutive animations in animation signals, it also plays the role of a convolutional kernel receptive field, which can learn the features of a single animation in a short period of time. At the same time, the limitation of ignoring long-term contextual content is also reflected. Later, we hope to add a temporal network to the model, so that we can use the powerful learning ability of convolutional kernels to learn local information of audio signals, and also use temporal networks to learn long-term dependent contextual information.

References

1 
Alonso G. E., Jin X. G., 2022, Skeleton-level control for multi-agent simulation through deep reinforcement learning, Computer Animation and Virtual Worlds, Vol. 33, No. 3-4, pp. 11DOI
2 
Aylagas M. V., Leon H. A., Teye M., Tollmar K., 2022, Voice2Face: Audio-driven facial and tongue rig animations with cVAEs, Computer Graphics Forum, Vol. 41, No. 8, pp. 255-265DOI
3 
Bertiche H., Madadi M., Escalera S., 2021, PBNS: Physically based neural simulation for unsupervised garment pose space deformation, ACM Transactions on Graphics, Vol. 40, No. 6, pp. 14DOI
4 
Chen J., Fan C., Zhang Z., Li G., Zhao Z., Deng Z., 2023, A music-driven deep generative adversarial model for Guzheng playing animation, Ieee Transactions on Visualization and Computer Graphics, Vol. 29, No. 2, pp. 1400-1414DOI
5 
Chen S. C., Liu G. T., Wong S. K., 2021, Generation of multiagent animation for object transportation using deep reinforcement learning and blend-trees, Computer Animation and Virtual Worlds, Vol. 32, No. 3-4, pp. 10DOI
6 
Choi Y., Seo B., Kang S., Choi J., 2023, Study on 2D Sprite*3.Generation using the impersonator network, KSII Transactions on Internet and Information Systems, Vol. 17, No. 7, pp. 1794-1806DOI
7 
Ding W., Li W. F., 2023, High speed and accuracy of animation 3D pose recognition based on an improved deep convolution neural network, Applied Sciences-Basel, Vol. 13, No. 13, pp. 7566DOI
8 
Duan L., 2022, Application of animation character intelligent analysis algorithm based on deep learning, Mobile Information Systems, Vol. 2022DOI
9 
He C., Jia Y., 2023, Automatic depth estimation and background blurring of animated scenes based on deep learning, Traitement Du Signal, Vol. 40, No. 5, pp. 2225-2232DOI
10 
Hu Z. P., Wu R., Li L., Zhang R., Hu Y., Qiu F., 2023, Deep learning applications in games: A survey from a data perspective, Applied Intelligence, Vol. 53, No. 24, pp. 31106-31128DOI
11 
Kwiatkowski A., Alvarado E., Kaiogeiton V., Liu C. K., Pettré J., van de Panne M., Cani M.-P., 2022, A survey on reinforcement learning methods in character animation, Computer Graphics Forum, Vol. 41, No. 2, pp. 613-639DOI
12 
Li C., Lussell L., Komura T., 2021, Multi-agent reinforcement learning for character control, Visual Computer, Vol. 37, No. 12, pp. 3115-3123DOI
13 
Li R., Shi R., Kanai T., 2023, Detail-aware deep clothing animations infused with multi-source attributes, Computer Graphics Forum, Vol. 42, No. 1, pp. 231-244DOI
14 
Liu G. T., Wong S. K., 2024, Mastering broom-like tools for object transportation animation using deep reinforcement learning, Computer Animation and Virtual Worlds, Vol. 35, No. 3, pp. 15DOI
15 
Liu L. J., Zheng Y. Y., Tang D., Yuan Y., Fan C. J., Zhou K., 2019, NeuroSkinning: Automatic skin binding for production characters with deep graph networks, ACM Transactions on Graphics, Vol. 38, No. 4, pp. 114DOI
16 
Luo Y. S., Soeseno J. H., Chen T. P. C., Chen W. C., 2020, CARL: Controllable agent with reinforcement learning for quadruped locomotion, ACM Transactions on Graphics, Vol. 39, No. 4, pp. 38DOI
17 
Morace C. C., Le T. N. H., Yao S. Y., Zhang S. W., Lee T. Y., 2022, Learning a perceptual manifold with deep features for animation video resequencing, Multimedia Tools and Applications, Vol. 81, No. 17, pp. 23687-23707DOI
18 
Moutafidou A., Toulatzis V., Fudos I., 2024, Deep fusible skinning of animation sequences, Visual Computer, Vol. 40, No. 8, pp. 5695-5715DOI
19 
Paier W., Hilsmann A., Eisert P., 2020, Interactive facial animation with deep neural networks, IET Computer Vision, Vol. 14, No. 6, pp. 359-369DOI
20 
Park S., Ryu H., Lee S., Lee S., Lee J., 2019, Learning predict-and-simulate policies from unorganized human motion data, ACM Transactions on Graphics, Vol. 38, No. 6, pp. 205DOI
21 
Peng T., Kuamg J., Liang J., Hu X., Miao J., P. Zhu , L. Li , F. Yu , M. Jiang , 2023, GSNet: Generating 3D garment animation via graph skinning network, Graphical Models, Vol. 129, pp. 10DOI
22 
Qiao Z., Li T. X., Hui L., Liu R. J., 2023, A deep learning-based framework for fast generation of photorealistic hair animations, IET Image Processing, Vol. 17, No. 2, pp. 375-387DOI
23 
Shan F., Wang Y. Y., 2022, Animation design based on 3D visual communication technology, Scientific Programming, Vol. 2022DOI
24 
Tan J., Tian Y., 2023, Fuzzy retrieval algorithm for film and television animation resource database based on deep neural network, Journal of Radiation Research and Applied Sciences, Vol. 16, No. 4, pp. 100675DOI
25 
Ullah S., Ijjeh A. A., Kudela P., 2023, Deep learning approach for delamination identification using animation of Lamb waves, Engineering Applications of Artificial Intelligence, Vol. 117, pp. 105520DOI
26 
Wu P., Chen S. J., 2022, A study on the relationship between painter's psychology and anime creation style based on a deep neural network, Computational Intelligence and Neuroscience, Vol. 2022DOI
27 
Yu C., Wang W. M., Yan J. H., 2020, Self-supervised animation synthesis through adversarial training, IEEE Access, Vol. 8, pp. 128140-128151DOI
28 
Yu Z. X., Wang H. H., Ren J., 2022, RealPRNet: A real-time phoneme-recognized network for "believable" speech animation, IEEE Internet of Things Journal, Vol. 9, No. 7, pp. 5357-5367DOI
29 
Zhang Y. L., Ban X. J., Du F. L., Di W., 2020, FluidsNet: End-to-end learning for Lagrangian fluid simulation, Expert Systems with Applications, Vol. 152, pp. 113410DOI
30 
Zhang Z. N., Wu Y. H., Pan Z. G., Li W. Q., Su Y., 2022, A novel animation authoring framework for the virtual teacher performing experiment in mixed reality, Computer Applications in Engineering Education, Vol. 30, No. 2, pp. 550-563DOI

Author

Xuelian Gao
../../Resources/ieie/IEIESPC.2025.14.6.776/au1.png

Xuelian Gao graduated from the Animation College of Jilin Art Institute in 2009 with a bachelor of Arts degree. She graduated from Changchun University of Science and Technology in 2017 with a master’s degree in Industrial Engineering. She has been a professional course teacher at Jilin Animation College since 2011. Her research interests include animation production and the application of artificial intelligence in animation. She has served as the head of many provincial social science research projects and published more than ten related academic papers. During her teaching career, she guided students to participate in various competitions, won many awards, and won the Outstanding Instructor Award.