Deep Learning Models for Automatic Animation Generation and Active Learning
GaoXuelian1
-
(Jilin Animation Institute, Changchun 130000, China)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Scene planning, Automatic generation of animations, Active learning, Deep learning models
1. Introduction
The computer animation automatic generation technology based on artificial intelligence,
starting from the design and production process of animation, studies the script written
in natural language to the final animation implementation process, aiming to improve
the automation and intelligence of animation production, thereby reducing animation
production costs, shortening animation production cycles, and improving the efficiency
of the animation industry [1,2]. The automatic generation of animation represents a novel area of research. In contrast
to traditional animation production, which involves lengthy cycles and substantial
investment in manpower and financial resources, this new approach leverages advancements
in computer hardware technology, graphics algorithms, and artificial intelligence
[3,4]. Consequently, automatic animation generation technology has emerged, relying on
computer technologies like artificial intelligence to explore how computers can comprehend
storylines, generate corresponding scenes, and ultimately produce animations. The
primary aim is to transform conventional animation production methods and enhance
the intelligence and efficiency of the production process. This technology serves
as a cutting-edge topic, garnering extensive attention from scholars and researchers
who have achieved promising results in various aspects. One significant focus area
is automatic text-to-scene conversion, which investigates how to automatically generate
corresponding 3D scenes from text described in natural language. In a text scene automatic
conversion system, each natural language statement is analyzed and translated; however,
unlike general natural language processing (NLP) technologies, the NLP techniques
employed for automatic text-to-scene conversion must tackle additional challenges.
These include understanding spatial relationships and other relevant constraints between
objects as conveyed through words, particularly in prepositional phrases. Notable
systems in this domain include the Words Eye system, the Car Sim system, and the Swan
system [5,6]. Research on the generation of computer animation from natural language is mainly
driven by stories described in natural language, which are translated into formal
descriptions by computers to generate virtual character Agent animations. It generally
includes the following steps: story understanding, generating Agent characters, constructing
animation scenes, and simulating Agent actions to form animations [7,8]. Representative systems are CONFUCIUS. Story understanding is an important module
in animation automatic generation systems. Story understanding is not only about understanding
each isolated sentence, but also about inferring and analyzing the implicit states,
constraints, conflicts, etc. in the story [9]. Therefore, it must have the necessary knowledge base and the ability to use knowledge
for reasoning. The task of environmental sound classification has not received widespread
attention and research [10].
Effectively harnessing or circumventing these acoustic signals has emerged as a primary
research focus in the 21st century, making the study of environmental sound a crucial
aspect of acoustics. In an age characterized by information overload, the introduction
of new public datasets has significantly advanced related fields, including the classification
of environmental sounds [11,12]. To streamline the design and implementation of parallel methods for deep learning
models and enhance the adaptability of parallel policy designs, researchers have started
investigating an alternative automatic parallel method for deep learning models, which
is grounded in graph algorithms or machine learning techniques. This approach offers
a distributed training framework for the entire deep learning process and facilitates
automated distributed parallel policy search [13]. The automatic parallel training method that employs graph algorithms primarily relies
on the principles of graph partitioning scheduling from parallel computing to achieve
model parallelism. While these methods are known for their rapid solving capabilities,
they are often constrained by specific types of networks, making them applicable solely
to deep learning. Furthermore, the model training process and various execution details
still necessitate human intervention, highlighting a significant reliance on domain
knowledge [14,15]. Different initialization conditions need to be set for different models. Therefore,
some scholars have proposed machine learning based methods to achieve end-to-end automatic
parallel policy output. However, in dynamic distributed environments, sample data
is prone to failure, resulting in large fluctuations in the performance of the output
distributed parallel policy. Reinforcement learning, due to its ability to interact
with dynamic environments in real-time and achieve label free autonomous learning,
has become the mainstream of machine learning algorithms for automatic parallelism.
Although existing reinforcement learning model-based automatic parallel methods have
solved some of the problems of automatic parallel training of models, there are still
many shortcomings, such as inadequate adaptation to the dynamic environment of the
policy [16,17]. Currently, the performance evaluation of the policy only focuses on the resource
requirements of the model itself, without considering the availability of storage
and communication bandwidth resources in the environment, resulting in a decline in
performance of parallel policies guided by static resources in actual environments
[18,19].
2. Distributed Training Performance Evaluation Model
2.1. Deep Learning Models
Deep learning mainly relies on expert experience to manually design parallel strategies
for parallel training of large-scale deep learning models. Deep learning models are
typical hierarchical structures, so manual parallel strategy design mainly involves
coarse-grained layer partitioning to divide the model into different sub models, as
shown in Eqs. (1) and (2), and then scheduling them to different devices for execution, balancing computing
and communication loads without affecting the model’s computational structure.
By studying the computational and memory characteristics of different layers in the
natural language domain model, the LSTM layer and Attention layer are allocated to
devices that match their computational and storage performance for execution. As shown
in Eq. (3), nodes with large-scale network parameters and high computational time complexity
are often distributed in different layers, and coarse-grained hierarchical division
makes it impossible for single working nodes in the cluster to complete the training
of such complex network layers.
A more fine-grained segmentation method was proposed, and the skeleton model of neural
networks was studied. By retaining the key computing nodes of its performance. Using
the matrix partitioning method in the field of high performance computing, partition
the parameter tensor matrix of the model, as shown in Eq. (4), to achieve finer grained tensor level parallelism. Design adaptive communication
methods for different model layers by analyzing underlying communication technologies
such as RPC and MPI.
With the in-depth study of parallel granularity and underlying technologies, the number
of combinations of parallel strategies is increasing exponentially, making it difficult
to find the optimal combination of parallel strategies based on manual parallelization
methods. As shown in Eqs. (5) and (6), customized parallel strategies for different deep learning models are also not universal.
Due to the shortcomings of manual parallel methods, many scholars have begun to study
automatic parallel methods. At present, the mainstream automatic parallel methods
include two types, as shown in Eqs. (7) and (8). One type is based on graph algorithms for model automatic parallel methods, and
the other type is based on machine learning algorithms for automatic parallel methods.
The initial model automatic parallel method based on graph algorithm adopted an adaptive
graph partitioning scheduling method similar to the high-performance computing field,
as shown in Eqs. (9) and (10). The proposed FM graph segmentation algorithm guides the static computation of graph
segmentation by balancing node computation cost and dependent edge data volume, balancing
multiple loads while minimizing communication costs; By analyzing the structural characteristics
of neural networks.
Opt deep learning was proposed, and a method based on tensor segmentation was proposed
for parallel scheduling. The automatic optimal policy search was implemented based
on dynamic programming ideas, as shown in Eqs. (11) and (12). However, it still focuses on coarse-grained tensor segmentation for layers, resulting
in limited performance improvement of the policy in the end.
2.2. Model Parallel Methods
A scheduling algorithm FastT based on deep learning model DAG was proposed, which
achieves operator scheduling by setting operator priority and graph critical path,
as shown in Eq. (13). However, its parallel strategy has limited improvement on RNN models. Building Beachi
implements strategy search using three graph algorithms: topological sorting, earliest
start time based, and minimum communication volume based. The model parallel strategy
can be searched in tens of seconds at the fastest.
Their different methods rely on specific constraints, resulting in inconsistent policy
effects across different deep learning models. Therefore, although the above graph
algorithm method has a fast search speed, as shown in Eq. (14), it does not consider the structured features of the deep learning model, resulting
in little improvement in the performance of the obtained parallel strategy, and is
only applicable to some networks and training scenarios, with poor portability.
These methods still require manual analysis of the details of the model training process
and still heavily rely on domain knowledge. In order to solve the problems of limited
scenarios and excessive reliance on knowledge in fields such as parallel computing
and architecture, as shown in Eqs. (15) and (16), automatic model parallelism based on machine learning algorithms has become a hot
research topic.
Establish an evaluation model for the parameter server architecture framework, predict
policy performance through machine learning methods, and automatically search for
job training strategies that minimize costs. As shown in Eqs. (17) and (18), a multi-dimensional strategy execution time prediction method based on message partitioning,
communication topology, and other dimensions was implemented in data parallel mode
to guide strategy search for synchronous data parallelism.
Parallex, developed by R&D, has constructed a linear prediction model for computation
and communication based on different partitioning methods of sparse model parameters,
to guide the adaptive partitioning of sparse model parameters. Using Bayesian optimization
to determine the trustworthy size of scheduling to guide policy output; However, the
above methods require obtaining a large amount of label data in the actual environment.
When in a dynamic cluster environment, as shown in Eqs. (19) and (20), the quality of label data decreases, resulting in the actual strategy effect not
meeting expectations, making it difficult to apply in practice.
Propose the Hierarchical method, which extracts deep learning models and device topology
features, and uses reinforcement learning algorithms to guide the automatic parallel
policy output of the model. Due to the need for frequent sampling and large search
space, as shown in Eq. (21), the strategy search cost of the above two methods is high, so the performance improvement
is limited compared to the model parallel method based on expert experience.
Spotlight proposed to abstract the parallel training of deep learning models into
operator scheduling problems, and for the first time constructed it as a Markov decision
process. Subsequently, POST further optimized it by introducing cross entropy and
proximal policy gradient methods in the sampling process to improve search efficiency.
As shown in Eq. (22), the Flex Flow framework is improved and proposed based on Opt deep learning.
3. Research on Aggregated Time Frequency Domain Deep Learning Models for Automatic
Animation Generation
3.1. Hybrid neural network model based on time-frequency domain aggregation features
Deep learning introduces the idea of SOAP to establish a multi-dimensional parallel
policy search space. However, its implementation is complex and cannot be applied
to recurrent neural network models. It can be seen that the above methods are only
effective for some network models, and when facing different network models, it is
necessary to retrain the strategy search model, which is not portable [20,21]. The cost of designing and implementing parallel strategies remains high for different
network models. To address this issue, Placeto is proposed, which introduces a graph
embedding encoding method to endow parallel strategies with portability and reduce
training time under similar networks [22,23]. A simulation executor is also proposed, which predicts the single step execution
time under a specified strategy through static data simulation. Although it can accelerate
the search speed of the strategy, the execution performance error of parallel strategies
is relatively large; Propose BRKGA, introduce graph neural networks, and construct
a deep learning model execution cost model to search for parallel strategies with
better execution performance; The proposed Auto Map achieves automatic parallel policy
search for fine-grained models based on XLA-IR graphs; A reinforcement learning based
adaptive distributed parallel training method called Trinity is proposed, which utilizes
a proximal policy optimization method to expand the learning ability of policy networks
[24,25]. A multi-dimensional distributed training evaluation model is proposed to evaluate
policy performance at a fine-grained level, but only considers the static resource
allocation of parallel policies. In practical environments, insufficient policy resources
may lead to a decrease in policy performance; HeterPS is proposed to combine pipeline
technology with reinforcement learning methods to achieve coarse-grained hierarchical
scheduling [26,27]. The accuracy of evaluation models in dynamic environments is low. At present, the
performance evaluation of the strategy only focuses on the calculation, memory, communication
and other requirements of the model itself, without considering the load balancing
of the environment memory and the availability of communication bandwidth resources
during the strategy output process, resulting in low accuracy in evaluating the model
in dynamic environments [28,29]. Causing the performance of strategies guided by traditional evaluation models to
fall short of expectations. Fig. 1 shows the flowchart of reinforcement learning in animation interaction design, with
a time-consuming strategy search process. The current reinforcement learning automatic
parallel methods all use full process sampling for value estimation. Only after completing
a full process interaction with the environment will a policy iteration be performed,
resulting in a decrease in the effective sampling data rate as the policy space increases,
greatly increasing the learning time of the overall policy iteration [30].
Fig. 1. The flowchart of reinforcement learning in animation interaction design.
As the state space and action space increase, the performance of this method will
continue to decline; Easy to converge to the local optimal strategy, the sampling
learning process is entirely based on the policy experience collected in the current
sampling, and after policy iteration based on this experience, this information will
be discarded. Therefore, when a suboptimal strategy appears, the model will continuously
lean towards the local optimal policy point, losing the ability to explore different
strategies, making the overall model only able to converge to the local optimal. Speech
animation is a vital aspect of automatic speech recognition research and plays a significant
role in the field. Initially, the recognition of speech animation was not treated
as an independent research topic; rather, it emerged as a necessity stemming from
the requirements of automatic speech recognition. The pronunciation of a word typically
involves about three animations, meaning that recognizing text through speech sequences
essentially involves the identification of these speech animations. The evolution
of automatic speech recognition has spanned nearly sixty years, indicating that research
into this technology has existed since the advent of computers. As a medium for information
exchange, speech has led to numerous practical applications and advancements based
on automatic speech recognition technology. Although some systems focus exclusively
on Arabic numerals in English pronunciation from the same speaker, the underlying
models and research methods have been adapted and refined by subsequent researchers.
Following this, interest in voiceprint recognition began to flourish within the speech
domain, with models based on Gaussian mixtures and their enhancements being utilized
for voiceprint recognition. Given the close relationship between voiceprint recognition
and speech animation recognition, techniques and models from voiceprint recognition
have also found application in speech animation recognition, with the Gaussian mixture
model emerging as a prominent representation during this phase of animation recognition.
Moreover, our understanding of speech animation extends beyond merely its constituent
units. Fig. 2 shows the structure diagram of autoencoders used for texture compression in animation.
With the advent of deep learning, models based on temporal neural networks and convolutional
neural networks have also been used in speech animation recognition, and many practical
applications have been developed based on this. A typical example is to further improve
the accuracy of speech animation recognition by improving its accuracy. In addition,
in the field of combining speech and animation, by learning the correspondence between
speech animation and mouth shape, the speech of animated characters can be more anthropomorphic.
Therefore, there will be many applications in the fields of animation and power supply,
especially in the field of robot anthropomorphism.
Fig. 2. Structure diagram of autoencoder for animation texture compression.
3.2. Analysis of Deep Learning Models
This algorithm enables single-layer perceptrons to transform into multi-layer perceptrons
and still learn to update weight parameters. In addition, the non-linear activation
function behind each layer of neurons can effectively solve the XOR problem. This
magical algorithm has led artificial neural networks towards non-linear computing
and towards deeper and larger artificial neural network structures. Robert established
the universal approximation theorem for multi-layer perceptrons through theoretical
means, indicating that artificial neural networks with hidden layers can, in principle,
approximate any continuous function within a specified domain. This proof serves as
a comforting validation, reassuring future researchers that artificial neural networks
are worthy of extensive study and application. In the realm of acoustic signal classification,
automatic animation generation has emerged as a significant research endeavor proposed
by scholars. Numerous deep learning models, particularly those based on convolutional
neural networks and their enhancements, have been introduced to extract high-dimensional
hidden features and more discriminative characteristics from low-level animation signal
features. Additionally, researchers emphasizing temporal information have developed
various deep learning models rooted in time series analysis to capture the temporal
dynamics between successive frames of animation signals. As previously noted, one
major challenge encountered by acoustic signals in the field of acoustics when applying
deep learning models is the focus on examining a hybrid neural network model that
integrates both time-domain and frequency-domain characteristics. Notable examples
of such models include active learning models and CLC models. For instance, CLDNN
utilizes logarithmic Mel animation features as inputs to the convolutional neural
network and subsequently feeds the features learned by this deep learning model into
a temporal prediction neural network. Fig. 3 shows the evaluation of animation style transfer effect. CLC uses LSTM to learn temporal
characteristics and deep learning models to learn animation characteristics in parallel,
and then trains a classifier network by simply cascading the temporal characteristics
encoded by LSTM and the animated features learned by deep learning. However, the temporal
and frequency-domain information should be studied as a whole to learn highly integrated
fusion features, which will ultimately be used for DNN classification.
Fig. 3. Evaluation of animation style transfer effect.
There are two main improvements made to convolutional neural networks. Firstly, by
changing the step size of convolutional kernels and pooling operations on the timeline
to maintain the same feature map size on the timeline dimension, it can process temporal
inputs and obtain consistent length temporal outputs. Utilizing the powerful learning
ability of convolutional kernels and the focus on local information, the recognition
accuracy is effectively improved; The next step involves enhancing the final fully
connected layer in convolutional neural networks by introducing a fully connected
mechanism with shared weights, which links the fully connected layer to each temporal
node. Subsequently, the model is trained by computing the cross-entropy loss between
the predicted outcomes of all temporal nodes and the actual temporal labels, effectively
mitigating issues related to the pre-segmentation and post-processing of animation
signals. In earlier research on voiceprints, Gaussian mixture models and general background
Gaussian mixture models were predominantly utilized, with support vector machines
(SVM) integrated into the GMM-UBM framework. By extracting the mean vector from each
Gaussian mixture component individually, a Gaussian super vector was created to serve
as a sample for the SVM. This approach leveraged the powerful nonlinear classification
capabilities of SVM kernel functions, significantly enhancing the performance of the
model compared to the previous GMM-UBM approach. Fig. 4 shows the automatic generation of evaluation graphs for character action sequences.
A theoretical analysis framework for joint factor analysis has been proposed, further
enhancing and compensating for the FA model. The red line represents the performance
of a specific model, usually the benchmark model or the optimal model; The green line
represents the result of another model or algorithm, used for comparison with the
red line model; The blue line displays the results of another strategy or optimization
algorithm; Gray lines generally represent random benchmarks, baseline models, or control
groups.
Fig. 4. Automatic generation and evaluation of character action sequences.
A theoretical model of full factor space was proposed, which maps speech signals onto
a fingerprint like identity vector in this space. The features of animation signals
are greatly reduced in dimension, and key features are extracted in representation
content. The corresponding traditional model prediction methods are also relatively
mature. Improved the recognition rate by one stage and maintained a dominant position
over the problem for a long period of time. Later, convolutional neural networks were
extended to multiple fields and made significant achievements in areas such as facial
recognition, video surveillance, behavioral motion analysis, natural language processing,
and speech recognition related to this research topic. Convolutional neural networks
include a feature extractor consisting of convolutional layers and subsampling layers.
In the convolutional layer of a convolutional neural network, a neural unit is only
connected to some adjacent layers of neural units, usually containing several feature
planes composed of neurons arranged in a matrix. The same set of feature planes share
neuron weight parameters, which are called convolutional kernels. Generally, convolutional
kernels are initialized in the form of a random decimal matrix that satisfies a certain
probability distribution. During the training and learning process of deep convolutional
neural networks, the convolutional kernels are iteratively learned and updated to
obtain reasonable weight parameters. Fig. 5 shows the evaluation of scene layout deep learning optimization. The direct benefit
of convolutional kernels, which share parameters, is to reduce the complexity of the
network, that is, the connections between layers in the network, and also reduce the
risk of overfitting. Another core operation of convolutional neural networks is pooling,
which is subsampling. There are generally two forms of subsampling: maximum subsampling
and mean subsampling. Its function is similar to that of convolutional kernels, and
it can be said to be a special type of convolutional kernel.
Fig. 5. Scene layout deep learning optimization evaluation diagram.
4. Research on Deep Learning Models for Automatic Animation Generation and Active
Learning
Although traditional Gaussian mixture models or machine learning methods can solve
classification problems to a certain extent, in the era of big data, deep learning
methods or a combination of deep learning and machine learning methods can more effectively
fit data and obtain more efficient performance. The use of convolutional neural network
models to train environmental sound datasets has achieved excellent performance compared
to traditional algorithms, providing good reference for later researchers in both
data and deep network models. Piczak extracts animation features from sound signals,
performs Mel cepstral analysis on animation features, then performs cosine transformation
to obtain Mel animation cepstral coefficient features, and performs first-order differencing
on MFCC features to obtain two channel data samples. By combining convolutional neural
networks with fully connected networks, Dropout mechanism is used to prevent overfitting.
Fig. 6 is Performance evaluation of lighting rendering algorithm, A sample-level deep learning
network was developed and enhanced by directly applying one-dimensional convolution
to learn animation features from the original wave data, ultimately concatenating
feature data from the previous pooling layer to serve as classification features.
To extract animation features, MFCC features, and CRP features, two prominent image
classification networks—AlexNet and GoogleNet—were employed to integrate these into
a three-channel image. Additionally, a preprocessing method based on Constant Q Transform
(CQT) was designed to extract more effective animation features before they are fed
into deep learning models. In the accompanying graph, the yellow line indicates a
newly proposed method or algorithm, typically used to showcase its performance on
specific tasks, while the blue line represents an alternative comparison model.
Fig. 6. Performance evaluation of lighting rendering algorithm.
Drawing inspiration from the effective application of deep learning, LSTM, and DNN
in active learning models, an enhanced CLDNN model based on a bidirectional long short-term
memory network has been proposed and is utilized for automatic animation generation
tasks. Furthermore, leveraging the CLDNN network architecture, the research team introduces
an active network that incorporates attention-weighted calculations following each
LSTM layer. In the context of automatic animation generation, another approach combines
deep learning with LSTM. A parallel joint LSTM and deep learning network structure
has been proposed, designed to independently learn the relationships between animation
frame sequences and high-dimensional local representations of animation features.
Within the LSTM module, Mel-frequency cepstral coefficient features serve as input
for temporal sequences, facilitating the extraction and learning of temporal relationships.
The deep learning module directly uses the animated image after the short-time Fourier
transform as input. Specifically, the sequence length of the deep learning module
and the LSTM module in the time dimension is consistent. After two independent modules
learn features separately, they are cascaded to synthesize and input as new feature
representations into a fully connected network for classification. Fig. 7 shows the accuracy evaluation of speech driven lip synchronization. The model demonstrated
in the TUT2016 animation automatic generation experiment that the parallel combination
of LSTM and deep learning outperforms traditional independent DNN, deep learning,
and LSTM models in classification accuracy. Many improvements have been made to the
parallel model mentioned earlier, using adversarial networks to construct feature
data instead of directly constructing environmental sound data. The reason for this
is that it is very difficult to generate reasonable animation signals. On the contrary,
constructing animation features is relatively easy and effective.
Fig. 7. Speech driven mouth shape synchronization accuracy evaluation diagram.
Next, LSTM and deep learning networks were used to extract features in parallel, and
finally, a mechanism combining fully connected neural networks and support vector
machine models was fused to classify environmental sounds. This paper’s method begins
by training a classification model for the appropriate image, incorporating LSTM and
deep learning at the base of the model, which connects the feature vectors generated
from both paths as ASC features. At the top of the model, a three-layer fully connected
neural network is employed, concluding with a Softmax function for classification.
The input features consist of animation features or Mel-frequency animation features,
with the Gaussian mixture model being replaced by a more robust convolutional neural
network. The improvement in performance arises from the ability of deep neural networks
to more effectively learn the intricate relationships within speech signal features,
providing greater robustness compared to Gaussian mixture models. This approach utilizes
delayed deep neural network models for speech animation recognition. Table 1 shows the DCASE2016 dataset, which further proposes a weight sharing method that
can better learn speech features. Experiments on speech animation recognition and
speech search tasks show that the convolutional neural network-based model reduces
error rates by 6% -10% compared to traditional models.
Table 1. DCASE2016 dataset.
|
|
Cafe/restaurant
|
Small cafe/restaurant
|
|
Indoor
|
Grocery store
|
Medium size grocery store
|
|
Home
|
Home environment
|
|
Library
|
Library environment
|
|
Metro station
|
Metro environment
|
|
Office
|
Multiple persons, typical workday
|
Whether it is text or speech animation, they all correspond to a segment of speech
signal. Speech signals of different lengths will generate frame sequences of different
sizes, and these temporal neural networks need to pre train continuous speech signals
into segments, and then obtain label sequences through post-processing and other methods.
Their applicability to this task is limited. A method that does not require pre segmentation
and post-processing was proposed to solve these two problems, and experiments were
conducted on the TIMIT speech dataset, proving that this method has more advantages
than HMM and hybrid HMM-RNN methods. A BLSTM-CTC model was proposed for speech animation
recognition based on the former method. However, existing hybrid models also have
some shortcomings. Epoch: 64 indicates that the model has completed 64 rounds of training.
This means that the model traverses the entire dataset multiple times during the training
process, with the aim of gradually improving the performance and accuracy of the model
by continuously updating weights and adjusting parameters. Although there are two
ways to reduce multi-channel feature maps to single channel feature maps, they may
lose global contextual content information and even destroy structural information;
By using LSTM to learn temporal characteristics and deep learning models to learn
animation characteristics in a parallel manner, Table 2 shows the impact of sliding window length and Mel feature dimension on the Multi-L
deep learning model. Then, a classifier network is trained by simply cascading the
temporal characteristics encoded by LSTM and the animation graph characteristics learned
by deep learning. However, the temporal and frequency domain information should be
learned as a whole to learn highly integrated fusion features, which will ultimately
be used for DNN classification.
5. Experimental Analysis
The model structure of serially connected convolutional neural networks and long short-term
memory networks first inputs animation features into deep learning network modules,
and then uses the output of deep learning network modules as input to LSTM network
modules. This sub-network structure has two obvious problems. The convolutional kernels
in convolutional neural networks can effectively learn local feature information of
receptive fields, but as the convolutional kernels move, the temporal information
of consecutive frames of animation signals gradually becomes confused, ultimately
leading to loss of global context content in animation, further affecting the effective
training of subsequent LSTM temporal network modules. Fig. 8 shows the evaluation of the enhancement effect of 3D model details. In the convolutional
neural network module, multiple convolutional kernels are usually used to enrich the
diversity of feature maps. However, in the CLDNN architecture, the final multi-channel
feature map of deep learning must be reduced to the single channel feature map required
by LSTM.
Fig. 8. Evaluation of 3D model detail enhancement effect.
This leads to a loss of temporal information and may even compromise the integrity
of the temporal structure. The sampling mechanism of animation signals facilitates
the independent analysis and examination of both the time and frequency domains associated
with animation. The current research methods based on temporal neural networks to
analyze animation time domain features and convolutional neural networks to analyze
animation frequency domain features have achieved certain performance improvements.
However, more importantly, the time and frequency domains of animation should be processed
and analyzed as a whole to learn more highly aggregated time-frequency domain fusion
features. Fig. 9 shows the evaluation of transition smoothness between animation frames, while existing
convolutional neural networks, temporal neural networks, or their hybrid networks
independently analyze and study time-domain and frequency-domain information.
Fig. 9. Evaluation of transition smoothness between animation frames.
The network architecture of the CLC model first combines the animation features learned
by deep learning and the temporal features learned by LSTM, and then classifies and
trains the joint features. However, further learning of the joint features is not
carried out, resulting in inadequate processing and analysis of animation signals,
which cannot fully leverage the best performance advantages of the hybrid model. Fig. 10 shows the diversity evaluation of character expression synthesis, showing the single
channel feature map Mf generated by the stretching transformation dimensionality reduction
operation and its correlation matrix.
Table 2. The influence of sliding window length and Mel feature dimension on multi-L
deep learning models.
|
Frames
|
10
|
20
|
30
|
40
|
50
|
60
|
70
|
80
|
90
|
100
|
|
Epoch:64
|
Flatten
|
512
|
1024
|
2048
|
2560
|
3072
|
3584
|
4608
|
5120
|
5632
|
6144
|
|
Segment
|
73.16
|
75.77
|
75.86
|
76.91
|
77.42
|
77.61
|
77.86
|
78.15
|
77.72
|
78.06
|
|
Vote
|
81.79
|
83.58
|
82.30
|
83.07
|
83.07
|
82.82
|
83.33
|
83.58
|
83.07
|
82.30
|
|
Mean
|
82.30
|
83.84
|
82.30
|
84.61
|
83.58
|
83.33
|
83.58
|
83.84
|
83.07
|
83.07
|
The structural integrity of the current animation has been significantly compromised,
as all the original multi-channel data has been retained, leading to excessive preservation
of redundant information. These disrupted data structures adversely affect the global
contextual information of the animation, rendering them ineffective as efficient embedding
encodings for training LSTM temporal networks. To validate our model, we performed
a series of ablation experiments for thorough analysis. Fig. 11 shows the automatic conversion evaluation graph from drawing scripts to scripts.
To verify the effectiveness of the Multi-L deep learning model, an LSTM network was
designed to learn temporal features, and the original time-frequency feature map was
fed into the LSTM network as an input sequence.
Fig. 10. Diversity evaluation of character expression synthesis.
Fig. 11. Evaluation diagram of automatic conversion from animation script to script.
The output of the LSTM network is input into an efficient deep learning model to learn
temporal encoding features; Deep learning models directly learn frequency features
from the original animation; Using the final Multi-L deep learning model to learn
highly integrated features from MFFM fusion features. Fig. 12 shows the feedback evaluation of interactive animation creation, where animation
features have a more effective impact on classification performance than time-domain
features. The results of Multi-L deep learning are superior to LSTM networks and deep
learning models, which verifies the effectiveness of MFFM.
Fig. 12. Interactive animation creation feedback evaluation chart.
6. Conclusion
A fully convolutional neural network can learn all the content information of a sample,
and can reshape the network structure through deconvolution. Drawing on this approach,
we can also improve our model to learn global content information. We hope to verify
our ideas through experiments in the future, further improve the model, and enhance
its performance. This represents the initial and most basic deep neural network, achieving
a speech animation recognition accuracy of 77%, which is a significant improvement
over traditional models. Building upon deep recursive recurrent neural networks, the
accuracy of speech animation recognition has been notably enhanced, surpassing 80%
for the first time, with a final accuracy of 82.3%. The model is trained for 50 iterations,
with an initial learning rate set to 0.01. Specifically, the learning rate for the
LSTM module is configured to 0.001, which is adjusted at the 15th, 25th, 35th, and
40th iterations. A year later, convolutional neural networks were proposed as a solution
for animation recognition challenges, followed by the introduction of an improved
hierarchical convolutional neural network to further enhance performance, achieving
a recognition accuracy of 83.5%. Ultimately, this method reached an accuracy of 87.43%,
marking it as the best result among end-to-end trained models to date.
In the scene planning section, this article outlines the design architecture of the
planning layer and explores the formal representation of scenes based on the narrative.
The utilization of attention signals is justified by their ability to significantly
enhance problem-solving effectiveness. The attention mechanism selectively concentrates
on the most pertinent aspects of the input data, allowing the model to extract essential
features more efficiently. This method not only boosts the model’s performance but
also improves its capability to comprehend complex scenes and dynamic changes. Furthermore,
attention signals facilitate the model’s ability to more accurately capture keyframe
information during animation generation, resulting in smoother and more natural animations.
The Text2Scene system has basically achieved the goal of converting simple natural
language texts into three-dimensional scenes. In the task of speech animation recognition,
although our proposed convolutional neural network is capable of recognizing multiple
consecutive animations in animation signals, it also plays the role of a convolutional
kernel receptive field, which can learn the features of a single animation in a short
period of time. At the same time, the limitation of ignoring long-term contextual
content is also reflected. Later, we hope to add a temporal network to the model,
so that we can use the powerful learning ability of convolutional kernels to learn
local information of audio signals, and also use temporal networks to learn long-term
dependent contextual information.
References
Alonso G. E., Jin X. G., 2022, Skeleton-level control for multi-agent simulation through
deep reinforcement learning, Computer Animation and Virtual Worlds, Vol. 33, No. 3-4,
pp. 11

Aylagas M. V., Leon H. A., Teye M., Tollmar K., 2022, Voice2Face: Audio-driven facial
and tongue rig animations with cVAEs, Computer Graphics Forum, Vol. 41, No. 8, pp.
255-265

Bertiche H., Madadi M., Escalera S., 2021, PBNS: Physically based neural simulation
for unsupervised garment pose space deformation, ACM Transactions on Graphics, Vol.
40, No. 6, pp. 14

Chen J., Fan C., Zhang Z., Li G., Zhao Z., Deng Z., 2023, A music-driven deep generative
adversarial model for Guzheng playing animation, Ieee Transactions on Visualization
and Computer Graphics, Vol. 29, No. 2, pp. 1400-1414

Chen S. C., Liu G. T., Wong S. K., 2021, Generation of multiagent animation for object
transportation using deep reinforcement learning and blend-trees, Computer Animation
and Virtual Worlds, Vol. 32, No. 3-4, pp. 10

Choi Y., Seo B., Kang S., Choi J., 2023, Study on 2D Sprite*3.Generation using the
impersonator network, KSII Transactions on Internet and Information Systems, Vol.
17, No. 7, pp. 1794-1806

Ding W., Li W. F., 2023, High speed and accuracy of animation 3D pose recognition
based on an improved deep convolution neural network, Applied Sciences-Basel, Vol.
13, No. 13, pp. 7566

Duan L., 2022, Application of animation character intelligent analysis algorithm based
on deep learning, Mobile Information Systems, Vol. 2022

He C., Jia Y., 2023, Automatic depth estimation and background blurring of animated
scenes based on deep learning, Traitement Du Signal, Vol. 40, No. 5, pp. 2225-2232

Hu Z. P., Wu R., Li L., Zhang R., Hu Y., Qiu F., 2023, Deep learning applications
in games: A survey from a data perspective, Applied Intelligence, Vol. 53, No. 24,
pp. 31106-31128

Kwiatkowski A., Alvarado E., Kaiogeiton V., Liu C. K., Pettré J., van de Panne M.,
Cani M.-P., 2022, A survey on reinforcement learning methods in character animation,
Computer Graphics Forum, Vol. 41, No. 2, pp. 613-639

Li C., Lussell L., Komura T., 2021, Multi-agent reinforcement learning for character
control, Visual Computer, Vol. 37, No. 12, pp. 3115-3123

Li R., Shi R., Kanai T., 2023, Detail-aware deep clothing animations infused with
multi-source attributes, Computer Graphics Forum, Vol. 42, No. 1, pp. 231-244

Liu G. T., Wong S. K., 2024, Mastering broom-like tools for object transportation
animation using deep reinforcement learning, Computer Animation and Virtual Worlds,
Vol. 35, No. 3, pp. 15

Liu L. J., Zheng Y. Y., Tang D., Yuan Y., Fan C. J., Zhou K., 2019, NeuroSkinning:
Automatic skin binding for production characters with deep graph networks, ACM Transactions
on Graphics, Vol. 38, No. 4, pp. 114

Luo Y. S., Soeseno J. H., Chen T. P. C., Chen W. C., 2020, CARL: Controllable agent
with reinforcement learning for quadruped locomotion, ACM Transactions on Graphics,
Vol. 39, No. 4, pp. 38

Morace C. C., Le T. N. H., Yao S. Y., Zhang S. W., Lee T. Y., 2022, Learning a perceptual
manifold with deep features for animation video resequencing, Multimedia Tools and
Applications, Vol. 81, No. 17, pp. 23687-23707

Moutafidou A., Toulatzis V., Fudos I., 2024, Deep fusible skinning of animation sequences,
Visual Computer, Vol. 40, No. 8, pp. 5695-5715

Paier W., Hilsmann A., Eisert P., 2020, Interactive facial animation with deep neural
networks, IET Computer Vision, Vol. 14, No. 6, pp. 359-369

Park S., Ryu H., Lee S., Lee S., Lee J., 2019, Learning predict-and-simulate policies
from unorganized human motion data, ACM Transactions on Graphics, Vol. 38, No. 6,
pp. 205

Peng T., Kuamg J., Liang J., Hu X., Miao J., P. Zhu , L. Li , F. Yu , M. Jiang , 2023,
GSNet: Generating 3D garment animation via graph skinning network, Graphical Models,
Vol. 129, pp. 10

Qiao Z., Li T. X., Hui L., Liu R. J., 2023, A deep learning-based framework for fast
generation of photorealistic hair animations, IET Image Processing, Vol. 17, No. 2,
pp. 375-387

Shan F., Wang Y. Y., 2022, Animation design based on 3D visual communication technology,
Scientific Programming, Vol. 2022

Tan J., Tian Y., 2023, Fuzzy retrieval algorithm for film and television animation
resource database based on deep neural network, Journal of Radiation Research and
Applied Sciences, Vol. 16, No. 4, pp. 100675

Ullah S., Ijjeh A. A., Kudela P., 2023, Deep learning approach for delamination identification
using animation of Lamb waves, Engineering Applications of Artificial Intelligence,
Vol. 117, pp. 105520

Wu P., Chen S. J., 2022, A study on the relationship between painter's psychology
and anime creation style based on a deep neural network, Computational Intelligence
and Neuroscience, Vol. 2022

Yu C., Wang W. M., Yan J. H., 2020, Self-supervised animation synthesis through adversarial
training, IEEE Access, Vol. 8, pp. 128140-128151

Yu Z. X., Wang H. H., Ren J., 2022, RealPRNet: A real-time phoneme-recognized network
for "believable" speech animation, IEEE Internet of Things Journal, Vol. 9, No. 7,
pp. 5357-5367

Zhang Y. L., Ban X. J., Du F. L., Di W., 2020, FluidsNet: End-to-end learning for
Lagrangian fluid simulation, Expert Systems with Applications, Vol. 152, pp. 113410

Zhang Z. N., Wu Y. H., Pan Z. G., Li W. Q., Su Y., 2022, A novel animation authoring
framework for the virtual teacher performing experiment in mixed reality, Computer
Applications in Engineering Education, Vol. 30, No. 2, pp. 550-563

Author
Xuelian Gao graduated from the Animation College of Jilin Art Institute in 2009
with a bachelor of Arts degree. She graduated from Changchun University of Science
and Technology in 2017 with a master’s degree in Industrial Engineering. She has been
a professional course teacher at Jilin Animation College since 2011. Her research
interests include animation production and the application of artificial intelligence
in animation. She has served as the head of many provincial social science research
projects and published more than ten related academic papers. During her teaching
career, she guided students to participate in various competitions, won many awards,
and won the Outstanding Instructor Award.