BakHuiyong1
                     LeeSangmin1
               
                  - 
                           
                        (Department of Electrical and Computer Engineering, Inha University, Incheon 22212,
                        Korea 22211253inha.edu@inha.edu, sanglee@inha.ac.kr
                        						)
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Violent scene discrimination,  Wav2vec 2.0,  Audio signal processing
             
            
          
         
            
                  1. Introduction
               The movie industry produces thousands of movies every year; however, movies with violent
                  content are not suitable for children. Watching violent scenes in movies tends to
                  make children more aggressive and leads to unhealthy attitudes. Thus, it is imperative
                  to have a violent scene discrimination system (VSDS) to protect children from viewing
                  violence in movies. Moreover, these systems can be useful for child-suitability ratings
                  for movies [1,2].
               
               Since most violent scenes are related to the behavior of objects, visual information
                  is utilized to discriminate violent scenes. However, visual information of violent
                  scenes does not include audio information, such as screams and offensive language.
                  The audio information can include information such as screams and profanities that
                  are not included in visual information. It can also include information about violent
                  scenes that do not last longer than a second, such as gunshots. Thus, it is beneficial
                  to utilize audio information in violent scene discrimination.
               
               Previous studies that have implemented audio-based violent scene discrimination are
                  as follows. Mu et al. built a VSDS using 2D convolutional neural networks (CNNs) [3]. Sarman and Sert built a VSDS using the support vector machine (SVM), random forest,
                  and bagging [4]. Potharaju et al. also built a VSDS using an SVM [5]. Gu et al. proposed a violent scene detection system using a mel spectrogram and
                  the CNN-based VGGNet [6]. Among the previous VSDS studies, a study using the mel spectrogram and the CNN-based
                  VGGish showed good performance. However, the study on violent scene discrimination
                  using the mel spectrogram and VGGish had two limitations. First, the mel spectrogram
                  can extract unique features of audio signals, but it cannot extract mutual information
                  that audio data have in common. Secondly, VGGish was pre-trained using audio that
                  is not related to violent scenes, such as sports and games.
               
               To improve on the limitations in previous studies, a new system is proposed that discriminates
                  violent scenes in movies by using audio signals. The proposed system extracts audio
                  features with Wav2vec 2.0, which can extract mutual information in audio data. Audio
                  features are then used as the input for a 1D CNN and long short-term memory (LSTM),
                  which can effectively discriminate audio data, and violent scenes are discriminated
                  through fully connected and softmax layers.
               
               Section 2 describes the techniques in the proposed system, which is presented in Section
                  3. Section 4 describes the experiment conducted, how the proposed system was used
                  in it, and the performance evaluation and results. Section 5 concludes the paper.
               
             
            
                  2. Technologies of the Proposed System
               
                     2.1 Wav2vec 2.0
                  As shown in Fig. 1, speech input for Wav2vec2.0 is converted into vectors of specific lengths through
                     the 1D CNN. The transformed vectors, called latent speech representations, are the
                     input for the transformer encoder, which creates contextualized representations that
                     restore the masked parts using the surrounding information. Wav2vec2.0 performs training
                     in such a way that contextualized representations and quantized represen-tations are
                     similar [7]. Using Wav2vec trained in this way has the advantage of extracting mutual information
                     common to audio data. Therefore, in the proposed system, audio features are extracted
                     using a pre-trained Wav2vec2.0, which is a model that obtains its characteristics
                     from self-supervised learning of human speech without a label.
                  
                  
                        Fig. 1. The structure of Wav2vec 2.0.
 
                
               
                     2.2 The CNN and LSTM
                  The CNN creates a feature map with the spatial characteristics of the data through
                     the convolution layer. The feature map is reduced in size through pooling, and the
                     features are compressed. After repeating this process, the data are classified using
                     fully connected and softmax layers [8]. The CNN can extract spatial features of input data through a convolution layer.
                     Therefore, the proposed system extracts spatial features from contextualized representations
                     of Wav2vec2.0 with the CNN.
                  
                  A recurrent neural network (RNN) structure was used to handle time series data, such
                     as audio signals. The RNN trains time series data by inputting the previous hidden
                     state into the next neural network. The RNN is limited in that the gradient required
                     for backpropagation decreases or increases exponentially, depending on the length
                     of the time series data. To overcome the limitations of the RNN, an LSTM architecture
                     adds the cell state to the RNN hidden state. When backpropagating to the cell state,
                     it does not pass through nonlinear functions, such as tanh, so it can prevent gradient
                     vanishing and exploding in the RNN [9]. Therefore, the proposed system extracts temporal characteristics from contextualized
                     represen-tations of Wav2vec2.0 with the LSTM.
                  
                
             
            
                  3. Violent Scene Discrimination
               
                     3.1 Proposed System Overview
                  As shown in Fig. 2, the proposed system inputs the audio signal into the backbone network, which uses
                     the pre-trained Wav2vec2.0 to extract features with mutual information. The transfer
                     network is trained using the extracted features. The proposed system discriminates
                     violent scenes using a trained transfer network and backbone network.
                  
                  
                        Fig. 2. The proposed system.
 
                
               
                     3.2 Backbone of the Proposed System
                  The backbone network converts the input audio signal into audio features using Wav2vec2.0.
                     Because the Wav2vec2.0 model is trained with unlabeled audio through self-supervised
                     learning, it can extract the mutual information from audio signals.
                  
                
               
                     3.3 Transfer Network in the System
                  The transfer network utilizes the CNN and LSTM. The CNN can consider spatial features
                     using a convolutional layer. Because LSTM receives the previous hidden state as input,
                     temporal characteristics can be considered. Because the transfer network uses both
                     CNN and LSTM models, it has the advantage of simultaneously considering spatial and
                     temporal characteristics.
                  
                  Because the backbone network uses a 1D CNN, the 1D CNN is also used in the transfer
                     network to preserve nonlinear information in the backbone network. A 1D CNN is suitable
                     for audio because it can convolve 1D data [10].
                  
                  LSTM exhibits good performance for time series data-prediction tasks. An LSTM increases
                     the prediction accuracy of time series data by reducing the importance of the data
                     at a point far from the prediction point, and increasing the importance of the data
                     at points near the prediction point. Therefore, an LSTM with high prediction accuracy
                     for time series data is used for the transfer network.
                  
                
             
            
                  4. Experiment 
               
                     4.1 Dataset used in the Experiment
                  The dataset used in this paper, called the Violent Movie Scenes Dataset (VMD) was
                     generated to discriminate violent scenes. Because the concept violent scene is subjective
                     and difficult to characterize, each audio dataset was manually labeled in the movie
                     by using violent scene criteria from a previous study, as shown in Table 1 [5].
                  
                  The details of the dataset used in this study are presented in Table 2. Violent and non-violent scenes were extracted from 69 movies. Of those movies, scenes
                     from 34 were used for training, scenes from 15 movies were used for validation, and
                     scenes from 20 movies were used for testing. In total, 2400 scenes were extracted
                     from the 69 movies selected. Training and validation sets were used for training,
                     and the testing set was used for evaluation.
                  
                  
                        Table 1. Criteria for classification of violent scenes[5].
                     
                           
                              
                                 | Violent scenes | 
                           
                                 | Categories | Detail | 
                        
                        
                              
                                 | Person-related sound | Angry voice, Scream | 
                           
                                 | Weapon-related sound | Gunshot, Bomb | 
                           
                                 | Vehicle-related sound | Accident | 
                           
                                 | Fight-related sound | Fight | 
                           
                                 | Environment sound | Sharp | 
                        
                     
                   
                  
                        Table 2. Dataset used in the study.
                     
                           
                              
                                 | Scene type | Training (34 Movies) | Validation (15 Movies) | Testing (20 Movies) | 
                        
                        
                              
                                 | Violence | 800 | 200 | 200 | 
                           
                                 | Non-violent | 800 | 200 | 200 | 
                           
                                 | Total | 1,600 | 400 | 400 | 
                        
                     
                   
                
               
                     4.2 Implementation Details
                  
                        4.2.1 Backbone Network in the System
                     The backbone network used the Wav2vec 2.0 base model without fine-tuning [7]. To reduce computations, the backbone network adopted a base model trained with 960
                        h of speech. As shown in Fig. 3, when audio is input to the backbone network, the network generates audio features
                        sized. The total number of parameters in the backbone network is 95 M.
                     
                     
                           Fig. 3. Processing the backbone network.
 
                   
                  
                        4.2.2 Transfer Network used in the Proposed System
                     As shown in Fig. 4, the transfer network transforms audio features at 100${\times}$768${\times}$49 into
                        a feature map sized 16${\times}$112${\times}$720 with spatial features through the
                        1D CNN.
                     
                     The kernel size and output channels of the 1D CNN were set to 25. The transformed
                        feature map is the input for the LSTM and is converted into an LSTM feature sized
                        16${\times}$112${\times}$48 with the characteristics of the data that change over
                        time. In the LSTM, hidden dim was 48 and num layers was 2. Subsequently, the LSTM
                        features were classified as violent or non-violent by fully connected and softmax
                        layers.
                     
                     The total number of parameters in the transfer network was 0.3 M.
                     
                           Fig. 4. Processing of the transfer network.
 
                   
                
               
                     4.3 Methods for Performance Evaluation
                  Eqs. (1) and (2) were used to evaluate the performance of the proposed system. The basis for these
                     metric evaluations is the confusion matrix, which is presented in Table 3 [11]. In Eq. (2), P is the number of ground truth violent scenes, N is the total amount of data, Li=1
                     when the i-th data is violent; otherwise, Li=0.
                  
                  
                  
                  
                        Table 3. Confusion matrix of the experiment results.
                     
                           
                              
                                 |    |    | True Class | 
                           
                                 |    |    | Violent | Non-violent | 
                           
                                 | Predicted     class | Violent | 199 (TP) | 14 (FP) | 
                           
                                 | Non-violent | 1 (FN) | 186 (TN) | 
                        
                     
                   
                
               
                     4.4 Results
                  The proposed system was trained with training and validation data. The result of evaluating
                     the performance of the trained model with the testing data is the confusion matrix
                     in Table 3. Table 4 displays the results from comparing the performance obtained using Eqs. (1) and (2) with those of previous studies.
                  
                  In order to evaluate the performance of the algorithm proposed in this paper, it was
                     compared with Gu et al. [6]. Among the previous studies, that of Gu et al. was the latest and had high performance;
                     thus, the performance of Gu et al. was compared with the proposed algorithm. The VCD
                     dataset used by Gu et al. was not disclosed, but Medieval 2015 was disclosed. Therefore,
                     the Medieval 2015 was applied to the proposed system. As a result, it was confirmed
                     that the performance of the proposed system was 4.5% higher. Additionally, the algorithm
                     proposed by Gu et al. was applied to the VMD dataset used in this paper to compare
                     performance. As a result, it was confirmed that the performance of the algorithm proposed
                     in this paper was higher. The reason for the higher accuracy is that it extracted
                     mutual information from audio using Wave2vec, and utilized a 1D CNN and LSTM, which
                     can effectively discriminate audio data.
                  
                  In Table 4, the datasets used in previous studies are Medieval 2014, Medieval 2015, Violent
                     video dataset (VSD), and Violent scenes dataset (VCD). Among them, VSD and VCD are
                     datasets in which author of papers directly extracted violent scenes from movies and
                     YouTube. On the other hand, Medieval 2014 and Medieval 2015 are the most widely used
                     public datasets for discriminating violent scenes, and were extracted from hundreds
                     of movies [5, 6, 13, 14]
                  
                        Table 4. Experimental results and comparison.
                     
                           
                              
                                 | Researcher | Sarman and Sert [4] | Potharaju et al. [5] | Gu et al. [6] | Our group | 
                        
                        
                              
                                 | Algorithm | Random Forest | SVM | Mel Spectrogram VGGish | Proposed System | Mel Spectrogram VGGish | Proposed System | 
                           
                                 | Dataset | Medieval 2014 | VSD | VCD | Medieval 2015* | Medieval 2015* | VMD | VMD | 
                           
                                 | Accuracy | - | 78.22% | 80.55% | - | - | 89.75% | 96.25% | 
                           
                                 | Average Precision | 68.80% | - | - | 14.16% | 18.69% | 79.5% | 99.50% | 
                        
                     
* Since the data in Medieval is unbalanced, the evaluation metric uses average precision.
                     				
                  
 
                
             
            
                  5. Conclusion
               Automatic identification of violent scenes is required to protect users from unwanted
                  and violent media. In this study, a system was proposed to discriminate violent movie
                  scenes based on audio signals. The proposed system uses Wav2vec 2.0 for audio feature
                  extraction, and a 1D CNN-LSTM combination to discriminate extracted audio features
                  into violent and non-violent scenes. The proposed system discriminated violent scenes
                  with an accuracy of 96.25% when using VMD, which is superior to results in previous
                  studies. This study considered only audio features to discriminate movie scenes as
                  violent or non-violent. Although it is generally more effective to discriminate violent
                  scenes using visual information along with audio signals, the results of this study
                  are expected to show more effective results in discriminating media with limited visual
                  information, such as radio.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This research was supported by the Basic Science Research Program through the
                  National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1A2C2004624
                  and NRF-2018R1A6A1A03025523).
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Gentile Douglas A., 2004, Media Violence as a Risk Factor for Children: A Longitudinal
                           Study., American Psychological Society 16th Annual Convention, Chicago, IL.

 
                     
                        
                        Shafaei M., Samghabadi N.S., Kar S., Solorio T., Rating for Parents: Predicting Children
                           Suitability Rating for Movies Based on Language of the Movies., arXiv 2019, arXiv:1908.07819.

 
                     
                        
                        Mu Guankun, Haibing Cao, Qin Jin , 2016, Violent Scene Detection Using Convolutional
                           Neural Networks and Deep Audio Features., Chinese Conference on Pattern Recognition.
                           Springer, Singapore

 
                     
                        
                        Sarman Sercan., Mustafa Sert., 2018, Audio Based Violent Scene Classification Using
                           Ensemble Learning., 2018 6th International Symposium on Digital Forensic and Security
                           (ISDFS). IEEE

 
                     
                        
                        Potharaju Y., Kamsali M., Kesavari C. R., 2019, Classification of Ontological Violence
                           Content Detection through Audio Features and Supervised Learning, International Journal
                           of Intelligent Engineering and Systems, Vol. 12, No. 3, pp. 20-230

 
                     
                        
                        Gu C., Wu X., Wang S., 2020, Violent Video Detection Based on Semantic Correspondence,
                           IEEE Access, Vol. 8, pp. 85 958-85 967

 
                     
                        
                        Baevski A., Zhou H., Mohamed. A., Auli M., Jun. 2020, Wav2vec 2.0: A Framework for
                           Self-Supervised Learning of Speech Representations

 
                     
                        
                        Krizhevsky Alex, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with
                           Deep Convolutional Neural Networks., Advances in neural information processing systems
                           25 (2012): 1097-1105.

 
                     
                        
                        Hochreiter Sepp, Schmidhuber Jürgen, 1997, Long Short-Term Memory., Neural Computation,
                           Vol. 9, No. 8, pp. 1735-1780

 
                     
                        
                        Kiranyaz S., Avci O., Abdeljaber O., Ince T., Gabbouj. M., Inman D. J., 2019, 1D Convolutional
                           Neural Networks and Applications: A Survey, arXiv preprint arXiv:1905.03554

 
                     
                        
                        Olson D. L., Delen D., 2008, Advanced Data Mining Techniques., Springer Science &
                           Business Media.

 
                     
                        
                        Schedi. M., et al. , 2015, VSD2014: A dataset for violent scenes detection in hollywood
                           movies and web videos, 2015 13th International Workshop on Content-Based Multimedia
                           Indexing (CBMI), pp. 1-6

 
                     
                        
                        Sjberg M., Baveye Y., Wang H., Quang V.L., Ionescu B., Dellandra E., Chen L., The
                           mediaeval 2015 affective impact of movies task., In: MediaEval 2015 Workshop

 
                   
                
             
            Author
            
            
               			Huiyong Bak received his B.S. from the Department of Mechatronics Engineering,
               Inha University, Incheon, Republic of Korea in 2021. He is currently pursuing an M.S.
               in the Department of Electrical and Computer Engineering, Inha University. His research
               interests include deep learning using audio signals.
               		
            
            
            
               			Sangmin Lee received a B.S., an M.S., and a Ph.D. from Inha University, all in
               electronic engineering, in 1987, 1989, and 2000, respectively. He is currently a Professor
               with the School of Electronic Engineering, Inha University, Korea. His research interests
               include bio-signal processing and psycho-acoustics.