Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 03, p.307-319

ISSN (online) :

2287-5255

Received : 13 December 2023Revised : 17 January 2024Accepted : 4 June 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.3.307

Regular Paper

Main Feature Extraction Method in Vocal Singing Based on Deep Neural Network

ZhengXujun¹

( School of Arts, Zhengzhou Technology and Business University, Zhengzhou 451400, China xujunz0806@hotmail.com})

^*Corresponding Author: Xujun Zheng, xujunz0806@hotmail.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Electronic music systems can achieve intelligent control of musical instruments, but there are still certain difficulties in extracting vocal features, especially when vocal singing is affected by various factors, resulting in a high error rate in extracting the main features of vocal singing. This article uses a deep neural network fusion method to extract the main features of vocal singing. The purpose is to analyze the waveform of vocal singing through intelligent methods and improve the extraction effect of the main features of vocal singing. In addition, this paper studies and validates time-frequency denoising and denoising, and proposes a new denoising threshold from the aspects of complexity and engineering, which can be better applied to engineering and time-frequency joint denoising. After the denoising is completed, this paper compares the time-frequency map for noise removal and the morphological filter for noise removal. Furthermore, given the computational complexity and completeness of extraction, this paper proposes a method to remove interference by applying connected-domain labeling. The quantitative evaluation results of the main feature extraction method in vocal singing based on deep neural networks are distributed between $[87$, $92]$, indicating that the proposed method can achieve high accuracy in main feature extraction in vocal singing, the simulation results show that the main feature extraction method for vocal singing based on deep neural networks proposed in this paper is feasible, and it can also play an important role in the main feature extraction of vocal singing. A method of using connected domain labeling to remove interference is proposed, which can be combined with existing music waveform algorithms for fusion, facilitating the practical application of subsequent intelligent models. The method proposed in this article effectively promotes the intelligent application of vocal feature recognition systems.

Keywords

Deep neural network, Vocal singing, Main feature, Extraction

1. Introduction

Singing began to become an important part of electronic music, and the popularity of this type of music also brought electronic music to a new height of popularity. Moreover, the method of directly sampling the singing of well-known singers also solves the problem of insufficient singing level in electronic music ^[1]. The vocal effector is a machine between the microphone and the mixer that can add many effects to the human voice. From regular delays, reverbs, frequencies, etc., to special chorus, Echo, auto-tune, vocoder, and more can be done in the vocal effects. Traditional real instruments are rarely found in much electronic music. Therefore, in order to make human voices match this music, we often need to perform some ?dehumanizing? processing on these singing voices ^[2]. Mechanical analog synths and a human voice that sounds like robotic speech make the song sound like an artificial intelligence from the future ^[3]. The most commonly used in electronic music singing is the compressor. It works by making the credit output range smaller, making those weak signals bigger and those strong signals weaker. There is hardly any electronic music that likes the dynamics of the vocals as much as rock and roll ^[4]. Moreover, suddenly shouting out can easily ruin the immersive electronic music listening experience. At this time, the singer can actively control the dynamics when singing by using the vocal cord closure, bubble singing method, and controlling the size of the guttural sound as mentioned above.

The vocals that electronic music pursues are wrapped in the music, not stand out. Mastering the application of compression for vocal effects, you can find the most wrapped feeling of the human voice in the music through the compression settings, so that the human voice never escapes from the music ^[5]. The way it works is that we can set a threshold with the compressor, a position where the vocals don't audibly overpower the music. The compressor kicks in whenever the signal level that our vocals recognize through the microphone exceeds the threshold we set. Even if we increase the total output when tuning or mixing, it will not cause the same amount of increase in the output level. In the author's many experiences of using vocal effects, the reverberation in the vocal effects has the greatest impact on singing electronic works is Decay-time (reverberation time) and Pre-delay (pre-delay) ^[6]. Mastering the application of these two reverberation parameters on the vocal effector can make the human voice unique in the electronic music scene. Although the vocal modulation is very complex, many parts can still be relied on by live tuners, but some crucial decisive factors can be grasped by the singer himself, which can better ensure the live presentation of electronic music. Some electronic music types that are suitable for small vocal spaces, require short vocals and less sweet or prominent music, such as most electronic dance music (EDM), dubstep, house, or jungle, etc., can use short DecayTime ^[7]. A short Decaytime can make the voice instantly tight and grainy. At this time, if you cooperate with some small sampling spaces such as cars, bathtubs, cabinets, etc., you can create a feeling of depression and suffocation, and you can sing Psy-Trance and Trip-pop well. electronic music ^[8]. In the face of some electronic music that is more relaxed, has a strong sense of atmosphere, and has rich pad-like tones, such as some soothing futurebass, a Decaytime that is about 4 to 5 seconds long, combined with the Pre-delay of 30% starting point, it can create a relatively Gorgeous and ?swish? vocals. 3pre-delay can affect the auditory space of the human voice, such as theaters, rooms, corridors, etc. ^[9]. The pre-delay in a large space can create a very pure retro feeling, including the VintageCall single released by the author, which also applies to this retro space feeling created by pre-delay ^[10].

Conducting feature extraction of frequency hopping signals can help military analyze all information of frequency hopping signals, effectively dealing with information feature extraction in various complex environments, especially in vocal feature extraction, improving the intelligent development of vocal singing training. The extraction and parameter estimation of frequency hopping signals for vocal singing in complex environments studied in this article can be considered to be applied to any environment where single or multiple frequency hopping signals exist to solve existing practical problems.

Unlike traditional algorithms, this article focuses on waveform enzymes in vocal feature recognition, and conducts research and verification on time-frequency denoising and anti-interference. A method of using connected domain labeling to remove interference is proposed, which can be combined with existing music waveform algorithms for fusion, facilitating the practical application of subsequent intelligent models. The method proposed in this article effectively promotes the intelligent application of vocal feature recognition systems.

The contributions of this article are as follows:

(1) This article proposes an improved vocal feature extraction algorithm that can be performed in complex vocal singing environments, promoting the practical application of feature extraction algorithms

(2) This article combines several common information feature extraction algorithms and applies communication principles for acoustic signal processing, promoting the extraction effect of information feature extraction algorithms.

The research significance of this article is that the improved algorithm can be widely applied to multimedia industry software to improve and reduce the difficulty of solving music source separation tasks, enhance the support of intelligent algorithms for the music industry, and provide assistance for the further popularization of music.

2. Related work

Some electronic music types have common vocal processing methods. Although the details of each work are different, the iconic effect is still essential. For example, the vocals of DreamPop (dream pop) will have a ``chorus'' effect, and the vocals of Trap (trap music) are very dry and compressed, and almost all have auto-tune of about 10%. In Trance (psychedelic) The vocals are often used in short pre-delays, etc. ^[11]. These effects are not simply added after singing to produce good results. The singer does not have certain experience with these effects, and it is easy to cause that there is no reserved space when singing. After adding these effects, the effect is not ideal. Case. Below I will share some experiences and personal thoughts on reserving space for special effects when singing ^[12]. The principle of the Chorus effect is to vibrate the sound, slightly change the pitch and superimpose it with the original audio to form an effect similar to chorus. When singing with this vocal effect, especially when singing live, the first thing to note is that this effect is very sensitive to pitch, each sung voice will be pitched, and it is based on the original pitch. This makes if the original sound is not sung accurately enough, the sound with the chorus effect will sound very dissonant. Therefore, this kind of effect is more suitable for works like DreamPop, whose melody is dominated by long notes. This effect may not achieve very good results when singing short works with many short durations in the melody ^[13]. When singing, try to reduce the vibrato, especially the large-scale low-frequency vibrato, which will seriously cause the sound recognized by the chorus effect to change all the time, resulting in a very unpleasant effect. The smoother the singing, the better. When singing strokes, it is necessary to design in advance which note to go from, which note to fall, how much speed and tempo to stroke to the final note, so as to avoid the temporary out-of-tune caused by the chorus effect during the stroke, and the duration is too long ^[14]. In addition, when the chorus effect is superimposed, the singer should keep the vocal cord pressure on a relatively large basis, that is, the ?voiced? part of the vocal music should be greater than the ?unvoiced? part. Because the chorus effect will widen the sound audibly, it will make the sound lose a certain graininess. Therefore, when singing, if a singer who mainly focuses on aura or has insufficient vocal cord closure, his voice may not be unpleasant in music, but after adding the chorus effect, it will sound extraordinarily weak. Minimizing the sound head, the strong guttural sound will also destroy the beauty of the chorus effect, which is also one of the tasks that need to be reserved in advance when singing vocals for this effect ^[15].

Double (multi-track overlay) recording method is a common method in some electronic music. Since the NewWave era, it has been widely used in the vocal singing and recording methods of electronic music ^[16]. The way the album's vocals were recorded may have inspired the electronic music of the period, but it's not yet proven. This way of recording singing is a lot like recording harmony, but it's different. First, sing a monophonic vocal in the middle of the channel with a stronger sound pressure and vocal cord closure as the main vocal. Afterwards, sing several times with the air-voice singing method with the vocal cords open, and put them on the left and right channels respectively, or a small part can be placed in the middle channel and most of them can be placed on the sides ^[17]. This kind of singing requires a lot of room for the effect when singing, and the most important thing is to minimize the vibration of the vocal cords. When we usually sing, the auditory will feel that the voiced sound caused by the vibration of the normal vocal cords does not violate the harmony, but when multiple tracks are superimposed, the voiced sound generated by the vibration of the vocal cords will be multiplied, resulting in an unpleasant effect ^[18]. When recording multi-track superposition, you can get close to the microphone infinitely, shield the resonance of the chest cavity, and sing only with the resonance of the mouth and some heads. Raise the throat to allow as much air to pass through as possible. In this way, the sound should minimize the mid-low frequency and retain the high frequency, which will form a psychedelic and erratic texture when superimposed ^[19].

Overall, there are currently many algorithms for estimating frequency hopping parameters, which can play a role in feature extraction of vocal singing information. However, in comparison, time-frequency analysis technology is more intuitive and does not require too much prior information. However, the estimation results are closely related to the clarity and focus of the time-frequency map. Currently, mature algorithms are mostly used to process single frequency hopping signals. Therefore, this article will conduct in-depth research on the optimization of time-frequency maps and the estimation of multi hop parameters

This paper combines the deep neural network to construct the main feature extraction model of vocal music singing, and analyzes the vocal music singing waveform through the intelligent method to improve the teaching effect of vocal music singing.

3. Extraction of vocal singing waveforms in complex environments

3.1 Time-frequency graph denoising

In an environment full of noise, in order to extract multiple vocal singing waveforms, a denoising threshold should be set for the combined time-frequency analysis first, so that the intercepted time-frequency map can reduce the interference of noise, so as to obtain a clear time-frequency distribution map.

If we first transform the received multi-vocal singing waveform with STFT to obtain the time-frequency matrix $STFT(m,n)$, where $m\in [1$, $M]$, $n\in [1$, $N]$, the denoising threshold of the time-frequency map is ^[20]

(1)

$ \text{threshold}=\frac{w}{MN} \sum _{m=1}^{M}\sum _{n=1}^{N}\text{STFT}(m,n). $

When the weights are just at the critical surface, the energy distribution trend of the entire time-frequency matrix will change from a rapid downward trend to a slow downward trend. That is, there will be a turning point, and the corresponding weight of this turning point is an ideal weight.

The specific steps of energy threshold denoising are as follows:

(1) First, the algorithm uses the weight w in the interval $[1$, $10]$ to take 0.1 as the step, and obtains the discrete threshold value $\text{threshold}(i)$ according to formula (1), where $1\le i\le 99$;

(2) Secondly, the algorithm uses the short-time Fourier transform (STFT) transformation to obtain the corresponding time-frequency map, and uses different weights $w$ for each time-frequency point in the time-frequency matrix to count the number of points that exceed the threshold, that is ^[21]

(2)

$ c(i)=c(i)+\left\{\begin{aligned} 1\text{STFT}(m,n)\ge \text{threshold}(i),\\ 0\text{STFT}(m,n)<\text{threshold}(i). \end{aligned}\right. $

Then, the corresponding energy distribution statistical curve under each threshold can be obtained, and the curve is shown in Fig. 1.

(3) The algorithm performs quadratic difference on c(i), and finds the point that is closest to zero for the first time in the quadratic difference, and its corresponding weight is the critical value of the turning point.

Fig. 1 is a curve of the energy distribution in the time-frequency diagram under Gaussian noise as a function of the weight $w$. The turning point can be clearly seen, and the weight corresponding to this turning point is the optimal denoising weight, and the interception threshold set by this weight can well remove the unnecessary noise interference in the time-frequency diagram.

It can be seen from Fig. 2 that the noise exhibits a higher peak value and a lower relative power, while the signal has a lower peak value and a higher relative power ^[22].

This paper proposes a simple and effective adaptive dynamic threshold method. Through the previous analysis, we know that the distribution of noise in the time-frequency matrix can be said to be relatively uniform and very scattered. Moreover, because of the fixed frequency set and time continuity of multiple vocal singing waveforms, the distribution of vocal singing waveforms in the entire time-frequency matrix is sparse and the energy is very concentrated. Therefore, in view of this characteristic, the entire time-frequency matrix is arranged in ascending order from small to large, and the first 20% of the data are taken according to experience, and the average value is obtained. Then, the maximum value of the entire time-frequency matrix is selected and supplemented by the mean value of the global maximum value and the minimum value for correction, and finally the three are weighted $1:2:1$ to obtain a new threshold. The definition formula is:

(3)

$ \text{threshold}=\frac{{STFT}_{\max } +2*\mu _{20\% } +D}{4}. $

In the formula, $STFT_{{\rm max}} $ is the global maximum value, $\mu _{20\% } $ is the mean value of the top 20% of the time-frequency matrix after arrangement, and $D$ is the corrected mean value. It has been proved that this dynamic threshold formula also has a good interception effect at a lower signal-to-noise ratio.

In order to observe the denoising effect, two frequency sets are given. The frequency set of the vocal singing waveform $s_{1} (t)$ is $[4300$, $4600$, $4900$, $5200$, $5500$, $5900$, $6200$, $6500]$ Hz, and the beating speed is $20$ hop/s. The frequency set of the vocal singing waveform $s_{2} (t)$ is $[700$, $1000$, $1300$, $1700$, $2000]$ Hz, and the beating speed is $12.5$ hop/s. The sampling rate is $16$ kHz, and the total simulation time is $0.4$ s. The noise is Gaussian noise, and the STFT is used for time-frequency transformation. The signal-to-noise ratio is $0$ dB.

As can be seen from Fig. 3, energy threshold denoising, histogram denoising and improved threshold denoising can remove noise very well. In order to quantitatively evaluate the quality of the three denoising algorithms, the signal point detection rate (SPDR) is defined as the measurement index, which is defined as

(4)

$ SPDR=\frac{AS}{BS} . $

In the formula, AS is the number of signal points after processing, and BS is the number of signal points before processing. It illustrates the validity and stability of the extracted signal. In addition, the computational complexity is applied to illustrate the practicality of its engineering.

From Fig. 4(a), it can be seen that at low SNR, the improved denoising threshold signal point inspection rate is higher than that of energy threshold denoising and histogram denoising. At a high signal-to-noise ratio, the signal point detection rate is low. However, its comprehensive performance of denoising is relatively stable.

In order to observe the computational complexity of the three methods more intuitively, the change trend of the total number of sampling points is shown in Fig. 4(b).

Therefore, it can be seen from Fig. 4(b) that the improved threshold denoising has stronger stable extraction and more stable time-frequency focusing than histogram denoising and energy threshold denoising. Moreover, the computational complexity of this method is low.

Fig. 1. Distribution of time-frequency matrix energy distribution under different weights.

Fig. 2. Histogram statistics of time-frequency graph.

Fig. 3. Original time-frequency diagram and time-frequency diagram after denoising.

Fig. 4. Signal point detection rate and complexity.

3.2 Overview of waveform processing methods for vocal singing

In order to describe the morphological filtering vividly, the vocal singing waveform with processing is defined as $A$, the predefined structural element is $B$, and each set belongs to $E^{N} $.

(1) Dilation operation

Dilation is a morphological change that combines two sets using vector addition of set elements. $A$ and $B$ are sets belonging to the N-dimensional space $\left(E^{N} \right)$, and have $N$-tuples with $a=(a_{1}$, $\ldots$, $a_{N})$ and $b=(b_{1}$, $\ldots$, $b_{N})$ as element coordinates, respectively. Therefore, the dilation of $B$ by $A$ is the set of all possible vector sums of pairs of elements, and one of them is from $A$ and the other is from $B$.

Definition 1: If both $A$ and $B$ belong to a subset of $E^{N} $, the dilation of $A$ by $B$ can be expressed as $A\oplus B$ and defined as

(5)

$ A\oplus B=\left\{c\in E^{N} \mid c=a+b,~a\in A~\text{and}~b\in B\right\}. $

(2) Corrosion operation

Erosion is a reverse operation of the dilation operation. It mainly erodes the elements of the vocal singing waveform. That is, the vectors of the two sets of elements are subtracted and then combined. If both $A$ and $B$ belong to sets in $N$-dimensional Euclidean space, then the erosion of $A$ by $B$ is the set of all elements $x$, where for every $b \in B$, there is $x+b \in A$. Some vocal waveform processors use the name shrink or reduce instead of erosion.

Definition 2: The erosion of A by B can be represented by $A \Theta B$ and is defined as follows:

(6)

$ A\Theta B=\left\{x\in E^{N} \mid x+b\in A~\text{for every}~b \in B \right\}. $

(3) Open operation

In practice, dilation and erosion are usually used in pairs, either dilation of the vocal waveform followed by erosion of the dilation result, or erosion of the vocal waveform followed by dilation. In both cases, the repeated application of dilation and erosion resulted in the removal of specific vocal waveform details smaller than the structuring elements, but it did not suppress the overall geometric deformation of the feature. Open operation is a combination of dilation and erosion operations.

Definition 3: The open operation of B on A can be represented by $A\circ B$ and is defined as follows:

(7)

$ A\circ B=\left(A\Theta B\right)\oplus B . $

(4) Close operation

Definition 4: The close operation of B on A can be represented by $A\bullet B$ and is defined as follows:

(8)

$ A\bullet B=\left(A\oplus B\right)\Theta B . $

The above are the four basic operations of mathematical morphology, and the so-called morphological filtering is to use a suitable structural element $B$ to perform the operation of opening and closing the target vocal singing waveform $A$, denoted as $A\odot B$, and the relevant definition formula is

(9)

$ A\odot B=\left(A\circ B\right)\bullet B . $

After the target vocal singing waveform is morphologically filtered, the isolated discrete points are eliminated, the sharp protrusions are smoothed, and the disconnected and partially disappeared parts are filled with connections. Because the binarized time-frequency map satisfies the relevant conditions of morphological filtering, the corresponding morphological filtering of the time-frequency map can eliminate various interferences and smooth the vocal singing waveform to a certain extent.

In the process of vocal singing waveform processing, in addition to various common noise interferences, there will also be some common signal interferences. The main interference signals are fixed frequency, sweep frequency and burst. In the following, some brief descriptions of the three types of interference are made. The time-frequency transform of the analysis application is a threshold-adaptive STFT-SPWVD. The reason for this is that other time-frequency analysis is not very effective in dealing with the coexistence of multiple signals.

(1) Fixed frequency interference

The fixed frequency signal specifically refers to a communication signal whose carrier frequency has not changed during the time period during which the receiver receives the entire signal, which is continuous in time and has a certain bandwidth. Therefore, when performing time-frequency analysis on it, it is a straight line parallel to the time axis in the figure. Fig. 5(a) is the time-frequency analysis diagram of the two fixed-frequency signals. It can be seen that the frequency is fixed and the time is continuous.

(2) Frequency sweep interference

The frequency sweep signal refers to a signal whose frequency changes over time, and usually occupies a relatively large bandwidth. Fig. 5(b) is the time-frequency diagram of the frequency sweep signal. It can be seen that the relationship between its frequency and time is a linear change, occupying a large frequency range and showing continuity in time.

(3) Burst interference

The burst signal is a signal whose frequency changes randomly with time, and its frequency corresponds to a fixed time period, but the duration is very short, usually less than the skip period of the vocal singing waveform. Fig. 5(c) is a time-frequency diagram of an 8-segment burst signal. It can be seen that its signal is intermittent and its frequency changes randomly during the entire observation period.

In the actual environment, generally received signals contain other signals. In order to illustrate the scenarios where various interference signals coexist with multiple vocal singing waveforms, a corresponding simulation environment needs to be constructed. The simulation duration is $0.4$ s, the sampling frequency is $16$ kHz, the frequency set of the vocal singing waveform $s_{1} (t)$ is $[4300$, $4600$, $4900$, $5200$, $5500$, $5900$, $6200$, $6500]$ Hz, and the beating speed is $20$ hop/s. The frequency set of the vocal singing waveform $s_{2} (t)$ is $[700$, $1000$, $1300$, $1700$, $2000]$ Hz, and the beating speed is $12.5$ hop/s. The frequencies of the two fixed frequency signals are $2800$ Hz and $3200$ Hz, respectively. The frequency range of the swept interference signal is $[7000$-$7800]$ Hz. The burst signal is $[0$-$8000]$ Hz, and eight segments are randomly selected. The noise is white Gaussian noise of $10$ dB, and Fig. 6 is the time-frequency diagram of STFT-SPWVD for multiple vocal singing waveforms and other interference signals and noise superimposed signals.

The commonly used method in eliminating interference is the method based on power spectrum cancellation. It mainly uses the power of different signals to present different forms to extract the vocal singing waveform. The main steps of the extraction are

(1) First, the algorithm performs time-frequency transformation on the received signal, that is, the time-frequency analysis matrix $TFR_{s} (m,n)$ can be obtained. If the segment length is selected as W according to the actual situation, the time-frequency spectrum can be obtained as:

(10)

$ P_{x} (m,n)=\frac{1}{W} \left|TFR_{s} (m,n)\right|^{2}. $

Among them, $m$ is the time sampling point, $n$ is the frequency sampling point.

(2) The algorithm sums the time-frequency matrix of the received signal in the direction of the time axis, that is, obtains the frequency-dependent power spectrum, and then averages it. Then, the average power spectrum is obtained as

(11)

$ \overline{P_{s} (m)}=\frac{1}{N} \sum _{n=1}^{N}P_{s} (m,n) . $

(3) After subtracting the average power spectrum from the obtained time spectrum, we can get

(12)

$ P_{sub} (m,n)=P_{s} (m,n)-\overline{P_{s} (m)} . $

(4) After observing the obtained time-frequency matrix, it can be found that $P_{sub} (m,n)$ has many elements less than 0, so the time-frequency matrix after cancellation can be obtained by intercepting it by using 0 as the threshold.

Fig. 5. Signal time-frequency diagram.

Fig. 6. Time-frequency diagram of signal in complex environment.

4. System construction and experimental research

4.1 System construction

The CNN network structure in this paper is shown in Fig. 7, which shares the basic framework with the traditional AlexNet. Specifically, it contains eight layers, the first five layers are convolutional layers alternating with pooling layers, and the remaining three layers are fully connected layers for classification. The input image of the CNN network is the harmonic spectrogram and shock spectrogram separated using HPSS and the spectrogram of the original music signal, and the input image size is normalized to $256*256$, which is then fed into the first convolution filter.

Fig. 7. CNN structure diagram.

Ahric seg useklu n pseudo deg umagrad-a d akka:

ttabla shuju (x, y).

Int * * shuju// Ssuter i yiwet n temlilit n unekner sin wudayen

Shuju Cout $=$ (int * *) malloc ((x))// Dynamic memory application

I (int i $=0$; i $<$ x; i$++$)

Shuju Cout [i] $=$ (int *) malloc (y * sizeof (int)// Taswiet n temduklit tuzzel

I (int i $=$ 0; i $<$ x; i$++$)

Herrek (ssushu_cout_cout [i]// Serreh yal agar n ugeffur

Herrek (ssuju_cout)// Suffvet-d aseker n 2D n yigerrad

Through the above steps, the time-frequency diagram of the vocal singing waveform after removing the interference can be obtained. However, power spectrum cancellation has a certain ability to eliminate fixed-frequency interference and burst interference, but it does not have a good elimination effect on frequency sweep interference. Fig. 8 is a time-frequency analysis diagram before and after time-frequency cancellation processing.

Fig. 8. The original time-frequency diagram and the time-frequency diagram after cancellation.

The hardware environment of this article is as follows:

Operating system: Windows 10 64 bit operating system; Programming language: Python; Deep learning framework: PyTorch; Development platform Pycharm; CPU: Inter Core i5-3210M CPU @ 2.5Hz; GPU NVIDIA GeForce GTX 2070.

As can be seen from the figure, the vocal singing waveform is more obvious on the time-frequency diagram after time-frequency cancellation, and the burst interference and fixed-frequency interference are partially eliminated but not completely eliminated. For swept-frequency interference, time-frequency cancellation has no effect at all. However, time-frequency cancellation can well preserve the time-frequency characteristics of vocal singing waveforms. Therefore, time-frequency cancellation can extract multi-vocal singing waveforms, but the final effect is not particularly ideal, and other means need to be used to eliminate frequency sweep interference, burst interference and the residue of fixed frequency signals.

In order to better illustrate the performance of the method for extracting vocal singing waveforms, a quality evaluation index for extracting vocal singing waveforms is proposed. That is, under a fixed signal-to-noise ratio, observe the change between the signal-to-interference ratio when the vocal singing waveform is input and the signal-to-interference ratio after noise and various interferences are removed. The binary image of the time-frequency transformation with the theoretical vocal singing waveform is $a2$, and the binary image of the interference removal is $a1$. The signal-to-interference ratio of the extracted vocal waveform and the interference is calculated. The main steps are as follows:

Firstly, a suitable method is used to extract the waveform of vocal music, and the processed signal is calculated to obtain the total energy value $u1$, that is,

(13)

$ u1=\text{sum}(\text{sum}(a1*a2*TFR)) . $

In the formula, $*$ represents the point multiplication of the matrix instead of the multiplication of the matrix, and $a1*a2$ represents the part of the same signal value of $a1$ and $a2$. $TFR$ is the input signal time-frequency transformation matrix, and the energy distribution of the extracted vocal singing waveform can be obtained by multiplying it with $TFR$.

Then, the interference energy value that is calculated but not eliminated, that is, the energy value $u2$ of other signals other than the vocal singing waveform is removed, and its corresponding expression is

(14)

$ u2=\text{sum}(\text{sum}((a1-(a1*a2))*TFR))). $

Finally, the signal-to-interference ratio $Q1$ of the extracted vocal singing waveform and the interference signal is calculated, that is

(15)

$ Q1=10\lg \left(\frac{u1}{u2} \right). $

After calculating the signal-to-interference ratio after extraction, it is also necessary to perform corresponding operations on the signal-to-interference ratio before extraction. Similarly, the total energy $u3$ of the binary image when the theoretical vocal singing waveform is used for time-frequency transformation is first calculated, that is

(16)

$ u3=\text{sum}(\text{sum}(a2*TFR)) . $

Then, all background energy values $u4$ of the input signal are calculated, namely

(17)

$ u4=\text{sum}(\text{sum}((1-a2)*TFR)) . $

Finally, the signal-to-interference ratio $Q2$ of the signal and the interference when the input is calculated, that is

(18)

$ Q2=10\lg \left(\frac{u3}{u4} \right). $

In order to illustrate the performance of the three extraction methods, a corresponding simulation environment is constructed, in which the simulation time is 0.4 s, the sampling frequency is 16 kHz, and the frequency set of the vocal singing waveform $s_{1} (t)$ is $[4300$, $4600$, $4900$, $5200$, $5500$, $5900$, $6200$, $6500]$ Hz, and the beating speed is 20 hop/s. The frequency set of the vocal singing waveform $s_{2} (t)$ is $[700$, $1000$, $1300$, $1700$, $2000]$ Hz, and the hop speed is 12.5 hop/s. The frequencies of the two fixed-frequency signals are 2800 Hz and 3200Hz, respectively. The frequency range of the swept interference signal is $[7000$-$7800]$ Hz. The burst signal is $[0$-$8000]$ Hz, and eight segments are randomly selected.

The effect of the main feature extraction method in vocal music singing based on deep neural network proposed in this paper is evaluated, and the main feature extraction effect of vocal music singing is explored, and the results are shown in Table Table 1.

Table 1. Evaluation of the effect of the main feature extraction method in vocal singing based on deep neural network.

Number	Feature extraction	Number	Feature extraction	Number	Feature extraction
1	92.17	15	88.13	29	87.48
2	89.64	16	92.35	30	87.44
3	88.24	17	87.01	31	91.97
4	91.67	18	88.19	32	89.22
5	92.65	19	88.62	33	92.69
6	89.63	20	87.76	34	90.98
7	89.58	21	88.84	35	90.11
8	88.61	22	88.92	36	88.84
9	87.47	23	89.78	37	92.41
10	90.05	24	90.54	38	92.75
11	88.00	25	87.97	39	87.11
12	91.04	26	92.20	40	90.08
13	87.36	27	87.91	41	89.48
14	87.64	28	92.64	42	88.17

4.2 Analysis and Discussion

As can be seen from Fig. 9, the proposed connected domain labeling algorithm has the best extraction effect, followed by morphological filtering to extract vocal singing waveforms, and the last is time-frequency cancellation extraction. The main reason is that when morphological filtering and time-frequency cancellation are used to extract the vocal singing waveform, the energy of the vocal singing waveform will be greatly lost, resulting in a lower signal-to-interference ratio after extraction. However, when the vocal singing waveform is extracted based on the connected domain labeling algorithm, the energy loss of the vocal singing waveform is small, so the signal interference after reception is relatively high. The higher the signal-to-interference ratio, the better the suppression of interference.

It can be seen from the table1 research that the main feature extraction method in vocal music based on deep neural network proposed in this paper can play an important role in the main feature extraction of vocal music.

The experiment in this article achieved a certain recognition rate Overall, the characteristic parameters of vocal signals constantly change over time, making them a non-stationary state that cannot be analyzed using techniques related to processing stationary signals. However, considering that the vocals in the song are formed by the movement of human oral muscles, this oral muscle movement is relatively slow and lagging; At the same time, the accompaniment in the song is formed by instruments such as the piano and guitar through the process of percussion and string vibration, so it also has a certain lag. From this perspective, although the vocal signal has the time-varying characteristics mentioned above, by extracting a certain short-term time interval of the signal, it can still be regarded as a quasi-static process, indicating that the vocal signal has short-term stationarity.

This paper studies and validates time frequency denoting and denoting, and proposes a new denoting threshold from the aspects of complexity and engineering, which can be better applied to engineering and time frequency joint denoting After the denoising is completed, this paper compares the time frequency map for noise removal and the morphological filter for noise removal Further more, give the computational complexity and completeness of extraction, this paper poses a method to remove interference by applying connected domain labeling The quantitative evaluation results of the main feature extraction method in vocal singing based on deep neural networks are distributed between $[87$, $92]$, indicating that the proposed method can achieve high accuracy in main feature extraction in vocal singing.

Fig. 9. Extraction curve of vocal singing waveform.

5. Conclusion

As an abstract subject, vocal music not only requires us to master a variety of basic knowledge when learning vocal music, but also requires us to be well integrated with it under the influence of psychological, physiological, cultural background and other factors. It is not only a simple singing on the stage, but also an intuitive expression of people's thoughts, culture and art. Only by deeply understanding its meaning can the singing charm of vocal music be better displayed on the stage. The innovation of this paper is that, starting from the dialectical study of the relationship between vocal singing feature recognition and machine learning, based on the perspectives of autonomous aesthetics, heteronomous aesthetics and music aesthetics, it dialectically analyzes the relationship between vocal singing skills and feature expression. This article combines deep neural networks to construct a main feature extraction model for vocal music, and uses intelligent methods to analyze the waveform of vocal music. The research results indicate that the method proposed in this paper for extracting the main features of vocal singing based on deep neural networks is effective and can play a positive role in extracting the main features of vocal music in practice.

This article proposes an improved algorithm for parameter estimation of frequency hopping signals after removing noise and interference, but it is also affected by the signal-to-noise ratio. Low signal-to-noise ratio means that the obtained time-frequency map may have a problem of fragmentation. Although morphological filtering can be applied to complete the fragmented signal, this will cause deviation in the horizontal and vertical coordinates when extracting the center of gravity of the connected domain in the future, So targeted improvements need to be made in this area in the future.

REFERENCES

T. Magnusson, ``Musical organics: A heterarchical approach to digital organology,'' Journal of New Music Research, vol. 46, no. 3, pp. 286-303, 2017.

R. H. Jack, A. Mehrabi, T. Stockman, and A. McPherson, ``Action-sound latency and the perceived quality of digital musical instruments: Comparing professional percussionists and amateur musicians,'' Music Perception: An Interdisciplinary Journal, vol. 36, no. 1, pp.109-128, 2018.

F. Calegario, M. M. Wanderley, S. Huot, G. Cabral, and G. Ramalho, ``A method and toolkit for digital musical instruments: Generating ideas and prototypes,'' IEEE MultiMedia, vol. 24, no. 1, pp. 63-71, 2017.

D. Toma˘sevi´c, S. Wells, I. Y. Ren, A. Volk, and M. Pesek, ``Exploring annotations for musical pattern discovery gathered with digital annotation tools,'' Journal of Mathematics and Music, vol. 15, no. 2, pp. 194-207, 2021.

X. Serra, ``The computational study of a musical culture through its digital traces,'' Acta Musicologica, vol. 89, no. 1, pp. 24-44, 2017.

I. B. Gorbunova and N. N. Petrova, ``Digital sets of instruments in the system of contemporary artistic education in music: Socio-cultural aspect,'' Journal of Critical Reviews, vol. 7, no. 19, pp. 982-989, 2020.

A. C. Tabuena, ``Chord-interval, direct-familiarization, musical instrument digital interface, circle of fifths, and functions as basic piano accompaniment transposition techniques,'' International Journal of Research Publications, vol. 66, no. 1, pp. 1-11, 2020.

L. Turchet and M. Barthet, ``An ubiquitous smart guitar system for collaborative musical practice,'' Journal of New Music Research, vol. 48, no. 4, pp. 352-365, 2019.

R. Khulusi, J. Kusnick, C, Meinecke, C, Gillmann, J. Focht, and S. Jänicke, ``A survey on visualizations for musical data,'' Computer Graphics Forum, vol. 39, no. 6, pp. 82-110, 2020.

E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. R. Stöter, ``Musical source separation: An introduction,'' IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 31-40, 2018.

T. Magnusson, ``The migration of musical instruments: On the socio-technological conditions of musical evolution,'' Journal of New Music Research, vol. 50, no. 2, pp. 175-183. 2021.

I. B. Gorbunova and N. N. Petrova, ``Music computer technologies, supply chain strategy and transformation processes in socio-cultural paradigm of performing art: Using digital button accordion,'' International Journal of Supply Chain Management, vol. 8, no. 6, pp. 436-445, 2019.

J. A. A. Amarillas, “Marketing musical: Música, industriay promoción en la era digital,” INTERdisciplina, vol. 9, no. 25, pp. 333-335, 2021..

G. Scavone and J. O. Smith, ``A landmark article on nonlinear time-domain modeling in musical acoustics,'' The Journal of the Acoustical Society of America, vol. 150, no. 2, 2021.

L. Turchet, T. West, and M. M. Wanderley, ``Touching the audience: Musical haptic wearables for augmented and participatory live music performances,'' Personal and Ubiquitous Computing, vol. 25, no. 4, pp. 749-769, 2021.

L. C. Way, ``Populism in musical mash ups: Recontextualising Brexit,'' Social Semiotics, vol. 31, no. 3, pp. 489-506, 2021.

K. Stensæth, ``Music therapy and interactive musical media in the future: Reflections on the subject-object interaction,'' Nordic Journal of Music Therapy, vol. 27, no. 4, pp. 312-327, 2018.

C. Michalakos,``Designing musical games for electroacoustic improvisation,'' Organised Sound, vol. 26, no. 1, pp. 78-88, 2021.

A. Amendola, G. Gabbriellini, P. Dell’Aversana, and A. J. Marini, ``Seismic facies analysis through musical attributes,'' Geophysical Prospecting, vol. 65, no. S1, pp. 49-58, 2017.

M. J. Hasan, J. Uddin, and S. N. Pinku, ``A novel modified SFTA approach for feature extraction,'' Proc. of 2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), Dhaka, Bangladesh, pp. 1-5, 2016.

J. Hasan, M. Sohaib, and J.-M. Kim, ``An explainable AI-based fault diagnosis model for bearings,'' Sensors, vol. 21, no. 12, 4070, 2021.

J. Hasan and J.-M. Kim, ``Bearing fault diagnosis under variable rotational speeds using stockwell transform-based vibration imaging and transfer learning,'' Applied Sciences, vol. 8, no. 12, 2357, 2018.

Author

Xujun Zheng

Xujun Zheng is a lecturer in music performance from Zhengzhou Technology and Business University in Zhengzhou, China. He holds a bachelor's degree in art from Sichuan Conservatory of Music, as well as his master's and doctoral degrees in art from Russian Herzen University. He has been engaged in vocal teaching research for more than ten years and has published several articles in international journals. His main research direction is pop music singing, arrangement, and production. For several years, he has adhered to the concept of independently completing lyrics, music, and composition, and is committed to creating his own unique style of Chinese pop music.

IEIE SPC

IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Main Feature Extraction Method in Vocal Singing Based on Deep Neural Network

Abstract

Keywords

1. Introduction

2. Related work

3. Extraction of vocal singing waveforms in complex environments

3.1 Time-frequency graph denoising

(1)

(2)

(3)

(4)

3.2 Overview of waveform processing methods for vocal singing

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

4. System construction and experimental research

4.1 System construction

(13)

(14)

(15)

(16)

(17)

(18)

4.2 Analysis and Discussion

5. Conclusion

REFERENCES

Author

Xujun Zheng

Article Information (continued)

Keywords

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Main Feature Extraction Method in Vocal Singing Based on Deep Neural Network

Abstract

Keywords

1. Introduction

2. Related work

3. Extraction of vocal singing waveforms in complex environments

3.1 Time-frequency graph denoising

(1)

(2)

(3)

(4)

3.2 Overview of waveform processing methods for vocal singing

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

4. System construction and experimental research

4.1 System construction

(13)

(14)

(15)

(16)

(17)

(18)

4.2 Analysis and Discussion

5. Conclusion

REFERENCES

Author

Xujun Zheng

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing