3.1 Time-frequency graph denoising
In an environment full of noise, in order to extract multiple vocal singing waveforms,
a denoising threshold should be set for the combined time-frequency analysis first,
so that the intercepted time-frequency map can reduce the interference of noise, so
as to obtain a clear time-frequency distribution map.
If we first transform the received multi-vocal singing waveform with STFT to obtain
the time-frequency matrix $STFT(m,n)$, where $m\in [1$, $M]$, $n\in [1$, $N]$, the
denoising threshold of the time-frequency map is [20]
When the weights are just at the critical surface, the energy distribution trend of
the entire time-frequency matrix will change from a rapid downward trend to a slow
downward trend. That is, there will be a turning point, and the corresponding weight
of this turning point is an ideal weight.
The specific steps of energy threshold denoising are as follows:
(1) First, the algorithm uses the weight w in the interval $[1$, $10]$ to take 0.1
as the step, and obtains the discrete threshold value $\text{threshold}(i)$ according
to formula (1), where $1\le i\le 99$;
(2) Secondly, the algorithm uses the short-time Fourier transform (STFT) transformation
to obtain the corresponding time-frequency map, and uses different weights $w$ for
each time-frequency point in the time-frequency matrix to count the number of points
that exceed the threshold, that is [21]
Then, the corresponding energy distribution statistical curve under each threshold
can be obtained, and the curve is shown in Fig. 1.
(3) The algorithm performs quadratic difference on c(i), and finds the point that
is closest to zero for the first time in the quadratic difference, and its corresponding
weight is the critical value of the turning point.
Fig. 1 is a curve of the energy distribution in the time-frequency diagram under Gaussian
noise as a function of the weight $w$. The turning point can be clearly seen, and
the weight corresponding to this turning point is the optimal denoising weight, and
the interception threshold set by this weight can well remove the unnecessary noise
interference in the time-frequency diagram.
It can be seen from Fig. 2 that the noise exhibits a higher peak value and a lower relative power, while the
signal has a lower peak value and a higher relative power [22].
This paper proposes a simple and effective adaptive dynamic threshold method. Through
the previous analysis, we know that the distribution of noise in the time-frequency
matrix can be said to be relatively uniform and very scattered. Moreover, because
of the fixed frequency set and time continuity of multiple vocal singing waveforms,
the distribution of vocal singing waveforms in the entire time-frequency matrix is
sparse and the energy is very concentrated. Therefore, in view of this characteristic,
the entire time-frequency matrix is arranged in ascending order from small to large,
and the first 20% of the data are taken according to experience, and the average value
is obtained. Then, the maximum value of the entire time-frequency matrix is selected
and supplemented by the mean value of the global maximum value and the minimum value
for correction, and finally the three are weighted $1:2:1$ to obtain a new threshold.
The definition formula is:
In the formula, $STFT_{{\rm max}} $ is the global maximum value, $\mu _{20\% } $ is
the mean value of the top 20% of the time-frequency matrix after arrangement, and
$D$ is the corrected mean value. It has been proved that this dynamic threshold formula
also has a good interception effect at a lower signal-to-noise ratio.
In order to observe the denoising effect, two frequency sets are given. The frequency
set of the vocal singing waveform $s_{1} (t)$ is $[4300$, $4600$, $4900$, $5200$,
$5500$, $5900$, $6200$, $6500]$ Hz, and the beating speed is $20$ hop/s. The frequency
set of the vocal singing waveform $s_{2} (t)$ is $[700$, $1000$, $1300$, $1700$, $2000]$
Hz, and the beating speed is $12.5$ hop/s. The sampling rate is $16$ kHz, and the
total simulation time is $0.4$ s. The noise is Gaussian noise, and the STFT is used
for time-frequency transformation. The signal-to-noise ratio is $0$ dB.
As can be seen from Fig. 3, energy threshold denoising, histogram denoising and improved threshold denoising
can remove noise very well. In order to quantitatively evaluate the quality of the
three denoising algorithms, the signal point detection rate (SPDR) is defined as the
measurement index, which is defined as
In the formula, AS is the number of signal points after processing, and BS is the
number of signal points before processing. It illustrates the validity and stability
of the extracted signal. In addition, the computational complexity is applied to illustrate
the practicality of its engineering.
From Fig. 4(a), it can be seen that at low SNR, the improved denoising threshold signal point inspection
rate is higher than that of energy threshold denoising and histogram denoising. At
a high signal-to-noise ratio, the signal point detection rate is low. However, its
comprehensive performance of denoising is relatively stable.
In order to observe the computational complexity of the three methods more intuitively,
the change trend of the total number of sampling points is shown in Fig. 4(b).
Therefore, it can be seen from Fig. 4(b) that the improved threshold denoising has stronger stable extraction and more stable
time-frequency focusing than histogram denoising and energy threshold denoising. Moreover,
the computational complexity of this method is low.
Fig. 1. Distribution of time-frequency matrix energy distribution under different
weights.
Fig. 2. Histogram statistics of time-frequency graph.
Fig. 3. Original time-frequency diagram and time-frequency diagram after denoising.
Fig. 4. Signal point detection rate and complexity.
3.2 Overview of waveform processing methods for vocal singing
In order to describe the morphological filtering vividly, the vocal singing waveform
with processing is defined as $A$, the predefined structural element is $B$, and each
set belongs to $E^{N} $.
(1) Dilation operation
Dilation is a morphological change that combines two sets using vector addition of
set elements. $A$ and $B$ are sets belonging to the N-dimensional space $\left(E^{N}
\right)$, and have $N$-tuples with $a=(a_{1}$, $\ldots$, $a_{N})$ and $b=(b_{1}$,
$\ldots$, $b_{N})$ as element coordinates, respectively. Therefore, the dilation of
$B$ by $A$ is the set of all possible vector sums of pairs of elements, and one of
them is from $A$ and the other is from $B$.
Definition 1: If both $A$ and $B$ belong to a subset of $E^{N} $, the dilation of
$A$ by $B$ can be expressed as $A\oplus B$ and defined as
(2) Corrosion operation
Erosion is a reverse operation of the dilation operation. It mainly erodes the elements
of the vocal singing waveform. That is, the vectors of the two sets of elements are
subtracted and then combined. If both $A$ and $B$ belong to sets in $N$-dimensional
Euclidean space, then the erosion of $A$ by $B$ is the set of all elements $x$, where
for every $b \in B$, there is $x+b \in A$. Some vocal waveform processors use the
name shrink or reduce instead of erosion.
Definition 2: The erosion of A by B can be represented by $A \Theta B$ and is defined
as follows:
(3) Open operation
In practice, dilation and erosion are usually used in pairs, either dilation of the
vocal waveform followed by erosion of the dilation result, or erosion of the vocal
waveform followed by dilation. In both cases, the repeated application of dilation
and erosion resulted in the removal of specific vocal waveform details smaller than
the structuring elements, but it did not suppress the overall geometric deformation
of the feature. Open operation is a combination of dilation and erosion operations.
Definition 3: The open operation of B on A can be represented by $A\circ B$ and is
defined as follows:
(4) Close operation
Definition 4: The close operation of B on A can be represented by $A\bullet B$ and
is defined as follows:
The above are the four basic operations of mathematical morphology, and the so-called
morphological filtering is to use a suitable structural element $B$ to perform the
operation of opening and closing the target vocal singing waveform $A$, denoted as
$A\odot B$, and the relevant definition formula is
After the target vocal singing waveform is morphologically filtered, the isolated
discrete points are eliminated, the sharp protrusions are smoothed, and the disconnected
and partially disappeared parts are filled with connections. Because the binarized
time-frequency map satisfies the relevant conditions of morphological filtering, the
corresponding morphological filtering of the time-frequency map can eliminate various
interferences and smooth the vocal singing waveform to a certain extent.
In the process of vocal singing waveform processing, in addition to various common
noise interferences, there will also be some common signal interferences. The main
interference signals are fixed frequency, sweep frequency and burst. In the following,
some brief descriptions of the three types of interference are made. The time-frequency
transform of the analysis application is a threshold-adaptive STFT-SPWVD. The reason
for this is that other time-frequency analysis is not very effective in dealing with
the coexistence of multiple signals.
(1) Fixed frequency interference
The fixed frequency signal specifically refers to a communication signal whose carrier
frequency has not changed during the time period during which the receiver receives
the entire signal, which is continuous in time and has a certain bandwidth. Therefore,
when performing time-frequency analysis on it, it is a straight line parallel to the
time axis in the figure. Fig. 5(a) is the time-frequency analysis diagram of the two fixed-frequency signals. It can
be seen that the frequency is fixed and the time is continuous.
(2) Frequency sweep interference
The frequency sweep signal refers to a signal whose frequency changes over time, and
usually occupies a relatively large bandwidth. Fig. 5(b) is the time-frequency diagram of the frequency sweep signal. It can be seen that
the relationship between its frequency and time is a linear change, occupying a large
frequency range and showing continuity in time.
(3) Burst interference
The burst signal is a signal whose frequency changes randomly with time, and its frequency
corresponds to a fixed time period, but the duration is very short, usually less than
the skip period of the vocal singing waveform. Fig. 5(c) is a time-frequency diagram of an 8-segment burst signal. It can be seen that its
signal is intermittent and its frequency changes randomly during the entire observation
period.
In the actual environment, generally received signals contain other signals. In order
to illustrate the scenarios where various interference signals coexist with multiple
vocal singing waveforms, a corresponding simulation environment needs to be constructed.
The simulation duration is $0.4$ s, the sampling frequency is $16$ kHz, the frequency
set of the vocal singing waveform $s_{1} (t)$ is $[4300$, $4600$, $4900$, $5200$,
$5500$, $5900$, $6200$, $6500]$ Hz, and the beating speed is $20$ hop/s. The frequency
set of the vocal singing waveform $s_{2} (t)$ is $[700$, $1000$, $1300$, $1700$, $2000]$
Hz, and the beating speed is $12.5$ hop/s. The frequencies of the two fixed frequency
signals are $2800$ Hz and $3200$ Hz, respectively. The frequency range of the swept
interference signal is $[7000$-$7800]$ Hz. The burst signal is $[0$-$8000]$ Hz, and
eight segments are randomly selected. The noise is white Gaussian noise of $10$ dB,
and Fig. 6 is the time-frequency diagram of STFT-SPWVD for multiple vocal singing waveforms
and other interference signals and noise superimposed signals.
The commonly used method in eliminating interference is the method based on power
spectrum cancellation. It mainly uses the power of different signals to present different
forms to extract the vocal singing waveform. The main steps of the extraction are
(1) First, the algorithm performs time-frequency transformation on the received signal,
that is, the time-frequency analysis matrix $TFR_{s} (m,n)$ can be obtained. If the
segment length is selected as W according to the actual situation, the time-frequency
spectrum can be obtained as:
Among them, $m$ is the time sampling point, $n$ is the frequency sampling point.
(2) The algorithm sums the time-frequency matrix of the received signal in the direction
of the time axis, that is, obtains the frequency-dependent power spectrum, and then
averages it. Then, the average power spectrum is obtained as
(3) After subtracting the average power spectrum from the obtained time spectrum,
we can get
(4) After observing the obtained time-frequency matrix, it can be found that $P_{sub}
(m,n)$ has many elements less than 0, so the time-frequency matrix after cancellation
can be obtained by intercepting it by using 0 as the threshold.
Fig. 5. Signal time-frequency diagram.
Fig. 6. Time-frequency diagram of signal in complex environment.