Image Recognition Processing Technology Based on Virtual Reality Technology and Adaptive
Feature Fusion
(Daogui Lin)
1,2
-
(The Internet of Things and Artificial Intelligence College, Fujian Polytechnic of
Information Technology, Fuzhou, 350003, China)
-
(Faculty of Computing and Informatics, University Malaysia Sabah, Sabah, 88300, Malaysia)
Copyright © 2025 The Institute of Electronics and Information Engineers
Keywords
Virtual reality, Image recognition, AWF, YOLOv4, SIoU loss function
1. Introduction
As the gradual advancement and maturity of computers and simulation, virtual reality
technology (VRT) has also driven rapid progress in many industries, such as health
care, education, and entertainment. VRT can allow people to experience a more realistic
virtual world, which can be used not only in education, entertainment, medical, and
other fields, but also in military, simulation, and other fields. Virtual environment
construction refers to the use of computer technology to combine the physical environment
in the real world with the virtual world to build a virtual world. Virtual environment
construction can be divided into two main aspects: physical simulation and image recognition
technology. Physical simulation technology refers to the construction of a virtual
world by simulating physical phenomena in the real world. IR refers to the use of
computer technology to compare input images with pre trained models, identify features
in the images, and construct a virtual world. Image recognition directly influenced
the application. Higher accuracy in IR results in higher accuracy in detecting objects.
Traditional image processing techniques have high requirements for datasets and professional
knowledge [1-3]. Traditional IR processing techniques have strong interpretability and high accuracy
in identifying and detecting practical problems with small data scales. However, the
extraction and selection of data have high requirements for manual design, and it
is difficult for manual design features to obtain advanced semantic features. Meanwhile,
it has low stability in multi-objective complex environments [4,5]. However, one-stage object detection algorithms such as single shot multi box detector
(SSD) and YOLO have strong advantages in the field of IR processing. The research
will design an IR under the background of VRT, which is on the grounds of YOLOv4 and
optimized through the adaptive weighted fusion (AWF) module, namely the AWF-YOLOv4
object detection algorithm. The contribution of the research is the proposal of the
AWF-YOLOv4 object detection algorithm, which improves the accuracy and stability of
image recognition, addresses the challenges of image recognition in virtual reality
applications, and optimizes the user experience in virtual environments. The implementation
of VRT can facilitate enhanced overall application effectiveness and user interaction
experience through the utilization of more accurate and efficient image recognition
technology.
The research will elaborate on the content from four aspects. The first part summarizes
the literature reports on IR, adaptive features, and VRT. The second part proposes
the IR of the AWF-YOLOv4 object detection algorithm. The third part analyzes the application
effect of IR. The fourth part summarizes the research results and looks forward to
the shortcomings of the research.
2. Related Works
The utilization of VRT in image recognition can achieve effective interaction between
virtual and real environments, and has been applied in cultural heritage protection,
medical and health care. As a key technology for applying VRT, image recognition have
been studied by many researchers. Li et al. analyzed the change detection of remote
sensing images under adverse weather conditions and proposed a deep translation image
change detection network. The network converts images from one domain to another through
a cyclic structure. The results of dataset testing showed that the detection method
of this image recognition is robust and effective [6]. Chen et al. proposed an image recognition processing method on the grounds of deep
neural networks and utilized it in the field of video image processing in robot vision
tasks. The test set validation showed that the neural network structure proposed by
this algorithm outperformed many image processing methods in both quantitative and
qualitative measurements. This method not only eliminated obvious image noise data,
but also preserved the true situation of medical images well [7]. Esfahlani et al. proposed a drone robot operating system and computer image recognition,
which considered the temporal changes in fire intensity, motion attributes, colors,
etc., and utilized state-of-the-art indoor and outdoor synchronous positioning and
drawing. This system could achieve inter frame motion estimation, avoid problems such
as motion failure caused by image data loss, and thus achieve the detection of dangerous
signals and abnormal data in natural environments [8]. Zheng et al. have demonstrated the accuracy of this method in human activity behavior
through a behavior recognition algorithm that combines autoencoders and recurrent
neural network features [9].
Zhang et al. proposed a dual mode physiological fusion feature sleep recognition technique,
which considered the correlation between each feature and category. Through 10 fold
cross experiments, the results proved the high recognition rate of this method [10]. Gou et al. et al. analyzed real-time disaster detection technology on the grounds
of end-to-end models. This model included an attention fusion module, a feature extraction
module, a feature extraction module of a convolutional neural network, and a maximum
mean difference domain adaptation module. The model results verified its superior
performance [11]. Zhao et al. proposed a deep feature fusion algorithm for multimodal data and applied
it to the auxiliary diagnosis of infectious respiratory diseases. This model could
achieve automated and intelligent diagnosis of infectious respiratory diseases [12]. Wang et al. proposed a multi feature and multimodal biometric recognition technique,
which identified the essential manifold structure of quaternion fusion features. The
experimental results showed that. Compared to other feature fusion algorithms, this
method had better performance [13]. Zhou et al. proposed a feature fusion algorithm on the grounds of Bayesian decision
theory to identify radar deception interference signals. This algorithm could identify
radar deception interference and had high recognition accuracy [14]. Siriwardhana et al. designed a multimodal emotion recognition algorithm on the grounds
of self supervised feature fusion, which combines attention fusion mechanism. The
benchmark test results verified the robustness of the model [15]. In order to deal with the declining accuracy of target tracking methods in thermal
infrared tracking scenarios, Yuan et al. designed an adaptive multi-feature fusion
model, which could adaptively integrate deep convolutional neural network features
and manual features, and adopt model update strategy to enable the model to achieve
adaptive tracking of target changes. The results showed that this method significantly
improved the tracking effect [16]. In view of the complexity and time-variability of large-scale multivariate time
series data, Xia et al. proposed a hybrid feature adaptive fusion network, and adopted
an attentional mechanism to deal with redundancy and conflicts between different scales,
so as to recognize and classify features. The results showed that the accuracy of
this method could reach 96%, and the performance is significantly improved [17].
On the grounds of the research results of domestic and foreign scholars, it can be
observed that many object detection algorithms have achieved certain achievements
in various fields. However, there is still significant room for enhancement in detection
accuracy (DA) and detection speed. The research will be on the grounds of VRT and
propose the AWF-YOLOv4 object detection algorithm, aiming to contribute to the improvement
and enhancement of image processing technology.
3. Image Recognition for AWF-YOLOv4 Object Detection Algorithm
For feature fusion, many object detection algorithms use feature pyramid networks
(FPN) or personal area networks (PAN) for neck networks [18]. While these methods are capable of achieving cross-scale feature fusion, they fail
to account for the disparities in information content across features within the same
level of abstraction across different time points. Additionally, they do not address
the phenomenon of information loss resulting from semantic discrepancies. Therefore,
the research proposes the AWF-YOLOv4 object detection algorithm, which is on the grounds
of the YOLOv4 object detection algorithm. It uses AWF of two feature maps and adds
a cross-stage fusion path to the FPN feature fusion network for preventing the loss
of feature layer information in the feature fusion stage. The fused feature network
is combined with the basic object detection algorithm.
Fig. 1. AWF for independent convolution and shared convolution.
3.1. AWF and Cross-scale and Cross-stage Feature Fusion Networks in Image Recognition
The residual blocks in the backbone network of the object detection algorithm did
not consider the differences in semantic levels of features during the feature fusion
stage. To make more efficient use of residual blocks, this study embeds AWF into the
residual blocks to achieve AWF of features, avoiding features covering different spatial
information being treated equally. Feature extraction is more flexible and refined,
strengthening the regions containing effective information while weakening the differences
containing invalid information. Figs. 1(a) and 1(b) show the AWF of independent convolution and shared convolution, respectively.
The AWF module consists of two structures, consisting of three parts: compression,
extraction, and allocation. The difference lies in the type of convolution used in
the compression stage. The compression part processes two feature maps through $1
\times 1$ convolution, compressing the number of channels to a constant $T$, while
extracting feature information. The compressed feature maps are used to extract weight
information. The extraction part first concatenates the two intermediate feature maps
through channels for getting a combined feature map. The combination feature map includes
information from the input feature map (IFM), with a channel count of $2T$. Secondly,
the number of channels is compressed through $1 \times 1$ convolution, while obtaining
the spatial weights of the IFM. The weight feature map contains two channels, both
of which contain weight information from two IFMs. Then $1 \times 1$ convolution is
used to further compress the channel number of the combined feature graph, and the
spatial weight of the input feature graph is extracted. A weight feature graph with
channel number of 2 is obtained after the channel compression. It should be noted
that the whole process uses $1 \times 1$ convolution because it does not change the
size of the IFM at all, saves all the information of the original feature map, and
can completely pass the position information in the feature map to the next layer.
In addition, $1 \times 1$ convolution realizes linear combination of multiple feature
maps. In this way, the output feature map is the integration information of multiple
channels, which can enrich the features extracted by the network. The two channels
of the weight feature graph respectively contain the weight information of the two
input feature graphs. Since the generated weights are obtained by convolution of the
same feature graph, there is a certain dependency between them, which is used to control
the weighted fusion of the two input feature graphs. Finally, the weight values are
mapped using Eq. (1).
In Eq. (1), the parameter values at the two channels are $\alpha_{i,j}$ and $\beta_{i,j}$, and
the two spatial weight values are $\omega^\alpha_{i,j}$ and $\omega^\beta_{i,j}$,
respectively. After the spatial weights are obtained, the distribution operation is
carried out, and the obtained spatial weights are matched with the original IFM. By
multiplying these weights with the corresponding IFM, the dependencies between the
weights can be passed to IFM, and the correlation between IFMs can be established.
The addition of these IFMs that have established correlation with each other forms
the final output feature map. Thus, AWF of features is realized. The calculation formula
for the weight feature map is equation (2).
In Eq. (2), the IFM is divided into $C1$ and $C2$, the convolution operation is $Conv(\cdot)$,
and the cascaded processing of the corresponding channels is $Concat(\cdot)$. The
calculation formula for outputting feature maps is Eq. (3).
In Eq. (3), the first and second channels of the weight feature map are $Z[0]$ and $Z[1]$, respectively.
Due to the AWF structure of independent convolution, there are significant semantic
differences between features at different levels. Each convolutional block only learns
the features of a specific semantic level, which is beneficial for learning the distribution
characteristics of semantic information at a specific level. Meanwhile, the semantic
gap between features can be reduced through convolution operations, thereby enabling
the learned features to be fused with subsequent features. Due to the semantic difference
between the output features of the stacked layer and the identity mapping of the IFMs,
this study investigates the use of shared convolution to simultaneously learn the
characteristics of two feature maps and their spatial relationships [19,20]. This study embeds the AWF module into the residual block of the backbone network,
and Figs. 2(a) and 2(b) show the original residual block and the improved residual block, respectively. The
improved residual block inputs the identical maps of the output and input features
of the stack layer into the AWF module, instead of adding them directly, and extracts
and assigns weights to the two feature maps in the AWF module to establish the correlation
between them. Therefore, the learning ability of the residual module is further improved.
Fig. 2. Original residual block and improved residual block.
The utilization of FPN and PAN in the neck feature fusion component of the YOLOV4
network enables the achievement of feature fusion at disparate scales throughout the
multi-scale feature fusion process. However, this approach does not encompass the
fusion of features across different stages of the same scale. Low level features can
obtain rich semantic information from other feature layers during the feature fusion
process, but this information will lose the original spatial data during cascading
and convolution operations. Meanwhile, the GiraffeDet network adopts a combination
of cross-scale connections and skip layer connections in the neck network. This can
enable the model to interact between low-level information and high-level information,
with skip connections being the interconnection of features at the same scale at different
stages [21,22]. The study designs a cross-scale and cross-stage feature fusion network as shown
in Fig. 3. This network adds cross-stage fusion on the basis of cross-scale fusion structure,
allowing the feature layer to retain the original feature information while obtaining
information from other scale feature layers.
Fig. 3. Cross scale and cross-stage feature fusion network.
The cross-scale fusion path includes sampling, cascading, and convolution, which is
the same as the original YOLOv4 fusion path. The cross-stage fusion path is a skip
connection, which has a shorter distance during the backpropagation process while
retaining the original feature dimensions. It can process the same layer features
after feature fusion with the features extracted from the backbone network. In addition,
the skip connection method does not require the introduction of other parameters,
nor does it increase the computational complexity. The improved feature fusion network
has a cross-stage connection for downsampling low-level features, which can fuse the
original information with low-level features and then complete the propagation of
high-level feature layers. This can make the feature fusion of lower and higher levels
more comprehensive.
3.2. Image Recognition for AWF-YOLOv4 Object Detection Algorithm
In the YOLO series of algorithms, YOLOv4 and YOLOv3 have similar structures, but they
have optimized the network structure through training techniques on the grounds of
the YOLOv3 algorithm. The detection method of the later versions of the YOLO series
algorithm is similar to that of the YOLOv4 algorithm, which has extremely high customization
and excellent basic performance. Fig. 4 shows the network structure of the AWF-YOLOv4 object detection algorithm, including
the backbone network, neck network, and prediction network. This method enhances the
network's ability to fuse features and extract features on the grounds of the YOLOv4
algorithm. Introducing an AWF module in the backbone network to enhance the learning
ability of residual blocks, and adopting a wide-scale cross-stage fusion network in
the neck network reduces information loss during the feature fusion process. This
enables the neck network to better utilize feature information. The input image size
used in this network structure is $416 \times 416$, and the backbone network consists
of 5 large residual blocks. A few residual blocks are embedded in the AWF module,
and each residual block includes 2 branches, corresponding to the stacking of residual
blocks and the edges crossing the residual blocks. The number of residual blocks covered
by each large residual block is 1, 2, 8, 8, and 4, respectively. Among them, the output
features of the last three large residual blocks will be transported to the neck network
for feature fusion, with dimensions of $52 \times 52 \times 256$, $26 \times 26 \times
512$, and $13 \times 13 \times 1024$, respectively. The neck network will fuse features
of different scales of input, first passing high-level features containing rich semantic
information to low-level features through a top-down path. Then it passes through
a bottom-up path to transfer low-level features containing rich spatial information
to high-level features, with the input feature having the same dimension as the final
output feature. Then, it uses a cross-stage fusion path to obtain features before
and after fusion at the same scale. Finally, the features output by the neck network
are input into the prediction network, and the position and category in the image
are detected through recognition regression techniques.
Fig. 4. The network structure of AWF-YOLOv4 object detection algorithm.
Given that the bounding box (BBO) regression loss function (LF) of YOLOv4 object detection
algorithm has a significant impact on the DA and convergence speed (CS), this study
conducted in-depth research on the regression LF and made improvements on the existing
foundation. The LF Loss contains classification loss (CL), confidence loss (COL),
and BBO loss, which are represented by $Loss_{cls}$, $Loss_{conf}$, and $Loss_{CIoU}$.
The relevant expression is Eq. (4).
In Eq. (4), $Loss_{CIoU}$ is represented by a regression function, and the relevant expression
is Eq. (5).
In Eq. (6), the Euclidean distance among two center points is $d$, the diagonal distance serves
as $c$, and $a$ serves as the relevant strength. The intersection to union ratio between
the predicted candidate box and the true BBO serves as $IoU$, and the consistency
consideration factor for aspect ratio is $v$. The relevant expression for $v = \frac{4}{\pi^2}
(\arctan \frac{\omega^{gt}}{h^{gt}} - \arctan \frac{\omega}{h})^2$ is equation (6).
Fig. 5. Three cases of intersection of RBO and BBO and SIoU LF AC.
In Eq. (6), the true height and width of the BBO are $h^{gt}$ as well as $\omega^{gt}$, and
the predicted height and width of the BBO are represented by $h$ as well as $\omega$.
Due to the large difference among the BBO as well as the real box (RBO), the movement
direction of the BBO only is decided by shape and distance constraints, which still
has significant errors. Fig. 5(a) serves as three situations where the RBO and BBO intersect. The study will use the
SCYLLA IoU (SIoU) LF to achieve loss calculation, which takes the four items into
consideration for determining the position, and takes into account the concept of
the first few LFs. Fig. 5(b) demonstrates that the angle calculation (AC) of the SIoU LF. The points $B$ and $B^{GT}$
serve as BBO and true boxes (TBO). This study assumes that the coordinates of point
$B$ as well as $B^{GT}$ serve as $(x, y)$ as well as $(x^{GT}, y^{GT})$, the vertical
and horizontal (VH) distances of the BBO and the TBO are $C_h$ as well as $C_w$, the
distance among point $B$ as well as point $B^{GT}$ serves as $\sigma$, and the angle
among the line connecting point $B$ and $B^{GT}$ and the VH directions is $\alpha$
as well as $\beta$. The relevant expression of the angle loss is Eq. (7).
In Eq. (7), $\alpha + \beta = \frac{\pi}{2}$. The relevant expression for the IoU loss term
(LT) is Eq. (5). Compared with the original Euclidean distance approach, this method minimizes the
number of distance-related variables, reduces the degree of freedom, improves the
training speed and accuracy, and reduces the model complexity. The final expression
$L_{SIoU}$ for the SIoU LF is Eq. (8).
In Eq. (8), the shape LT is $1\Omega$ as well as the distance LT is $2\Delta$. The formula for
calculating the distance LT $\Delta$ is Eq. (9).
In Eq. (9), $p_x = (\frac{x^{GT}-x}{C_x})^2$, $p_y = (\frac{y^{GT}-y}{C_y})^2$, $\gamma = 2
- \Lambda$. Adding an angle to the distance term can reduce the distance between the
center points of two boxes, and also make the BBO and the TBO consistent in both VH
directions. The calculation formula for the shape LT $\Omega$ is Eq. (10).
In Eq. (10), $\tau_w = \frac{|w^{GT}-w|}{max(w^{GT},w)}$, $\tau_h = \frac{|h^{GT}-h|}{max(h^{GT},h)}$.
The SIoU LF introduces the concept of angle to move the BBO towards the x-axis as
well as y-axis of the TBO, improving CS by constraining degrees.
4. Image Recognition Processing Effect of AWF-YOLOv4 Object Detection Algorithm
The study analyzed the image recognition and processing performance of AWF-YOLOv4
object detection algorithm using a public dataset (pattern analysis statistical modeling
and computational learning, visual object classes, PASCAL VOC). The training set and
test set are the training set of the 2007 version of PASCAL VOC and the validation
set of the 2012 version of PASCAL VOC, respectively. The test set is the test set
of the 2007 version of PASCAL VOC. The distribution of sample images is highly similar
to the real scene. The experimental testing environment is as follows: the operating
system is Ubuntu 20.04, the deep learning framework is Python 1.8, the programming
language is Python 3.7, the system memory is 64GB, and the hardware is Intel (R) Core
(TM) i9-10980XE CPU@3.00GHz. The evaluation indicators are loss value and mean average
precision (mAP). This study is the first to test the effect of adding AWF modules
at different positions on image recognition processing. Table 1 refers to the addition locations of different AWF modules. There are differences
in the placement and quantity of different addition schemes. The model structure,
from large to small, is Scheme B, Scheme C, and Scheme A. These include Scheme A:
Starting from the second large residual block, add AWF modules to the last residual
block of the 2nd through 5th large residual blocks, for a total of 4 AWF modules.
Scheme B: Starting from the second large residual block, add AWF modules to the first
and last residual blocks of the 2nd through 5th large residual blocks, for a total
of 8 AWF modules. Scheme C: Starting with the third large residual block, add AWF
modules to the middle and last residual blocks of the 3rd through 5th large residual
blocks, for a total of 6 AWF modules.
Figs. 6(a)-6(b) show the mAP of the training and testing sets in the AWF module configuration, respectively.
The horizontal axis represents the number of traversals, and the vertical axis represents
the mAP values of each scheme. Overall, the average accuracy of the model, in descending
order, is Scheme C, Scheme B, and Scheme A. In the same scheme, the AWF module uses
a shared convolutional structure with fewer parameters, but the average accuracy is
lower, especially in the first two schemes. Adding the AWF module to the second large
residual block will not affect the final detection result. Scheme C has a faster CS
and a higher mAP value, with the values of 0.9256 and 0.9156. The mAP values of the
other schemes are all around 85%.
Fig. 6. mAP of relevant sets in AWF module configuration.
Table 1. Different AWF module addition positions.
|
Large residual block
|
A
|
B
|
C
|
|
1
|
First place
|
0
|
0
|
0
|
|
Median
|
0
|
0
|
0
|
|
Last place
|
0
|
0
|
0
|
|
2
|
First place
|
0
|
1
|
0
|
|
Median
|
0
|
0
|
0
|
|
Last place
|
1
|
1
|
0
|
|
3
|
First place
|
0
|
1
|
0
|
|
Median
|
0
|
0
|
1
|
|
Last place
|
1
|
1
|
1
|
|
4
|
First place
|
0
|
1
|
0
|
|
Median
|
0
|
0
|
1
|
|
Last place
|
1
|
1
|
1
|
|
5
|
First place
|
0
|
1
|
0
|
|
Median
|
0
|
0
|
1
|
|
Last place
|
1
|
1
|
1
|
|
Total value
|
4
|
8
|
6
|
Fig. 7(a) and 7(b) refer to the convergence values and runtime of YOLOv4 and AWF-YOLOv4 algorithms,
which are experimentally analyzed using a test set. The horizontal axis represents
the number of iterations of the algorithm. The vertical axis represents the error
value and running time. The error values of the two algorithms gradually decrease
with the increase of iteration times. Moreover, the convergence times are about 200
and 175 times respectively. At the same time, the stable error value of the AWF-YOLOv4
algorithm is 0.028, which is 0.011 lower than that of the YOLOv4 algorithm. The stable
values of the two algorithms are 3.1s and 2.9s, respectively. The CS of the AWF-YOLOv4
algorithm is better than the YOLOv4 algorithm, and the running time of this algorithm
is shorter. The AWF-YOLOv4 algorithm has advantages in both optimization time and
error values because it adopts SIoU, which improves the CS of the proposed method.
Fig. 7. Convergence values and runtime of YOLOv4 and AWF-YOLOv4 algorithms.
VRT is the utilization of data from real life for generating electronic signals through
computer technology, which are combined with different relevant devices for transforming
them into phenomena that can be felt. Figs. 8(a) and 8(b) show the original image processing signals before and after image processing techniques,
respectively. The horizontal axis represents the sampling points of the test, and
the vertical axis represents the amplitude of the data signal. The original signal
contains a large amount of noise data. After processing with image processing techniques,
the changes in the original data can be largely preserved, and the resulting denoised
signal tends to be smooth. Therefore, the wavelet coefficient threshold denoising
method can effectively remove noise data from the original image data and restore
the feature data of the original data to the maximum extent.
Fig. 8. Vibration signals before and after wavelet coefficient threshold denoising.
To verify the application effect of image recognition, the study will select classic
image recognition for comparison, SSD, faster region with convolutional neural network
(Faster R-CNN), YOLOv8, and set different categories of images for detection. Figs. 9(a)-9(d) show the image detection mAP for four scenarios: natural scenery, urban roads, office
environment, and plateau desert. The horizontal axis represents the number of traversals,
and the vertical axis represents the mAP values of image detection for each algorithm.
Overall, the AWF-YOLOv4 image detection algorithm has the highest mAP, SSD has the
lowest mAP, followed by Faster R-CNN and YOLOv8. In the image detection of natural
scenery, urban roads, office environments, and plateau deserts, the mAP of AWF-YOLOv4
image detection algorithm is 0.9056, 0.9143, 0.9106, and 0.9812, respectively.
Fig. 9. Image detection results for different categories.
Figs. 10(a)-10(d) show the image data signal processing results of four image detection techniques,
respectively. The horizontal axis represents the sampling points of the test, and
the vertical axis represents the amplitude of the data signal. The figure shows that
the AWF-YOLOv4 image detection algorithm has the lowest amplitude of image signal
fluctuation, while the other methods have larger amplitude of image signal fluctuation,
with maximum amplitude reduction rates of 14.25%, 17.36%, and 22.36%, respectively.
This may be due to the image processing technology of the AWF-YOLOv4 image detection
algorithm combining multi-scale and multi-stage image data features, while calculating
the loss value through the S-IOU LF. However, the image data signal processing results
of all image detection technologies are within the confidence interval. Overall, it
can be concluded that AWF-YOLOv4 image detection technology has significant advantages
in practical image processing applications.
Fig. 10. Image data signal processing results of four image detection techniques.
Finally, the detection results are analyzed in different environments, including strong
light, backlight, blurry targets, and normal targets. The corresponding sample images
are shown in Fig. 11.
Fig. 11. Sample images in four different environments.
The detection results of each algorithm in four environments: strong light, backlight,
blurred targets, and normal targets are shown in Table 2. In four environments of strong light, backlight, blurred targets, and normal targets,
the mAP of AWF-YOLOv4 image detection algorithm is about 91.5%, and the detection
speed is about 0.120/s per piece. The mAP of other image detection algorithms is in
the range of 80% -88%, with a detection speed of 0.088-0.105/s per piece. Therefore,
the AWF-YOLOv4 image detection algorithm has high DA in different environments.
Table 2. Four detection results for strong light, backlight, blurred targets, and
normal targets.
|
Detection type
|
Model
|
Object mAP/%
|
Background mAP/%
|
mAP/%
|
Detection rate (s/piece)
|
|
Strong light
|
AWF-YOLOv4
|
92.0
|
89.9
|
89.0
|
0.123
|
|
YOLOv8
|
86.3
|
84.9
|
86.0
|
0.105
|
|
Faster R-CNN
|
85.0
|
84.3
|
85.2
|
0.095
|
|
SSD
|
81.7
|
85.9
|
76.0
|
0.089
|
|
Backlight
|
AWF-YOLOv4
|
91.4
|
89.3
|
89.4
|
0.120
|
|
YOLOv8
|
87.7
|
84.3
|
85.4
|
0.102
|
|
Faster R-CNN
|
84.4
|
83.7
|
84.6
|
0.092
|
|
SSD
|
81.6
|
20.4
|
47.6
|
0.086
|
|
Fuzzy target
|
AWF-YOLOv4
|
91.4
|
89.3
|
90.4
|
0.121
|
|
YOLOv8
|
87.7
|
84.3
|
85.4
|
0.103
|
|
Faster R-CNN
|
85.4
|
83.7
|
84.6
|
0.093
|
|
SSD
|
82.6
|
20.4
|
47.6
|
0.089
|
|
Normal target
|
AWF-YOLOv4
|
91.7
|
89.5
|
91.6
|
0.122
|
|
YOLOv8
|
86.0
|
84.5
|
85.6
|
0.104
|
|
Faster R-CNN
|
84.7
|
83.9
|
84.8
|
0.094
|
|
SSD
|
83.9
|
20.6
|
47.8
|
0.086
|
Further research will use the receiver operating characteristic curve (ROC) to evaluate
the image detection performance of AWF-YOLOv4. YOLOv8 and Faster R-CNN are compared
to it. The ROC curves of each algorithm are shown in Fig. 12. According to Fig. 12, the area under the curve (AUC) of the AWF-YOLOv4 algorithm is as high as 0.8869.
The AUC values of the YOLOv8 and Faster R-CNN algorithms are only 0.8385 and 0.8346,
respectively, which are significantly lower than the AWF-YOLOv4 algorithm. A higher
AUC value reflects that the image detection algorithm can maintain high sensitivity
and specificity under different classification thresholds. The AWF-YOLOv4 algorithm
proposed by the study has significant advantages in image detection.
Fig. 12. ROC curves of various algorithms.
5. Conclusion
To achieve high accuracy and detection speed of image processing technology in virtual
environments, an AWF-YOLOv4 object detection algorithm was proposed and applied to
image recognition processing. In the same scheme, the AWF module used a shared convolutional
structure with fewer parameters, but the average accuracy was lower, especially in
the first two schemes. The error values of the two algorithms gradually decreased
with the increase of iteration times, and the convergence times were about 200 and
175 times respectively. Meanwhile, the stable error value of AWF-YOLOv4 algorithm
was 0.028, which was 0.011 lower than YOLOv4 algorithm. Compared to the YOLOv4 algorithm,
the AWF-YOLOv4 algorithm introduced an adaptive weight fusion mechanism that could
dynamically adjust the weights of feature maps. This allowed the network to more effectively
fuse feature maps at different levels, thereby better capturing information at different
scales and contexts and extracting fine-grained features, making AWF-YOLOv4 better
at detecting small objects and complex backgrounds. The original signal contains a
large amount of noise data. After processing with image processing techniques, the
changes in the original data could be largely preserved, and the resulting denoised
signal tends to be smooth. The AWF-YOLOv4 image detection algorithm had the highest
mAP, and SSD had the lowest mAP, followed by Faster R-CNN and YOLOv8. The mAP of AWF-YOLOv4
image detection algorithm was about 91.5% in four environments: strong light, backlight,
blurred targets, and normal targets. The detection speed was about 0.120/s/piece,
while the mAP of other image detection algorithms was in the range of 80% -88%, with
a detection speed of 0.088-0.105/s/piece. In comparison to alternative algorithms,
the AWF-YOLOv4 image detection algorithm exhibits a superior degree of efficacy. This
is due to the fact that AWF-YOLOv4 employs an enhanced feature fusion mechanism, which
was better able to discern minute details and delineate target boundaries in intricate
settings. Furthermore, the incorporation of SIoU enhances the algorithm's convergence
rate. The multi-scale and multi-stage feature fusion network designed by the research
has high feature extraction ability and extremely high detection ability in the field
of image recognition. However, there are still shortcomings in the research. The dataset
selected in the study contains fewer object categories, and the application scenarios
of objects can be expanded in the future for further strengthening the detection performance
of the model.
Acknowledgments
The research is supported by Fujian Province Education Science "The 14th Five-Year
Plan" 2022 Project (FJJKGZ22-051).
References
Tang X., Zhang T., 2021, Facial expression recognition algorithm based on convolution
neural network and multi-feature fusion, Journal of Physics: Conference Series, Vol.
1883, pp. 012018

Wnag Y., 2019, Multimodal emotion recognition algorithm based on edge network emotion
element compensation and data fusion, Personal and Ubiquitous Computing, Vol. 23,
No. 3-4, pp. 383-392

Li Y., He Z., Wang S., Wang Z., Huang W., 2021, Multideep feature fusion algorithm
for clothing style recognition, Wireless Communications and Mobile Computing, Vol.
2021, No. 4, pp. 1-14

Huang Y., Tian K., Wu A., Zhang G., 2019, Feature fusion methods research based on
deep belief networks for speech emotion recognition under noise condition, Journal
of Ambient Intelligence and Humanized Computing, Vol. 10, No. 5, pp. 1787-1798

Pei M., Li H. R., Yu H., 2021, A novel three-stage feature fusion methodology and
its application in degradation state identification for hydraulic pumps, Measurement
Science Review, Vol. 21, pp. 123-135

Li X., Du Z., Huang Y., Tan Z., 2021, A deep translation (GAN) based change detection
network for optical and SAR remote sensing images, ISPRS Journal of Photogrammetry
and Remote Sensing, Vol. 179, pp. 14-34

Chen L., Tang W., John N. W., Wan T. R., Zhang J. J., 2019, De-smokeGCN: Generative
cooperative networks for joint surgical smoke detection and removal, IEEE Transactions
on Medical Imaging, Vol. 39, No. 5, pp. 1615-1625

Esfahlani S. S., 2019, Mixed reality and remote sensing application of unmanned aerial
vehicle in fire and smoke detection, Journal of Industrial Information Integration,
Vol. 15, pp. 42-49

Zheng B., Yun D., Liang Y., 2020, Research on behavior recognition based on feature
fusion of automatic coder and recurrent neural network, Journal of Intelligent and
Fuzzy Systems, Vol. 39, No. 6, pp. 8927-8935

Zhang B. T., Wang X. P., Shenm Y., Lei T., 2019, Dual-modal physiological feature
fusion-based sleep recognition using CFS and RF algorithm, International Journal of
Automation and Computing, Vol. 16, No. 3, pp. 286-296

Gou Y., Wang K., Wei S., Shi C., 2023, GMDA: GCN-based multi-modal domain adaptation
for real-time disaster detection, International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems, Vol. 31, No. 6, pp. 957-973

Zhao J., Yu L., Liu Z., 2021, Research based on multimodal deep feature fusion for
the auxiliary diagnosis model of infectious respiratory diseases, Scientific Programming,
Vol. 2021, No. 4, pp. 1-6

Wang Z., Zhen J., Li Y., Li G., Han Q., 2019, Multi-feature multimodal biometric recognition
based on quaternion locality preserving projection, Chinese Journal of Electronics,
Vol. 28, No. 4, pp. 789-796

Zhou H., Dong C., Wu R., Xu X., Guo Z., 2021, Feature fusion based on Bayesian decision
theory for radar deception jamming recognition, IEEE Access, Vol. 9, pp. 16296-16304

Siriwardhana S., Kaluarachchi T., Billinghurst M., Nanayakkara S., 2020, Multimodal
emotion recognition with transformer-based self supervised feature fusion, IEEE Access,
Vol. 8, pp. 176274-176285

Yuan D., Shu X., Liu Q., Zhang X., He Z., 2023, Robust thermal infrared tracking via
an adaptively multi-feature fusion model, Neural Computing and Applications, Vol.
35, No. 4, pp. 3423-3434

Xia S., Zhou X., Shi H., Li S., 2024, Hybrid feature adaptive fusion network for multivariate
time series classification with application in AUV fault detection, Ships and Offshore
Structures, Vol. 19, No. 6, pp. 807-819

Xie J., Pang Y., Pan J., Nie J., Cao K., Han J., 2023, Complementary feature pyramid
network for object detection, ACM Transactions on Multimedia Computing, Communications,
and Applications, Vol. 19, No. 6, pp. 1-15

Huo L., Zhu J., Singh P. K., Pavlovich P. A., 2021, Research on QR image code recognition
system based on artificial intelligence algorithm, Journal of Intelligent Systems,
Vol. 30, No. 1, pp. 855-867

Wang B., Wang Y., Cui L., 2020, Fuzzy clustering recognition algorithm of medical
image with multi-resolution feature, Concurrency and Computation: Practice and Experience,
Vol. 32, No. 1, pp. e4886

Bandewad G., Datta K. P., Gawali B. W., Pawar S. N., 2023, Review on discrimination
of hazardous gases by smart sensing technology, Artificial Intelligence and Applications,
Vol. 1, No. 2, pp. 86-97

Gheisari M., Hamidpour H., Liu Y., Saedi P., Raza A., Jalili A., Rokhsati H., Amin
R., 2023, Data mining techniques for web mining: A survey, Artificial Intelligence
and Applications, Vol. 1, No. 1, pp. 3-10

Author
Daogui Lin is pursuing a Ph.D. degree in computer science from University Malaysia
Sabah. Currently, he serves as an associate professor at the Fujian Polytechnic of
Information Technology and is a leading figure in the field of "Multimedia Design
and Production" in Fujian Province. He has served as an expert consultant for the
Taiwan, Hong Kong, and Macau Affairs Office of the Fujian Provincial People's Government
and a VR judge for the World Vocational College Skills Competition. He has authored
a national-level textbook, "Photoshop CC Visual Design Case Course," and has led a
provincial-level high-quality online course and multiple research projects. He has
been awarded the title of "Excellent Guidance Teacher" in the National Digital Art
Design Competition (NCDA) and as a core member, he has won the Grand Prize in the
Provincial Teaching Achievement Award. His research areas include virtual reality,
image processing, artificial intelligence fundamentals, and computer fundamentals
teaching.