Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (The Internet of Things and Artificial Intelligence College, Fujian Polytechnic of Information Technology, Fuzhou, 350003, China)
  2. (Faculty of Computing and Informatics, University Malaysia Sabah, Sabah, 88300, Malaysia)



Virtual reality, Image recognition, AWF, YOLOv4, SIoU loss function

1. Introduction

As the gradual advancement and maturity of computers and simulation, virtual reality technology (VRT) has also driven rapid progress in many industries, such as health care, education, and entertainment. VRT can allow people to experience a more realistic virtual world, which can be used not only in education, entertainment, medical, and other fields, but also in military, simulation, and other fields. Virtual environment construction refers to the use of computer technology to combine the physical environment in the real world with the virtual world to build a virtual world. Virtual environment construction can be divided into two main aspects: physical simulation and image recognition technology. Physical simulation technology refers to the construction of a virtual world by simulating physical phenomena in the real world. IR refers to the use of computer technology to compare input images with pre trained models, identify features in the images, and construct a virtual world. Image recognition directly influenced the application. Higher accuracy in IR results in higher accuracy in detecting objects. Traditional image processing techniques have high requirements for datasets and professional knowledge [1-3]. Traditional IR processing techniques have strong interpretability and high accuracy in identifying and detecting practical problems with small data scales. However, the extraction and selection of data have high requirements for manual design, and it is difficult for manual design features to obtain advanced semantic features. Meanwhile, it has low stability in multi-objective complex environments [4,5]. However, one-stage object detection algorithms such as single shot multi box detector (SSD) and YOLO have strong advantages in the field of IR processing. The research will design an IR under the background of VRT, which is on the grounds of YOLOv4 and optimized through the adaptive weighted fusion (AWF) module, namely the AWF-YOLOv4 object detection algorithm. The contribution of the research is the proposal of the AWF-YOLOv4 object detection algorithm, which improves the accuracy and stability of image recognition, addresses the challenges of image recognition in virtual reality applications, and optimizes the user experience in virtual environments. The implementation of VRT can facilitate enhanced overall application effectiveness and user interaction experience through the utilization of more accurate and efficient image recognition technology.

The research will elaborate on the content from four aspects. The first part summarizes the literature reports on IR, adaptive features, and VRT. The second part proposes the IR of the AWF-YOLOv4 object detection algorithm. The third part analyzes the application effect of IR. The fourth part summarizes the research results and looks forward to the shortcomings of the research.

2. Related Works

The utilization of VRT in image recognition can achieve effective interaction between virtual and real environments, and has been applied in cultural heritage protection, medical and health care. As a key technology for applying VRT, image recognition have been studied by many researchers. Li et al. analyzed the change detection of remote sensing images under adverse weather conditions and proposed a deep translation image change detection network. The network converts images from one domain to another through a cyclic structure. The results of dataset testing showed that the detection method of this image recognition is robust and effective [6]. Chen et al. proposed an image recognition processing method on the grounds of deep neural networks and utilized it in the field of video image processing in robot vision tasks. The test set validation showed that the neural network structure proposed by this algorithm outperformed many image processing methods in both quantitative and qualitative measurements. This method not only eliminated obvious image noise data, but also preserved the true situation of medical images well [7]. Esfahlani et al. proposed a drone robot operating system and computer image recognition, which considered the temporal changes in fire intensity, motion attributes, colors, etc., and utilized state-of-the-art indoor and outdoor synchronous positioning and drawing. This system could achieve inter frame motion estimation, avoid problems such as motion failure caused by image data loss, and thus achieve the detection of dangerous signals and abnormal data in natural environments [8]. Zheng et al. have demonstrated the accuracy of this method in human activity behavior through a behavior recognition algorithm that combines autoencoders and recurrent neural network features [9].

Zhang et al. proposed a dual mode physiological fusion feature sleep recognition technique, which considered the correlation between each feature and category. Through 10 fold cross experiments, the results proved the high recognition rate of this method [10]. Gou et al. et al. analyzed real-time disaster detection technology on the grounds of end-to-end models. This model included an attention fusion module, a feature extraction module, a feature extraction module of a convolutional neural network, and a maximum mean difference domain adaptation module. The model results verified its superior performance [11]. Zhao et al. proposed a deep feature fusion algorithm for multimodal data and applied it to the auxiliary diagnosis of infectious respiratory diseases. This model could achieve automated and intelligent diagnosis of infectious respiratory diseases [12]. Wang et al. proposed a multi feature and multimodal biometric recognition technique, which identified the essential manifold structure of quaternion fusion features. The experimental results showed that. Compared to other feature fusion algorithms, this method had better performance [13]. Zhou et al. proposed a feature fusion algorithm on the grounds of Bayesian decision theory to identify radar deception interference signals. This algorithm could identify radar deception interference and had high recognition accuracy [14]. Siriwardhana et al. designed a multimodal emotion recognition algorithm on the grounds of self supervised feature fusion, which combines attention fusion mechanism. The benchmark test results verified the robustness of the model [15]. In order to deal with the declining accuracy of target tracking methods in thermal infrared tracking scenarios, Yuan et al. designed an adaptive multi-feature fusion model, which could adaptively integrate deep convolutional neural network features and manual features, and adopt model update strategy to enable the model to achieve adaptive tracking of target changes. The results showed that this method significantly improved the tracking effect [16]. In view of the complexity and time-variability of large-scale multivariate time series data, Xia et al. proposed a hybrid feature adaptive fusion network, and adopted an attentional mechanism to deal with redundancy and conflicts between different scales, so as to recognize and classify features. The results showed that the accuracy of this method could reach 96%, and the performance is significantly improved [17].

On the grounds of the research results of domestic and foreign scholars, it can be observed that many object detection algorithms have achieved certain achievements in various fields. However, there is still significant room for enhancement in detection accuracy (DA) and detection speed. The research will be on the grounds of VRT and propose the AWF-YOLOv4 object detection algorithm, aiming to contribute to the improvement and enhancement of image processing technology.

3. Image Recognition for AWF-YOLOv4 Object Detection Algorithm

For feature fusion, many object detection algorithms use feature pyramid networks (FPN) or personal area networks (PAN) for neck networks [18]. While these methods are capable of achieving cross-scale feature fusion, they fail to account for the disparities in information content across features within the same level of abstraction across different time points. Additionally, they do not address the phenomenon of information loss resulting from semantic discrepancies. Therefore, the research proposes the AWF-YOLOv4 object detection algorithm, which is on the grounds of the YOLOv4 object detection algorithm. It uses AWF of two feature maps and adds a cross-stage fusion path to the FPN feature fusion network for preventing the loss of feature layer information in the feature fusion stage. The fused feature network is combined with the basic object detection algorithm.

Fig. 1. AWF for independent convolution and shared convolution.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig1.png

3.1. AWF and Cross-scale and Cross-stage Feature Fusion Networks in Image Recognition

The residual blocks in the backbone network of the object detection algorithm did not consider the differences in semantic levels of features during the feature fusion stage. To make more efficient use of residual blocks, this study embeds AWF into the residual blocks to achieve AWF of features, avoiding features covering different spatial information being treated equally. Feature extraction is more flexible and refined, strengthening the regions containing effective information while weakening the differences containing invalid information. Figs. 1(a) and 1(b) show the AWF of independent convolution and shared convolution, respectively.

The AWF module consists of two structures, consisting of three parts: compression, extraction, and allocation. The difference lies in the type of convolution used in the compression stage. The compression part processes two feature maps through $1 \times 1$ convolution, compressing the number of channels to a constant $T$, while extracting feature information. The compressed feature maps are used to extract weight information. The extraction part first concatenates the two intermediate feature maps through channels for getting a combined feature map. The combination feature map includes information from the input feature map (IFM), with a channel count of $2T$. Secondly, the number of channels is compressed through $1 \times 1$ convolution, while obtaining the spatial weights of the IFM. The weight feature map contains two channels, both of which contain weight information from two IFMs. Then $1 \times 1$ convolution is used to further compress the channel number of the combined feature graph, and the spatial weight of the input feature graph is extracted. A weight feature graph with channel number of 2 is obtained after the channel compression. It should be noted that the whole process uses $1 \times 1$ convolution because it does not change the size of the IFM at all, saves all the information of the original feature map, and can completely pass the position information in the feature map to the next layer. In addition, $1 \times 1$ convolution realizes linear combination of multiple feature maps. In this way, the output feature map is the integration information of multiple channels, which can enrich the features extracted by the network. The two channels of the weight feature graph respectively contain the weight information of the two input feature graphs. Since the generated weights are obtained by convolution of the same feature graph, there is a certain dependency between them, which is used to control the weighted fusion of the two input feature graphs. Finally, the weight values are mapped using Eq. (1).

(1)
$\left\{\begin{aligned} \omega _{i,j}^{\alpha } ={e^{\alpha _{i,j} }/ \left(e^{\alpha _{i,j} } +e^{\beta _{i,j} } \right)},\\ \omega _{i,j}^{\beta } ={e^{\beta _{i,j} }/ \left(e^{\alpha _{i,j} } +e^{\beta _{i,j} } \right)}. \end{aligned}\right. $

In Eq. (1), the parameter values at the two channels are $\alpha_{i,j}$ and $\beta_{i,j}$, and the two spatial weight values are $\omega^\alpha_{i,j}$ and $\omega^\beta_{i,j}$, respectively. After the spatial weights are obtained, the distribution operation is carried out, and the obtained spatial weights are matched with the original IFM. By multiplying these weights with the corresponding IFM, the dependencies between the weights can be passed to IFM, and the correlation between IFMs can be established. The addition of these IFMs that have established correlation with each other forms the final output feature map. Thus, AWF of features is realized. The calculation formula for the weight feature map is equation (2).

(2)
$Z=Conv\left(Concat\left[Conv\left(C1\right),Conv\left(C2\right)\right]\right) .$

In Eq. (2), the IFM is divided into $C1$ and $C2$, the convolution operation is $Conv(\cdot)$, and the cascaded processing of the corresponding channels is $Concat(\cdot)$. The calculation formula for outputting feature maps is Eq. (3).

(3)
$E=Z\left[0\right]\times C1+Z\left[1\right]\times C2.$

In Eq. (3), the first and second channels of the weight feature map are $Z[0]$ and $Z[1]$, respectively. Due to the AWF structure of independent convolution, there are significant semantic differences between features at different levels. Each convolutional block only learns the features of a specific semantic level, which is beneficial for learning the distribution characteristics of semantic information at a specific level. Meanwhile, the semantic gap between features can be reduced through convolution operations, thereby enabling the learned features to be fused with subsequent features. Due to the semantic difference between the output features of the stacked layer and the identity mapping of the IFMs, this study investigates the use of shared convolution to simultaneously learn the characteristics of two feature maps and their spatial relationships [19,20]. This study embeds the AWF module into the residual block of the backbone network, and Figs. 2(a) and 2(b) show the original residual block and the improved residual block, respectively. The improved residual block inputs the identical maps of the output and input features of the stack layer into the AWF module, instead of adding them directly, and extracts and assigns weights to the two feature maps in the AWF module to establish the correlation between them. Therefore, the learning ability of the residual module is further improved.

Fig. 2. Original residual block and improved residual block.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig2.png

The utilization of FPN and PAN in the neck feature fusion component of the YOLOV4 network enables the achievement of feature fusion at disparate scales throughout the multi-scale feature fusion process. However, this approach does not encompass the fusion of features across different stages of the same scale. Low level features can obtain rich semantic information from other feature layers during the feature fusion process, but this information will lose the original spatial data during cascading and convolution operations. Meanwhile, the GiraffeDet network adopts a combination of cross-scale connections and skip layer connections in the neck network. This can enable the model to interact between low-level information and high-level information, with skip connections being the interconnection of features at the same scale at different stages [21,22]. The study designs a cross-scale and cross-stage feature fusion network as shown in Fig. 3. This network adds cross-stage fusion on the basis of cross-scale fusion structure, allowing the feature layer to retain the original feature information while obtaining information from other scale feature layers.

Fig. 3. Cross scale and cross-stage feature fusion network.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig3.png

The cross-scale fusion path includes sampling, cascading, and convolution, which is the same as the original YOLOv4 fusion path. The cross-stage fusion path is a skip connection, which has a shorter distance during the backpropagation process while retaining the original feature dimensions. It can process the same layer features after feature fusion with the features extracted from the backbone network. In addition, the skip connection method does not require the introduction of other parameters, nor does it increase the computational complexity. The improved feature fusion network has a cross-stage connection for downsampling low-level features, which can fuse the original information with low-level features and then complete the propagation of high-level feature layers. This can make the feature fusion of lower and higher levels more comprehensive.

3.2. Image Recognition for AWF-YOLOv4 Object Detection Algorithm

In the YOLO series of algorithms, YOLOv4 and YOLOv3 have similar structures, but they have optimized the network structure through training techniques on the grounds of the YOLOv3 algorithm. The detection method of the later versions of the YOLO series algorithm is similar to that of the YOLOv4 algorithm, which has extremely high customization and excellent basic performance. Fig. 4 shows the network structure of the AWF-YOLOv4 object detection algorithm, including the backbone network, neck network, and prediction network. This method enhances the network's ability to fuse features and extract features on the grounds of the YOLOv4 algorithm. Introducing an AWF module in the backbone network to enhance the learning ability of residual blocks, and adopting a wide-scale cross-stage fusion network in the neck network reduces information loss during the feature fusion process. This enables the neck network to better utilize feature information. The input image size used in this network structure is $416 \times 416$, and the backbone network consists of 5 large residual blocks. A few residual blocks are embedded in the AWF module, and each residual block includes 2 branches, corresponding to the stacking of residual blocks and the edges crossing the residual blocks. The number of residual blocks covered by each large residual block is 1, 2, 8, 8, and 4, respectively. Among them, the output features of the last three large residual blocks will be transported to the neck network for feature fusion, with dimensions of $52 \times 52 \times 256$, $26 \times 26 \times 512$, and $13 \times 13 \times 1024$, respectively. The neck network will fuse features of different scales of input, first passing high-level features containing rich semantic information to low-level features through a top-down path. Then it passes through a bottom-up path to transfer low-level features containing rich spatial information to high-level features, with the input feature having the same dimension as the final output feature. Then, it uses a cross-stage fusion path to obtain features before and after fusion at the same scale. Finally, the features output by the neck network are input into the prediction network, and the position and category in the image are detected through recognition regression techniques.

Fig. 4. The network structure of AWF-YOLOv4 object detection algorithm.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig4.png

Given that the bounding box (BBO) regression loss function (LF) of YOLOv4 object detection algorithm has a significant impact on the DA and convergence speed (CS), this study conducted in-depth research on the regression LF and made improvements on the existing foundation. The LF Loss contains classification loss (CL), confidence loss (COL), and BBO loss, which are represented by $Loss_{cls}$, $Loss_{conf}$, and $Loss_{CIoU}$. The relevant expression is Eq. (4).

(4)
$Loss=Loss_{CIoU} +Loss_{conf} +Loss_{cls} . $

In Eq. (4), $Loss_{CIoU}$ is represented by a regression function, and the relevant expression is Eq. (5).

(5)
$Loss_{CIoU} =1-IoU+\frac{d^{2} }{c^{2} } +av .$

In Eq. (6), the Euclidean distance among two center points is $d$, the diagonal distance serves as $c$, and $a$ serves as the relevant strength. The intersection to union ratio between the predicted candidate box and the true BBO serves as $IoU$, and the consistency consideration factor for aspect ratio is $v$. The relevant expression for $v = \frac{4}{\pi^2} (\arctan \frac{\omega^{gt}}{h^{gt}} - \arctan \frac{\omega}{h})^2$ is equation (6).

(6)
$v=\frac{4}{\pi ^{2} } \left(\arctan \frac{\omega ^{gt} }{h^{gt} } -\arctan \frac{\omega }{h} \right) .$

Fig. 5. Three cases of intersection of RBO and BBO and SIoU LF AC.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig5.png

In Eq. (6), the true height and width of the BBO are $h^{gt}$ as well as $\omega^{gt}$, and the predicted height and width of the BBO are represented by $h$ as well as $\omega$. Due to the large difference among the BBO as well as the real box (RBO), the movement direction of the BBO only is decided by shape and distance constraints, which still has significant errors. Fig. 5(a) serves as three situations where the RBO and BBO intersect. The study will use the SCYLLA IoU (SIoU) LF to achieve loss calculation, which takes the four items into consideration for determining the position, and takes into account the concept of the first few LFs. Fig. 5(b) demonstrates that the angle calculation (AC) of the SIoU LF. The points $B$ and $B^{GT}$ serve as BBO and true boxes (TBO). This study assumes that the coordinates of point $B$ as well as $B^{GT}$ serve as $(x, y)$ as well as $(x^{GT}, y^{GT})$, the vertical and horizontal (VH) distances of the BBO and the TBO are $C_h$ as well as $C_w$, the distance among point $B$ as well as point $B^{GT}$ serves as $\sigma$, and the angle among the line connecting point $B$ and $B^{GT}$ and the VH directions is $\alpha$ as well as $\beta$. The relevant expression of the angle loss is Eq. (7).

(7)
$\left\{\begin{aligned} C_{h} =\max \left(y,y^{GT} \right)-\min \left(y,y^{GT} \right),\\ \sigma =\sqrt{\left(x^{GT} -x\right)^{2} +\left(y^{GT} -y\right)^{2} },\\ \sin \alpha ={C_{h}/ \sigma }, \\ \Lambda =\sin 2\alpha. \end{aligned}\right. $

In Eq. (7), $\alpha + \beta = \frac{\pi}{2}$. The relevant expression for the IoU loss term (LT) is Eq. (5). Compared with the original Euclidean distance approach, this method minimizes the number of distance-related variables, reduces the degree of freedom, improves the training speed and accuracy, and reduces the model complexity. The final expression $L_{SIoU}$ for the SIoU LF is Eq. (8).

(8)
$L_{SIoU} =1-IoU+\frac{\Delta +\Omega }{2} .$

In Eq. (8), the shape LT is $1\Omega$ as well as the distance LT is $2\Delta$. The formula for calculating the distance LT $\Delta$ is Eq. (9).

(9)
$\Delta =2-e^{-\gamma p_{x} } -e^{-\gamma p_{y} } . $

In Eq. (9), $p_x = (\frac{x^{GT}-x}{C_x})^2$, $p_y = (\frac{y^{GT}-y}{C_y})^2$, $\gamma = 2 - \Lambda$. Adding an angle to the distance term can reduce the distance between the center points of two boxes, and also make the BBO and the TBO consistent in both VH directions. The calculation formula for the shape LT $\Omega$ is Eq. (10).

(10)
$\Omega =\left(1-e^{-\tau w} \right)^{\theta } +\left(1-e^{-\tau h} \right)^{\theta } . $

In Eq. (10), $\tau_w = \frac{|w^{GT}-w|}{max(w^{GT},w)}$, $\tau_h = \frac{|h^{GT}-h|}{max(h^{GT},h)}$. The SIoU LF introduces the concept of angle to move the BBO towards the x-axis as well as y-axis of the TBO, improving CS by constraining degrees.

4. Image Recognition Processing Effect of AWF-YOLOv4 Object Detection Algorithm

The study analyzed the image recognition and processing performance of AWF-YOLOv4 object detection algorithm using a public dataset (pattern analysis statistical modeling and computational learning, visual object classes, PASCAL VOC). The training set and test set are the training set of the 2007 version of PASCAL VOC and the validation set of the 2012 version of PASCAL VOC, respectively. The test set is the test set of the 2007 version of PASCAL VOC. The distribution of sample images is highly similar to the real scene. The experimental testing environment is as follows: the operating system is Ubuntu 20.04, the deep learning framework is Python 1.8, the programming language is Python 3.7, the system memory is 64GB, and the hardware is Intel (R) Core (TM) i9-10980XE CPU@3.00GHz. The evaluation indicators are loss value and mean average precision (mAP). This study is the first to test the effect of adding AWF modules at different positions on image recognition processing. Table 1 refers to the addition locations of different AWF modules. There are differences in the placement and quantity of different addition schemes. The model structure, from large to small, is Scheme B, Scheme C, and Scheme A. These include Scheme A: Starting from the second large residual block, add AWF modules to the last residual block of the 2nd through 5th large residual blocks, for a total of 4 AWF modules. Scheme B: Starting from the second large residual block, add AWF modules to the first and last residual blocks of the 2nd through 5th large residual blocks, for a total of 8 AWF modules. Scheme C: Starting with the third large residual block, add AWF modules to the middle and last residual blocks of the 3rd through 5th large residual blocks, for a total of 6 AWF modules.

Figs. 6(a)-6(b) show the mAP of the training and testing sets in the AWF module configuration, respectively. The horizontal axis represents the number of traversals, and the vertical axis represents the mAP values of each scheme. Overall, the average accuracy of the model, in descending order, is Scheme C, Scheme B, and Scheme A. In the same scheme, the AWF module uses a shared convolutional structure with fewer parameters, but the average accuracy is lower, especially in the first two schemes. Adding the AWF module to the second large residual block will not affect the final detection result. Scheme C has a faster CS and a higher mAP value, with the values of 0.9256 and 0.9156. The mAP values of the other schemes are all around 85%.

Fig. 6. mAP of relevant sets in AWF module configuration.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig6.png

Table 1. Different AWF module addition positions.

Large residual block A B C
1 First place 0 0 0
Median 0 0 0
Last place 0 0 0
2 First place 0 1 0
Median 0 0 0
Last place 1 1 0
3 First place 0 1 0
Median 0 0 1
Last place 1 1 1
4 First place 0 1 0
Median 0 0 1
Last place 1 1 1
5 First place 0 1 0
Median 0 0 1
Last place 1 1 1
Total value 4 8 6

Fig. 7(a) and 7(b) refer to the convergence values and runtime of YOLOv4 and AWF-YOLOv4 algorithms, which are experimentally analyzed using a test set. The horizontal axis represents the number of iterations of the algorithm. The vertical axis represents the error value and running time. The error values of the two algorithms gradually decrease with the increase of iteration times. Moreover, the convergence times are about 200 and 175 times respectively. At the same time, the stable error value of the AWF-YOLOv4 algorithm is 0.028, which is 0.011 lower than that of the YOLOv4 algorithm. The stable values of the two algorithms are 3.1s and 2.9s, respectively. The CS of the AWF-YOLOv4 algorithm is better than the YOLOv4 algorithm, and the running time of this algorithm is shorter. The AWF-YOLOv4 algorithm has advantages in both optimization time and error values because it adopts SIoU, which improves the CS of the proposed method.

Fig. 7. Convergence values and runtime of YOLOv4 and AWF-YOLOv4 algorithms.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig7.png

VRT is the utilization of data from real life for generating electronic signals through computer technology, which are combined with different relevant devices for transforming them into phenomena that can be felt. Figs. 8(a) and 8(b) show the original image processing signals before and after image processing techniques, respectively. The horizontal axis represents the sampling points of the test, and the vertical axis represents the amplitude of the data signal. The original signal contains a large amount of noise data. After processing with image processing techniques, the changes in the original data can be largely preserved, and the resulting denoised signal tends to be smooth. Therefore, the wavelet coefficient threshold denoising method can effectively remove noise data from the original image data and restore the feature data of the original data to the maximum extent.

Fig. 8. Vibration signals before and after wavelet coefficient threshold denoising.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig8.png

To verify the application effect of image recognition, the study will select classic image recognition for comparison, SSD, faster region with convolutional neural network (Faster R-CNN), YOLOv8, and set different categories of images for detection. Figs. 9(a)-9(d) show the image detection mAP for four scenarios: natural scenery, urban roads, office environment, and plateau desert. The horizontal axis represents the number of traversals, and the vertical axis represents the mAP values of image detection for each algorithm. Overall, the AWF-YOLOv4 image detection algorithm has the highest mAP, SSD has the lowest mAP, followed by Faster R-CNN and YOLOv8. In the image detection of natural scenery, urban roads, office environments, and plateau deserts, the mAP of AWF-YOLOv4 image detection algorithm is 0.9056, 0.9143, 0.9106, and 0.9812, respectively.

Fig. 9. Image detection results for different categories.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig9.png

Figs. 10(a)-10(d) show the image data signal processing results of four image detection techniques, respectively. The horizontal axis represents the sampling points of the test, and the vertical axis represents the amplitude of the data signal. The figure shows that the AWF-YOLOv4 image detection algorithm has the lowest amplitude of image signal fluctuation, while the other methods have larger amplitude of image signal fluctuation, with maximum amplitude reduction rates of 14.25%, 17.36%, and 22.36%, respectively. This may be due to the image processing technology of the AWF-YOLOv4 image detection algorithm combining multi-scale and multi-stage image data features, while calculating the loss value through the S-IOU LF. However, the image data signal processing results of all image detection technologies are within the confidence interval. Overall, it can be concluded that AWF-YOLOv4 image detection technology has significant advantages in practical image processing applications.

Fig. 10. Image data signal processing results of four image detection techniques.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig10.png

Finally, the detection results are analyzed in different environments, including strong light, backlight, blurry targets, and normal targets. The corresponding sample images are shown in Fig. 11.

Fig. 11. Sample images in four different environments.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig11.png

The detection results of each algorithm in four environments: strong light, backlight, blurred targets, and normal targets are shown in Table 2. In four environments of strong light, backlight, blurred targets, and normal targets, the mAP of AWF-YOLOv4 image detection algorithm is about 91.5%, and the detection speed is about 0.120/s per piece. The mAP of other image detection algorithms is in the range of 80% -88%, with a detection speed of 0.088-0.105/s per piece. Therefore, the AWF-YOLOv4 image detection algorithm has high DA in different environments.

Table 2. Four detection results for strong light, backlight, blurred targets, and normal targets.

Detection type Model Object mAP/% Background mAP/% mAP/% Detection rate (s/piece)
Strong light AWF-YOLOv4 92.0 89.9 89.0 0.123
YOLOv8 86.3 84.9 86.0 0.105
Faster R-CNN 85.0 84.3 85.2 0.095
SSD 81.7 85.9 76.0 0.089
Backlight AWF-YOLOv4 91.4 89.3 89.4 0.120
YOLOv8 87.7 84.3 85.4 0.102
Faster R-CNN 84.4 83.7 84.6 0.092
SSD 81.6 20.4 47.6 0.086
Fuzzy target AWF-YOLOv4 91.4 89.3 90.4 0.121
YOLOv8 87.7 84.3 85.4 0.103
Faster R-CNN 85.4 83.7 84.6 0.093
SSD 82.6 20.4 47.6 0.089
Normal target AWF-YOLOv4 91.7 89.5 91.6 0.122
YOLOv8 86.0 84.5 85.6 0.104
Faster R-CNN 84.7 83.9 84.8 0.094
SSD 83.9 20.6 47.8 0.086

Further research will use the receiver operating characteristic curve (ROC) to evaluate the image detection performance of AWF-YOLOv4. YOLOv8 and Faster R-CNN are compared to it. The ROC curves of each algorithm are shown in Fig. 12. According to Fig. 12, the area under the curve (AUC) of the AWF-YOLOv4 algorithm is as high as 0.8869. The AUC values of the YOLOv8 and Faster R-CNN algorithms are only 0.8385 and 0.8346, respectively, which are significantly lower than the AWF-YOLOv4 algorithm. A higher AUC value reflects that the image detection algorithm can maintain high sensitivity and specificity under different classification thresholds. The AWF-YOLOv4 algorithm proposed by the study has significant advantages in image detection.

Fig. 12. ROC curves of various algorithms.

../../Resources/ieie/IEIESPC.2025.14.6.715/fig12.png

5. Conclusion

To achieve high accuracy and detection speed of image processing technology in virtual environments, an AWF-YOLOv4 object detection algorithm was proposed and applied to image recognition processing. In the same scheme, the AWF module used a shared convolutional structure with fewer parameters, but the average accuracy was lower, especially in the first two schemes. The error values of the two algorithms gradually decreased with the increase of iteration times, and the convergence times were about 200 and 175 times respectively. Meanwhile, the stable error value of AWF-YOLOv4 algorithm was 0.028, which was 0.011 lower than YOLOv4 algorithm. Compared to the YOLOv4 algorithm, the AWF-YOLOv4 algorithm introduced an adaptive weight fusion mechanism that could dynamically adjust the weights of feature maps. This allowed the network to more effectively fuse feature maps at different levels, thereby better capturing information at different scales and contexts and extracting fine-grained features, making AWF-YOLOv4 better at detecting small objects and complex backgrounds. The original signal contains a large amount of noise data. After processing with image processing techniques, the changes in the original data could be largely preserved, and the resulting denoised signal tends to be smooth. The AWF-YOLOv4 image detection algorithm had the highest mAP, and SSD had the lowest mAP, followed by Faster R-CNN and YOLOv8. The mAP of AWF-YOLOv4 image detection algorithm was about 91.5% in four environments: strong light, backlight, blurred targets, and normal targets. The detection speed was about 0.120/s/piece, while the mAP of other image detection algorithms was in the range of 80% -88%, with a detection speed of 0.088-0.105/s/piece. In comparison to alternative algorithms, the AWF-YOLOv4 image detection algorithm exhibits a superior degree of efficacy. This is due to the fact that AWF-YOLOv4 employs an enhanced feature fusion mechanism, which was better able to discern minute details and delineate target boundaries in intricate settings. Furthermore, the incorporation of SIoU enhances the algorithm's convergence rate. The multi-scale and multi-stage feature fusion network designed by the research has high feature extraction ability and extremely high detection ability in the field of image recognition. However, there are still shortcomings in the research. The dataset selected in the study contains fewer object categories, and the application scenarios of objects can be expanded in the future for further strengthening the detection performance of the model.

Acknowledgments

The research is supported by Fujian Province Education Science "The 14th Five-Year Plan" 2022 Project (FJJKGZ22-051).

References

1 
Tang X., Zhang T., 2021, Facial expression recognition algorithm based on convolution neural network and multi-feature fusion, Journal of Physics: Conference Series, Vol. 1883, pp. 012018DOI
2 
Wnag Y., 2019, Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion, Personal and Ubiquitous Computing, Vol. 23, No. 3-4, pp. 383-392DOI
3 
Li Y., He Z., Wang S., Wang Z., Huang W., 2021, Multideep feature fusion algorithm for clothing style recognition, Wireless Communications and Mobile Computing, Vol. 2021, No. 4, pp. 1-14DOI
4 
Huang Y., Tian K., Wu A., Zhang G., 2019, Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition, Journal of Ambient Intelligence and Humanized Computing, Vol. 10, No. 5, pp. 1787-1798DOI
5 
Pei M., Li H. R., Yu H., 2021, A novel three-stage feature fusion methodology and its application in degradation state identification for hydraulic pumps, Measurement Science Review, Vol. 21, pp. 123-135DOI
6 
Li X., Du Z., Huang Y., Tan Z., 2021, A deep translation (GAN) based change detection network for optical and SAR remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 179, pp. 14-34DOI
7 
Chen L., Tang W., John N. W., Wan T. R., Zhang J. J., 2019, De-smokeGCN: Generative cooperative networks for joint surgical smoke detection and removal, IEEE Transactions on Medical Imaging, Vol. 39, No. 5, pp. 1615-1625DOI
8 
Esfahlani S. S., 2019, Mixed reality and remote sensing application of unmanned aerial vehicle in fire and smoke detection, Journal of Industrial Information Integration, Vol. 15, pp. 42-49DOI
9 
Zheng B., Yun D., Liang Y., 2020, Research on behavior recognition based on feature fusion of automatic coder and recurrent neural network, Journal of Intelligent and Fuzzy Systems, Vol. 39, No. 6, pp. 8927-8935DOI
10 
Zhang B. T., Wang X. P., Shenm Y., Lei T., 2019, Dual-modal physiological feature fusion-based sleep recognition using CFS and RF algorithm, International Journal of Automation and Computing, Vol. 16, No. 3, pp. 286-296DOI
11 
Gou Y., Wang K., Wei S., Shi C., 2023, GMDA: GCN-based multi-modal domain adaptation for real-time disaster detection, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 31, No. 6, pp. 957-973DOI
12 
Zhao J., Yu L., Liu Z., 2021, Research based on multimodal deep feature fusion for the auxiliary diagnosis model of infectious respiratory diseases, Scientific Programming, Vol. 2021, No. 4, pp. 1-6DOI
13 
Wang Z., Zhen J., Li Y., Li G., Han Q., 2019, Multi-feature multimodal biometric recognition based on quaternion locality preserving projection, Chinese Journal of Electronics, Vol. 28, No. 4, pp. 789-796DOI
14 
Zhou H., Dong C., Wu R., Xu X., Guo Z., 2021, Feature fusion based on Bayesian decision theory for radar deception jamming recognition, IEEE Access, Vol. 9, pp. 16296-16304DOI
15 
Siriwardhana S., Kaluarachchi T., Billinghurst M., Nanayakkara S., 2020, Multimodal emotion recognition with transformer-based self supervised feature fusion, IEEE Access, Vol. 8, pp. 176274-176285DOI
16 
Yuan D., Shu X., Liu Q., Zhang X., He Z., 2023, Robust thermal infrared tracking via an adaptively multi-feature fusion model, Neural Computing and Applications, Vol. 35, No. 4, pp. 3423-3434DOI
17 
Xia S., Zhou X., Shi H., Li S., 2024, Hybrid feature adaptive fusion network for multivariate time series classification with application in AUV fault detection, Ships and Offshore Structures, Vol. 19, No. 6, pp. 807-819DOI
18 
Xie J., Pang Y., Pan J., Nie J., Cao K., Han J., 2023, Complementary feature pyramid network for object detection, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 19, No. 6, pp. 1-15DOI
19 
Huo L., Zhu J., Singh P. K., Pavlovich P. A., 2021, Research on QR image code recognition system based on artificial intelligence algorithm, Journal of Intelligent Systems, Vol. 30, No. 1, pp. 855-867DOI
20 
Wang B., Wang Y., Cui L., 2020, Fuzzy clustering recognition algorithm of medical image with multi-resolution feature, Concurrency and Computation: Practice and Experience, Vol. 32, No. 1, pp. e4886DOI
21 
Bandewad G., Datta K. P., Gawali B. W., Pawar S. N., 2023, Review on discrimination of hazardous gases by smart sensing technology, Artificial Intelligence and Applications, Vol. 1, No. 2, pp. 86-97DOI
22 
Gheisari M., Hamidpour H., Liu Y., Saedi P., Raza A., Jalili A., Rokhsati H., Amin R., 2023, Data mining techniques for web mining: A survey, Artificial Intelligence and Applications, Vol. 1, No. 1, pp. 3-10DOI

Author

Daogui Lin
../../Resources/ieie/IEIESPC.2025.14.6.715/au1.png

Daogui Lin is pursuing a Ph.D. degree in computer science from University Malaysia Sabah. Currently, he serves as an associate professor at the Fujian Polytechnic of Information Technology and is a leading figure in the field of "Multimedia Design and Production" in Fujian Province. He has served as an expert consultant for the Taiwan, Hong Kong, and Macau Affairs Office of the Fujian Provincial People's Government and a VR judge for the World Vocational College Skills Competition. He has authored a national-level textbook, "Photoshop CC Visual Design Case Course," and has led a provincial-level high-quality online course and multiple research projects. He has been awarded the title of "Excellent Guidance Teacher" in the National Digital Art Design Competition (NCDA) and as a core member, he has won the Grand Prize in the Provincial Teaching Achievement Award. His research areas include virtual reality, image processing, artificial intelligence fundamentals, and computer fundamentals teaching.