Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIE Transactions on Smart Processing and Computing

IEIESPC Vol. 14, No. 06, p.728-740

ISSN (online) :

2287-5255

Received : 12 August 2024Revised : 12 October 2024Accepted : 25 October 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.6.728

Regular Paper

Advancement in Visual SLAM: Feature, Object Detection and 3D Scene Understanding

(Yungu Won) ¹ (Sung Soo Hwang) ²

(Computer Graphics and Vision Lab, Handong Global University, South Koea)
(School of Computer Science and Electrical Engineering, Handong Global University, South Korea)

^* Corresponding Author: Sung Soo Hwang, sshwang@handong.edu

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recent advancements in SLAM systems have made significant progress in terms of performance, accuracy, and efficiency. Particularly, Visual SLAM, a type of SLAM that utilizes cameras to perform simultaneous localization and mapping, offers advantages such as cost reduction in hardware and the ability to leverage various visual information. However, Visual SLAM still faces challenges, such as lighting variations, dynamic objects, rapid camera movements, and environments with limited texture or complex structures. In this paper, we introduce efforts aimed at addressing these challenges. We present feature-based methods that utilize various features for feature extraction. Object-based methods are discussed, focusing on identifying dynamic objects and static environments to enhance accuracy. We explore research from the perspective of 3D scene understanding and representation, which involves analyzing images to comprehend 3D space.

Keywords

Visual SLAM, Deep learning, Neural networks, Computer vision

1. Introduction

SLAM, Simultaneous Localization and Mapping, is a technology wherein agents such as robots or autonomous vehicles navigate unknown environments while simultaneously creating maps of those environments. This technology plays a crucial role primarily in autonomous driving and smart industry sectors, proving useful in diverse scenarios ranging from indoor spaces to outdoor environments. In indoor environments, where GPS signals may be limited, SLAM enables autonomous vehicles or robots to accurately determine their positions and navigate. Furthermore, in outdoor settings such as building exteriors or roadways, precise localization and map generation are achievable.

SLAM utilizes a variety of sensors such as cameras, LiDAR, IMU, and GPS to acquire data. Among these sensors, SLAM that employs vision-based sensors like cameras is referred to as Visual SLAM. Utilizing vision-based sensors offers advantages such as reducing hardware costs, facilitating object detection and tracking, and providing rich visual and semantic information.

However, this approach suffers from inherent challenges. Visual SLAM is susceptible to environmental factors such as changes in lighting conditions, the presence of moving dynamic objects, rapid camera movements, and environments with irregular structures or limited textures, all of which can adversely affect performance. Efforts to address these challenges generally fall into three main categories.

There are efforts to address these issues by leveraging features. These involve identifying spatial regions in images or video frames and extracting features for use. These features are typically selected based on visual attributes such as intensity, color, edges, etc. Recently, there have been attempts to utilize geometric features like lines or planes, not just points. Additionally, there are efforts to use deep learning for feature extraction and to employ various visual elements as features.

Another approach involves utilizing objects. While VSLAM typically operates based on static elements, in the real world, there are often moving or changeable elements (such as people, cars, etc.). Dynamic objects can act as factors that decrease accuracy during camera pose estimation and map creation processes. Consequently, researchers are striving to develop methods to differentiate and handle static environments and dynamic objects. These efforts include not only simply removing dynamic objects but also appropriately integrating them to maintain the environmental model.

Utilizing 3D scene understanding is another approach. Researchers have attempted to analyze images or video frames captured by cameras and understand the meaning and structure of 3D space using techniques such as semantic segmentation and Neural Radiance Fields (NeRF) ^[45]. They aim to provide more meaning and information to VSLAM, thereby enhancing the understanding and perception of the environment.

This paper is structured as follows: Section 2 provides an overview of the overall structure of Visual SLAM, and factors that hinder its performance. Section 3 introduces research conducted from the perspectives of features, object detection, and 3D scene understanding to address the persistent problems of Visual SLAM mentioned earlier. Section 4 draws conclusions.

2. Related Work

2.1. Visual SLAM System

Fig. 1 depicts the structure of the Visual SLAM system. It consists of five key components: Camera Sensor, Tracking, Optimization, Loop Closure, and Mapping.

Fig. 1. Visual SLAM flowchart.

• Camera sensor

The camera sensor is responsible for collecting image data. Cameras are categorized into monocular, stereo, and RGB-D types.

Monocular camera utilizes a single lens, offering advantages such as low cost and lightweight design. However, it poses challenges in accurately estimating landmark depth. Additionally, it may lead to scale ambiguity issues during map construction.

Stereo camera utilizes two lenses and involves calibration, rectification, matching, and computation processes to obtain depth information. While it enables depth information to be acquired both indoors and outdoors, it suffers from the drawback of high computational load.

RGB-D camera measures depth information in real-time using structured light or Time-of-Flight (TOF) sensors. However, it may have a limited measurement range and could be challenging to use in outdoor environments.

• Tracking

Tracking includes the concept of visual odometry (VO), which determines the position and orientation of the camera by analyzing camera images. This operates through stages such as feature extraction, matching, motion, and pose estimation. It is primarily divided into two approaches. Table 1 illustrates the differences between the two methods.

Table 1. Comparison of indirect and direct method.

	Indirect methd	Direct method
Approach	Feature	Pixel value
Illumination change	Robust	Vulnerable
Computational cost	Low	High
Motion blur	Vulnerable Robust
Low-texture area	Vulnerable	Robust
Dynamic scene	Robust	Vulnerable
Cost function	Reprojection error	Photometric error

The indirect method utilizes features, unique points in images, to perform analysis. It is robust to changes in lighting conditions, object motion, and dynamic environmental changes, demanding fewer computations. However, it may exhibit instability in scenarios with significant motion blur or in environments with texture-less patterns, resulting in relatively sparse reconstructed 3D maps.

The direct method primarily processes images using pixel values themselves, analyzing brightness variations between pixels. This method exhibits robustness to motion blur and environments with low texture, allowing for the generation of denser 3D maps. However, it is vulnerable to changes in lighting conditions and instability in dynamic scenes, demanding higher computational resources due to utilizing all pixel values in the image.

• Optimization

Based on the estimated camera position and orientation information obtained from tracking, Optimization refines the predicted pose more accurately. Two methods used in Optimization are filter-based methods and graph-based methods.

Filter-based optimization: This method deals with probabilistic system state estimation. The most prominent filter is the Kalman Filter. Fig. 2 illustrates a Flow Chart for the Kalman Filter. The Kalman Filter is applied to linear systems. It predicts the state using a linear model after setting the initial estimate and uncertainty and updates using observed data. The Extended Kalman Filter is an extended version applicable to nonlinear systems, approximating nonlinear functions to perform prediction and update. The Particle Filter represents probability distributions with particles for state estimation and is flexible for application to nonlinear and non-Gaussian systems.

Fig. 2. Kalman filter.

Graph-based optimization: This includes Factor Graph Optimization (FGO), Pose Graph Optimization (PGO), and Bundle Adjustment (BA). These methods utilize graph structures composed of nodes and edges to perform optimization, employing nonlinear optimization techniques. FGO utilizes a structure consisting of variables and factors, while PGO models camera poses as nodes and movements as edges. Fig. 3 describes the composition of the Pose Graph.

Fig. 3. Pose graph.

BA adjusts the camera poses and landmark positions using input images and feature points, minimizing reprojection errors.

• Loop closure

In Fig. 4, the role of Loop Closure in closed-loop scenarios is illustrated. Loop Closure verifies whether the current location has been visited previously and corrects accumulated position estimation errors as the robot moves. Typically, Loop Closure is performed using the Bag-of-Words (BoW) model. The BoW model clusters extracted keypoints from images to form groups of similar keypoints, known as “visual words.” Each image is represented by a vector recording the frequency of occurrence of corresponding visual words, allowing measurement of similarities between images. If the current location is close to a previously visited place, it is considered a Loop Closure candidate, and if necessary, the robot’s position estimation is corrected.

Fig. 4. Loop closure.

• Mapping

Mapping is the process of creating a detailed representation of the surrounding environment, known as a map. Maps are broadly classified into Metric maps and topological maps. Fig. 5 illustrates two types of maps.

Metric map is to accurately model the surrounding environment in 3D space, representing actual physical distances and directions. It primarily uses 3D point clouds, grids, or various geometric structures to represent the environment. As it contains detailed spatial information, it is useful for determining precise locations and exploring the surrounding environment. Metric maps can be classified into sparse maps, which contain limited information, and dense maps, which provide detailed and comprehensive information. Typically, the process of creating a map involves setting the intrinsic and extrinsic parameters of the camera, finding corresponding pairs in two images, restricting possible locations using epipolar constraints, calculating 3D coordinates through triangulation, collecting these 3D points to create a 3D point cloud, and reconstructing the structure of the environment.

Fig. 5. Metric and topological map comparison.

Topological map is to represent the surrounding environment using connected nodes and edges, primarily depicting the structure and relationships within the environment. Nodes typically represent places or locations, while edges indicate connections between these places. This allows robots to identify paths for moving from one place to another. Topological maps provide abstracted information about the surrounding environment and do not include detailed spatial information.

Semantic mapping combines geometric information with semantic knowledge of the environment. This involves identifying objects through object recognition and segmenting images into object instances via semantic segmentation, providing object or structural information corresponding to each pixel. This information is integrated with SLAM systems to better understand the environment.

Implicit mapping involves implicitly representing the neural map of the environment to encode both geometric and semantic information. It captures environmental characteristics using two methods: deep autoencoders and neural rendering-based scene representation. Deep autoencoders compress input data into high-level abstract representations to implicitly capture key features of the environment. In contrast, neural rendering learns and models the 3D structure of the environment to reconstruct scenes. These methods are useful for understanding the surrounding environment and extracting valuable information without explicitly representing the environment’s features.

2.2. Degrading factors of Visual SLAM

In the field of SLAM, several challenges primarily arise from factors such as lighting changes, presence of dynamic objects, rapid camera movements, love-texture environments, and unstructured environments.

Natural phenomena or artificial factors can induce lighting changes, causing abrupt alterations in image brightness. These changes can compromise the consistency of feature matching, potentially leading to errors in the SLAM process.

Dynamic objects in images also pose a challenge. While static elements provide a stable, dynamic objects can disrupt this stability with their rapid or unpredictable movements, making camera pose tracking difficult and distorting the map.

Rapid camera motion introduces shaking and distortion in images, leading to data loss and inconsistency between consecutive images, necessitating high computational capabilities to adapt to swift environmental changes.

Low-texture environments, like indoor settings, lack discernible color or brightness differences and are characterized by repetitive or monotonous patterns. Techniques such as line detection are used to overcome these limitations.

Unstructured environments, typically encountered outdoors, lack fixed roads, paths, or structures. Irregular terrain, various obstacles, and absence of structure make it challenging for robots or autonomous systems to navigate or operate. Interpreting sensor data and modeling the environment for SLAM systems become difficult in such environments.

3. Recent Advancements in Visual SLAM

The following Section introduces recent research efforts aimed at addressing the challenges that degrade the performance of Visual SLAM, as mentioned in Subsection 2.2.

3.1. Improvement Using Various Features

In early research on VSLAM, point features have been widely utilized ^[1-^6]. These point features offer simplicity and fast processing speed, making them suitable for real time applications. A well-known feature detector for such point features is referenced in ^[7-^14]. However, point features have certain limitations. In particular, they struggle to extract sufficient features in environments with rapid changes in lighting conditions or fast camera rotations.

Accordingly, attempts to utilize line features have been proposed ^[15-^20]. Line features enable feature extraction even in texture-less environments. However, several challenges need to be addressed to utilize line features. Processing line features requires a significant amount of time. Some studies introduce the Line Segment Detector (LSD) to minimize the time required for line extraction ^[15,^16,^20]. Efforts have been made to reduce the number of parameters required for optimization by utilizing the Plücker matrix when employing line features ^[17]. This approach has increased computational efficiency and stability compared to the traditional endpoint representation by representing the geometric information of 3D lines in coordinates. Next, during the mapping process using line features, degeneracy occurs when lines have identical or similar direction vectors, resulting in a loss of diversity in information. Studies have been conducted to address this issue by utilizing vanishing points to correct direction vectors ^[18,^19].

Additionally, attempts to incorporate plane features have been proposed ^[21-^23]. Planes are relatively stable features, making them effective in indoor or artificial environments. However, plane detection faces limitations due to the irregularities of natural terrain and various elements in the environment. UPLP-SLAM ^[21] considers both homogeneous feature correspondences like point-to-point, line-to-line, plane-to-plane, as well as heterogeneous feature correspondences like point-to-line, point-to-plane, and line-to-plane. Structure PLP SLAM ^[22] proposes various optimization techniques to overcome the challenge of reconstructing geometric elements at a consistent scale. Fig. 6 shows a structured scene with 2d features, orthogonal lines and planes. The results are based on PlanarSLAM [58] execution using the TUM RGB-D [60] dataset.

Fig. 6. Example with point, line, plane features in a structured scene.

The advancement of deep learning technologies has shown remarkable achievements in various tasks related to image processing. Particularly, neural network models like Convolutional Neural Network (CNN) demonstrate outstanding performance in image feature extraction. CNN efficiently learns local patterns and features of images through convolution and pooling operations. Due to these characteristics, CNN exhibits superior performance in various image processing tasks, primarily used in tasks such as image recognition, classification, object detection, and segmentation. With the development of deep learning technologies, attempts have been made to replace the process of extracting keypoints in traditional VSLAM structures. Deep learning-based VSLAM utilizes trained neural networks to extract features from images, enabling the utilization of more information compared to traditional VSLAM’s keypoint extraction process, thereby demonstrating robust performance under various conditions.

There are attempts to replace the feature extraction part in the ORB-SLAM2 ^[6] framework with CNNs ^[24-^28]. In DF-SLAM ^[24], a TFeat network is proposed, designed with a thin and efficient structure to extract feature vectors of fixed size from image patches. In DXSLAM ^[25], an HF-Net is proposed, which is robust and efficient in extracting both local and global features. Reference ^[26] introduces SAFT, a learning-based descriptor that uses dynamic feature transformation. Lift-SLAM ^[27] achieves denser and more accurate matching results using the LIFT network. Reference ^[28] proposes a multitask feature extraction network that generates both keypoints and their descriptors simultaneously to enhance performance.

Research has been proposed to improve performance by utilizing artificial structures or information ^[29-^32]. This enhances the diversity and intensity of features extracted in challenging environments, enabling the system to recognize and track more accurately. However, it has the drawback that artificial visual information needs to be prepared in advance within the environment. Tagslam ^[29] employs markers called AprilTags, which consist of unique patterns of squares easily detectable by cameras. Spm-slam ^[30] uses square planar markers, with the constraint that at least two markers must be present in a single frame for operation. Ucoslam ^[31] combines natural landmarks (keypoints) with artificial landmarks (square markers), automatically computing the scale of the surrounding environment. Additionally, leveraging text information has been proposed Textslam ^[32], which involves detecting text in images and generating high-quality 3D text maps. Table 2 above details various feature type used in SLAM, including their strengths, limitations, and key techniques.

Table 2. Comparison of feature type in VSLAM.

Feature type	Ref	Advantages	Disadvantages	Key Technologies	Applications
Point features	^[1-^6]	Simple and fast processing speed	Struggles with lighting changes, fast rotations, and textureless environments	Well-known feature detectors referenced in ^[7-^14]	Real-time applications
Line features	^[15-^20]	Extract features in texture-less environments	Time-consuming processing, similar direction vector issues	LSD, Plücker matrix, vanishing points	Mapping in VSLAM with texture-less conditions
Plane features	^[21-^23]	Stable features, effective in artificial environments	Limited by irregularities of natural terrain		SLAM in indoor or structured environment
Deep learning-based features	^[24-^28]	Superior performance in image processing tasks	Requires extensive training and computational resources	CNN	Image recognition, classification, object detection
Artificial structure	^[29-^32]	Enhanced feature diversity and intensity	Requires prior preparation	AprilTags, square planar markers, text detection	SLAM in environments with prepared visual markers

3.2. Improvement Using Object Detection

In the early stages of Visual SLAM, object-related research mostly relied on methods using optical flow. Optical flow is a technique for tracking the motion of objects at the pixel level in images, primarily used for detecting and tracking moving objects. However, optical flow has the drawback of operating at the individual pixel level, thereby not considering the structure or semantics of objects. The advancement of deep learning has overcome these limitations and introduced new methods for more accurate object detection and tracking. With the introduction of technologies such as semantic segmentation and instance segmentation, objects can now be recognized and distinguished more accurately.

Objects detected in images are classified into static and dynamic objects. Static objects represent entities that do not change, typically providing stability in position estimation and enhancing the accuracy of SLAM. On the other hand, dynamic objects are those that move or change, and the motion of such objects can degrade the consistency of the map and introduce errors in position estimation. Consequently, there are various studies focusing on handling dynamic objects.

Efforts to minimize the influence of dynamic objects by removal have been proposed ^[33-^36]. DS-SLAM ^[33] combines a real-time semantic segmentation network called SegNet with an optical flow algorithm to remove dynamic objects. Dynamic-SLAM ^[34] detects dynamic objects using Single Shot MultiBox Detector (SSD) and treats them as outliers, subsequently removing the corresponding keypoints in those areas. DP-slam ^[35] employs a Bayesian probabilistic propagation model to iteratively update probabilities in real-time for dynamic object removal. SimVODIS++ ^[36] estimates the relative poses and depth maps between input image frames to generate static background information, which is then compared with the object detection results of the current frame to remove dynamic objects.

There are attempts to utilize dynamic objects without removing them ^[37-^39]. Cubeslam ^[37] improves camera pose estimation accuracy by leveraging motion model constraints that track the movements of individual objects represented as 3D cuboids, without treating dynamic objects as outliers. DynaSLAM II ^[38] tracks dynamic objects using instance semantic segmentation and ORB features, optimizing the structure of static scenes and the trajectories of dynamic objects along with camera poses. Moving SLAM ^[39] assumes the entire scene to be non-rigid but approximates rigid regions in small areas to infer and predict the motion of moving objects.

There are attempts to utilize objects to create maps ^[40-^42]. Object maps enable robots to effectively understand and interact with their surrounding environment, accurately represent structured indoor environments, enhance the precision of SLAM systems, and enable efficient data management for accurate localization. Reference ^[40] generates object landmarks using semantic information to obtain consistently maintained points, estimating the positions and depth information of objects through the alignment between images obtained from the camera and the model. Reference ^[41] combines ORB-SLAM2 ^[6] and Mask R-CNN ^[43] to recognize objects in 2D images, model them in 3D space, and create semantic point cloud maps. SO-SLAM ^[42] generates object-level maps representing the environment using quadrics. Table 3 above presents various methods for managing dynamic objects in SLAM, either by removing or utilizing dynamic objects for more robust map creation and object tracking.

Table 3. Dynamic object handling in VSLAM.

Aspect	Ref	Description	Key technologies
Dynamic object removal	DS-SLAM ^[33]	Combines SegNet and optical flow algorithms to remove dynamic objects	SegNet, Optical flow
	Dynamic-SLAM ^[34]	Uses SSD to detect dynamic objects and treat them as outliers, removing keypoints in those areas	SSD
	DP-slam ^[35]	Uses a Bayesian probabilistic propagation model to update probabilities in real-time for removal	Bayesian probabilistic propagation model
	SimVODIS++ ^[36]	Estimates relative poses and depth maps between input image frames to generate static background information and compare with current frame object detection results to remove dynamic objects	Relative pose and depth map estimation
Dynamic object utilization	Cubeslam ^[37]	Does not treat dynamic objects as outliers; represents them as 3D cuboids and leverages motion model constraints	3D cuboid representation
	DynaSLAM II ^[38]	Tracks dynamic objects using instance semantic segmentation and ORB features, optimizing static scene structure and dynamic object trajectories along with camera poses	Instance semantic segmentation, ORB features
	Moving SLAM ^[39]	Assumes the entire scene to be non-rigid but approximates rigid regions in small areas to infer and predict moving object motion	Rigid region approximation
Map creation using objects	^[40]	Object landmark creation	Semantic information
	^[41]	Semantic point cloud map creation	ORB-SLAM2, mask R-CNN
	^[42]	Object-level map creation	Quadrics

3.3. 3D Scene Understanding

In SLAM technology, 3D point clouds have traditionally been employed to model the surrounding environment and estimate one’s own position. However, this approach is limited by the resolution and density of the sensor data and is vulnerable to environmental factors.

Accordingly, various efforts have been made to understand the semantics and structure of 3D space. Initially, attempts were made to comprehend it through semantic segmentation ^[33,^44]. Semantic segmentation is a technique that classifies each pixel in an image into specific classes or meaningful regions to segment the environment. This allows for the identification and utilization of elements in space to estimate more accurate positions and orientations. In ^[33], a semantic octo-tree map is used to hierarchically store 3D space, where each cube contains information such as position, color, and texture to create the map. Reference ^[44] generates maps including semantic elements such as lane markings, crosswalks, ground traffic signals, and stop lines on road surfaces, and updates them through crowdsourcing methods.

Recently, neural network-based models like NeRF have emerged to enhance the representation of 3D maps. NeRF offers several advantages compared to traditional 3D point clouds. It can generate high-resolution 3D maps without being constrained by sensor resolution or data density. Additionally, It is more robust to environmental factors such as noise and lighting variations, estimating detailed information about light sources and surface properties for enhanced visual effects. Consequently, incorporating NeRF into SLAM has been proposed to represent scenes. This approach aims to build 3D models and color information while maintaining accurate positional information from consecutive image frames. Such SLAM research utilizing NeRF employs implicit neural representation to construct 3D maps and perform tracking ^[46-^57]. Neural implicit representation involves utilizing neural networks to implicitly represent spatial information. It uses MLPs to occupy 3D points or map them to colors and optimizes scenes. To achieve this, grid structures like voxel grids are used, enabling the network to predict attributes for each point. iMAP ^[46] employs a single MLP to represent the entire scene, optimizing computational speed and resource usage by processing only the necessary data in required areas through guided pixel sampling based on dynamic information. This approach enables tracking speeds of up to 10 Hz and global map updates at 2 Hz. Nice-SLAM ^[47] organizes multiple MLPs into a hierarchical structure, with each MLP corresponding to different levels and types of local information, representing scenes at multiple levels of detail. Furthermore, it enhances tracking accuracy and stability by reducing the impact of dynamic objects, filtering out pixels with large depth/color rendering loss. iSDF ^[48] utilizes Signed Distance Fields (SDF) to represent 3D maps, which are then utilized for collision detection and path planning in robotics. By employing neural networks, it estimates signed distances for input 3D coordinates, enabling the understanding of surrounding environment shapes and distances to obstacles. Additionally, it provides a more accurate map through adaptive levels of detail and reducing noise in the observations, thereby improving navigation and tracking performance. Idf-SLAM ^[49] predicts T-SDF(Truncated Signed Distance Function) using a single MLP, considering the 3D positions and viewing directions of input points. T-SDF encodes distances from points and truncates points beyond a certain distance, reducing memory usage while aiding in the creation of accurate 3D models. Additionally, it utilizes pairwise point cloud registration networks to align point clouds captured at different times or locations, enhancing camera tracking accuracy. ESLAM ^[50] enables accurate and fast 3D map generation through a hybrid representation consisting of multi-scale axis-aligned feature planes and shallow decoders. This approach leverages implicit neural representation and TSDF to convert each point’s features into TSDF and RGB values, providing high-quality 3D reconstruction performance. Additionally, it improves tracking performance by utilizing a global loss function and Adam optimizer, while excluding outlier pixels and rays without ground truth depth during optimization. MeSLAM ^[51] introduces a network distribution strategy where multiple MLPs are assigned to different regions to process specialized information. This allows each neural network to update and adjust its designated area independently, enabling adaptation to new information or changes. Moreover, by decomposing large scenes and stitching them during reconstruction, it maintains high accuracy and consistency throughout the map-building process. Vox-fusion ^[52] combines neural implicit representation with traditional volumetric fusion methods to encode and optimize scenes within each voxel, constructing a 3D map. Additionally, it utilizes an octree-based structure to dynamically expand scenes and detects object instances to enhance tracking performance. vMAP ^[53] constructs a SLAM system using an MLP neural field model, enabling efficient and accurate modeling of each object. Additionally, it leverages vectorized training to simultaneously optimize multiple objects. Moreover, the system detects object instances within the scene and continuously updates them to enhance tracking performance. Co-SLAM ^[54] utilizes multi-resolution hash-grid and one-blob encoding to achieve high convergence speed and accurate 3D map generation. Additionally, by combining joint coordinate and sparse-parametric encodings, it performs global bundle adjustment, achieving efficient memory usage and robust tracking performance without the need to maintain keyframe selection.

NeRF-SLAM ^[55] combines Dense monocular SLAM with NeRF. By accurately estimating camera poses and generating depth maps, it constructs a scene’s neural radiance field, providing enhanced 3D scene reconstruction capabilities using associated uncertainty information. Orbeez-SLAM ^[56] integrates ORB features with NeRF to build 3D maps. ORB features contribute to initial pose estimation by identifying and extracting distinctive points in the camera’s surroundings. Subsequently, NeRF predicts color and density values for these points to construct a dense 3D map. Nicer-SLAM ^[57] enables high-fidelity 3D scene reconstruction and novel view synthesis by utilizing hierarchical neural implicit representation to construct 3D maps. Additionally, it enhances tracking performance by incorporating additional information such as monocular geometric cues, optical flow, and warping loss to strengthen geometry consistency. Fig. 7 compares rendering results from NeRF-SLAM, Nice-SLAM, and Nicer-SLAM, highlighting differences in scene reconstruction quality. Table 4 below presents a comparative overview of recent advancements in the integration of SLAM with NeRF. The table categorizes these advancements based on their core functionalities, such as scene representation and optimization, distance fields and collision detection, volumetric and implicit representations, and integration with feature-based methods.

Fig. 7. Novel view synthesis results on the Replica dataset [59].

Table 4. Integration of SLAM and NeRF.

Aspect	Ref	Description	Key technologies
Scene representation and optimization	iMAP ^[46]	Employs a single MLP to represent the entire scene, optimizing computational speed and resource usage through guided pixel sampling based on dynamic information	Single MLP, guided pixel sampling based on dynamic information
	Nice-SLAM ^[47]	Organizes multiple MLPs into a hierarchical structure, representing scenes at multiple levels of detail and enhancing tracking accuracy and stability	Hierarchical MLPs, multi-level detail
	MeSLAM ^[51]	Introduces a network distribution strategy where multiple MLPs process specialized information independently, maintaining high accuracy and consistency	Network distribution strategy
Distance fields and collision detection	iSDF ^[48]	Utilizes signed distance fields (SDF) for collision detection and path planning, estimating signed distances for input 3D coordinates	Signed distance fields (SDF)
	ldf-SLAM ^[49]	Predicts T-SDF using a single MLP, reducing memory usage while aiding in the creation of accurate 3D models and enhancing camera tracking accuracy	T-SDF, pairwise point cloud registration network
	ESLAM ^[50]	Enables accurate and fast 3D map generation through a hybrid representation and improved tracking performance using a global loss function and Adam optimizer	Hybrid representation, T-SDF, global loss function, Adam optimizer
Volumetric and implicit representations	Vox-fusion ^[52]	Combines neural implicit representation with traditional volumetric fusion methods to construct 3D maps and dynamically expand scenes	Volumetric fusion
	vMAP ^[53]	Constructs a SLAM system using an MLP neural field model, enabling efficient modeling of each object and continuously updating them	MLP neural field model, object modeling
	Co-SLAM ^[54]	Utilizes multi-resolution hash-grid and one-blob encoding for high convergence speed and accurate 3D map generation, achieving efficient memory usage	Multi-resolution hash-grid, one-blob encoding
Integration with NeRF and feature-based methods	NeRF-SLAM ^[55]	Combines dense monocular SLAM with NeRF to provide enhanced 3D scene reconstruction capabilities using associated uncertainty information	Dense monocular SLAM, NeRF, uncertainty information
	Orbeez-SLAM ^[56]	Integrates ORB features with NeRF to build 3D maps, predicting color and density values for these points to construct a dense 3D map	ORB features, NeRF, Dense 3D map construction
	Nicer-SLAM ^[57]	Enables high-fidelity 3D scene reconstruction and novel view synthesis by utilizing hierarchical neural implicit representation, enhancing tracking performance	Hierarchical neural implicit representation

4. Conclusion

This paper discusses Visual SLAM, which relies on visual sensors for data collection. However, Visual SLAM faces challenges such as variations in lighting, moving objects, rapid camera movements, and environments lacking texture or structure irregularity. Researchers are exploring different approaches to address these challenges. One approach involves utilizing various features like geometric, deep learning, and artificial features. Geometric features combine points, lines, and planes to navigate complex environments. Deep learning features use CNN-based models to extract high-level features, improving performance across diverse conditions. Artificial features entail adding structures to the environment, but this requires pre installation. Another approach is employing object detection techniques to deal with dynamic objects. Technologies such as semantic segmentation, instance segmentation, and YOLO are utilized to detect and track dynamic objects, either removing them or incorporating them into the map. Lastly, there are efforts to enhance 3D scene understanding and representation. Initially, semantic segmentation was used for scene understanding, but recent advancements include NeRF-based SLAM, which employs implicit neural representation for constructing 3D maps and tracking. Table 5 below provides a summary of evaluation metrics from various papers based on the TUM RGB-D dataset. The evaluation is conducted using the Absolute Trajectory Error Root Mean Square Error (ATE RMSE), with units measured in meters. The values presented in the table are derived from the most recent evaluations available in the latest research papers.

Table 5. ATE RMSE RESULTS ON TUM RGB-D dataset (Unit: m).

In future research, firstly, there is a need to further advance deep learning-based feature extraction and learning methods to enhance the performance of SLAM in various environments. Research is required to enable robust performance even in challenging environments such as lighting changes, texture absence, and structural irregularities. Additionally, more advanced detection, tracking, and modeling techniques for dynamic objects at a higher level are necessary to effectively handle the influence of dynamic objects within the environment. Furthermore, 3D scene understanding and processing should also be pursued to increase real-world applicability, facilitating the development of systems for accurate and efficient real-time 3D modeling across various fields.

Acknowledgement

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2022R1C1C1011084).

References

Davison A. J., Reid I. D., Molton N. D., Stasse O., 2007, MonoSLAM: Real-time single camera SLAM, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, pp. 1052-1067

Klein G., Murray D., 2009, Parallel tracking and mapping on a camera phone, Proc. of IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 83-86

Newcombe R. A., Lovegraove S. J., Davison A. J., 2011, DTAM: Dense tracking and mapping in real-time, Proc. of the 2011 International Conference on Computer Vision, pp. 2320-2327

Engel J., Schöps T., Cremers D., 2014, LSD-SLAM: Large-scale direct monocular SLAM, Proc. of European Conference on Computer Vision (ECCV), pp. 834-849

Mur-Artal R., Montiel J. M. M., Tardós J. D., 2015, ORB-SLAM: A versatile and accurate monocular SLAM system, IEEE Transactions on Robotics, Vol. 31, No. 5, pp. 1147-1163

Mur-Artal R., 2017, ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras, IEEE Transactions on Robotics, Vol. 33, No. 5

Harris C., Stephens M., 1988, A combined corner and edge detector, Proc. of Alvey Vision Conference, Vol. 15, No. 50, pp. 10-5244

Shi J., Tomasi , 1994, Good features to track, Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593-600

Lowe D. G., 2004, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, Vol. 60, pp. 91-110

Bay H., Tuytelars T., Gool L. V., 2006, SURF: Speeded up robust features, Proc. of European Conference on Computer Vision (ECCV)

Rosten E., Drummond T., 2006, Machine learning for high-speed corner detection, Proc. of European Conference on Computer Vision (ECCV)

Mair E., Häger G. D., Burschka D., Suppa M., Hirzinger G., 2010, Adaptive and generic corner detection based on the accelerated segment test, Proc. of European Conference on Computer Vision (ECCV)

Rublee E., Rabaud V., Konolige K., Bradski G., 2011, ORB: An efficient alternative to SIFT or SURF, Proc. of 2011 International Conference on Computer Vision, pp. 2564-2571

Alcantarilla P. F., Bartoli A., Davison A. J., 2012, KAZE features, Proc. of European Conference on Computer Vision (ECCV)

Pumarola A., Vakhitov A., Agudo A., Sandeliu A., Moreno-Noguer F., 2017, PL-SLAM: Real-time monocular visual slam with points and lines, Proc. of 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4503-4508

Gomez-Ojeda R., Moreno F.-A., Zuniga-Noël D., Scaramuzza D., Gonzalez-Jimenez J., 2019, PL-SLAM: A stereo slam system through the combination of points and line segments, IEEE Transactions on Robotics, Vol. 35, No. 3, pp. 734-746

Lee S. J., Hwnag S. S., 2019, Elaborate monocular point and line slam with robust initialization, Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 1121-1129

Lim H., Kim Y., Jung K., Hu S., Myung H., 2021, Avoiding degeneracy for monocular visual slam with point and line features, Proc. of 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11675-11681

Lim H., Jeon J., Myung H., 2022, UV-SLAM: Unconstrained line-based SLAM using vanishing points for structural mapping, IEEE Robotics and Automation Letters, Vol. 7, No. 2, pp. 1518-1525

Islam R., Habinuyllah H., Hossain T., 2023, AGRI-SLAM: A real-time stereo visual SLAM for agricultural environment, Autonomous Robots, Vol. 47, pp. 649-668

Yang H., Juan J., Gao Y., Sun X., Zhang X., 2023, UPLP-SLAM: unified point-line-plane feature fusion for RGB-D visual SLAM, Information Fusion, Vol. 96, pp. 51-65

Shu F., Wang J., Pagani A., Stricker D., 2023, Structure PLP-SLAM: Efficient sparse mapping and localization using point, line and plane for monocular, RGB-D and stereo cameras, Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2105-2112

Yan J., Zheng Y., Yang J., Mihaylove L., Yuan W., Gu F., 2024, PLPF‐VSLAM: An indoor visual SLAM with adaptive fusion of point‐line‐plane features, Journal of Field Robotics, Vol. 41, No. 1, pp. 50-67

Kang R., Shi J., Li X., Yiu L., Liu X., 2019, DF-SLAM: A deep-learning enhanced visual SLAM system based on deep local features, arXiv preprint arXiv:190107223

Li D., Shi X., Long Q., Liu S., Wang W., Wang F., 2020, DXSLAM: A robust and efficient visual SLAM system with deep features, Proc. of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Xu L., Feng C., Kamat V. R., Menassa C. C., 2020, A scene-adaptive descriptor for visual SLAM-based locating applications in built environments, Automation in Construction, Vol. 112

Bruno H. M. S., Colombini E. L., 2021, LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method, Neurocomputing, Vol. 455, pp. 97-110

Li G., Yu L., Fei S., 2021, A deep-learning real-time visual slam system based on multi-task feature extraction network and self-supervised feature points, Measurement, Vol. 168, pp. 108403

Pfrommer B., Daniilidis K., 2019, TagSLAM: Robust SLAM with fiducial markers, arXiv preprint arXiv:1910.00679

Munoz-Salinas R., Marín-Jimenez M. J., Medina-Carnicer R., 2019, SPM-SLAM: Simultaneous localization and mapping with squared planar markers, Pattern Recognition, Vol. 86, pp. 156-171

Munoz-Salinas R., Medina-Carnicer R., 2020, UcoSLAM: Simultaneous localization and mapping by fusion of keypoints and squared planar markers, Pattern Recognition, Vol. 101, pp. 107193

Li B., Zou D., Sartori D., Pei L., Yu W., 2020, TextSLAM: Visual SLAM with planar text features, Proc. of 2020 IEEE International Conference on Robotics and Automation (ICRA)

Yu C., Liu Z., Liu X.-J., Xie F., Yang Y., Wei Q., 2018, DS-SLAM: A semantic visual SLAM towards dynamic environments, Proc. of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Xiao L., Wang J., Qiu X., Rong Z., Zou X., 2019, Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment, Robotics and Autonomous Systems, Vol. 117, pp. 1-16

Li A., Wang J., Xu M., Chen Z., 2021, DP-SLAM: A visual SLAM with moving probability towards dynamic environments, Information Sciences, Vol. 556, pp. 128-142

Kim U. H., Kim S.-H., Kim J.-H., 2022, SimVODIS++: Neural semantic visual odometry in dynamic environments, IEEE Robotics and Automation Letters, Vol. 7, No. 2, pp. 4244-4251

Yang S., Scherer S., 2019, CubeSLAM: Monocular 3-D object SLAM, IEEE Transactions on Robotics, Vol. 35, No. 4, pp. 925-938

Bescos B., Campos C., Tardós J. D., Neira J., 2021, DynaSLAM II: Tightly-coupled multi-object tracking and SLAM, IEEE Robotics and Automation Letters, Vol. 6, No. 3, pp. 5191-5198

Xu D., Vedaldi A., Henriques J. F., 2021, Moving SLAM: Fully unsupervised deep learning in non-rigid scenes, Proc. of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Sharma A., Dong W., Kaess M., 2021, Compositional and scalable object SLAM, Proc. of 2021 IEEE International Conference on Robotics and Automation (ICRA)

Sun Y., Hu J., Yun J., Liu Y., Bai D., Liu X., Zhao G., Liang G., Kong J., Chen B., 2022, Multi-objective location and mapping based on deep learning and visual SLAM, Sensors, Vol. 22, No. 19, pp. 7576

Liao Z., Hu Y., Zhang J., Qi X., Zhang X., Wang W., 2022, SO-SLAM: Semantic object SLAM with scale proportional and symmetrical texture constraints, IEEE Robotics and Automation Letters, Vol. 7, No. 2, pp. 4008-4015

He K., Gkioxari G., Dollár P., Girshick R., 2017, Mask R-CNN, Proc. of the IEEE International Conference on Computer Vision

Qin T., Yen Y., Zheng T., Chen Y., Chen Q., Su Q., 2021, A light-weight semantic map for visual localization towards autonomous driving, Proc. of 2021 IEEE International Conference on Robotics and Automation (ICRA)

Mildenhall B., 2021, NeRf: Representing scenes as neural radiance fields for view synthesis, Communications of the ACM, Vol. 65, No. 1, pp. 99-106

Sucar E., Liu S., Ortiz J., Davison A. J., 2021, iMAP: Implicit mapping and positioning in real-time, Proc. of 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Zhu Z., Peng S., Larsson V., Xu W., Bao H., Cui Z., 2022, Nice-SLAM: Neural implicit scalable encoding for SLAM, Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ortiz J., Clegg A., Dong J., Sucar E., Novotny D., Zollhoefer M., Mukadam M., 2022, iSDF: Real-time neural signed distance fields for robot percaption, arXiv preprint arXiv:2204.02296

Ming Y., Ye W., Calway A., 2022, iDF-SLAM: End-to-end RGB-D SLAM with neural implicit mapping and deep feature tracking, arXiv preprint arXiv:2209.07919

Johari M. M., Carta C., Fleuret F., 2023, ESLAM: Efficient dense SLAM system based on hybrid representation of signed distance fields, Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kruzhkov E., Savinykh A., Karpyshev P., Murenkov M., Yudin E., Potapov A., 2022, MeSLAM: Memory efficient slam based on neural fields, Proc. of 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

Yang X., Li H., Zhai H., Ming Y., Liu Y., Zhang G., 2022, Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation, Proc. of IEEE International Symposium on Mixed and Augmented Reality (ISMAR)

Kong X., Liu S., Taher M., Davison A. J., 2023, vMAP: Vectorised object mapping for neural field SLAM, arXiv preprint arXiv:2302.01838

Wang H., Wang J., Agapito L., 2023, Co-SLAM: Joint coordinate and sparse parametric encodings for neural real-time SLAM, Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rosinol A., Leonard J. J., Carlone L., 2023, NeRF-SLAM: Real-time dense monocular SLAM with neural radiance fields, Proc. of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Chung C.-M., Tseng C.-M., Hsu Y.-C., Shi X.-Q., Hua Y.-H., Yeh J.-F., 2023, Orbeez-SLAM: A realtime monocular visual slam with orb features and nerfrealized mapping, Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA)

Zhu Z., Peng S., Larsson V., Cui Z., Oswald M. R., Geiger A., 2024, Nicer-slam: Neural implicit scene encoding for RGB SLAM, Proc. of 2024 International Conference on 3D Vision (3DV)

Li Y., Yunus R., Brasch N., Navab N., Tombari F., 2021, RGB-D SLAM with structural regularities, Proc. of 2021 IEEE International Conference on Robotics and Automation (ICRA)

Author

Yungu Won

Yungu Won received his B.S. degree in artificial intelligence, computer science and engineering from Handong Global University, Pohang, South Korea, in 2023. He is currently in course of M.S degree at Computer Graphics and Vision Lab in Handong Global University. His research interests include SLAM and computer vision

Sung Soo Hwang

Sung Soo Hwang received his B.S. degree in computer science and electrical engineering from Handong Global University, Pohang, South Korea, in 2008 and his M.S. and Ph.D degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2010 and 2015, respectively. He is currently working as an associate professor with the School of Computer Science and Electrical Engineering, Handong Global University, South Korea. His current research interests include Visual SLAM and Neural Rendering based 3D Reconstruction.