2.1. Visual SLAM System
Fig. 1 depicts the structure of the Visual SLAM system. It consists of five key components:
Camera Sensor, Tracking, Optimization, Loop Closure, and Mapping.
Fig. 1. Visual SLAM flowchart.
• Camera sensor
The camera sensor is responsible for collecting image data. Cameras are categorized
into monocular, stereo, and RGB-D types.
Monocular camera utilizes a single lens, offering advantages such as low cost and
lightweight design. However, it poses challenges in accurately estimating landmark
depth. Additionally, it may lead to scale ambiguity issues during map construction.
Stereo camera utilizes two lenses and involves calibration, rectification, matching,
and computation processes to obtain depth information. While it enables depth information
to be acquired both indoors and outdoors, it suffers from the drawback of high computational
load.
RGB-D camera measures depth information in real-time using structured light or Time-of-Flight
(TOF) sensors. However, it may have a limited measurement range and could be challenging
to use in outdoor environments.
• Tracking
Tracking includes the concept of visual odometry (VO), which determines the position
and orientation of the camera by analyzing camera images. This operates through stages
such as feature extraction, matching, motion, and pose estimation. It is primarily
divided into two approaches. Table 1 illustrates the differences between the two methods.
Table 1. Comparison of indirect and direct method.
|
|
Indirect methd
|
Direct method
|
|
Approach |
Feature
|
Pixel value
|
|
Illumination change |
Robust
|
Vulnerable
|
|
Computational cost |
Low
|
High
|
|
Motion blur |
Vulnerable Robust
|
|
|
Low-texture area |
Vulnerable
|
Robust
|
|
Dynamic scene |
Robust
|
Vulnerable
|
|
Cost function |
Reprojection error
|
Photometric error
|
The indirect method utilizes features, unique points in images, to perform analysis.
It is robust to changes in lighting conditions, object motion, and dynamic environmental
changes, demanding fewer computations. However, it may exhibit instability in scenarios
with significant motion blur or in environments with texture-less patterns, resulting
in relatively sparse reconstructed 3D maps.
The direct method primarily processes images using pixel values themselves, analyzing
brightness variations between pixels. This method exhibits robustness to motion blur
and environments with low texture, allowing for the generation of denser 3D maps.
However, it is vulnerable to changes in lighting conditions and instability in dynamic
scenes, demanding higher computational resources due to utilizing all pixel values
in the image.
• Optimization
Based on the estimated camera position and orientation information obtained from tracking,
Optimization refines the predicted pose more accurately. Two methods used in Optimization
are filter-based methods and graph-based methods.
Filter-based optimization: This method deals with probabilistic system state estimation.
The most prominent filter is the Kalman Filter. Fig. 2 illustrates a Flow Chart for the Kalman Filter. The Kalman Filter is applied to linear
systems. It predicts the state using a linear model after setting the initial estimate
and uncertainty and updates using observed data. The Extended Kalman Filter is an
extended version applicable to nonlinear systems, approximating nonlinear functions
to perform prediction and update. The Particle Filter represents probability distributions
with particles for state estimation and is flexible for application to nonlinear and
non-Gaussian systems.
Graph-based optimization: This includes Factor Graph Optimization (FGO), Pose Graph
Optimization (PGO), and Bundle Adjustment (BA). These methods utilize graph structures
composed of nodes and edges to perform optimization, employing nonlinear optimization
techniques. FGO utilizes a structure consisting of variables and factors, while PGO
models camera poses as nodes and movements as edges. Fig. 3 describes the composition of the Pose Graph.
BA adjusts the camera poses and landmark positions using input images and feature
points, minimizing reprojection errors.
• Loop closure
In Fig. 4, the role of Loop Closure in closed-loop scenarios is illustrated. Loop Closure verifies
whether the current location has been visited previously and corrects accumulated
position estimation errors as the robot moves. Typically, Loop Closure is performed
using the Bag-of-Words (BoW) model. The BoW model clusters extracted keypoints from
images to form groups of similar keypoints, known as “visual words.” Each image is
represented by a vector recording the frequency of occurrence of corresponding visual
words, allowing measurement of similarities between images. If the current location
is close to a previously visited place, it is considered a Loop Closure candidate,
and if necessary, the robot’s position estimation is corrected.
• Mapping
Mapping is the process of creating a detailed representation of the surrounding environment,
known as a map. Maps are broadly classified into Metric maps and topological maps.
Fig. 5 illustrates two types of maps.
Metric map is to accurately model the surrounding environment in 3D space, representing
actual physical distances and directions. It primarily uses 3D point clouds, grids,
or various geometric structures to represent the environment. As it contains detailed
spatial information, it is useful for determining precise locations and exploring
the surrounding environment. Metric maps can be classified into sparse maps, which
contain limited information, and dense maps, which provide detailed and comprehensive
information. Typically, the process of creating a map involves setting the intrinsic
and extrinsic parameters of the camera, finding corresponding pairs in two images,
restricting possible locations using epipolar constraints, calculating 3D coordinates
through triangulation, collecting these 3D points to create a 3D point cloud, and
reconstructing the structure of the environment.
Fig. 5. Metric and topological map comparison.
Topological map is to represent the surrounding environment using connected nodes
and edges, primarily depicting the structure and relationships within the environment.
Nodes typically represent places or locations, while edges indicate connections between
these places. This allows robots to identify paths for moving from one place to another.
Topological maps provide abstracted information about the surrounding environment
and do not include detailed spatial information.
Semantic mapping combines geometric information with semantic knowledge of the environment.
This involves identifying objects through object recognition and segmenting images
into object instances via semantic segmentation, providing object or structural information
corresponding to each pixel. This information is integrated with SLAM systems to better
understand the environment.
Implicit mapping involves implicitly representing the neural map of the environment
to encode both geometric and semantic information. It captures environmental characteristics
using two methods: deep autoencoders and neural rendering-based scene representation.
Deep autoencoders compress input data into high-level abstract representations to
implicitly capture key features of the environment. In contrast, neural rendering
learns and models the 3D structure of the environment to reconstruct scenes. These
methods are useful for understanding the surrounding environment and extracting valuable
information without explicitly representing the environment’s features.
2.2. Degrading factors of Visual SLAM
In the field of SLAM, several challenges primarily arise from factors such as lighting
changes, presence of dynamic objects, rapid camera movements, love-texture environments,
and unstructured environments.
Natural phenomena or artificial factors can induce lighting changes, causing abrupt
alterations in image brightness. These changes can compromise the consistency of feature
matching, potentially leading to errors in the SLAM process.
Dynamic objects in images also pose a challenge. While static elements provide a stable,
dynamic objects can disrupt this stability with their rapid or unpredictable movements,
making camera pose tracking difficult and distorting the map.
Rapid camera motion introduces shaking and distortion in images, leading to data loss
and inconsistency between consecutive images, necessitating high computational capabilities
to adapt to swift environmental changes.
Low-texture environments, like indoor settings, lack discernible color or brightness
differences and are characterized by repetitive or monotonous patterns. Techniques
such as line detection are used to overcome these limitations.
Unstructured environments, typically encountered outdoors, lack fixed roads, paths,
or structures. Irregular terrain, various obstacles, and absence of structure make
it challenging for robots or autonomous systems to navigate or operate. Interpreting
sensor data and modeling the environment for SLAM systems become difficult in such
environments.