FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry
Abstract
This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in Simultaneous Localization and Mapping (SLAM) tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements, efficiently through an error-state iterated kalman filter (ESIKF). To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points (i.e., visual map points) enabling new image alignment. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior in the alignment process) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FAST-LIVO2 employs an on-demanding raycast operation and estimates the image exposure time in real time. We conduct extensive experiments on both benchmark and private datasets, demonstrating that our proposed system significantly outperforms other state-of-the-art odometry systems in terms of accuracy, robustness, and computation efficiency. Moreover, the effectiveness of key modules in the system is also validated. Lastly, we detail three applications of FAST-LIVO2: UAV onboard navigation demonstrating the system’s computation efficiency for real-time onboard navigation, airborne mapping showcasing the system’s mapping accuracy, and 3D model rendering (mesh-based and NeRF-based) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application of this work on GitHub111https://github.com/hku-mars/FAST-LIVO2 to benefit the robotics community.
Index Terms:
Simultaneous Localization and Mapping (SLAM), Sensor Fusion, 3D Reconstruction, Aerial Navigation.I Introduction
In recent years, simultaneous localization and mapping (SLAM) technology has seen significant advancements, particularly in real-time 3D reconstruction and localization in unknown environments. Due to its ability to estimate poses and reconstruct maps in real time, SLAM has become indispensable for various robot navigation tasks. The localization process delivers crucial state feedback for the robot’s onboard controllers, while the dense 3D map provides key environmental information, such as free spaces and obstacles, essential for effective trajectory planning. A colored map also carries substantial semantic information, enabling a vivid representation of the real world that opens up vast potential applications, such as virtual and augmented reality, 3D modeling, and robot-human interactions.
Currently, several SLAM frameworks have been successfully implemented with single-measurement sensors, primarily cameras [1, 2, 3, 4] or LiDAR [5, 6, 7]. Although visual and LiDAR SLAM have shown promise in their own domains, each has inherent limitations that constrain their performance in various scenarios.
Visual SLAM, leveraging cost-effective CMOS sensors and lenses, is capable of establishing accurate data associations, thereby achieving a certain level of localization accuracy. The abundance of color information further enriches the semantic perception. Further leveraging this enhanced scene comprehension, deep learning methods are employed for robust feature extraction and dynamic object filtering. However, the lack of direct depth measurement in visual SLAM necessitates concurrent optimization of map points via operations such as triangulation or depth filtering, which introduces significant computational overhead that often limits map accuracy and density. Visual SLAM also encounters numerous other limitations, such as varying measurement noise across different scales, sensitivity to illumination changes, and the impact of texture-less environments on data association.
LiDAR SLAM, utilizing LiDAR sensors, obtains precise depth measurements directly, offering superior precision and efficiency in localization and mapping tasks compared with visual SLAM. Despite these strengths, LiDAR SLAM exhibits several significant shortcomings. On one hand, the point cloud maps it reconstructs, albeit detailed, lack color information, thereby reducing their information scale. On the other hand, LiDAR SLAM performance tends to deteriorate in environments presenting insufficient geometric constraints, such as narrow tunnels, a single and extended wall, etc.
As the demand to operate intelligent robots in the real world grows, especially in environments that often lack structure or texture, it is becoming clear that existing systems relying on a single sensor cannot provide the accurate and robust pose estimation as required. To address this issue, the fusion of commonly-used sensors such as LiDAR, camera, and IMU is gaining increasing attention. This strategy not only combines the strengths of these sensors to provide enhanced pose estimation, but also aids in the construction of accurate, dense, and colored point cloud maps, even in environments where the performance of individual sensors degenerate.
Efficient and accurate LiDAR-inertial-visual odometry (LIVO) and mapping are still challenging problems: 1) The entire LIVO system is tasked with processing LiDAR measurements, consisting of hundreds to thousands of points per second, as well as high-rate, high-resolution images. The challenge of fully utilizing such a vast amount of data, particularly with limited onboard resources, necessitates exceptional computational efficiency; 2) Many existing systems typically incorporate a LiDAR-Inertial Odometry (LIO) subsystem and a Visual-Inertial Odometry (VIO) subsystem, each necessitating the extraction of features from visual and LiDAR data respectively to reduce computational load. In environments that lack structure or texture, this extraction process often results in limited feature points. Furthermore, to optimize feature extraction, extensive engineering adaptations are essential to accommodate the variability in LiDAR scanning patterns and point densities; 3) To reduce computational demands and achieve tighter integration between camera and LiDAR measurements, a unified map is essential to manage sparse points and the observed high-resolution image measurements simultaneously. However, designing and maintaining such maps are particularly challenging considering the heterogeneous measurements of LiDAR and cameras; 4) To ensure the accuracy of the reconstructed colored point cloud, pose estimation needs to achieve pixel-level accuracy. Meeting this standard presents considerable challenges: proper hardware synchronization, rigorous pre-calibration of extrinsic parameters between LiDAR and cameras, precise recovery of exposure time, and a fusion strategy capable of reaching pixel-level accuracy in real time.
Motivated by these issues, we propose FAST-LIVO2, a high-efficiency LIVO system that tightly integrates LiDAR, image and IMU measurements through a sequentially updated error-state iterated Kalman filter (ESIKF). With the prior from IMU propagation, the system state is updated sequentially, first by the LiDAR measurements and then by the image measurements, both utilizing direct methods based on a single unified voxel map. Specifically, in the LiDAR update, the system registers raw points to the map to construct and update its geometric structure, and in the visual update, the system reuses LiDAR map points as the visual map points directly without extracting, triangulating, or optimizing any visual features from images. The chosen visual map points in the map are attached with reference image patches previously observed and then projected to the current image to align its pose by minimizing the direct photometric errors (i.e., sparse image alignment). To improve the accuracy in the image alignment, FAST-LIVO2 dynamically updates the reference patches and uses the plane priors obtained from LiDAR points. For improved computation efficiency, FAST-LIVO2 uses LiDAR points to identify visual map points visible from the current image and conduct an on-demanding voxel raycast in case of no LiDAR points. FAST-LIVO2 also estimates the exposure time in real time to handle illumination variation.
FAST-LIVO2 is developed based on FAST-LIVO first proposed in our previous work [8]. The new contributions compared to FAST-LIVO are listed below:
-
1.
We propose an efficient ESIKF framework with sequential update to address the dimension mismatch between LiDAR and visual measurements, improving the robustness of FAST-LIVO that uses asynchronous updates.
-
2.
We use (and even refine) plane priors from LiDAR points for improved accuracy. In contrast, FAST-LIVO assumes all pixels in a patch share the same depth, a wild assumption significantly reducing the accuracy of affine warping in image alignment.
-
3.
We propose a reference patch update strategy to improve the accuracy of image alignment, by selecting high-quality, inlier reference patches that have large parallax and sufficient texture details. FAST-LIVO selects the reference patch based on proximity to the current view, often resulting in low-quality reference patches degrading the accuracy.
-
4.
We conduct online exposure time estimation for handling environment illumination variation. FAST-LIVO did not address this issue, leading to poor convergence in image alignment under significant lighting changes.
-
5.
We propose on-demand voxel raycasting to enhance the system robustness in the absence of LiDAR point measurements caused by LiDAR close proximity blind zones, an issue not considered in FAST-LIVO.
Each of the above contributions are evaluated in comprehensive ablation studies to verify their effectiveness. We implement the proposed system as practical open software, meticulously optimized for real-time operation on both Intel and ARM processors. The system is versatile, supporting multi-line spinning LiDARs, emerging solid-state LiDARs with unconventional scanning patterns, as well as both pinhole cameras and various fisheye cameras.
Besides, we conduct extensive experiments on 25 sequences of public datasets (i.e., Hilti and NTU-VIRAL datasets), alongside various representative private datasets, enabling a comparison with other state-of-the-art SLAM systems (e.g., R3LIVE, LVI-SAM, FAST-LIO2, etc). Both qualitative and quantitative results demonstrate that our proposed system significantly outpaces other counterparts in terms of accuracy and robustness at a reduced computation cost.
Taking a step further to underline the real-world applicability and versatility of our system, we deploy three distinctive applications. Firstly, fully onboard autonomous UAV navigation, demonstrating the system’s real-time capabilities, marks a pioneering instance of employing a LiDAR-inertial-visual system for real-world autonomous UAV flights. Secondly, airborne mapping showcases the system’s pixel-level precision under structure-less environments in practical use. Lastly, the high-quality generation of mesh, texturing, and NeRF models underscores the system’s suitability for rendering tasks. We make our code and dataset available on GitHub.
II Related Works
II-A Direct Methods
Direct methods stand out as a prominent approach for fast pose estimation in both visual and LiDAR SLAM. Unlike feature-based methods [9, 10, 5, 6] which necessitate the extraction of salient feature points (e.g., corners and edge pixels in images; plane and edge points in LiDAR scans) and the generation of robust descriptors for matching, direct methods directly leverage raw measurements to optimize the sensor pose [11] by minimizing an error function based on photometric error or point-to-plane residuals, e.g., [3, 12, 13, 14]. By eliminating the time-consuming feature extraction and matching, direct methods offer fast pose estimation. Nonetheless, the absence of feature matching requires fairly accurate state prior estimation to avoid local minima.
Direct methods in visual SLAM can be broadly categorized into dense direct, semi-dense direct, and sparse direct methods. Dense direct methods, predominantly adopted for RGB-D cameras with full depth measurements as exemplified by [15, 16, 17], apply image-to-model alignment for pose estimation. In contrast, semi-dense direct methods [18, 3] implement direct image alignment by capitalizing on pixels with significant gray-level gradients for estimation. Sparse direct methods [12, 2] focus on delivering accurate state estimation through only a few well-selected raw patches, thus further diminishing the computational burden in comparison to both dense and semi-dense direct methods.
Unlike direct visual SLAM methods, direct LiDAR SLAM systems [13, 14, 19, 20] do not distinguish between dense and sparse approaches and commonly use spatially-downsampled or temporally-downsampled raw points in each scan to construct constraints for pose optimization.
In our work, we harness the principles of the direct method for both LiDAR and visual modules. The LiDAR module of our system is adapted from VoxelMap [14], and the visual model is based on a variant of sparse direct method [12]. While drawing inspiration from sparse direct image alignment in [12], our visual module differs by re-utilizing the LiDAR points as visual map points, thus mitigating the intensive backend computations (i.e., feature alignment, sliding window optimization and/or depth filtering).
II-B LiDAR-visual(-inertial) SLAM
The incorporation of multiple sensors in LiDAR-visual-inertial SLAM equips the system with the capability to handle a wide range of challenging environments, particularly when one sensor experiences failure or partial degeneration. Motivated by this, the research community has seen the emergence of various LiDAR-visual-inertial SLAM systems. Existing methods can generally be divided into two categories: loosely coupled and tightly coupled. The classification can be determined from two perspectives: the state estimation level and the raw measurement level. At the state estimation level, the key is whether the estimate from one sensor serves as an optimization objective in another sensor’s model. At the raw measurement level, it involves whether raw data from different sensors are combined.
Zhang et al. propose a LiDAR-Visual-Inertial SLAM system [21] that is loosely coupled at the state estimation level. In this system, VIO subsystem only provides the initial pose for the scan registration in LIO subsystem, instead of being optimized jointly with the scan registration. VIL-SLAM [22] employs a similar loosely-coupled method, not utilizing joint optimization of LiDAR, camera, and IMU measurements.
Some systems (e.g., DEMO [23], LIMO [24], CamVox [25], [26]) use 3D LiDAR points to provide depth measurements for the visual module [1, 27, 4]. While these systems exhibit measurement-level tight coupling, they remain loosely-coupled in state estimation, primarily due to the absence of constraints directly derived from LiDAR measurements at state estimation. Another issue arises as 3D LiDAR points do not have a one-to-one correspondence with 2D image feature points and/or lines due to mismatched resolutions. This mismatch requires interpolation in depth association, introducing potential errors. To address this, DVL-SLAM [28] employs a direct method for visual tracking, wherein the LiDAR points are directly projected into the image to ascertain the depth of corresponding pixel positions.
The works mentioned above have not achieved tight coupling at the state estimation level. In pursuit of higher accuracy and robustness, many recent studies have emerged that jointly optimize sensor data in a tightly-coupled manner. To name a few, LIC-Fusion [29] tightly fuses IMU measurements, sparse visual features, and LiDAR plane and edge features based on the MSCKF [30] framework. The subsequent LIC-Fusion2.0 [31] enhances LiDAR pose estimation by implementing plane-feature tracking within a sliding window. VILENS [32] offers a joint optimization of visual, LiDAR, and inertial data through a unified factor graph, relying on fixed lag smoothing. R2LIVE [33] tightly fuses the LiDAR, camera and IMU measurements in an on-manifold iterated Kalman filter [34]. For the VIO subsystem in R2LIVE, a sliding window optimization is used to triangulate the locations of visual features in the map.
Several systems achieve complete tight coupling at both the measurement and state estimation levels. LVI-SAM [35] fuses the LiDAR, visual and inertial sensors in a tightly-coupled smoothing and mapping framework, which is built atop a factor graph. The VIO subsystem performs visual feature tracking and extracts feature depth using LiDAR scans. R3LIVE [36] constructs the geometric structure of the global map by LIO and renders map texture by VIO. These two subsystems estimate the system state jointly by fusing their respective LiDAR or visual data with IMUs. The advanced version, R3LIVE++ [37], estimates exposure time in real time and conducts photometric calibration in advance[38], which enables the system to recover the radiance of map points. Unlike most previously mentioned LiDAR-inertial-visual systems that rely on feature-based methods for both LIO and VIO subsystems, R3LIVE series [36, 37] adopt direct methods for both without feature extraction, enabling them to capture subtle environmental features even in texture-less or structure-less scenarios.
Our system also jointly estimates the state using LiDAR, image and IMU data, and maintains a tightly-coupled voxel map at the measurement level. Furtheremore, our system uses direct methods, harnessing raw LiDAR points for LiDAR scan registration and employing raw image patches for visual tracking. The key difference between our system and R3LIVE (or R3LIVE++) is that R3LIVE (and R3LIVE++) operate at an individual pixel level in the VIO, while our system operates at image patch levels. This difference bestows our system with marked advantages. Firstly, in terms of robustness, our methodology uses a simplified, one-step frame-to-map sparse image alignment for pose estimation, mitigating the heavy reliance on an accurate initial state that is has to be obtained by a frame-to-frame optical flow in R3LIVEs. Consequently, our system simplifies and improves upon the two-stage frame-to-frame and frame-to-map operations in R3LIVE. Secondly, from a computational standpoint, the VIO in R3LIVE predominantly employs a dense direct method which is computationally expensive, necessitating extensive points for residual construction and rendering. In contrast, our sparse direct method provides enhanced computational efficiency. Lastly, our system exploits information at the resolution of raw image patches, whereas R3LIVE is capped at the resolution of its point map.
The visual module of our system is most similar to DV-LOAM [39], SDV-LOAM [40] and LVIO-Fusion [41], which projects LiDAR points attached with patches into a new image and tracks the image by minimizing the direct photometric error. However, they have several key differences, such as the use of separate maps for vision and LiDAR, reliance on the assumption of constant depth in patch warping in the visual module, loosely-coupling at the state estimation level, and two stages of frame-to-frame and frame-to-keyframes for image alignment. In contrast, our system tightly integrates frame-to-map image alignment, LiDAR scan registration, and IMU measurements in an iterated Kalman filter. Besides, thanks to the single unified map for both LiDAR and visual modules, our system can directly employ the plane priors provided by the LiDAR points to accelerate the image alignment.
Notations | Explanation |
/ | The encapsulated “boxplus” and |
“boxminus” operations on the state manifold | |
A vector in global world frame | |
A vector in camera frame | |
The extrinsic of LiDAR frame w.r.t. IMU frame | |
The extrinsic of IMU frame w.r.t. camera frame | |
The pose of IMU frame at time w.r.t. the global frame | |
, | The ground-truth, predicted and updated estimation of |
The -th update of | |
The error state between ground-truth and its estimation |
III System Overview
The overview of our system is shown in Fig. 1, which contains four sections: ESIKF (Section IV), Local Mapping (Section V), LiDAR Measurement Model (Section VI), and Visual Measurement Model (Section VII).
The asynchronously-sampled LiDAR points are first recombined into scans at the camera’s sampling time through scan recombination. Then, we tightly couple the LiDAR, image and inertial measurements via an ESIKF with sequential state update, where the system state is updated sequentially, first by the LiDAR measurements and then by the image measurements, both utilizing direct methods based on a single unified voxel map (Section IV). To construct the LiDAR measurement model in the ESIKF update (Section VI), we compute the frame-to-map point-to-plane residual. To establish visual measurement model (Section VII), we extract the visual map points within the current FoV from the map, making use of visible voxel query and on-demand raycasting; after the extraction, we identify and discard outlier visual map points (e.g., points that are occluded or exhibit depth discontinuity); we then compute frame-to-map image photometric errors for visual update.
The local map for both visual and LiDAR updates is a voxel-map structure (Section V): the LiDAR points construct and update the map’s geometric structure, while the visual images append image patches to selected map points (i.e., visual map points) and update reference patches dynamically. The updated reference patches have their normal vectors further refined in a separate thread.
IV Error-State Iterated Kalman Filter with Sequential State Update
This section outlines the system’s architecture, based on the sequentially-updated Error-State Iterated Kalman Filter (ESIKF) framework.
IV-A Notations and State Transition Model
In our system, we assume the time offsets among the three sensors (LiDAR, IMU and camera) are known, which can be calibrated or synchronized in advance. We take IMU frame (denoted as ) as the body frame and the first body frame as the global frame (denoted as ). Besides, we assume that the three sensors are rigidly attached and the extrinsic, defined in Table I, are pre-calibrated. Then, the discrete state transition model at the -th IMU measurement is:
(1) |
where is the IMU sample period, the state , input , process noise , and function are defined as follows:
(2) | ||||
where , , and respectively denote the IMU attitude, position, and velocity in the global frame, is the gravity vector in the global frame, is the inverse camera exposure time relative to the first frame, is the Gaussian noise that models as a random walk, and are the raw IMU measurements, and are measurement noises in and , and are IMU bias, which are modeled as random walk driven by Gaussian noise and , respectively.
IV-B Scan Recombination
We employ the scan recombination to segment the high-frequency, sequentially-sampled LiDAR raw points into distinct LiDAR scans at the camera sampling moments, as depicted in Fig. 2. This ensures both camera and LiDAR data are synchronized at the same frequency (e.g., 10 Hz), allowing for update of state at the same time.
IV-C Propagation
In the ESIKF framework, the state and covariance are propagated from time , when the last LiDAR scan and image frame are received, to time , when the current LiDAR scan and image frame are received. This forward propagation predicts the state at each IMU input during and , by setting the process noise in (1) to zero. Denote the propagated state as and covariance as , which will serve as a prior distribution for the subsequent update in Section IV-D. Moreover, to compensate for motion distortion, we conduct a backward propagation as in [42], ensuring points in a LiDAR scan are “measured” at the scan-end time . Note that for notation simplification, we omit the subscript in all state vectors.
IV-D Sequential Update
The IMU propagated state and covariance impose a prior distribution for , the system state at time , as follows:
(3) |
We denote the above prior distribution as and the measurement models for the LiDAR and camera as:
(4) |
where and respectively denote the measurement noises for the LiDAR and camera.
A standard ESIKF [43] would update the state using all the current measurements, including both LiDAR measurement and image measurements . However, LiDAR and image measurements are two different sensing modalities, whose data dimensions do not match. Furthermore, the fusion of image measurement may be performed at various levels of the image pyramid. To address the dimension mismatch and give more flexibility for each module, we propose a sequential update strategy. This strategy is theoretically equivalent to the standard update using all measurements, assuming statistical independence of LiDAR measurements and image measurements given the state vector (i.e., measurements corrupted by statistically independent noise).
To introduce the sequential update, we rewrite the total conditional distribution for the current state as:
(5) |
Equation (5) implies that the total conditional distribution can be obtained by two sequential Bayesian updates. The first step fuses only the LiDAR measurement with the IMU-propagated prior distribution to obtain the distribution :
(6) |
The second step then fuses the camera measurement with to obtain the final posterior distribution of :
(7) |
Interestingly, the two fusion in (6) and (7) follow the same form:
(8) |
To conduct the fusion in (8) for either LiDAR or image measurements, we detail the prior distribution and measurement model as follows. For the prior distribution , denote it as with . In case of the LiDAR update (i.e., the first step), is the state and covariance obtained from the propagation step. In case of the visual update (i.e., the second step), is the converged state and covariance obtained from the LiDAR update.
To obtain the measurement model distribution , denote state estimated at the -th iteration as , where . Approximating the measurement model (4) (either the LiDAR or camera measurement) through its first-order Taylor expansion made at leads to:
(9) | ||||
(10) |
where , is the residual, is the lumped measurement noise, and are the Jacobian matrixes of with respect to and , evaluated at zero, respectively.
Then, substituting the prior distribution and the measurement distribution in (10) into the posterior distribution (8) and performing maximum likelihood estimation (MLE), we can obtain the maximum a-posterior estimation (MAP) of (and hence ) from the standard update step in the ESIKF framework [43]:
(11) |
The converged state and covariance matrix then makes the mean and covariance of the posterior distribution .
Kalman filter with sequential update has been investigated in the literature, such as in [44, 45]. This paper adopts this approach for ESIKF for LiDAR and camera systems. The implementation of the ESIKF with sequential update is detailed in Algorithm 1. In the first step (Line 6-10), the error state is updated from the LiDAR measurements (Section VI-A) iteratively until convergence. The converged state and covariance estimates, denoted again as and , are used to update the geometry of the map (Section V-B), and subsequently refined in the second step visual update (Line 13 - 23) on each level of the image pyramid (Section VII-B) until convergence. The optimal state and covariance, denoted as and , are employed for propagating incoming IMU measurements (Section IV-C) and update the visual structures of the map (Section V-D and V-E).
V Local Mapping
V-A Map Structure
Our map employs an adaptive voxel structure presented in [14], which is organized by a Hash table and an octree for each Hash entry (Fig. 1). The hash table manages root voxels, each with a fixed dimension of meters. Each root voxel encapsulates an octree structure to further organize leaf voxels of varying sizes. A leaf voxel represents a local plane and stores a plane feature (i.e., plane center, normal vector, and uncertainty) along with a set of LiDAR raw points situated on this plane. Some of these points are attached with three-level image patches ( patch size), which we refer to as visual map points. Converged visual map points are only attached with reference patches, while non-converged ones are attached with reference patches and other visible patches (see Section V-E). The varying size of the leaf voxel allows it to represent local planes of different scales, thus being adaptable to environments with different structure [14].
To prevent the size of the map from going unbound, we keep only a local map within a large local region of length around the LiDAR’s current position, as illustrated in a 2D example in Fig. 3. Initially, the map is a cube centered at the LiDAR’s starting position . The LiDAR’s detection area is visualized as a sphere centered at its current position, with its radius defined by the LiDAR’s detection range. When the LiDAR moves to a new position where the detection area touches the boundaries of the map, we move the map away from the boundaries by a distance . As the map moves, the memory containing the area moved out of the local map will be reset to store new areas moved in the local map. This ring-buffer approach ensures that our local map is maintained within a fixed size of memory. The implementation of the ring-buffer Hash map is detailed in [46]. The map move check is performed after each ESIKF update step.
V-B Geometry Construction and Update
The geometry of the map is constructed and updated from LiDAR point measurements. Specifically, after the LiDAR update in ESIKF (Section IV), we register all points from the LiDAR scan to the global frame. For each registered LiDAR point, we determine its located root voxel in the Hash map. If the voxel does not exist, we initialize the voxel with the new point and index it into the Hash map. If the determined voxel already exists in the map, we append the point to the existing voxel. After all points in a scan are distributed, we conduct geometry construction and update as follows.
For newly-created voxels, we determine if all the contained points lie on a plane based on the singular value decomposition. If so, we calculate the center point , plane normal , and the covariance matrix of , denoted as , of the plane. is used to characterize the plane uncertainty, which arises from both pose estimation uncertainty and the point measurement noise. The detailed plane criteria and calculation of the plane parameters and uncertainties can be referred to our previous work [14]. If the contained points do not lie on a plane, the voxel is continuously subdivided into eight smaller octants until either the points in the sub-voxel are determined to form a plane or the maximum layer (eg., 3) is reached. In the latter case, the points in the leaf voxel will be discarded. As a result, the map only contains voxels (either root or sub) identified as planes.
For existing voxels that have new points appended, we assess if the new points still form a plane with the existing points in the root voxel or sub-voxel. If not, we conduct voxel sub-division as above. If yes, we update the plane’s parameters (,) and covariance as above too. Once the plane parameters converge (see [14]), the plane will be considered as mature and new points on this plane will be discarded. Moreover, mature planes will have their estimated plane parameters (,) and covariance fixed.
LiDAR points on planes (either in the root voxel or sub-voxel) will be used for generating visual map points in the subsequent section. For mature planes, 50 most recent LiDAR points are candidates for visual map point generation, while for unmature planes, all LiDAR points are the candidates. The visual map point generation process will identify some of these candidate points as visual map points and attach them with image patches for image alignment.
V-C Visual Map Point Generation and Update
To generate and update visual map points, we select the candidate LiDAR points in the map that (1) are visible from the current frame (detailed in Section VII-A), and (2) exhibit significant gray-level gradients in the current image. We project these candidate points, after the visual update (Section IV-D), onto the current image and retain the candidate point with the smallest depth for the local plane in each voxel. Then, we divide the current image into uniform grid cells each with pixels. If a grid cell does not contain any visual map point projected here, we generate a new visual map point using the candidate point with the highest gray-level gradient and associate it with the current image patch, estimated current state (i.e., frame pose and exposure time), and the plane normal calculated from the LiDAR points as in the previous section. The patch attached to the visual map points has three layers of the same size (e.g., pixels), each layer is half sampled from the previous layer, forming a patch pyramid. If a grid cell contains visual map points projected here, we add new patch (all three layers of pyramid) to the existing visual map point if (1) more than 20 frames have passed since its last patch addition, or (2) its pixel position in the current frame deviates by more than 40 pixels from its position at the last patch addition. As a result, the map points will likely have effective patches with uniformly distributed viewing angles. Along with the patch pyramid, we also attach the estimated current state (i.e.,pose and exposure time to the map point.
V-D Reference Patch Update
A visual map point could have more than one patch due to the addition of new patches. We need to choose one reference patch for image alignment in the visual update. In detail, we score each patch based on photometric similarity and viewing angle as follows:
(12) | ||||
where represents the Normalized Cross-Correlation (NCC) used to measure the similarity between patch and at the -th pyramid level (the level with the highest resolution) of both patch, with mean subtraction applied to both patches, denotes the cosine similarity between the normal vector and view direction of patch under evaluation. When the patch is directly facing the plane where the map point is located, the value of is 1. The overall score is calculated by summing the weighted NCC and , where the former represents the average similarity between the patch under evaluation and all other patches and represents the trace of the covariance matrix of the normal vector.
Among all the patches attached to a visual map point, the one with the highest score is updated as the reference patch. The above scoring mechanism tends to choose reference patches whose (1) appearance is similar (in terms of NCC) to most of the rest of the patches, a technique used by MVS [47] to avoid patches on dynamic objects; (2) view direction is orthogonal to the plane, thereby maintaining texture details at a high resolution. In contrast, the reference patches update strategy in our previous work FAST-LIVO [8] and prior arts [4] directly select the patch with the smallest view direction difference from the current frame, causing the selected reference patch to be very close to the current frame, hence imposing weak constraints on the current pose update.
V-E Normal Refine
Each visual map point is assumed to lie on a small local plane. Existing works [4, 2, 8] assumed that all pixels in a patch have the same depth, a wild assumption that does not hold in general. We use plane parameters computed from the LiDAR points as detailed in Section V-B to achieve greater accuracy. This plane normal is crucial for performing affine warping for image alignment in the visual update process. To further enhance the accuracy of the affine warping, the plane normal could be further refined from the patches attached to the visual map point. Specifically, we refine the plane normal in the reference patch by minimizing the photometric error with respect to the other patches attached to the visual map point.
V-E1 Affine warping
Affine warping is used to transform patch pixels from the reference frame (i.e., the source patch) to patch pixels in the rest of the frames (i.e., the target patch), illustrated in Fig. 4(a). Let be the -th pixel coordinates in the source patch and be the -th pixel coordinates in the -th target patch. Assuming all pixels in the patch lie in a local plane with normal and visual map point position (which corresponds to the center pixel for both source and target patches), both represented in the source patch frame, we have:
(13) | ||||
where represents the affine warping matrix that transforms the pixel coordinates from the source (or reference) patch to the -th target patch, and denote the relative pose of the reference frame w.r.t. the target frame . To use fisheye images directly without rectifying them to pinhole images, we implement projection matrix and back projection matrix based on different camera models (e.g., is the camera intrinsic matrix for the pinhole camera model).
V-E2 Normal Optimization
To refine the plane normal , we minimize photometric errors between the reference patch and other image patches at the -th pyramid level (i.e., the highest-resolution level):
(14) |
where is the path size, and are the inverse exposure times of the reference frame and the -th target frame, respectively. denote -th patch pixel in the reference frame, denotes the -th path pixel in the -th target frame, and is the set of all target frames.
V-E3 Optimization Variable Transformation
To enhance the computational efficiency, we reparameterize the least squares problem in (14). Note that the optimization variable only appeared in in (13), the optimization over can be conducted over . Moreover, the vector is subject to constraint , meaning that can be parameterized as follows:
(15) | ||||
where since no such reference patch could be chosen for the visual map point. The relation among , , and are shown in Fig. 4 (b).
Finally, the optimization in (14) is conducted over the vector without any constraints. This optimization can be performed in a separate thread to avoid blocking the main odometry thread. The optimized parameter can then be used to recover the optimal normal vector :
(16) |
Once the plane normal converges, the reference patch and normal vector for this visual map point are fixed without further refinement, and all other patches are deleted.
VI LiDAR Measurement Model
This section details the LiDAR measurement model used in the LiDAR update of ESIKF in Section IV-D.
VI-A Point-to-plane LiDAR Measurement Model
After obtaining the undistorted points in a scan, we project them to the global frame using the estimated state at the -th iteration of LiDAR update:
(17) |
We then identify the root or sub voxel where lies in the Hash map. If no voxel is found or the voxel does not contain a plane, the point is discarded. Otherwise, we use the plane in the voxel to establish a measurement equation for the LiDAR point. Specifically, we assume the true LiDAR point , given the accurate LiDAR pose , should lie on the plane with normal and center point in the voxel. i.e.,
(18) |
Since the ground-true point is measured as with ranging and bearing noises , we have . Likewise, the plane parameters are estimated as with covariance (Section V-B), so we have: . Therefore,
(19) |
where the measurement noise consisting of the noise associated with the LiDAR point, the normal vector, and the plane center respectively.
VI-B LiDAR Measurement Noise with Beam Divergence
The uncertainty of a LiDAR point in the local LiDAR frame is decomposed into two components in [14], the ranging uncertainty caused by laser time of flight (TOF), and the bearing direction uncertainty originated from encoders. Besides these uncertainties, we also consider uncertainties caused by the laser beam divergence angle , as illustrated in Fig. 5. As the angle between the bearing direction and normal vector increases, the ranging uncertainty of the LiDAR point increases significantly, while the bearing direction uncertainty remains unaffected. The due to the laser beam divergence angle can be modeled as:
(20) |
Considering influenced by TOF and laser beam divergence, when our system selects more points from the ground or walls (see Fig. 5 (c, d)), it achieves a more precise pose estimation than that not considering such effect.
VII Visual Measurement Model
This section details the visual measurement model used in the visual update of ESIKF in Section IV-D.
VII-A Visual Map Point Selection
To perform sparse image alignment in the visual update, we begin by selecting appropriate visual map points. We first extract the set of map points (termed as visual submap) that is visible in the current camera FoV, using voxel and raycasting queries. Then, the visual map points from this submap are selected and outliers are rejected. This process yields a refined set of visual map points ready for constructing visual photometric errors in the visual measurement model.
VII-A1 Visible Voxel Query
Identifying map voxels within the current frame FoV is challenging due to the large number of voxels in the map. To address this issue, we poll the voxels hit by LiDAR points in the current scan. This can be done efficiently by inquiring the voxel Hash table using the measured point position. If the camera FoV is largely overlapped with the LiDAR FoV, map points in the camera FoV likely lie in these voxels as well. We also poll voxels hit by map points identified visible (through the same voxel query and raycasting) in the previous image frame, assuming that two consecutive image frames have large FoV overlaps. Finally, the current visual submap can be obtained as map points contained in these two types of voxels followed by a FoV check.
VII-A2 Raycasting on Demand
In most cases, the visual submap can be obtained through the voxel queries above. However, a LiDAR sensor could return no points when it is too close to an object (known as the close proximity blind zones). Also, the camera FoV may not be completely covered by the LiDAR FoV. To recall more visual map points in these cases, we employ a raycasting strategy as illustrated in Fig. 6. We divide the image into uniform grid cells, each with pixels, and project the visual map points obtained from the voxel query onto the grid cells. For each image grid cell that is not occupied by these visual map points, a ray is cast backward along the central pixel, where sample points are uniformly distributed along the ray in the depth direction from to . In order to reduce computation load, the positions of sample points on each ray in the camera body frame are pre-computed. For each sampled point, we evaluate the corresponding voxel’s status: if the voxel contains map points that lie in this grid cell after projection, we incorporate these map points into the visual submap and cease for this ray. Otherwise, we continue to the next sample point on the ray until reaching the maximum depth . After processing all the unoccupied image grid cells through raycasting, we obtain a set of visual map points that distribute among the whole image.
VII-A3 Outlier Rejection
After voxel query and raycasting, we obtain all visual map points in the current frame FoV. However, these visual map points could be occluded in the current frame, have discontinuous depth, have their reference patch taken at large view angles, or have large view angles in the current frame, all of which can severely degrade image alignment accuracy. To address the first issue, we project all the visual map points in the submap into the current frame using the pose after the LiDAR update and keep the lowest-depth points in each grid cell of pixels. To address the second issue, we project the LiDAR points in the current LiDAR scan to the current frame producing a depth map. By comparing the depth of visual map points with their neighbor in the depth map, we determine their occlusion and depth variation. Occluded and depth-discontinuous map points are rejected (see Fig. 7). To address the third and fourth issue, we remove points where the view angle (i.e., the angle between normal vector and direction from the visual map point to the patch optical center) of the reference patch or current patch is too large (e.g. over 80°). The remaining visual map points will be used to align the current image.
VII-B Sparse-Direct Visual Measurement Model
The visual map points extracted above are used to construct the visual measurement model. The underlying principle is that, when transforming the map point to the current image with the ground-truth state (i.e., pose) , the photometric error between the reference patch and the current patch should be zero:
(21) |
where is the common camera projection model (i.e., Pinhole, MEI, ATAN, Scaramuzza, Equidistant), is the pose of the global frame w.r.t. reference frame , which has been estimated when receiving and fusing the reference frame, is the affine warping matrix that transforms pixels from the -th current patch to the reference patch, is the relative pixel position to the center within the current patch, denote the ground-true pixel values of the reference and current frames, respectively. They are measured as the actual image pixel values with measurement noise , which originate from various sources (e.g., shot noise and the Analog-to-Digital Converter (ADC) noise of the camera CMOS). Hence,
(22) |
To enhance computational efficiency, we employ an inverse compositional formulation [48, 4], where the pose incremental parameterizing in (see (21)), is moved from to as follows:
(23) |
Given that in the reference frame remains unchanged during each iteration, we only require a one-time computation of the Jacobian matrices w.r.t. , rather than re-calculating them for every iteration.
To estimate the inverse exposure time from the measurement equation (22), we fix the initial inverse exposure time to eliminate the degeneration of equation (22) when all inverse exposure time are zeros. The estimated inverse exposure times of subsequent frames are therefore the exposure time relative to the first frame.
The equation (22) is used in the visual update step across three levels (see Algorithm 1); the visual update starts from the coarsest level, after the convergence of a level, it proceeds to the next finer level. The estimated state is then used to generate visual map points (Section V-C) and update reference patch (Section V-D).
VIII Datasets for Evaluation
In this section, we introduce datasets for performance evaluation, including public datasets NTU-VIRAL [49], Hilti’22 [50], Hilti’23 [51], and MARS-LVIG [52], as well as our self-collected FAST-LIVO2 private dataset. Specifically, the NTU-VIRAL and Hilti datasets are used to conduct a quantitative benchmark comparison of our system against other state-of-the-art (SOTA) SLAM systems (Section IX-B). The FAST-LIVO2 private dataset is primarily used to evaluate our system across various extremely challenging scenarios (Section IX-C), to demonstrate its capability for high-precision mapping (Section IX-D), and to validate the functionality of the individual modules within our system (Sections I-A through I-D in the Supplementary Material[53]). MARS-LVIG dataset is employed for application demonstrations (Section X) and ablation study (Section I-E in the Supplementary Material[53]).
VIII-A NTU-VIRAL, Hilti and MARS-LVIG Dataset
The NTU-VIRAL dataset, collected at the Nanyang Technological University campus using an aerial platform, presents diverse scenarios embodying unique aerial operational challenges. Specifically, the “sbs” sequences can only provide noisy visual features from distant objects. The “nya” sequences present challenges to LiDAR SLAM due to semi-transparent surfaces and to visual SLAM owing to intricate flight dynamics and low lighting conditions. The dataset is equipped with a 16-channel OS1 gen1333https://ouster.com/products/os1-lidar-sensor LiDAR sampled at 10 Hz and with a built-in IMU at 100 Hz, and two synchronized pinhole cameras triggered at 10 Hz. The left camera is used for evaluation.
The Hilti’22 and Hilti’23 datasets, collected by handheld and robot devices, encompass indoor and outdoor sequences from environments like construction sites, offices, labs and parking areas. These sequences introduce numerous challenges from long corridors, basements, and stairs, with textureless features, varying illumination conditions, and insufficient LiDAR plane constraints. Handheld sequences use a Hesai PandarXT-32444https://www.hesaitech.com/product/xt32/ LiDAR at 10 Hz, five wide-angle cameras at 40 Hz, which is downsampled into 10 Hz, and an external Bosch BMI085 IMU at 400 Hz. Meanwhile, the robot-mounted sequences feature a Robosense BPearl555https://www.robosense.ai/en/rslidar/RS-Bpearl LiDAR at 10 Hz, eight omnidirectional cameras at 10 Hz, and an Xsens MTi-670 IMU at 200 Hz. In both cases, the front-facing camera is used for all systems under evaluation. Millimeter-accurate ground truth, obtained through a motion capture system (MoCap) or a Total Station [54], is provided for each sequence. Note that the ground truth of the Hilti datasets is not open-source; therefore, algorithmic results on these datasets are evaluated via the Hilti official website. Since “Site 3” in Hilti’23 does not provide in-depth analysis plots (e.g., RMSE), we exclude these four sequences, but our scoring results for these sequences can still be found on their official website666https://hilti-challenge.com/leader-board-2022.html, https://hilti-challenge.com/leader-board-2023.html. NTU-VIRAL and Hilti contribute a total of 25 sequences.
The MARS-LVIG dataset provides high-altitude, ground-facing mapping data that encompasses diverse unstructured terrains such as jungles, mountains, and islands. The dataset was collected via a DJI M300 RTK quadrotor, which is equipped with a Livox Avia777https://www.livoxtech.com/avia LiDAR (with built-in BMI088 IMU) and a high-resolution global-shutter camera, both triggered at 10 Hz. This is notably distinct from the aforementioned NTU-VIRAL and Hilti datasets, which use grayscale images, while the MARS dataset employs RGB images, thereby facilitating the generation of clear, dense colored point clouds. Therefore, we leverage this public dataset to validate our capabilities in high-altitude aerial mapping applications.
VIII-B FAST-LIVO2 Private Dataset
To validate the system’s performance under more extreme conditions (e.g., LiDAR degeneration, low illumination, drastic exposure changes, and cases of no LiDAR measurements), we make a new dataset named FAST-LIVO2 private dataset. The dataset, hardware device, and hardware synchronization scheme are released with the codes of this work to facilitate the reproduction of our work.
VIII-B1 Platform
Our data collection platform, illustrated in Fig. 8, is equipped with an industrial camera (MV-CA013-21UC), a Livox Avia LiDAR, and a DJI manifold-2c (Intel i7-8550u CPU and 8 GB RAM) as onboard computers. The camera FoV is and the LiDAR FoV is . All sensors are hard synchronized with a 10 Hz trigger signal, generated by STM32 synchronized timers.
VIII-B2 Sequence Description
As summarized in Table S1 in the Supplementary Material[53], FAST-LIVO2 private dataset comprises 20 sequences across various scenes (e.g., campus buildings, corridors, basements, mining tunnel, etc.) characterized by structure-less, cluttered, dim, variable-lighting, and weakly textured environments, with a total duration of 66.9 min. Most sequences exhibit visual and/or LiDAR degeneration, such as facing a single and/or texture-less plane, traversing an extremely narrow and/or dark tunnel, and experiencing varying light conditions from indoor to outdoor (see Fig. S7 in the Supplementary Material[53]). To guarantee enhanced synchronous data collection between the camera and LiDAR, we configure the camera with fixed exposure time but auto-gain mode in most scenarios. For the remaining sequences with auto-exposure, we record their ground truth exposure times. In all sequences, the platform returns to the starting point, which enables the drift evaluation.
IX Experiment Results
In this section, we conduct extensive experiments to evaluate our proposed system.
IX-A Implementation and System Configurations
We implemented the proposed FAST-LIVO2 system in C++ and Robots Operating System (ROS). In the default configuration, the exposure time estimation is enabled, while normal vector refinement is turned off. LiDAR points in a scan are downsampled temporally at a 1:3 ratio. The root voxel size for the voxel map is set at 0.5 m, and the maximum layer of the internal octree is 3. The image patch size is for image alignment and for normal refinement. Within the sequential ESIKF settings, for all the experiments, the camera photometric noise is set to a constant value of 100. The LiDAR depth error and bearing angle error are adjusted to 0.02 m and 0.05° for Livox Avia LiDAR and OS1-16, 0.001 m and 0.001° for PandarXT-32, 0.008 m and 0.01° for Robosense BPearl LiDAR. The laser beam divergence angle is set at 0.15° for Livox Avia LiDAR and OS1-16, and at 0.001° for the PandarXT-32 and Robosense BPearl LiDAR. Our system uses the same parameters in all sequences of all datasets with the same sensor setup. The computation platform for all experiments is a desktop PC equipped with an Intel i7-10700K CPU and 32 GB RAM. For FAST-LIVO2, we also test it on an ARM processor that is commonly used in embedded systems with reduced power and cost. The ARM platform is RB5888https://www.adlinktech.com/Products/Computer_on_Modules/SMARC/LEC-RB5?lang=en. with a Qualcomm Kryo585 CPU and 8 GB RAM. We refer to the implementation of FAST-LIVO2 on the ARM-based platform as “FAST-LIVO2 (ARM)”.
Dataset | Sequence | SDV- LOAM | Our LIO | FAST- LIO2 | R3LIVE | LVI- SAM | FAST- LIVO | Ours (w/o expo) | Ours (w normal) | Ours (w/o update) | Ours |
Hilti’22 | Construction Ground | 25.121 | 0.011 | 0.013 | 0.021 | 0.022 | 0.011 | 0.008 | 0.015 | 0.010 | |
Construction Multilevel | 12.561 | 0.031 | 0.044 | 0.024 | 0.052 | 0.021 | 0.018 | 0.025 | 0.020 | ||
Construction Stairs | 9.212 | 0.221 | 0.320 | 0.784 | 9.142 | 0.241 | 0.049 | 0.027 | 0.151 | 0.016 | |
Long Corridor | 19.531 | 0.061 | 0.064 | 0.061 | 6.312 | 0.065 | 0.069 | 0.059 | 0.071 | 0.067 | |
Cupola | 9.321 | 0.221 | 0.250 | 2.142 | 0.182 | 0.161 | 0.122 | 0.179 | 0.121 | ||
Lower Gallery | 11.232 | 0.014 | 0.024 | 0.008 | 2.281 | 0.022 | 0.010 | 0.008 | 0.010 | 0.007 | |
Attic to Upper Gallery | 4.551 | 0.223 | 0.720 | 2.412 | 0.621 | 0.101 | 0.077 | 0.221 | 0.069 | ||
Outside Building | 2.622 | 0.030 | 0.028 | 0.029 | 0.952 | 0.052 | 0.042 | 0.033 | 0.050 | 0.035 | |
Hilti’23 | Floor 0 | 4.621 | 0.028 | 0.031 | 0.018 | 0.021 | 0.025 | 0.023 | 0.023 | 0.022 | |
Floor 1 | 7.951 | 0.025 | 0.031 | 0.024 | 8.682 | 0.022 | 0.024 | 0.022 | 0.031 | 0.023 | |
Floor 2 | 7.912 | 0.041 | 0.083 | 0.046 | 0.048 | 0.023 | 0.021 | 0.051 | 0.022 | ||
Basement | 6.151 | 0.021 | 0.038 | 0.024 | 0.035 | 0.020 | 0.018 | 0.018 | 0.016 | ||
Stairs | 9.032 | 0.110 | 0.170 | 0.110 | 3.584 | 0.152 | 0.025 | 0.020 | 0.132 | 0.018 | |
Parking 3x floors down | 19.952 | 0.162 | 0.320 | 0.462 | 0.356 | 0.035 | 0.022 | 0.112 | 0.032 | ||
Large room | 16.781 | 0.121 | 0.028 | 0.035 | 0.563 | 0.031 | 0.033 | 0.027 | 0.118 | 0.026 | |
Large room (dark) | 15.012 | 0.051 | 0.040 | 0.059 | 0.053 | 0.049 | 0.051 | 0.058 | 0.046 | ||
NTU VIRAL | eee_01 | 0.301 | 0.122 | 0.212 | 0.072 | 3.901 | 0.191 | 0.069 | 0.066 | 0.109 | 0.068 |
eee_02 | 1.842 | 0.131 | 0.172 | 0.059 | 0.182 | 0.132 | 0.051 | 0.055 | 0.112 | 0.051 | |
eee_03 | 0.301 | 0.124 | 0.213 | 0.078 | 0.287 | 0.192 | 0.068 | 0.070 | 0.099 | 0.068 | |
nya_01 | 0.202 | 0.084 | 0.141 | 0.080 | 0.205 | 0.121 | 0.075 | 0.078 | 0.106 | 0.073 | |
nya_02 | 0.214 | 0.153 | 0.212 | 0.084 | 1.296 | 0.182 | 0.076 | 0.081 | 0.118 | 0.075 | |
nya_03 | 0.251 | 0.082 | 0.133 | 0.079 | 0.176 | 0.112 | 0.060 | 0.060 | 0.092 | 0.059 | |
sbs_01 | 0.212 | 0.112 | 0.184 | 0.075 | 0.254 | 0.253 | 0.064 | 0.063 | 0.098 | 0.062 | |
sbs_02 | 0.233 | 0.123 | 0.161 | 0.076 | 0.221 | 0.134 | 0.062 | 0.048 | 0.116 | 0.061 | |
sbs_03 | 0.281 | 0.122 | 0.142 | 0.070 | 0.309 | 0.132 | 0.061 | 0.047 | 0.119 | 0.060 | |
Average | 7.416 | 0.097 | 0.151 | 0.278 | 1.928 | 0.137 | 0.051 | 0.044 | 0.089 | 0.045 |
-
1
denotes the system totally failed.
IX-B Benchmark Experiments
In this experiment, we conduct quantitative evaluations on 25 sequences from the NTU-VIRAL, Hilti’22, and 23 open datasets. Our approach is benchmarked against several state-of-the-art open-source odometry systems, including R3LIVE [36], a dense direct LiDAR-inertial-visual odometry system; FAST-LIO2 [13], a direct LiDAR-inertial odometry system; SDV-LOAM [40], a semi-direct LiDAR-visual odometry system; LVI-SAM [35], a feature-based LiDAR-inertial-visual SLAM system; and our previous work FAST-LIVO [8].
These systems are downloaded from their respective GitHub repositories. For FAST-LIO2, FAST-LIVO, and LVI-SAM, we use the recommended settings for indoor and outdoor scenes equipped with multi-line LiDAR sensors. For R3LIVE, we adapt the system to work with fisheye camera models and multi-line LiDARs equipped with external IMUs (the default configuration only supports internal IMUs). We disable the real-time optimization of the camera intrinsic and the extrinsic due to adverse optimization caused by insufficient IMU excitation in the datasets. Other parameters, including the window size and pyramid level for optical flow tracking, the resolution for downsampling the point cloud of the current scan and the global map, are fine-tuned to achieve optimal performance. Since only the vision module of SDV-LOAM is open-sourced, we integrate it with LeGO-LOAM[7] in a loosely coupled manner, following the methodology described in the original paper[40]. This enhanced system continues to refine poses obtained from the vision module and we also open this implementation on GitHub999https://github.com/xuankuzcr/SDV-LOAM_reimplementation. Given that all compared systems are odometry without loop closure, except for LVI-SAM, we remove the loop-closure module of LVI-SAM to ensure a fair comparison. Additionally, we conduct an ablation study on the exposure time estimation module, the normal refine module, and the reference patch update strategy. The default FAST-LIVO2 has real-time exposure estimation and reference patch update, but no normal refinement.
The results of all methods are shown in Table II. It is seen that our method achieves the highest overall accuracy across all sequences with an average RMSE of 0.044 m, which is three times more accurate than the second-place FAST-LIVO at 0.137 m. Our system delivers the best results in most sequences, except for ”Outside Building” and ”Large Room (dark)”, where our system exhibits a slightly (millimeter-level) higher error compared to the LiDAR-inertial only odometry FAST-LIO2. This discrepancy can be attributed to the rich structural features but poor lighting conditions of these sequences, resulting in dim and blurred images. Consequently, fusing these low-quality images does not enhance odometry accuracy. Excluding these two sequences, our method, which leverages tightly-coupled LiDAR, inertial, and visual information, outperforms FAST-LIO2, our LIO subsystem, and the LiDAR-visual only odometry, SDV-LOAM, significantly. Notably, SDV-LOAM performs particularly poorly on the Hilti datasets due to its lack of tight integration with IMU measurements, leading to drift in the LO subsystem. Additionally, the loose coupling between LiDAR and visual observations, along with poor initial values for VO, often results in local optima or even negative optimization. Our LIO subsystem generally surpasses FAST-LIO2 due to our more accurate noise modeling for each LiDAR point. In a few sequences where FAST-LIO2 outperforms slightly, the differences are minimal, at the millimeter level, and negligible. Moreover, our system’s accuracy significantly exceeds that of other tightly coupled LiDAR-inertial-visual systems across all sequences. Among them, LVI-SAM fails in nine sequences primarily due to its feature-based LIO and VIO subsystems not fully utilizing raw measurements, which degrades its robustness in environments with subtle geometric or texture features. R3LIVE generally performs well, but struggles in “Construction Stairs”, “Cupola”, and “Attic to Upper Gallery” sequences, where its performance is even worse than FAST-LIO2. This is because intense rotations at structure-less staircases result in inadequate pose priors, causing local optima when aligning colored map points with the current frame, and ultimately leading to negative optimization. FAST-LIVO and FAST-LIVO2 overcome such challenges in these sequences by the patch-based image alignments. Additionally, situations where the sensors are close to walls in these sequences highlight the effectiveness of raycasting in FAST-LIVO2, with the mapping results in these large-scale scenes shown in Fig. S8 in the Supplementary Material [53]. On the other hand, FAST-LIVO is outperformed by R3LIVE and FAST-LIVO2 on the NTU-VIRAL dataset, especially in unstructured scenes like “nya” sequences, where the effects of affine warping based on constant depth assumptions are inaccurate. In contrast, the pixel-level alignment of R3LIVE and the plane prior (or refinement) of FAST-LIVO2 do not encounter such issues.
Comparing the different variants of FAST-LIVO2, we observe that the average accuracy without real-time exposure time estimation decreases by 6 mm compared to the default, as the exposure time estimation can actively compensate illumination changes in the environment. On the other hand, the average accuracy without the reference patch update decreases by 44 mm compared to the default, as the reference patch update strategy effectively selects patches with higher resolution and avoids selecting outlier patches. Finally, the normal refinement increases the average accuracy by 1 mm, and the accuracy improvement is not consistent in all sequences. The limited improvement is mainly because normal vector refinement yields positive optimization only in simple structured scenes with nice image observations. In the NTU-VIRAL dataset, images from the “eee” and “nya” sequences are extremely dim and blurry, where negative optimization is particularly severe. To further study the effectiveness of the different modules, including exposure time estimation, affine warping, reference patch update, normal convergence, on-demand raycasting, and ESIKF sequential update, we conducted a thorough study on our private dataset and MARS-LVIG dataset. The results are presented in Section I (System Module Validation) in the Supplementary Material[53] due to the space limit. As confirmed in the results, our system can achieve robust and accurate pose estimation in both structured and unstructured environments, under severe light variations, in remarkably large-scale scenarios with long-term, high-speed data collection, and even in extremely narrow spaces with few LiDAR measurements.
IX-C LiDAR Degenerated and Visually Challenging Environments
In this experiment, we evaluate the robustness of our system under environments experiencing LiDAR degeneration and/or visual challenges, comparing it with the qualitative mapping results of FAST-LIVO and R3LIVE in 8 sequences as shown in Fig. 9 and 10. Fig. 9 showcases LiDAR degeneration sequences where the LiDAR is facing a big wall while moving along the wall from one side to the other. Due to the absence of geometrical constraints since only one wall plane is being observed by the LiDAR, LIO methods would fail. It’s worth mentioning that the “HIT Graffiti Wall” sequence spans nearly 800 meters with LiDAR continuously facing the wall, leading to considerable degeneration. In all sequences, FAST-LIVO2 distinctly showcases its robustness against even long-term degeneration and its capability to deliver high-precision colored point maps. In contrast, FAST-LIVO managed to obtain the geometric structure but with completely blurred texture. R3LIVE struggles with both geometric structure and texture clarity. Fig. 10 showcases tests in more complicated scenarios where LiDAR and/or camera both degenerate occasionally. The degeneration directions are indicated by respective arrows. “HKU Cultural Center” (Fig. 10 (a)) showcases the mapping results of FAST-LIVO2, R3LIVE, and FAST-LIVO. As can be seen, R3LIVE and FAST-LIVO have distorted point maps, blurred textures, and drifts exceeding 1 m. In contrast, FAST-LIVO2 successfully returns to the starting point, achieving an impressive end-to-end error of less than 0.01 m, while achieving a consistent point map with clear textures. “CBD Building 03” (Fig. 10 (b)) and “Mining Tunnel” (Fig. 10 (c)) display only FAST-LIVO2 results, as R3LIVE and FAST-LIVO failed. In Fig. 10 (b), the blue arrow represents movement towards a pure black screen, indicating concurrent LiDAR and camera degeneration. In Fig. 10 (c1) and (c2), the red points represent the LiDAR scan at that location, illustrating the areas of LiDAR degeneration due to the single plane being observed. Furthermore, the “Mining Tunnel” exhibits very dim lighting throughout the whole sequence, coupled with frequent visual and LiDAR degeneration. Despite of these challenges, FAST-LIVO2 still returns to the starting point with an end-to-end error of less than 0.01 m in both sequences.
IX-D High-precision Mapping
In this experiment, we validate the high-precision mapping capabilities of our system. To explore the mapping accuracy across different algorithms and ensure fairness, we compare our system with FAST-LIO2, R3LIVE, and FAST-LIVO in scenes characterized by rich texture and structured environments. We take the sequences “SYSU 01”, “HKU Landmark” and “CBD Building 01” as examples. Fig. S9 in the Supplementary Material[53] shows the colored point maps of these sequences reconstructed in real time. We can clearly observe that the point maps generated by FAST-LIVO2 retain the finest details among all the systems, the enlarged views of the colored point maps are akin to those in the actual RGB image. In the “SYSU 01” sequence, our algorithm produces fewer white noise dots on the signboard because we normalize the image colors to a reasonable exposure time using the recovered exposure time before coloring, resulting in rarely overexposed colored point maps. The reconstruction of the human and motorcycle in “CBD Building 01” also exemplifies our ability to rebuild the details of unstructured objects. In all sequences, the estimated final position returns to the starting point with an end-to-end error of less than 0.01 m. We also tested FAST-LIVO2 in the remaining sequences of the prviate dataset, with mapping results shown in Fig. S10-S13 in the Supplementary Material[53].
IX-E Run Time Analysis
In this section, we evaluate the average computational time per LiDAR scan and image frame of our proposed system, tested on a desktop PC equipped with an Intel i7-10700K CPU and 32 GB RAM. Our evaluations span public datasets including Hilti’22, Hilti’23, and NTU-VIRAL, and our private dataset. As shown in Table III, our system exhibits the lowest processing time across all sequences. The average computation time consumption on an Intel i7 processor is only 30.03 ms (17.13 ms per LiDAR scan and 12.90 ms per image frame), fulfilling real-time operation at 10 Hz. Besides, our system can even operate in real time on ARM processors with an average processing time per frame of just 78.44 ms. LVI-SAM’s LiDAR and visual feature extraction modules in LIO and VIO are time-consuming. In addition to the time consumed by LIO and VIO, LVI-SAM integrates IMU pre-integration constraints, visual odometry constraints, and LiDAR odometry constraints within a factor graph, further increasing the overall processing time. For R3LIVE, although also employing a direct method, its pixel-wise image alignment necessitates the use of a large number of visual map points. In contrast, our approach uses sparse points with reference patches, enabling efficient alignment. Additionally, R3LIVE maintains a colored map that undergoes Bayesian updating, significantly increasing the computational load as the map resolution increases. For FAST-LIO2, the average processing time per frame (Table S3 in the Supplementary Material[53] due to space constraints) is approximately 10.35 ms less than FAST-LIVO2 due to not processing additional image measurements.
FAST-LIVO2 also shows noticeable improvements over the predecessor FAST-LIVO. The primary enhancement stems from our application of inverse compositional formulation in the sparse image alignment. Employing affine warping based on the plane prior from LiDAR points further enhances the convergence efficiency of our method. Consequently, FAST-LIVO2 reduces the number of iterations per pyramid level from to , while still achieving superior accuracy.
Dataset | R3LIVE | LVI- SAM | FAST- LIVO | FAST-LIVO2 (LiDAR / Image) | FAST-LIVO2 (ARM) |
Hilti’22 | |||||
Construction Ground | 105.03 | 52.33 | 36.52 (20.44 / 15.05) | 96.12 | |
Construction Multilevel | 112.13 | 56.12 | 38.77 (21.38 / 17.39) | 95.38 | |
Construction Stairs | 125.41 | 138.47 | 51.34 | 39.33 (24.32 / 15.01) | 98.43 |
Long Corridor | 120.42 | 109.97 | 48.68 | 41.42 (26.21 / 15.21) | 94.33 |
Cupola | 151.52 | 59.42 | 43.54 (26.53 / 17.01) | 98.12 | |
Lower Gallery | 119.74 | 131.37 | 51.15 | 41.11 (25.69 / 15.42) | 92.13 |
Attic to Upper Gallery | 144.39 | 58.61 | 44.21 (27.12 / 17.09) | 91.15 | |
Outside Building | 105.17 | 107.92 | 44.25 | 33.82 (18.91 / 14.91) | 85.43 |
Hilti’23 | |||||
Floor 0 | 117.01 | 52.23 | 42.22 (25.13 / 17.09) | 92.24 | |
Floor 1 | 106.11 | 106.98 | 50.14 | 43.12 (27.43 / 15.69) | 93.52 |
Floor 2 | 154.65 | 53.24 | 41.78 (26.12 / 15.66) | 94.43 | |
Basement | 118.21 | 48.23 | 39.65 (24.53 / 15.12) | 93.42 | |
Stairs | 122.94 | 114.29 | 48.55 | 38.42 (22.64 / 15.78) | 95.53 |
Parking 3x floors down | 142.43 | 51.89 | 43.62 (26.99 / 16.63) | 94.22 | |
Large room | 125.78 | 182.18 | 55.23 | 43.23 (26.43 / 16.80) | 93.28 |
Large room (dark) | 131.31 | 51.46 | 44.21 (29.12 / 15.09) | 92.29 | |
NTU VIRAL | |||||
eee_01 | 105.89 | 113.61 | 38.22 | 31.45 (17.24 / 14.21) | 78.03 |
eee_02 | 112.27 | 119.19 | 39.43 | 30.24 (16.23 / 14.01) | 79.22 |
eee_03 | 108.11 | 108.77 | 37.53 | 29.44 (16.12 / 13.32) | 77.43 |
nya_01 | 122.73 | 124.54 | 39.66 | 34.52 (18.14 / 16.38) | 79.44 |
nya_02 | 111.96 | 117.18 | 35.11 | 32.23 (17.33 / 14.90) | 80.52 |
nya_03 | 115.96 | 111.24 | 38.42 | 33.42 (18.12 / 15.30) | 78.95 |
sbs_01 | 112.77 | 115.67 | 39.23 | 31.92 (14.68 / 17.24) | 78.32 |
sbs_02 | 113.01 | 110.37 | 35.39 | 32.56 (17.62 / 14.94) | 72.43 |
sbs_03 | 119.91 | 120.73 | 37.41 | 33.62 (18.02 / 15.60) | 75.56 |
Private Dataset | |||||
Retail Street | 85.32 | 70.23 | 31.22 | 19.44 (10.42 / 9.02) | 64.12 |
CBD Building 01 | 79.22 | 65.32 | 29.14 | 18.22 (10.32 / 7.90) | 62.33 |
CBD Building 02 | 21.53 (11.50 / 10.03) | 69.98 | |||
CBD Building 03 | 21.89 (11.98 / 9.91) | 69.92 | |||
HKU Landmark | 75.43 | 68.77 | 28.33 | 17.42 (10.12 / 7.30) | 63.21 |
HKU Lecture Center | 70.42 | 31.45 | 18.12 (10.99 / 7.13) | 62.88 | |
HKU Centennial Garden | 74.33 | 70.12 | 33.46 | 19.22 (10.80 / 8.42) | 64.52 |
HKU Cultural Center | 102.34 | 35.62 | 21.43 (11.42 / 10.01) | 72.43 | |
HKU Main Building | 75.62 | 30.21 | 19.43 (10.33 / 9.10) | 65.31 | |
HKUST Red Sculpture | 84.22 | 90.13 | 32.43 | 20.14 (11.12 / 9.02) | 65.42 |
HIT Graffiti Wall | 112.56 | 39.98 | 21.71 (10.55 / 11.16) | 68.99 | |
Banner Wall | 74.13 | 30.22 | 19.32 (10.01 / 9.31) | 63.43 | |
Bright Screen Wall | 73.23 | 28.43 | 18.22 (9.92 / 8.30) | 61.21 | |
Black Screen Wall | 71.42 | 27.66 | 17.99 (9.10 / 8.89) | 60.91 | |
Office Building Wall | 19.42 (10.11 / 9.31) | 62.33 | |||
Narrow Corridor | 18.44 (8.32 / 10.12) | 61.42 | |||
Long Corridor | 19.53 (10.43 / 9.10) | 62.55 | |||
Mining Tunnel | 29.43 (15.42 / 14.01) | 79.32 | |||
SYSU 01 | 112.66 | 88.95 | 32.23 | 23.65 (12.51 / 11.14) | 77.43 |
SYSU 02 | 110.12 | 32.12 | 22.63 (11.82 / 10.81) | 72.31 | |
Average | 108.36 | 108.45 | 41.43 | 30.03 (17.13 / 12.90) | 78.44 |
-
1
denotes the system totally failed.
X Applications
To showcase the superior performance and versatility of FAST-LIVO2 in real-world applications, we develop multiple solutions, including fully onboard autonomous UAV navigation, airborne mapping, textured mesh generation, and 3D Gaussian splatting reconstruction for 3D scene representation.
X-A Fully Onboard Autonomous UAV Navigation
Given the high precision and robust localization performance of FAST-LIVO2, along with its real-time capabilities, we conduct closed-loop autonomous UAV flights.
X-A1 System Configurations
The hardware and software setup are illustrated in Fig. 11. For hardware, we use a NUC (Intel i7-1360P CPU and 32 GB RAM) as the onboard computer. In terms of software, the localization component is powered by FAST-LIVO2, which provides position feedback at 10Hz. The localizationr result is fed to the flight controller to achieve 200Hz feedback on position, velocity, and attitude. Besides localization, FAST-LIVO2 supplies a dense registered point cloud to the planning module, the Bubble planner [55], which plans a smooth trajectory that is then tracked by an on-manifold Model Predictive Control (MPC) [56]. The MPC calculates the desired angular rates and thrust, which are tracked by respective low-level angular rate controllers running on the flight controller. Importantly, the MPC, Planner, and FAST-LIVO2 all operate on the onboard computer in real time.
X-A2 UAV Autonomous Navigation
We conduct 4 fully onboard autonomous UAV navigation experiments, “Basement”, “Woods”, “Narrow Opening”, and “SYSU Campus” (Table S2 in the Supplementary Material[53]). “Basement” and “Woods” experiments are fully autonomous flights incorporating all planning, MPC, and FAST-LIVO2 modules, while “Narrow Opening” and “SYSU Campus” are manual flights with only MPC and FAST-LIVO2 (without the planning component). As can be seen, “Basement” and “Woods” showcase the UAV’s successful autonomous navigation and obstacle avoidance. In “Narrow Opening”, the UAV is commanded to fly close proximity to a wall leading to few LiDAR points measurements. Nevertheless, the raycasting module recalls a greater number of visual map points, providing abundant constraints for localization, which allows for stable localization. Moreover, “Basement” and “Narrow Opening” experience LiDAR degeneration, observing only a single wall (see Fig. LABEL:fig_cover (e1) and (e4), Fig. 12 (b1-b4)), along with significant exposure variations (see Fig. LABEL:fig_cover (e5-e6)). Despite these challenges, our UAV system performed exceptionally well. “Woods” involves the UAV moving at high speeds up to 3 m/s, demanding rapid response from the entire UAV system (see Fig. 12 (a1-a4)). “SYSU Campus”, a non-degenerated scene, primarily demonstrates the onboard high-precision mapping capabilities (see Fig. S14 in the Supplementary Material[53]). Finally, it is worth mentioning that in all these four UAV flights, severe lighting variation occurred. FAST-LIVO2 is able to estimate exposure time that closely follows the ground-truth values (see Fig. S15 in the Supplementary Material[53]).
Regarding onboard computational time, the need to run MPC (at 100 Hz) and Planning (at 10 Hz) on the onboard computer consumes computational resources and memory, limiting the computation resources available to FAST-LIVO2. Despite of the concurrent execution of control and planning, as illustrated in Fig. 13, the average onboard processing time per LiDAR scan and image frame for FAST-LIVO2, approximately 53.47 ms, is still well below the frame period 100 ms. The average processing times for planning and MPC are 8.43 ms and 18.5 ms, respectively. The total average processing time of 80.4 ms meets very well the real-time requirements for onboard operations.
X-B Airborne Mapping
Airborne mapping represents a crucial task in surveying and mapping applications. To evaluate the suitability of FAST-LIVO2 for this application, we conduct an aerial mapping experiment using the public dataset MARS-LVIG [52] whose hardware configuration is detailed in Section VIII-A. We evaluate the two sequences “HKairport01” and “HKisland01”, whose real-time mapping results are illustrated in Fig. LABEL:fig_cover (a-c), with (a) and (c) corresponding to “HKisland01”, and (b) depicting “HKairport01”. The results demonstrate the effectiveness of FAST-LIVO2 in unstructured environments such as forests and islands. The system successfully captures many fine structures and sharp coloring effects, including buildings, lane marks on roads, road curbs, tree crowns, and rocks, all of which are clearly visible. The APE (RMSE) for these sequences are 0.64 m and 0.27 m for FAST-LIVO2, respectively, compared to 2.76 m and 0.52 m for R3LIVE. The average processing times on the desktop PC (Section IX-A), are approximately 25.2 ms and 21.8 ms, respectively, compared to 110.5 ms and 100.2 ms for R3LIVE.
X-C Supporting 3D Scene Applications: Mesh Generation, Texture, and Gaussian Splatting
Leveraging the high-precision sensor localization and dense 3D colored point map obtained from FAST-LIVO2, we develop software applications for rendering pipelines including meshing and texturing, as well as emerging NeRF-like rendering pipeline such as 3D Gaussian Splatting (3DGS). For meshing, we employ VDBFusion [57] based on the Truncated Signed Distance Function (TSDF) in “CBD Building 01”, shown in Fig. 14 (a). The sharp edges on the columns and the distinct structure of the roof are clearly visible, demonstrating the high quality of the mesh. This level of detail is achieved due to the high density of FAST-LIVO2’s point clouds and the exceptional accuracy of structural reconstruction. After mesh construction, we use OpenMVS [58] to perform texture mapping using the estimated camera poses in “CBD Building 01” and “Retail Street”, shown in Fig. 14 (b-c). In Fig. 14 (c1-c2), the texture images applied on the triangular facets are seamless and accurately aligned, resulting in a highly clear and precise texture mapping. This is attributed to pixel-level image alignment achieved by FAST-LIVO2.
The dense color point clouds from FAST-LIVO2 can also directly serve as the input of 3DGS. We conduct tests on the sequence “CBD Building 01” utilizing 300 frames out of a total of 1,180 images. The results are shown in Fig. 15. Compared to COLMAP[59], our method significantly reduces the time required to obtain dense point clouds and poses from 9 hours to 21 s. However, the training time increases from 10 min and 59 s to 15 min and 30 s. This increase is attributed to the denser point clouds (downsampled to 5 cm), which introduce more parameters to optimize. Nonetheless, the increased density and precision of our point clouds result in a slightly higher Peak Signal-to-Noise Ratio (PSNR) compared to the PSNR obtained from COLMAP inputs.
XI Conclusion and Future Work
This paper proposed FAST-LIVO2, a direct LIVO framework achieving fast, accurate, and robust state estimation while reconstructing the map on the fly. FAST-LIVO2 can achieve high localization accuracy while being robust to severe LiDAR and/or visual degeneration.
The gain in speed is attributed to the use of raw LiDAR, inertial, and camera measurements within an efficient ESIKF framework with sequential update. In the image update, an inverse compositional formulation along with a sparse patch-based image alignment is further adopted to boost the efficiency. The gain in accuracy is attributed to the use (and even refine) of plane priors from LiDAR points to enhance accuracy of image alignment. Besides, a single unified voxel map is used to manage simultaneously the map points and the observed high-resolution image measurements. The voxel map structure, which supports geometry construction and update, visual map point generation and update, and reference patch update, is developed and validated. The gain in robustness is due to real-time estimation of exposure time, which effectively handles environment illumination variation, and on-demand voxel raycasting to cope with LiDARs’ close proximity blind zones. The efficiency and accuracy of FAST-LIVO2 were evaluated on extensive public datasets, while the robustness and effectiveness of each system module were evaluated on private dataset. The applications of FAST-LIVO2 in real-world robotics applications, such as UAV navigation, 3D mapping, and model rendering, were also demonstrated.
As an odometry, FAST-LIVO2 may have drifts over long distances. In the future, we could integrate loop closure and the sliding window optimization into FAST-LIVO2 to mitigate this long-term drift. Moreover, the accurate and dense colored point maps could be used to extract semantic information for object-level semantic mapping.
References
- [1] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
- [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
- [3] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European conference on computer vision. Springer, 2014, pp. 834–849.
- [4] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2016.
- [5] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
- [6] J. Lin and F. Zhang, “Loam livox: A fast, robust, high-precision lidar odometry and mapping package for lidars of small fov,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 3126–3131.
- [7] T. Shan and B. Englot, “Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4758–4765.
- [8] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 4003–4009.
- [9] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
- [10] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- [11] M.Irani and P.Anandan, “All about direct methods,” in Proc. Workshop Vis. Algorithms, Theory Pract, 1999, pp. 267–277.
- [12] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 15–22.
- [13] W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-inertial odometry,” IEEE Transactions on Robotics, pp. 1–21, 2022.
- [14] C. Yuan, W. Xu, X. Liu, X. Hong, and F. Zhang, “Efficient and probabilistic adaptive voxel mapping for accurate online lidar odometry,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8518–8525, 2022.
- [15] M. Meilland, A. I. Comport, and P. Rives, “Real-time dense visual tracking under large lighting variations,” in British Machine Vision Conference. British Machine Vision Association, 2011, pp. 45–1.
- [16] T. Tykkälä, C. Audras, and A. I. Comport, “Direct iterative closest point for real-time visual odometry,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 2011, pp. 2050–2056.
- [17] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgb-d cameras,” in 2013 IEEE international conference on robotics and automation. IEEE, 2013, pp. 3748–3754.
- [18] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1449–1456.
- [19] K. Chen, R. Nemiroff, and B. T. Lopez, “Direct lidar-inertial odometry: Lightweight lio with continuous-time motion correction,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 3983–3989.
- [20] Z. Wang, L. Zhang, Y. Shen, and Y. Zhou, “D-liom: Tightly-coupled direct lidar-inertial odometry and mapping,” IEEE Transactions on Multimedia, 2022.
- [21] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mapping with high robustness and low drift,” Journal of field robotics, vol. 35, no. 8, pp. 1242–1264, 2018.
- [22] W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo visual inertial lidar simultaneous localization and mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 370–377.
- [23] J. Zhang, M. Kaess, and S. Singh, “A real-time method for depth enhanced visual odometry,” Autonomous Robots, vol. 41, pp. 31–43, 2017.
- [24] J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidar-monocular visual odometry,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2018, pp. 7872–7879.
- [25] Y. Zhu, C. Zheng, C. Yuan, X. Huang, and X. Hong, “Camvox: A low-cost and accurate lidar-assisted visual slam system,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5049–5055.
- [26] S.-S. Huang, Z.-Y. Ma, T.-J. Mu, H. Fu, and S.-M. Hu, “Lidar-monocular visual odometry using point and line features,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1091–1097.
- [27] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
- [28] Y.-S. Shin, Y. S. Park, and A. Kim, “Dvl-slam: Sparse depth enhanced direct visual-lidar slam,” Autonomous Robots, vol. 44, no. 2, pp. 115–130, 2020.
- [29] X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Lic-fusion: Lidar-inertial-camera odometry,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 5848–5854.
- [30] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 965–972, 2018.
- [31] X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys, “Lic-fusion 2.0: Lidar-inertial-camera odometry with sliding-window plane-feature tracking,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5112–5119.
- [32] D. Wisth, M. Camurri, S. Das, and M. Fallon, “Unified multi-modal landmark tracking for tightly coupled lidar-visual-inertial odometry,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1004–1011, 2021.
- [33] J. Lin, C. Zheng, W. Xu, and F. Zhang, “R2live: A robust, real-time, lidar-inertial-visual tightly-coupled state estimator and mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7469–7476, 2021.
- [34] C. F. W. Bell B.M., “The iterated kalman filter update as a gauss-newton method,” Automatic Control IEEE Transactions, vol. 38, no. 2, pp. 294–297, 1993.
- [35] T. Shan, B. Englot, C. Ratti, and D. Rus, “Lvi-sam: Tightly-coupled lidar-visual-inertial odometry via smoothing and mapping,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 5692–5698.
- [36] J. Lin and F. Zhang, “R 3 live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 10 672–10 678.
- [37] ——, “R 3 live++: A robust, real-time, radiance reconstruction package with a tightly-coupled lidar-inertial-visual state estimator,” arXiv preprint arXiv:2209.03666, 2022.
- [38] J. Engel, V. Usenko, and D. Cremers, “A photometrically calibrated benchmark for monocular visual odometry,” arXiv preprint arXiv:1607.02555, 2016.
- [39] W. Wang, J. Liu, C. Wang, B. Luo, and C. Zhang, “Dv-loam: Direct visual lidar odometry and mapping,” Remote Sensing, vol. 13, no. 16, p. 3340, 2021.
- [40] Z. Yuan, Q. Wang, K. Cheng, T. Hao, and X. Yang, “Sdv-loam: Semi-direct visual-lidar odometry and mapping,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- [41] H. Zhang, L. Du, S. Bao, J. Yuan, and S. Ma, “Lvio-fusion:tightly-coupled lidar-visual-inertial odometry and mapping in degenerate environments,” IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3783–3790, 2024.
- [42] W. Xu and F. Zhang, “Fast-lio: A fast, robust lidar-inertial odometry package by tightly-coupled iterated kalman filter,” IEEE Robotics and Automation Letters, pp. 1–1, 2021.
- [43] D. He, W. Xu, and F. Zhang, “Symbolic representation and toolkit development of iterated error-state extended kalman filters on manifolds,” IEEE Transactions on Industrial Electronics, 2023.
- [44] D. Willner, C.-B. Chang, and K.-P. Dunn, “Kalman filter algorithms for a multi-sensor system,” in 1976 IEEE conference on decision and control including the 15th symposium on adaptive processes. IEEE, 1976, pp. 570–574.
- [45] J. Ma and S. Sun, “Globally optimal distributed and sequential state fusion filters for multi-sensor systems with correlated noises,” Information Fusion, p. 101885, 2023.
- [46] Y. Ren, Y. Cai, F. Zhu, S. Liang, and F. Zhang, “Rog-map: An efficient robocentric occupancy grid map for large-scene and high-resolution lidar-based motion planning,” arXiv preprint arXiv:2302.14819, 2023.
- [47] R. M. Stereopsis, “Accurate, dense, and robust multiview stereopsis,” IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 32, no. 8, 2010.
- [48] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International journal of computer vision, vol. 56, pp. 221–255, 2004.
- [49] T.-M. Nguyen, S. Yuan, M. Cao, Y. Lyu, T. H. Nguyen, and L. Xie, “Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,” The International Journal of Robotics Research, vol. 41, no. 3, pp. 270–280, 2022.
- [50] M. Helmberger, K. Morin, B. Berner, N. Kumar, G. Cioffi, and D. Scaramuzza, “The hilti slam challenge dataset,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7518–7525, 2022.
- [51] L. Zhang, M. Helmberger, L. F. T. Fu, D. Wisth, M. Camurri, D. Scaramuzza, and M. Fallon, “Hilti-oxford dataset: A millimeter-accurate benchmark for simultaneous localization and mapping,” IEEE Robotics and Automation Letters, vol. 8, no. 1, pp. 408–415, 2022.
- [52] H. Li, Y. Zou, N. Chen, J. Lin, X. Liu, W. Xu, C. Zheng, R. Li, D. He, F. Kong, et al., “Mars-lvig dataset: A multi-sensor aerial robots slam dataset for lidar-visual-inertial-gnss fusion,” The International Journal of Robotics Research, p. 02783649241227968, 2024.
- [53] “Supplementary material: Fast-livo2: Fast, direct lidar-inertial-visual odometry,” available online: https://github.com/hku-mars/FAST-LIVO2/blob/main/Supplementary/LIVO2˙supplementary.pdf.
- [54] C. Klug, C. Arth, D. Schmalstieg, and T. Gloor, “Measurement uncertainty analysis of a robotic total station simulation,” in IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2018, pp. 2576–2582.
- [55] Y. Ren, F. Zhu, W. Liu, Z. Wang, Y. Lin, F. Gao, and F. Zhang, “Bubble planner: Planning high-speed smooth quadrotor trajectories using receding corridors,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 6332–6339.
- [56] G. Lu, W. Xu, and F. Zhang, “On-manifold model predictive control for trajectory tracking on robotic systems,” IEEE Transactions on Industrial Electronics, vol. 70, no. 9, pp. 9192–9202, 2022.
- [57] I. Vizzo, T. Guadagnino, J. Behley, and C. Stachniss, “Vdbfusion: Flexible and efficient tsdf integration of range sensor data,” Sensors, vol. 22, no. 3, 2022. [Online]. Available: https://www.mdpi.com/1424-8220/22/3/1296
- [58] D. Cernea, “Openmvs: multi-view stereo reconstruction library. 2020,” URL: https://cdcseacave. github. io/openMVS, vol. 5, no. 6, p. 7, 2020.
- [59] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113.
Supplementary Material for FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry
I. System Module Validation
In this section, we validate the key modules of our system, including the affine warping, normal refinement, reference patch update, on-demanding raycasting query, exposure time estimation, and ESIKF sequential update utilizing the FAST-LIVO2 private dataset and MARS-LVIG dataset.
A. Evaluation of Affine Warping
In this experiment, we aim to comprehensively evaluate the various affine warping effects based on the constant depth assumption (a common technique utilized in semi-dense methods), the plane prior from point clouds, and the refined plane normal of our proposed system (denoted as “Constant depth”, “Plane prior”, and “Plane normal refined“, respectively). To achieve this, we compare the mapping results and drift metrics of the three methods on “CBD Building 02” and “Office Building Wall”. As depicted in Fig. S2, “Plane normal refined” delivers the most clear and accurate mapping results and the next best is “Plane prior”. Notably, “Plane normal refined” renders text and patterns on the ground and walls, as well as lane markings, with remarkable clarity. Moreover, the drift associated with “Plane prior” and “Plane normal refined” remains below , whereas “Constant depth” does not return to the starting position, experiencing a drift of . Such results confirm the enhanced performance of affine warping based on the plane prior and enhancements by the plane normal refinement.
Besides, we compare warped projection effects based on “Constant depth” and “Plane prior” on the sequence “CBD Building 02” and “Office Building Wall”. We randomly select several image frames from these two sequences for the qualitative analysis. For each frame, we project the reference patches attached to the visual map points visible in the frame onto a blank image of the current frame. This process yields a novel RGB image. If the affine warping and pose estimation are both performed well, areas with patch projections will produce a seamless and minimally distorted appearance, closely resembling the raw RGB image. The comparison of warped patches is presented in Fig. S1. The results indicate that the pose accuracy and warped performance under “Plane prior” significantly outperform those under “Constant depth”
.
B. Evaluation of Reference Patch Update and Normal Convergence
In this experiment, we validate the effects of the reference patch update strategy and normal convergence on “HIT Graffiti Wall” and “HKU Centennial Garden”. As shown in Fig. S3, (a) and (b) are the reconstructed point clouds of these two sequences. On the right, for each region from A to H, we respectively present five patch observations captured at different poses, each with a patch size of pixels for visualization. These patches are observed in camera frames located at corresponding numbers on the left. As can be seen, our reference path update strategy tends to choose a high-resolution reference patch that faces the plane along its normal. It is also noticed that, despite of visual map points and patches generated at non-planar locations (e.g., tree leaves, trunk, and a lamp stand), the overall mapping quality is still high.
We also evaluate the convergence of our proposed normal estimation across patches in regions A to H. Each patch is in the size of pixels. The initial normal vector is estimated from the LiDAR points. The convergence curves, which represent the angle change between the initial and optimized normal vectors at each iteration number, are shown in Fig. S3. Regions A, C, D, and E are structured areas, whereas regions B, F, G, and H are unstructured ones. It can be observed that normal vectors for structured areas converge faster (within 6 iterations) with small normal refinement (2 to 4 degrees) because the initial normal provided by point clouds is relatively accurate. In unstructured areas, such as shrubs and tree leaves (i.e., B and F), the normal refinement is significant (up to 9 degrees) and requires 9 iterations to converge. Overall, normal refinement across these 8 regions demonstrates good convergence properties.
C. Evaluation of On-demand Raycasting
In this experiment, we assess the performance of the on-demand raycasting module under extreme conditions where the current and recent LiDAR scans have few or even no points due to LiDARs’ close proximity blind zone111https://www.livoxtech.com/avia/specs. We use the sequence “Narrow Corridor” for an in-depth analysis illustrated in Fig. S4. In this sequence, we traverse an extremely narrow tunnel, approximately in width, and turn to face the weakly textured wall on one side. Due to the limited points in the LiDAR scans when facing the wall, we can only acquire few visual map points through voxel query (yellow dots in Fig. S4 (b)). In this case, raycasting offers sufficient visual constraints to mitigate degeneration (blue dots in Fig. S4 (b)). The visualization results demonstrate that the on-demand raycasting module works well under challenging conditions with few points in LiDAR scans.
D. Evaluation of Exposure Time Estimation
In this experiment, we validate the exposure time estimation module in two parts: 1) For the sequence with fixed exposure and gain, we multiply each pixel of the received raw image by an exposure factor that changes sinusoidally over time. We verify the effectiveness of our estimation by comparing the estimated exposure times against the sinusoidal function applied. 2) For sequences with auto-exposure and either fixed or auto-gain settings, we evaluate the accuracy of our estimated exposure times by comparing them with the ground truth values retrieved from the camera’s API.
In part one, we test on the sequence “Retail Street”, applying an exposure factor to images with fixed exposure and gain. As shown in Fig. S5, the estimated relative inverse exposure time matches the true values very well, evidencing the convergence of our exposure estimation in synthetic conditions. In part two, we use the sequences “HKU Centennial Garden”, “HKU Cultural Garden” and “HKU Main Building”, which have significant exposure time changes, for testing. We scale the estimated relative exposure times by the first frame to recover the actual exposure time (ms) of each frame. As shown in Fig. S5, the estimated exposure time follows closely the ground-truth values, which validates the effectiveness of our exposure time estimation module. The occasional mismatches are possibly due to the unmodeled response function and vignetting factor [38].
E. Evaluation of ESIKF Sequential Update
In this experiment, we evaluate different ESIKF update strategies for LiDAR and camera states. We compare asynchronous versus synchronous updates, as well as standard versus sequential updates. Specifically, we assess three strategies: “asynchronous (standard update)”, where the camera and LiDAR states are updated at their respective sampling times without scan recombination; “synchronous (standard update)”, where LiDAR scans are recombined to sync with camera images and the state is updated with both LiDAR and camera measurements within a standard ESIKF; and “synchronous (sequential update)”, where the LiDAR and camera are synced but the state is first updated by LiDAR measurements and then updated by camera measurements. These strategies are evaluated in terms of accuracy, robustness, and efficiency using the “AMvalley03” sequence of the MARS-LVIG dataset. We select this sequence for several key reasons:
-
(1)
This sequence includes slopes that lead to both LiDAR and visual degenerations, making it a challenging test case.
-
(2)
This sequence represents an extremely large-scale scenario (approximately ) with long-term and high-speed data collection (covering at a speed of ), where pose deviations are prone to occur (due to the long-term and high-speed conditions), and even slight drifts can cause significant blurring in the colored point clouds (due to the large scale), leading to more pronounced comparative results.
-
(3)
This sequence provides the RTK ground truth data, allowing for more accurate quantitative comparisons.
We compare the qualitative mapping results, quantitative APE, and the average processing time of the three update strategies. The experimental configuration is as follows: LiDAR updates involve up to 5 iterations, visual updates use a three-level pyramid with up to 5 iterations per level and no more than 3 iterations per level when the camera and LiDAR are updated simultaneously in a standard ESIKF, and the scale normalization factor (from visual photometric error to LiDAR point-to-plane distance) for the “synchronous (standard update)” is set to 0.0032, which has been meticulously tuned for optimal performance.
Fig. S6, (a-c) present the reconstructed colored point clouds for this sequence. It is evident that the “synchronous (sequential update)” strategy produces accurate mapping results, particularly in the areas highlighted by the blue and orange boxes, where the mountain roads are reconstructed without any layering. In contrast, the other two strategies exhibit misalignments in these areas, although the “synchronous (standard update)” performs slightly better than the “asynchronous (standard update)”. The superior performance of the “synchronous (sequential update)” strategy is mainly attributed to its robustness in handling significant LiDAR and visual degenerations, as seen in the white box (c3). This area features a large, textureless slope, and the UAV passes over it at a high speed, heavily relying on a strong prior. The other two methods, which rely solely on the IMU prior, struggle to compute a relatively accurate image gradient descent direction, leading to significant linearization errors.
The APE (RMSE) metrics for the “AMvalley03” sequence are , , and for the “asynchronous (standard update)”, “synchronous (standard update)”, and “synchronous (sequential update)”, respectively. The average processing times on a desktop PC (Section IX-A) are approximately , , and . Our proposed “synchronous (sequential update)” achieves the highest efficiency and accuracy, while the “asynchronous (standard update)” has the lowest accuracy. The “synchronous (standard update)” is the most time-consuming, primarily because it requires fusing all LiDAR measurements at each level of the image pyramid.
Overall, our proposed “synchronous (sequential update)” offers superior accuracy and efficiency, while the “asynchronous (standard update)” has the lowest accuracy, and the “synchronous (standard update)” is the most time-consuming.