FASTLIVO2: Fast, Direct LiDARInertialVisual Odometry
Abstract
This paper proposes FASTLIVO2: a fast, direct LiDARinertialvisual odometry framework to achieve accurate and robust state estimation in Simultaneous Localization and Mapping (SLAM) tasks and provide great potential in realtime, onboard robotic applications. FASTLIVO2 fuses the IMU, LiDAR and image measurements, efficiently through an errorstate iterated kalman filter (ESIKF). To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points (i.e., visual map points) enabling new image alignment. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior in the alignment process) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FASTLIVO2 employs an ondemanding raycast operation and estimates the image exposure time in real time. We conduct extensive experiments on both benchmark and private datasets, demonstrating that our proposed system significantly outperforms other stateoftheart odometry systems in terms of accuracy, robustness, and computation efficiency. Moreover, the effectiveness of key modules in the system is also validated. Lastly, we detail three applications of FASTLIVO2: UAV onboard navigation demonstrating the system’s computation efficiency for realtime onboard navigation, airborne mapping showcasing the system’s mapping accuracy, and 3D model rendering (meshbased and NeRFbased) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application of this work on GitHub^{1}^{1}1https://github.com/hkumars/FASTLIVO2 to benefit the robotics community.
Index Terms:
Simultaneous Localization and Mapping (SLAM), Sensor Fusion, 3D Reconstruction, Aerial Navigation.I Introduction
In recent years, simultaneous localization and mapping (SLAM) technology has seen significant advancements, particularly in realtime 3D reconstruction and localization in unknown environments. Due to its ability to estimate poses and reconstruct maps in real time, SLAM has become indispensable for various robot navigation tasks. The localization process delivers crucial state feedback for the robot’s onboard controllers, while the dense 3D map provides key environmental information, such as free spaces and obstacles, essential for effective trajectory planning. A colored map also carries substantial semantic information, enabling a vivid representation of the real world that opens up vast potential applications, such as virtual and augmented reality, 3D modeling, and robothuman interactions.
Currently, several SLAM frameworks have been successfully implemented with singlemeasurement sensors, primarily cameras [1, 2, 3, 4] or LiDAR [5, 6, 7]. Although visual and LiDAR SLAM have shown promise in their own domains, each has inherent limitations that constrain their performance in various scenarios.
Visual SLAM, leveraging costeffective CMOS sensors and lenses, is capable of establishing accurate data associations, thereby achieving a certain level of localization accuracy. The abundance of color information further enriches the semantic perception. Further leveraging this enhanced scene comprehension, deep learning methods are employed for robust feature extraction and dynamic object filtering. However, the lack of direct depth measurement in visual SLAM necessitates concurrent optimization of map points via operations such as triangulation or depth filtering, which introduces significant computational overhead that often limits map accuracy and density. Visual SLAM also encounters numerous other limitations, such as varying measurement noise across different scales, sensitivity to illumination changes, and the impact of textureless environments on data association.
LiDAR SLAM, utilizing LiDAR sensors, obtains precise depth measurements directly, offering superior precision and efficiency in localization and mapping tasks compared with visual SLAM. Despite these strengths, LiDAR SLAM exhibits several significant shortcomings. On one hand, the point cloud maps it reconstructs, albeit detailed, lack color information, thereby reducing their information scale. On the other hand, LiDAR SLAM performance tends to deteriorate in environments presenting insufficient geometric constraints, such as narrow tunnels, a single and extended wall, etc.
As the demand to operate intelligent robots in the real world grows, especially in environments that often lack structure or texture, it is becoming clear that existing systems relying on a single sensor cannot provide the accurate and robust pose estimation as required. To address this issue, the fusion of commonlyused sensors such as LiDAR, camera, and IMU is gaining increasing attention. This strategy not only combines the strengths of these sensors to provide enhanced pose estimation, but also aids in the construction of accurate, dense, and colored point cloud maps, even in environments where the performance of individual sensors degenerate.
Efficient and accurate LiDARinertialvisual odometry (LIVO) and mapping are still challenging problems: 1) The entire LIVO system is tasked with processing LiDAR measurements, consisting of hundreds to thousands of points per second, as well as highrate, highresolution images. The challenge of fully utilizing such a vast amount of data, particularly with limited onboard resources, necessitates exceptional computational efficiency; 2) Many existing systems typically incorporate a LiDARInertial Odometry (LIO) subsystem and a VisualInertial Odometry (VIO) subsystem, each necessitating the extraction of features from visual and LiDAR data respectively to reduce computational load. In environments that lack structure or texture, this extraction process often results in limited feature points. Furthermore, to optimize feature extraction, extensive engineering adaptations are essential to accommodate the variability in LiDAR scanning patterns and point densities; 3) To reduce computational demands and achieve tighter integration between camera and LiDAR measurements, a unified map is essential to manage sparse points and the observed highresolution image measurements simultaneously. However, designing and maintaining such maps are particularly challenging considering the heterogeneous measurements of LiDAR and cameras; 4) To ensure the accuracy of the reconstructed colored point cloud, pose estimation needs to achieve pixellevel accuracy. Meeting this standard presents considerable challenges: proper hardware synchronization, rigorous precalibration of extrinsic parameters between LiDAR and cameras, precise recovery of exposure time, and a fusion strategy capable of reaching pixellevel accuracy in real time.
Motivated by these issues, we propose FASTLIVO2, a highefficiency LIVO system that tightly integrates LiDAR, image and IMU measurements through a sequentially updated errorstate iterated Kalman filter (ESIKF). With the prior from IMU propagation, the system state is updated sequentially, first by the LiDAR measurements and then by the image measurements, both utilizing direct methods based on a single unified voxel map. Specifically, in the LiDAR update, the system registers raw points to the map to construct and update its geometric structure, and in the visual update, the system reuses LiDAR map points as the visual map points directly without extracting, triangulating, or optimizing any visual features from images. The chosen visual map points in the map are attached with reference image patches previously observed and then projected to the current image to align its pose by minimizing the direct photometric errors (i.e., sparse image alignment). To improve the accuracy in the image alignment, FASTLIVO2 dynamically updates the reference patches and uses the plane priors obtained from LiDAR points. For improved computation efficiency, FASTLIVO2 uses LiDAR points to identify visual map points visible from the current image and conduct an ondemanding voxel raycast in case of no LiDAR points. FASTLIVO2 also estimates the exposure time in real time to handle illumination variation.
FASTLIVO2 is developed based on FASTLIVO first proposed in our previous work [8]. The new contributions compared to FASTLIVO are listed below:

1.
We propose an efficient ESIKF framework with sequential update to address the dimension mismatch between LiDAR and visual measurements, improving the robustness of FASTLIVO that uses asynchronous updates.

2.
We use (and even refine) plane priors from LiDAR points for improved accuracy. In contrast, FASTLIVO assumes all pixels in a patch share the same depth, a wild assumption significantly reducing the accuracy of affine warping in image alignment.

3.
We propose a reference patch update strategy to improve the accuracy of image alignment, by selecting highquality, inlier reference patches that have large parallax and sufficient texture details. FASTLIVO selects the reference patch based on proximity to the current view, often resulting in lowquality reference patches degrading the accuracy.

4.
We conduct online exposure time estimation for handling environment illumination variation. FASTLIVO did not address this issue, leading to poor convergence in image alignment under significant lighting changes.

5.
We propose ondemand voxel raycasting to enhance the system robustness in the absence of LiDAR point measurements caused by LiDAR close proximity blind zones, an issue not considered in FASTLIVO.
Each of the above contributions are evaluated in comprehensive ablation studies to verify their effectiveness. We implement the proposed system as practical open software, meticulously optimized for realtime operation on both Intel and ARM processors. The system is versatile, supporting multiline spinning LiDARs, emerging solidstate LiDARs with unconventional scanning patterns, as well as both pinhole cameras and various fisheye cameras.
Besides, we conduct extensive experiments on 25 sequences of public datasets (i.e., Hilti and NTUVIRAL datasets), alongside various representative private datasets, enabling a comparison with other stateoftheart SLAM systems (e.g., R3LIVE, LVISAM, FASTLIO2, etc). Both qualitative and quantitative results demonstrate that our proposed system significantly outpaces other counterparts in terms of accuracy and robustness at a reduced computation cost.
Taking a step further to underline the realworld applicability and versatility of our system, we deploy three distinctive applications. Firstly, fully onboard autonomous UAV navigation, demonstrating the system’s realtime capabilities, marks a pioneering instance of employing a LiDARinertialvisual system for realworld autonomous UAV flights. Secondly, airborne mapping showcases the system’s pixellevel precision under structureless environments in practical use. Lastly, the highquality generation of mesh, texturing, and NeRF models underscores the system’s suitability for rendering tasks. We make our code and dataset available on GitHub.
II Related Works
IIA Direct Methods
Direct methods stand out as a prominent approach for fast pose estimation in both visual and LiDAR SLAM. Unlike featurebased methods [9, 10, 5, 6] which necessitate the extraction of salient feature points (e.g., corners and edge pixels in images; plane and edge points in LiDAR scans) and the generation of robust descriptors for matching, direct methods directly leverage raw measurements to optimize the sensor pose [11] by minimizing an error function based on photometric error or pointtoplane residuals, e.g., [3, 12, 13, 14]. By eliminating the timeconsuming feature extraction and matching, direct methods offer fast pose estimation. Nonetheless, the absence of feature matching requires fairly accurate state prior estimation to avoid local minima.
Direct methods in visual SLAM can be broadly categorized into dense direct, semidense direct, and sparse direct methods. Dense direct methods, predominantly adopted for RGBD cameras with full depth measurements as exemplified by [15, 16, 17], apply imagetomodel alignment for pose estimation. In contrast, semidense direct methods [18, 3] implement direct image alignment by capitalizing on pixels with significant graylevel gradients for estimation. Sparse direct methods [12, 2] focus on delivering accurate state estimation through only a few wellselected raw patches, thus further diminishing the computational burden in comparison to both dense and semidense direct methods.
Unlike direct visual SLAM methods, direct LiDAR SLAM systems [13, 14, 19, 20] do not distinguish between dense and sparse approaches and commonly use spatiallydownsampled or temporallydownsampled raw points in each scan to construct constraints for pose optimization.
In our work, we harness the principles of the direct method for both LiDAR and visual modules. The LiDAR module of our system is adapted from VoxelMap [14], and the visual model is based on a variant of sparse direct method [12]. While drawing inspiration from sparse direct image alignment in [12], our visual module differs by reutilizing the LiDAR points as visual map points, thus mitigating the intensive backend computations (i.e., feature alignment, sliding window optimization and/or depth filtering).
IIB LiDARvisual(inertial) SLAM
The incorporation of multiple sensors in LiDARvisualinertial SLAM equips the system with the capability to handle a wide range of challenging environments, particularly when one sensor experiences failure or partial degeneration. Motivated by this, the research community has seen the emergence of various LiDARvisualinertial SLAM systems. Existing methods can generally be divided into two categories: loosely coupled and tightly coupled. The classification can be determined from two perspectives: the state estimation level and the raw measurement level. At the state estimation level, the key is whether the estimate from one sensor serves as an optimization objective in another sensor’s model. At the raw measurement level, it involves whether raw data from different sensors are combined.
Zhang et al. propose a LiDARVisualInertial SLAM system [21] that is loosely coupled at the state estimation level. In this system, VIO subsystem only provides the initial pose for the scan registration in LIO subsystem, instead of being optimized jointly with the scan registration. VILSLAM [22] employs a similar looselycoupled method, not utilizing joint optimization of LiDAR, camera, and IMU measurements.
Some systems (e.g., DEMO [23], LIMO [24], CamVox [25], [26]) use 3D LiDAR points to provide depth measurements for the visual module [1, 27, 4]. While these systems exhibit measurementlevel tight coupling, they remain looselycoupled in state estimation, primarily due to the absence of constraints directly derived from LiDAR measurements at state estimation. Another issue arises as 3D LiDAR points do not have a onetoone correspondence with 2D image feature points and/or lines due to mismatched resolutions. This mismatch requires interpolation in depth association, introducing potential errors. To address this, DVLSLAM [28] employs a direct method for visual tracking, wherein the LiDAR points are directly projected into the image to ascertain the depth of corresponding pixel positions.
The works mentioned above have not achieved tight coupling at the state estimation level. In pursuit of higher accuracy and robustness, many recent studies have emerged that jointly optimize sensor data in a tightlycoupled manner. To name a few, LICFusion [29] tightly fuses IMU measurements, sparse visual features, and LiDAR plane and edge features based on the MSCKF [30] framework. The subsequent LICFusion2.0 [31] enhances LiDAR pose estimation by implementing planefeature tracking within a sliding window. VILENS [32] offers a joint optimization of visual, LiDAR, and inertial data through a unified factor graph, relying on fixed lag smoothing. R2LIVE [33] tightly fuses the LiDAR, camera and IMU measurements in an onmanifold iterated Kalman filter [34]. For the VIO subsystem in R2LIVE, a sliding window optimization is used to triangulate the locations of visual features in the map.
Several systems achieve complete tight coupling at both the measurement and state estimation levels. LVISAM [35] fuses the LiDAR, visual and inertial sensors in a tightlycoupled smoothing and mapping framework, which is built atop a factor graph. The VIO subsystem performs visual feature tracking and extracts feature depth using LiDAR scans. R3LIVE [36] constructs the geometric structure of the global map by LIO and renders map texture by VIO. These two subsystems estimate the system state jointly by fusing their respective LiDAR or visual data with IMUs. The advanced version, R3LIVE++ [37], estimates exposure time in real time and conducts photometric calibration in advance[38], which enables the system to recover the radiance of map points. Unlike most previously mentioned LiDARinertialvisual systems that rely on featurebased methods for both LIO and VIO subsystems, R3LIVE series [36, 37] adopt direct methods for both without feature extraction, enabling them to capture subtle environmental features even in textureless or structureless scenarios.
Our system also jointly estimates the state using LiDAR, image and IMU data, and maintains a tightlycoupled voxel map at the measurement level. Furtheremore, our system uses direct methods, harnessing raw LiDAR points for LiDAR scan registration and employing raw image patches for visual tracking. The key difference between our system and R3LIVE (or R3LIVE++) is that R3LIVE (and R3LIVE++) operate at an individual pixel level in the VIO, while our system operates at image patch levels. This difference bestows our system with marked advantages. Firstly, in terms of robustness, our methodology uses a simplified, onestep frametomap sparse image alignment for pose estimation, mitigating the heavy reliance on an accurate initial state that is has to be obtained by a frametoframe optical flow in R3LIVEs. Consequently, our system simplifies and improves upon the twostage frametoframe and frametomap operations in R3LIVE. Secondly, from a computational standpoint, the VIO in R3LIVE predominantly employs a dense direct method which is computationally expensive, necessitating extensive points for residual construction and rendering. In contrast, our sparse direct method provides enhanced computational efficiency. Lastly, our system exploits information at the resolution of raw image patches, whereas R3LIVE is capped at the resolution of its point map.
The visual module of our system is most similar to DVLOAM [39], SDVLOAM [40] and LVIOFusion [41], which projects LiDAR points attached with patches into a new image and tracks the image by minimizing the direct photometric error. However, they have several key differences, such as the use of separate maps for vision and LiDAR, reliance on the assumption of constant depth in patch warping in the visual module, looselycoupling at the state estimation level, and two stages of frametoframe and frametokeyframes for image alignment. In contrast, our system tightly integrates frametomap image alignment, LiDAR scan registration, and IMU measurements in an iterated Kalman filter. Besides, thanks to the single unified map for both LiDAR and visual modules, our system can directly employ the plane priors provided by the LiDAR points to accelerate the image alignment.
Notations  Explanation 
$\u229e$ / $\boxminus $  The encapsulated “boxplus” and 
“boxminus” operations on the state manifold  
${}_{}{}^{G}(\cdot )$  A vector $(\cdot )$ in global world frame 
${}_{}{}^{C}(\cdot )$  A vector $(\cdot )$ in camera frame 
${}_{}{}^{I}\mathbf{T}_{L}^{}$  The extrinsic of LiDAR frame w.r.t. IMU frame 
${}_{}{}^{C}\mathbf{T}_{I}^{}$  The extrinsic of IMU frame w.r.t. camera frame 
${}_{}{}^{G}\mathbf{T}_{I}^{}$  The pose of IMU frame at time $k$ w.r.t. the global frame 
$\mathbf{x},\widehat{\mathbf{x}}$, $\overline{\mathbf{x}}$  The groundtruth, predicted and updated estimation of $\mathbf{x}$ 
${\widehat{\mathbf{x}}}^{\kappa}$  The $\kappa $th update of $\mathbf{x}$ 
$\bm{\delta}\mathbf{x}$  The error state between groundtruth $\mathbf{x}$ and its estimation 
III System Overview
The overview of our system is shown in Fig. 1, which contains four sections: ESIKF (Section IV), Local Mapping (Section V), LiDAR Measurement Model (Section VI), and Visual Measurement Model (Section VII).
The asynchronouslysampled LiDAR points are first recombined into scans at the camera’s sampling time through scan recombination. Then, we tightly couple the LiDAR, image and inertial measurements via an ESIKF with sequential state update, where the system state is updated sequentially, first by the LiDAR measurements and then by the image measurements, both utilizing direct methods based on a single unified voxel map (Section IV). To construct the LiDAR measurement model in the ESIKF update (Section VI), we compute the frametomap pointtoplane residual. To establish visual measurement model (Section VII), we extract the visual map points within the current FoV from the map, making use of visible voxel query and ondemand raycasting; after the extraction, we identify and discard outlier visual map points (e.g., points that are occluded or exhibit depth discontinuity); we then compute frametomap image photometric errors for visual update.
The local map for both visual and LiDAR updates is a voxelmap structure (Section V): the LiDAR points construct and update the map’s geometric structure, while the visual images append image patches to selected map points (i.e., visual map points) and update reference patches dynamically. The updated reference patches have their normal vectors further refined in a separate thread.
IV ErrorState Iterated Kalman Filter with Sequential State Update
This section outlines the system’s architecture, based on the sequentiallyupdated ErrorState Iterated Kalman Filter (ESIKF) framework.
IVA Notations and State Transition Model
In our system, we assume the time offsets among the three sensors (LiDAR, IMU and camera) are known, which can be calibrated or synchronized in advance. We take IMU frame (denoted as $I$) as the body frame and the first body frame as the global frame (denoted as $G$). Besides, we assume that the three sensors are rigidly attached and the extrinsic, defined in Table I, are precalibrated. Then, the discrete state transition model at the $i$th IMU measurement is:
$${\mathbf{x}}_{i+1}={\mathbf{x}}_{i}\u229e\left(\mathrm{\Delta}t\mathbf{f}({\mathbf{x}}_{i},{\mathbf{u}}_{i},{\mathbf{w}}_{i})\right)$$  (1) 
where $\mathrm{\Delta}t$ is the IMU sample period, the state $\mathbf{x}$, input $\mathbf{u}$, process noise $\mathbf{w}$, and function $\mathbf{f}$ are defined as follows:
$\mathcal{M}$  $\triangleq SO(3)\times {\mathbb{R}}^{16},\text{dim}(\mathcal{M})=19$  (2)  
$\mathbf{x}$  $\triangleq {\left[\begin{array}{ccccccc}{}_{}{}^{G}\mathbf{R}_{I}^{T}& {}_{}{}^{G}\mathbf{p}_{I}^{T}& {}_{}{}^{G}\mathbf{v}_{I}^{T}& {\mathbf{b}}_{g}^{T}& {\mathbf{b}}_{a}^{T}& {}_{}{}^{G}\mathbf{g}_{}^{T}& \tau \end{array}\right]}^{T}\in \mathcal{M}$  
$\mathbf{u}$  $\triangleq {\left[\begin{array}{cc}{\bm{\omega}}_{m}^{T}& {\mathbf{a}}_{m}^{T}\end{array}\right]}^{T},\mathbf{w}\triangleq {\left[\begin{array}{ccccc}{\mathbf{n}}_{g}^{T}& {\mathbf{n}}_{a}^{T}& {\mathbf{n}}_{{\mathbf{b}}_{g}}^{T}& {\mathbf{n}}_{{\mathbf{b}}_{a}}^{T}& {n}_{\tau}\end{array}\right]}^{T}$  
$\mathbf{f}$  $(\mathbf{x},\mathbf{u},\mathbf{w})=\left[\begin{array}{c}{\bm{\omega}}_{m}{\mathbf{b}}_{g}{\mathbf{n}}_{g}\\ {}_{}{}^{G}\mathbf{v}_{I}^{}+\frac{1}{2}{(}^{G}{\mathbf{R}}_{I}({\mathbf{a}}_{m}{\mathbf{b}}_{a}{\mathbf{n}}_{a})+{}_{}{}^{G}\mathbf{g})\mathrm{\Delta}t\\ {}_{}{}^{G}\mathbf{R}_{I}^{}\left({\mathbf{a}}_{m}{\mathbf{b}}_{a}{\mathbf{n}}_{a}\right)+{}_{}{}^{G}\mathbf{g}\\ {\mathbf{n}}_{{\mathbf{b}}_{g}}\\ {\mathbf{n}}_{{\mathbf{b}}_{a}}\\ {\mathrm{\U0001d7ce}}_{3\times 1}\\ {n}_{\tau}\end{array}\right]$ 
where ${}_{}{}^{G}\mathbf{R}_{I}^{}$, ${}_{}{}^{G}\mathbf{p}_{I}^{}$, and ${}_{}{}^{G}\mathbf{v}_{I}^{}$ respectively denote the IMU attitude, position, and velocity in the global frame, ${}_{}{}^{G}\mathbf{g}$ is the gravity vector in the global frame, $\tau $ is the inverse camera exposure time relative to the first frame, ${n}_{\tau}$ is the Gaussian noise that models $\tau $ as a random walk, ${\bm{\omega}}_{m}$ and ${\mathbf{a}}_{m}$ are the raw IMU measurements, ${\mathbf{n}}_{g}$ and ${\mathbf{n}}_{a}$ are measurement noises in ${\bm{\omega}}_{m}$ and ${\mathbf{a}}_{m}$, ${\mathbf{b}}_{a}$ and ${\mathbf{b}}_{g}$ are IMU bias, which are modeled as random walk driven by Gaussian noise ${\mathbf{n}}_{{\mathbf{b}}_{g}}$ and ${\mathbf{n}}_{{\mathbf{b}}_{a}}$, respectively.
IVB Scan Recombination
We employ the scan recombination to segment the highfrequency, sequentiallysampled LiDAR raw points into distinct LiDAR scans at the camera sampling moments, as depicted in Fig. 2. This ensures both camera and LiDAR data are synchronized at the same frequency (e.g., 10 Hz), allowing for update of state at the same time.
IVC Propagation
In the ESIKF framework, the state and covariance are propagated from time ${t}_{k1}$, when the last LiDAR scan and image frame are received, to time ${t}_{k}$, when the current LiDAR scan and image frame are received. This forward propagation predicts the state at each IMU input ${\mathbf{u}}_{i}$ during ${t}_{k1}$ and ${t}_{k}$, by setting the process noise ${\mathbf{w}}_{i}$ in (1) to zero. Denote the propagated state as $\widehat{\mathbf{x}}$ and covariance as $\widehat{\mathbf{P}}$, which will serve as a prior distribution for the subsequent update in Section IVD. Moreover, to compensate for motion distortion, we conduct a backward propagation as in [42], ensuring points in a LiDAR scan are “measured” at the scanend time ${t}_{k}$. Note that for notation simplification, we omit the subscript $k$ in all state vectors.
IVD Sequential Update
The IMU propagated state $\widehat{\mathbf{x}}$ and covariance $\widehat{\mathbf{P}}$ impose a prior distribution for $\mathbf{x}$, the system state at time ${t}_{k}$, as follows:
$$\mathbf{x}\boxminus \widehat{\mathbf{x}}\sim \mathcal{N}(\mathrm{\U0001d7ce},\widehat{\mathbf{P}})$$  (3) 
We denote the above prior distribution as $p(\mathbf{x})$ and the measurement models for the LiDAR and camera as:
$$\begin{array}{c}\hfill \left[\begin{array}{c}{\mathbf{y}}_{l}\\ {\mathbf{y}}_{c}\end{array}\right]=\left[\begin{array}{c}{\mathbf{h}}_{l}(\mathbf{x},{\mathbf{v}}_{l})\\ {\mathbf{h}}_{c}(\mathbf{x},{\mathbf{v}}_{c})\end{array}\right]\end{array}$$  (4) 
where ${\mathbf{v}}_{l}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\mathbf{\Sigma}}_{{\mathbf{v}}_{l}})$ and ${\mathbf{v}}_{c}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\mathbf{\Sigma}}_{{\mathbf{v}}_{c}})$ respectively denote the measurement noises for the LiDAR and camera.
A standard ESIKF [43] would update the state $\mathbf{x}$ using all the current measurements, including both LiDAR measurement ${\mathbf{y}}_{l}$ and image measurements ${\mathbf{y}}_{c}$. However, LiDAR and image measurements are two different sensing modalities, whose data dimensions do not match. Furthermore, the fusion of image measurement may be performed at various levels of the image pyramid. To address the dimension mismatch and give more flexibility for each module, we propose a sequential update strategy. This strategy is theoretically equivalent to the standard update using all measurements, assuming statistical independence of LiDAR measurements ${\mathbf{y}}_{l}$ and image measurements ${\mathbf{y}}_{c}$ given the state vector $\mathbf{x}$ (i.e., measurements corrupted by statistically independent noise).
To introduce the sequential update, we rewrite the total conditional distribution for the current state $\mathbf{x}$ as:
$\begin{array}{cc}\hfill p(\mathbf{x}\mid {\mathbf{y}}_{l},{\mathbf{y}}_{c})& \propto p(\mathbf{x},{\mathbf{y}}_{l},{\mathbf{y}}_{c})=p({\mathbf{y}}_{c}\mid \mathbf{x},{\mathbf{y}}_{l})p(\mathbf{x},{\mathbf{y}}_{l})\hfill \\ & =p({\mathbf{y}}_{c}\mid \mathbf{x})\underset{\propto p(\mathbf{x}\mid {\mathbf{y}}_{l})}{\underset{\u23df}{p({\mathbf{y}}_{l}\mid \mathbf{x})p(\mathbf{x})}}\hfill \end{array}$  (5) 
Equation (5) implies that the total conditional distribution $p(\mathbf{x}\mid {\mathbf{y}}_{l},{\mathbf{y}}_{c})$ can be obtained by two sequential Bayesian updates. The first step fuses only the LiDAR measurement ${\mathbf{y}}_{l}$ with the IMUpropagated prior distribution $p(\mathbf{x})$ to obtain the distribution $p(\mathbf{x}\mid {\mathbf{y}}_{l})$:
$$p(\mathbf{x}\mid {\mathbf{y}}_{l})\propto p({\mathbf{y}}_{l}\mid \mathbf{x})p(\mathbf{x})$$  (6) 
The second step then fuses the camera measurement ${\mathbf{y}}_{c}$ with $p(\mathbf{x}\mid {\mathbf{y}}_{l})$ to obtain the final posterior distribution of $\mathbf{x}$:
$$p(\mathbf{x}\mid {\mathbf{y}}_{l},{\mathbf{y}}_{c})\propto p({\mathbf{y}}_{c}\mid \mathbf{x})p(\mathbf{x}\mid {\mathbf{y}}_{l})$$  (7) 
Interestingly, the two fusion in (6) and (7) follow the same form:
$$q(\mathbf{x}\mid \mathbf{y})\propto q(\mathbf{y}\mid \mathbf{x})q(\mathbf{x})$$  (8) 
To conduct the fusion in (8) for either LiDAR or image measurements, we detail the prior distribution $q(\mathbf{x})$ and measurement model $q(\mathbf{y}\mid \mathbf{x})$ as follows. For the prior distribution $q(\mathbf{x})$, denote it as $\mathbf{x}=\widehat{\mathbf{x}}\u229e\bm{\delta}\mathbf{x}$ with $\bm{\delta}\mathbf{x}\sim \mathcal{N}(\mathrm{\U0001d7ce},\widehat{\mathbf{P}})$. In case of the LiDAR update (i.e., the first step), $(\widehat{\mathbf{x}},\widehat{\mathbf{P}})$ is the state and covariance obtained from the propagation step. In case of the visual update (i.e., the second step), $(\widehat{\mathbf{x}},\widehat{\mathbf{P}})$ is the converged state and covariance obtained from the LiDAR update.
To obtain the measurement model distribution $q(\mathbf{y}\mid \mathbf{x})$, denote state estimated at the $\kappa $th iteration as ${\widehat{\mathbf{x}}}^{\kappa}$, where ${\widehat{\mathbf{x}}}^{0}=\widehat{\mathbf{x}}$. Approximating the measurement model (4) (either the LiDAR or camera measurement) through its firstorder Taylor expansion made at ${\widehat{\mathbf{x}}}^{\kappa}$ leads to:
$\mathbf{y}\mid \mathbf{x}\simeq \underset{{\mathbf{z}}^{\kappa}}{\underset{\u23df}{\mathbf{h}({\widehat{\mathbf{x}}}^{\kappa},\mathrm{\U0001d7ce})}}+{\mathbf{H}}^{\kappa}\bm{\delta}{\mathbf{x}}^{\kappa}+{\mathbf{L}}^{\kappa}\mathbf{v}$  (9)  
$q(\mathbf{y}\mid \mathbf{x})\simeq \mathcal{N}(\mathbf{h}({\widehat{\mathbf{x}}}^{\kappa},\mathrm{\U0001d7ce})+{\mathbf{H}}^{\kappa}\bm{\delta}{\mathbf{x}}^{\kappa},\mathbf{R})$  (10) 
where $\bm{\delta}{\mathbf{x}}^{\kappa}=\mathbf{x}\boxminus {\widehat{\mathbf{x}}}^{\kappa}$, ${\mathbf{z}}^{\kappa}$ is the residual, ${\mathbf{L}}^{\kappa}\mathbf{v}\sim \mathcal{N}(\mathrm{\U0001d7ce},\mathbf{R})$ is the lumped measurement noise, ${\mathbf{H}}^{\kappa}$ and ${\mathbf{L}}^{\kappa}$ are the Jacobian matrixes of $\mathbf{h}({\widehat{\mathbf{x}}}^{\kappa}\u229e\bm{\delta}{\mathbf{x}}^{\kappa},\mathbf{v})$ with respect to $\bm{\delta}{\mathbf{x}}^{\kappa}$ and $\mathbf{v}$, evaluated at zero, respectively.
Then, substituting the prior distribution $q(\mathbf{x})$ and the measurement distribution $q(\mathbf{y}\mid \mathbf{x})$ in (10) into the posterior distribution (8) and performing maximum likelihood estimation (MLE), we can obtain the maximum aposterior estimation (MAP) of $\bm{\delta}{\mathbf{x}}^{\kappa}$ (and hence ${\mathbf{x}}^{\kappa}$) from the standard update step in the ESIKF framework [43]:
$$\begin{array}{cc}\hfill \mathbf{K}& ={\left({({\mathbf{H}}^{\kappa})}^{T}{\mathbf{R}}^{1}{\mathbf{H}}^{\kappa}+{\widehat{\mathbf{P}}}^{1}\right)}^{1}{({\mathbf{H}}^{\kappa})}^{T}{\mathbf{R}}^{1},\hfill \\ \hfill {\widehat{\mathbf{x}}}^{\kappa +1}& ={\widehat{\mathbf{x}}}^{\kappa}\u229e\left({\mathrm{\mathbf{K}\mathbf{z}}}^{\kappa}(\mathbf{I}{\mathrm{\mathbf{K}\mathbf{H}}}^{\kappa})\left({\widehat{\mathbf{x}}}^{\kappa}\boxminus \widehat{\mathbf{x}}\right)\right)\hfill \end{array}$$  (11) 
The converged state and covariance matrix then makes the mean and covariance of the posterior distribution $q(\mathbf{x}\mathbf{y})$.
Kalman filter with sequential update has been investigated in the literature, such as in [44, 45]. This paper adopts this approach for ESIKF for LiDAR and camera systems. The implementation of the ESIKF with sequential update is detailed in Algorithm 1. In the first step (Line 610), the error state is updated from the LiDAR measurements (Section VIA) iteratively until convergence. The converged state and covariance estimates, denoted again as $\widehat{\mathbf{x}}$ and $\widehat{\mathbf{P}}$, are used to update the geometry of the map (Section VB), and subsequently refined in the second step visual update (Line 13  23) on each level of the image pyramid (Section VIIB) until convergence. The optimal state and covariance, denoted as $\overline{\mathbf{x}}$ and $\overline{\mathbf{P}}$, are employed for propagating incoming IMU measurements (Section IVC) and update the visual structures of the map (Section VD and VE).
V Local Mapping
VA Map Structure
Our map employs an adaptive voxel structure presented in [14], which is organized by a Hash table and an octree for each Hash entry (Fig. 1). The hash table manages root voxels, each with a fixed dimension of $0.5\times 0.5\times 0.5$ meters. Each root voxel encapsulates an octree structure to further organize leaf voxels of varying sizes. A leaf voxel represents a local plane and stores a plane feature (i.e., plane center, normal vector, and uncertainty) along with a set of LiDAR raw points situated on this plane. Some of these points are attached with threelevel image patches ($8\times 8$ patch size), which we refer to as visual map points. Converged visual map points are only attached with reference patches, while nonconverged ones are attached with reference patches and other visible patches (see Section VE). The varying size of the leaf voxel allows it to represent local planes of different scales, thus being adaptable to environments with different structure [14].
To prevent the size of the map from going unbound, we keep only a local map within a large local region of length $L$ around the LiDAR’s current position, as illustrated in a 2D example in Fig. 3. Initially, the map is a cube centered at the LiDAR’s starting position ${\mathbf{p}}_{0}$. The LiDAR’s detection area is visualized as a sphere centered at its current position, with its radius defined by the LiDAR’s detection range. When the LiDAR moves to a new position ${\mathbf{p}}_{1}$ where the detection area touches the boundaries of the map, we move the map away from the boundaries by a distance $d$. As the map moves, the memory containing the area moved out of the local map will be reset to store new areas moved in the local map. This ringbuffer approach ensures that our local map is maintained within a fixed size of memory. The implementation of the ringbuffer Hash map is detailed in [46]. The map move check is performed after each ESIKF update step.
VB Geometry Construction and Update
The geometry of the map is constructed and updated from LiDAR point measurements. Specifically, after the LiDAR update in ESIKF (Section IV), we register all points from the LiDAR scan to the global frame. For each registered LiDAR point, we determine its located root voxel in the Hash map. If the voxel does not exist, we initialize the voxel with the new point and index it into the Hash map. If the determined voxel already exists in the map, we append the point to the existing voxel. After all points in a scan are distributed, we conduct geometry construction and update as follows.
For newlycreated voxels, we determine if all the contained points lie on a plane based on the singular value decomposition. If so, we calculate the center point $\mathbf{q}=\overline{\mathbf{p}}$, plane normal $\mathbf{n}$, and the covariance matrix of $(\mathbf{q},\mathbf{n})$, denoted as ${\mathbf{\Sigma}}_{\mathbf{n},\mathbf{q}}$, of the plane. ${\mathbf{\Sigma}}_{\mathbf{n},\mathbf{q}}$ is used to characterize the plane uncertainty, which arises from both pose estimation uncertainty and the point measurement noise. The detailed plane criteria and calculation of the plane parameters and uncertainties can be referred to our previous work [14]. If the contained points do not lie on a plane, the voxel is continuously subdivided into eight smaller octants until either the points in the subvoxel are determined to form a plane or the maximum layer (eg., 3) is reached. In the latter case, the points in the leaf voxel will be discarded. As a result, the map only contains voxels (either root or sub) identified as planes.
For existing voxels that have new points appended, we assess if the new points still form a plane with the existing points in the root voxel or subvoxel. If not, we conduct voxel subdivision as above. If yes, we update the plane’s parameters ($\mathbf{q}$,$\mathbf{n}$) and covariance ${\mathbf{\Sigma}}_{\mathbf{n},\mathbf{q}}$ as above too. Once the plane parameters converge (see [14]), the plane will be considered as mature and new points on this plane will be discarded. Moreover, mature planes will have their estimated plane parameters ($\mathbf{q}$,$\mathbf{n}$) and covariance ${\mathbf{\Sigma}}_{\mathbf{n},\mathbf{q}}$ fixed.
LiDAR points on planes (either in the root voxel or subvoxel) will be used for generating visual map points in the subsequent section. For mature planes, 50 most recent LiDAR points are candidates for visual map point generation, while for unmature planes, all LiDAR points are the candidates. The visual map point generation process will identify some of these candidate points as visual map points and attach them with image patches for image alignment.
VC Visual Map Point Generation and Update
To generate and update visual map points, we select the candidate LiDAR points in the map that (1) are visible from the current frame (detailed in Section VIIA), and (2) exhibit significant graylevel gradients in the current image. We project these candidate points, after the visual update (Section IVD), onto the current image and retain the candidate point with the smallest depth for the local plane in each voxel. Then, we divide the current image into uniform grid cells each with $30\times 30$ pixels. If a grid cell does not contain any visual map point projected here, we generate a new visual map point using the candidate point with the highest graylevel gradient and associate it with the current image patch, estimated current state (i.e., frame pose and exposure time), and the plane normal calculated from the LiDAR points as in the previous section. The patch attached to the visual map points has three layers of the same size (e.g., $11\times 11$ pixels), each layer is half sampled from the previous layer, forming a patch pyramid. If a grid cell contains visual map points projected here, we add new patch (all three layers of pyramid) to the existing visual map point if (1) more than 20 frames have passed since its last patch addition, or (2) its pixel position in the current frame deviates by more than 40 pixels from its position at the last patch addition. As a result, the map points will likely have effective patches with uniformly distributed viewing angles. Along with the patch pyramid, we also attach the estimated current state (i.e.,pose and exposure time to the map point.
VD Reference Patch Update
A visual map point could have more than one patch due to the addition of new patches. We need to choose one reference patch for image alignment in the visual update. In detail, we score each patch $\mathbf{f}$ based on photometric similarity and viewing angle as follows:
$\text{NCC}(\mathbf{f},\mathbf{g})$  $={\displaystyle \frac{{\sum}_{x,y}[\mathbf{f}(x,y)\overline{\mathbf{f}}][\mathbf{g}(x,y)\overline{\mathbf{g}}]}{\sqrt{{\sum}_{x,y}{[\mathbf{f}(x,y)\overline{\mathbf{f}}]}^{2}{\sum}_{x,y}{[\mathbf{g}(x,y)\overline{\mathbf{g}}]}^{2}}}}$  
$c$  $={\displaystyle \frac{\mathbf{n}\cdot \mathbf{p}}{\Vert \mathbf{p}\Vert}},{\omega}_{1}={\displaystyle \frac{1}{1+{e}^{\text{tr}({\mathbf{\Sigma}}_{\mathbf{n}})}}}$  (12)  
$S$  $=(1{\omega}_{1})\cdot {\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}\text{NCC}(\mathbf{f},{\mathbf{g}}_{i})+{\omega}_{1}\cdot c$ 
where $\text{NCC}(\mathbf{f},\mathbf{g})$ represents the Normalized CrossCorrelation (NCC) used to measure the similarity between patch $\mathbf{f}$ and $\mathbf{g}$ at the $0$th pyramid level (the level with the highest resolution) of both patch, with mean subtraction applied to both patches, $c$ denotes the cosine similarity between the normal vector $\mathbf{n}$ and view direction $\mathbf{p}/\Vert \mathbf{p}\Vert $ of patch $\mathbf{f}$ under evaluation. When the patch is directly facing the plane where the map point is located, the value of $c$ is 1. The overall score $S$ is calculated by summing the weighted NCC and $c$, where the former represents the average similarity between the patch $\mathbf{f}$ under evaluation and all other patches ${\mathbf{g}}_{i}$ and $\text{tr}({\mathbf{\Sigma}}_{\mathbf{n}})$ represents the trace of the covariance matrix of the normal vector.
Among all the patches attached to a visual map point, the one with the highest score is updated as the reference patch. The above scoring mechanism tends to choose reference patches whose (1) appearance is similar (in terms of NCC) to most of the rest of the patches, a technique used by MVS [47] to avoid patches on dynamic objects; (2) view direction is orthogonal to the plane, thereby maintaining texture details at a high resolution. In contrast, the reference patches update strategy in our previous work FASTLIVO [8] and prior arts [4] directly select the patch with the smallest view direction difference from the current frame, causing the selected reference patch to be very close to the current frame, hence imposing weak constraints on the current pose update.
VE Normal Refine
Each visual map point is assumed to lie on a small local plane. Existing works [4, 2, 8] assumed that all pixels in a patch have the same depth, a wild assumption that does not hold in general. We use plane parameters computed from the LiDAR points as detailed in Section VB to achieve greater accuracy. This plane normal is crucial for performing affine warping for image alignment in the visual update process. To further enhance the accuracy of the affine warping, the plane normal could be further refined from the patches attached to the visual map point. Specifically, we refine the plane normal in the reference patch by minimizing the photometric error with respect to the other patches attached to the visual map point.
VE1 Affine warping
Affine warping is used to transform patch pixels from the reference frame (i.e., the source patch) to patch pixels in the rest of the frames (i.e., the target patch), illustrated in Fig. 4(a). Let ${\mathbf{u}}_{r}^{j}$ be the $j$th pixel coordinates in the source patch and ${\mathbf{u}}_{i}^{j}$ be the $j$th pixel coordinates in the $i$th target patch. Assuming all pixels in the patch lie in a local plane with normal ${}_{}{}^{{I}_{r}}\mathbf{n}$ and visual map point position ${}_{}{}^{{I}_{r}}\mathbf{p}$ (which corresponds to the center pixel for both source and target patches), both represented in the source patch frame, we have:
${\mathbf{u}}_{i}^{j}$  $={\mathbf{A}}_{r}^{i}{\mathbf{u}}_{r}^{j}$  (13)  
${\mathbf{A}}_{r}^{i}$  $=\mathbf{P}({}_{}{}^{{I}_{i}}\mathbf{R}_{{I}_{r}}^{}+{}_{}{}^{{I}_{i}}\mathbf{t}_{{I}_{r}}^{}{\displaystyle \frac{1}{{}_{}{}^{{I}_{r}}\mathbf{n}_{}^{T}\cdot {}_{}{}^{{I}_{r}}\mathbf{p}}}{}_{}{}^{{I}_{r}}\mathbf{n}_{}^{T}){\mathbf{P}}^{1}$ 
where ${\mathbf{A}}_{r}^{i}$ represents the affine warping matrix that transforms the pixel coordinates from the source (or reference) patch to the $i$th target patch, ${}_{}{}^{{I}_{i}}\mathbf{R}_{{I}_{r}}^{}$ and ${}_{}{}^{{I}_{i}}\mathbf{t}_{{I}_{r}}^{}$ denote the relative pose of the reference frame ${I}_{r}$ w.r.t. the target frame ${I}_{i}$. To use fisheye images directly without rectifying them to pinhole images, we implement projection matrix $\mathbf{P}$ and back projection matrix ${\mathbf{P}}^{1}$ based on different camera models (e.g., $\mathbf{P}$ is the camera intrinsic matrix for the pinhole camera model).
VE2 Normal Optimization
To refine the plane normal ${}_{}{}^{{I}_{r}}\mathbf{n}$, we minimize photometric errors between the reference patch and other image patches at the $0$th pyramid level (i.e., the highestresolution level):
${}^{{I}_{r}}{\mathbf{n}}^{\ast}$  $=\mathrm{arg}\underset{{}_{}{}^{{I}_{r}}\mathbf{n}\in {\mathbb{S}}^{2}}{\mathrm{min}}{\displaystyle \sum _{i\in S}}{\displaystyle \sum _{j=1}^{{N}^{2}}}{\Vert {\tau}_{i}{\mathbf{I}}_{i}({\mathbf{A}}_{r}^{i}{\mathbf{u}}_{r}^{j}){\tau}_{r}{\mathbf{I}}_{r}({\mathbf{u}}_{r}^{j})\Vert}_{2}$  (14) 
where $N$ is the path size, ${\tau}_{r}$ and ${\tau}_{i}$ are the inverse exposure times of the reference frame and the $i$th target frame, respectively. ${\mathbf{I}}_{r}({\mathbf{u}}_{r}^{j})$ denote $j$th patch pixel in the reference frame, ${\mathbf{I}}_{i}({\mathbf{A}}_{r}^{i}{\mathbf{u}}_{r}^{j})$ denotes the $j$th path pixel in the $i$th target frame, and $S$ is the set of all target frames.
VE3 Optimization Variable Transformation
To enhance the computational efficiency, we reparameterize the least squares problem in (14). Note that the optimization variable ${}_{}{}^{{I}_{r}}\mathbf{n}$ only appeared in $\mathbf{M}\triangleq {\displaystyle \frac{1}{{}_{}{}^{{I}_{r}}\mathbf{n}_{}^{T}\cdot {}_{}{}^{{I}_{r}}\mathbf{p}}}{}_{}{}^{{I}_{r}}\mathbf{n}\in {\mathbb{R}}^{3}$ in (13), the optimization over ${}_{}{}^{{I}_{r}}\mathbf{n}$ can be conducted over $\mathbf{M}$. Moreover, the vector $\mathbf{M}$ is subject to constraint ${}_{}{}^{{I}_{r}}\mathbf{p}\cdot \mathbf{M}=1$, meaning that $\mathbf{M}$ can be parameterized as follows:
$\mathbf{M}$  $=\left[\begin{array}{c}{\mathbf{M}}_{x}\\ {\mathbf{M}}_{y}\\ \frac{1}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}\frac{{}_{}{}^{{I}_{r}}\mathbf{p}_{x}^{}}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}{\mathbf{M}}_{x}\frac{{}_{}{}^{{I}_{r}}\mathbf{p}_{y}^{}}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}{\mathbf{M}}_{y}\end{array}\right]=\mathrm{\mathbf{B}\mathbf{m}}+\mathbf{b}$  (15)  
$\mathbf{B}$  $=\left[\begin{array}{cc}1& 0\\ 0& 1\\ \frac{{}_{}{}^{{I}_{r}}\mathbf{p}_{x}^{}}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}& \frac{{}_{}{}^{{I}_{r}}\mathbf{p}_{y}^{}}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}\end{array}\right],\mathbf{b}=\left[\begin{array}{c}0\\ 0\\ \frac{1}{{}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}}\end{array}\right],\mathbf{m}=\left[\begin{array}{c}{\mathbf{M}}_{x}\\ {\mathbf{M}}_{y}\end{array}\right]\in {\mathbb{R}}^{2}$ 
where ${}_{}{}^{{I}_{r}}\mathbf{p}_{z}^{}\ne 0$ since no such reference patch could be chosen for the visual map point. The relation among ${}_{}{}^{{I}_{r}}\mathbf{n}$, $\mathbf{M}$, and $\mathbf{m}$ are shown in Fig. 4 (b).
Finally, the optimization in (14) is conducted over the vector $\mathbf{m}\in {\mathbb{R}}^{2}$ without any constraints. This optimization can be performed in a separate thread to avoid blocking the main odometry thread. The optimized parameter ${\mathbf{m}}^{\ast}$ can then be used to recover the optimal normal vector ${}_{}{}^{{I}_{r}}\mathbf{n}_{}^{\ast}$:
$${}_{}{}^{{I}_{r}}\mathbf{n}_{}^{\ast}=\frac{{\mathbf{M}}^{\ast}}{\Vert {\mathbf{M}}^{\ast}\Vert},{\mathbf{M}}^{\ast}={\mathrm{\mathbf{B}\mathbf{m}}}^{\ast}+\mathbf{b}$$  (16) 
Once the plane normal converges, the reference patch and normal vector for this visual map point are fixed without further refinement, and all other patches are deleted.
VI LiDAR Measurement Model
This section details the LiDAR measurement model ${\mathbf{y}}_{l}={\mathbf{h}}_{l}(\mathbf{x},{\mathbf{v}}_{l})$ used in the LiDAR update of ESIKF in Section IVD.
VIA Pointtoplane LiDAR Measurement Model
After obtaining the undistorted points $\{{}_{}{}^{L}\mathbf{p}_{j}^{}\}$ in a scan, we project them to the global frame using the estimated state ${\widehat{\mathbf{x}}}^{\kappa}$ at the $\kappa $th iteration of LiDAR update:
$${}^{G}{\widehat{\mathbf{p}}}_{j}^{\kappa}={}_{}{}^{G}\widehat{\mathbf{T}}_{I}^{\kappa}{}_{}{}^{I}\mathbf{T}_{L}^{}{}_{}{}^{L}\mathbf{p}_{j}^{}$$  (17) 
We then identify the root or sub voxel where ${}_{}{}^{G}\widehat{\mathbf{p}}_{j}^{\kappa}$ lies in the Hash map. If no voxel is found or the voxel does not contain a plane, the point is discarded. Otherwise, we use the plane in the voxel to establish a measurement equation for the LiDAR point. Specifically, we assume the true LiDAR point ${}_{}{}^{L}\mathbf{p}_{j}^{gt}$, given the accurate LiDAR pose ${}_{}{}^{G}\mathbf{T}_{I}^{}$, should lie on the plane with normal ${\mathbf{n}}_{j}^{gt}$ and center point ${\mathbf{q}}_{j}^{gt}$ in the voxel. i.e.,
$\mathrm{\U0001d7ce}$  $={({\mathbf{n}}_{j}^{gt})}^{T}({}_{}{}^{G}\mathbf{T}_{I}^{}{}_{}{}^{I}\mathbf{T}_{L}^{L}{\mathbf{p}}_{j}^{gt}{\mathbf{q}}_{j}^{gt})$  (18) 
Since the groundtrue point ${}_{}{}^{L}\mathbf{p}_{j}^{gt}$ is measured as ${}_{}{}^{L}\mathbf{p}_{j}^{}$ with ranging and bearing noises $\bm{\delta}{}_{}{}^{L}\mathbf{p}_{j}^{}$, we have ${}_{}{}^{L}\mathbf{p}_{j}^{gt}={}_{}{}^{L}\mathbf{p}_{j}^{}\bm{\delta}{}_{}{}^{L}\mathbf{p}_{j}^{}$. Likewise, the plane parameters $({\mathbf{n}}_{j}^{gt},{\mathbf{q}}_{j}^{gt})$ are estimated as $({\mathbf{n}}_{j},{\mathbf{q}}_{j})$ with covariance ${\mathbf{\Sigma}}_{\mathbf{n},\mathbf{q}}$ (Section VB), so we have: ${\mathbf{n}}_{j}^{gt}={\mathbf{n}}_{j}\boxminus \bm{\delta}{\mathbf{n}}_{j},{\mathbf{q}}_{j}^{gt}={\mathbf{q}}_{j}\bm{\delta}{\mathbf{q}}_{j}$. Therefore,
$\underset{{\mathbf{y}}_{l}}{\underset{\u23df}{\mathrm{\U0001d7ce}}}$  $=\underset{{\mathbf{h}}_{l}(\mathbf{x},{\mathbf{v}}_{l})}{\underset{\u23df}{{({\mathbf{n}}_{j}\boxminus \bm{\delta}{\mathbf{n}}_{j})}^{T}({}_{}{}^{G}\mathbf{T}_{I}^{}{}_{}{}^{I}\mathbf{T}_{L}^{}({}_{}{}^{L}\mathbf{p}_{j}^{}\bm{\delta}{}_{}{}^{L}\mathbf{p}_{j}^{})({\mathbf{q}}_{j}\bm{\delta}{\mathbf{q}}_{j}))}}$  (19) 
where the measurement noise ${\mathbf{v}}_{l}=(\bm{\delta}{}_{}{}^{L}\mathbf{p}_{j}^{},\bm{\delta}{\mathbf{n}}_{j},\bm{\delta}{\mathbf{q}}_{j})$ consisting of the noise associated with the LiDAR point, the normal vector, and the plane center respectively.
VIB LiDAR Measurement Noise with Beam Divergence
The uncertainty of a LiDAR point $\bm{\delta}{}_{}{}^{L}\mathbf{p}_{j}^{}$ in the local LiDAR frame is decomposed into two components in [14], the ranging uncertainty $\delta d$ caused by laser time of flight (TOF), and the bearing direction uncertainty $\bm{\delta}\bm{\omega}$ originated from encoders. Besides these uncertainties, we also consider uncertainties caused by the laser beam divergence angle $\theta $, as illustrated in Fig. 5. As the angle $\phi $ between the bearing direction and normal vector increases, the ranging uncertainty of the LiDAR point increases significantly, while the bearing direction uncertainty remains unaffected. The $\delta d$ due to the laser beam divergence angle can be modeled as:
$$\delta d={L}_{2}{L}_{1}=d\left(\frac{\mathrm{cos}\phi}{\mathrm{cos}(\theta +\phi )}\frac{\mathrm{cos}\phi}{\mathrm{cos}(\theta \phi )}\right)$$  (20) 
Considering $\delta d$ influenced by TOF and laser beam divergence, when our system selects more points from the ground or walls (see Fig. 5 (c, d)), it achieves a more precise pose estimation than that not considering such effect.
VII Visual Measurement Model
This section details the visual measurement model ${\mathbf{y}}_{c}={\mathbf{h}}_{c}(\mathbf{x},{\mathbf{v}}_{c})$ used in the visual update of ESIKF in Section IVD.
VIIA Visual Map Point Selection
To perform sparse image alignment in the visual update, we begin by selecting appropriate visual map points. We first extract the set of map points (termed as visual submap) that is visible in the current camera FoV, using voxel and raycasting queries. Then, the visual map points from this submap are selected and outliers are rejected. This process yields a refined set of visual map points ready for constructing visual photometric errors in the visual measurement model.
VIIA1 Visible Voxel Query
Identifying map voxels within the current frame FoV is challenging due to the large number of voxels in the map. To address this issue, we poll the voxels hit by LiDAR points in the current scan. This can be done efficiently by inquiring the voxel Hash table using the measured point position. If the camera FoV is largely overlapped with the LiDAR FoV, map points in the camera FoV likely lie in these voxels as well. We also poll voxels hit by map points identified visible (through the same voxel query and raycasting) in the previous image frame, assuming that two consecutive image frames have large FoV overlaps. Finally, the current visual submap can be obtained as map points contained in these two types of voxels followed by a FoV check.
VIIA2 Raycasting on Demand
In most cases, the visual submap can be obtained through the voxel queries above. However, a LiDAR sensor could return no points when it is too close to an object (known as the close proximity blind zones). Also, the camera FoV may not be completely covered by the LiDAR FoV. To recall more visual map points in these cases, we employ a raycasting strategy as illustrated in Fig. 6. We divide the image into uniform grid cells, each with $30\times 30$ pixels, and project the visual map points obtained from the voxel query onto the grid cells. For each image grid cell that is not occupied by these visual map points, a ray is cast backward along the central pixel, where sample points are uniformly distributed along the ray in the depth direction from ${d}_{\mathrm{min}}$ to ${d}_{\mathrm{max}}$. In order to reduce computation load, the positions of sample points on each ray in the camera body frame are precomputed. For each sampled point, we evaluate the corresponding voxel’s status: if the voxel contains map points that lie in this grid cell after projection, we incorporate these map points into the visual submap and cease for this ray. Otherwise, we continue to the next sample point on the ray until reaching the maximum depth ${d}_{\mathrm{max}}$. After processing all the unoccupied image grid cells through raycasting, we obtain a set of visual map points that distribute among the whole image.
VIIA3 Outlier Rejection
After voxel query and raycasting, we obtain all visual map points in the current frame FoV. However, these visual map points could be occluded in the current frame, have discontinuous depth, have their reference patch taken at large view angles, or have large view angles in the current frame, all of which can severely degrade image alignment accuracy. To address the first issue, we project all the visual map points in the submap into the current frame using the pose after the LiDAR update and keep the lowestdepth points in each grid cell of $30\times 30$ pixels. To address the second issue, we project the LiDAR points in the current LiDAR scan to the current frame producing a depth map. By comparing the depth of visual map points with their $9\times 9$ neighbor in the depth map, we determine their occlusion and depth variation. Occluded and depthdiscontinuous map points are rejected (see Fig. 7). To address the third and fourth issue, we remove points where the view angle (i.e., the angle between normal vector and direction from the visual map point to the patch optical center) of the reference patch or current patch is too large (e.g. over 80°). The remaining visual map points will be used to align the current image.
VIIB SparseDirect Visual Measurement Model
The visual map points ${\{}^{G}{\mathbf{p}}_{i}\}$ extracted above are used to construct the visual measurement model. The underlying principle is that, when transforming the map point ${}_{}{}^{G}\mathbf{p}_{i}^{}$ to the current image ${\mathbf{I}}_{k}(\cdot )$ with the groundtruth state (i.e., pose) ${\mathbf{x}}_{k}$, the photometric error between the reference patch and the current patch should be zero:
$\begin{array}{cc}\hfill \mathrm{\U0001d7ce}& ={\tau}_{k}{\mathbf{I}}_{k}^{gt}(\underset{{\mathbf{u}}_{i}}{\underset{\u23df}{\bm{\pi}({}_{}{}^{C}\mathbf{T}_{I}^{}{{(}^{G}{\mathbf{T}}_{I})}^{1}{}_{}{}^{G}\mathbf{p}_{i}^{})}}+\mathrm{\Delta}\mathbf{u})\hfill \\ & {\tau}_{r}{\mathbf{I}}_{r}^{gt}(\underset{{\mathbf{u}}_{i}^{\prime}}{\underset{\u23df}{\bm{\pi}({}_{}{}^{{C}_{r}}\mathbf{T}_{G}^{}{}_{}{}^{G}\mathbf{p}_{i}^{})}}+{\mathbf{A}}_{i}^{r}\mathrm{\Delta}\mathbf{u})\hfill \end{array}$  (21) 
where $\bm{\pi}(\cdot )$ is the common camera projection model (i.e., Pinhole, MEI, ATAN, Scaramuzza, Equidistant), ${}_{}{}^{{C}_{r}}\mathbf{T}_{G}^{}$ is the pose of the global frame $G$ w.r.t. reference frame ${C}_{r}$, which has been estimated when receiving and fusing the reference frame, ${\mathbf{A}}_{i}^{r}$ is the affine warping matrix that transforms pixels from the $i$th current patch to the reference patch, $\mathrm{\Delta}\mathbf{u}$ is the relative pixel position to the center ${\mathbf{u}}_{i}$ within the current patch, ${\mathbf{I}}_{k}^{gt},{\mathbf{I}}_{r}^{gt}$ denote the groundtrue pixel values of the reference and current frames, respectively. They are measured as the actual image pixel values ${\mathbf{I}}_{k},{\mathbf{I}}_{r}$ with measurement noise ${\mathbf{v}}_{c}=(\bm{\delta}{\mathbf{I}}_{k},\bm{\delta}{\mathbf{I}}_{r})$, which originate from various sources (e.g., shot noise and the AnalogtoDigital Converter (ADC) noise of the camera CMOS). Hence,
$\underset{{\mathbf{y}}_{c}}{\underset{\u23df}{\mathrm{\U0001d7ce}}}=\underset{{\mathbf{h}}_{c}(\mathbf{x},{\mathbf{v}}_{c})}{\underset{\u23df}{{\tau}_{k}({\mathbf{I}}_{k}({\mathbf{u}}_{i}+\mathrm{\Delta}\mathbf{u})\bm{\delta}{\mathbf{I}}_{k}){\tau}_{r}({\mathbf{I}}_{r}({\mathbf{u}}_{i}^{\prime}+{\mathbf{A}}_{i}^{r}\mathrm{\Delta}\mathbf{u})\bm{\delta}{\mathbf{I}}_{r})}}$  (22) 
To enhance computational efficiency, we employ an inverse compositional formulation [48, 4], where the pose incremental $\bm{\delta}\mathbf{T}\in {\mathbb{R}}^{6}$ parameterizing ${}_{}{}^{G}\mathbf{T}_{I}^{}{=}^{G}{\widehat{\mathbf{T}}}_{I}^{\kappa}\text{Exp}(\bm{\delta}\mathbf{T})$ in ${\mathbf{u}}_{i}$ (see (21)), is moved from ${\mathbf{u}}_{i}$ to ${\mathbf{u}}_{i}^{\prime}$ as follows:
$\begin{array}{cc}\hfill {\mathbf{u}}_{i}& =\bm{\pi}({}_{}{}^{C}\mathbf{T}_{I}^{}{{(}^{G}{\widehat{\mathbf{T}}}_{I}^{\kappa})}^{1}{}_{}{}^{G}\mathbf{p}_{i}^{})\hfill \\ \hfill {\mathbf{u}}_{i}^{\prime}& =\bm{\pi}({}_{}{}^{{C}_{r}}\mathbf{T}_{G}^{}\text{Exp}{(\bm{\delta}\mathbf{T})}^{G}{\mathbf{p}}_{i})\hfill \end{array}$  (23) 
Given that ${\mathbf{u}}_{i}^{\prime}$ in the reference frame remains unchanged during each iteration, we only require a onetime computation of the Jacobian matrices w.r.t. $\bm{\delta}\mathbf{T}$, rather than recalculating them for every iteration.
To estimate the inverse exposure time ${\tau}_{k}$ from the measurement equation (22), we fix the initial inverse exposure time ${\tau}_{0}=1$ to eliminate the degeneration of equation (22) when all inverse exposure time are zeros. The estimated inverse exposure times of subsequent frames are therefore the exposure time relative to the first frame.
The equation (22) is used in the visual update step across three levels (see Algorithm 1); the visual update starts from the coarsest level, after the convergence of a level, it proceeds to the next finer level. The estimated state is then used to generate visual map points (Section VC) and update reference patch (Section VD).
VIII Datasets for Evaluation
In this section, we introduce datasets for performance evaluation, including public datasets NTUVIRAL [49], Hilti’22 [50], Hilti’23 [51], and MARSLVIG [52], as well as our selfcollected FASTLIVO2 private dataset. Specifically, the NTUVIRAL and Hilti datasets are used to conduct a quantitative benchmark comparison of our system against other stateoftheart (SOTA) SLAM systems (Section IXB). The FASTLIVO2 private dataset is primarily used to evaluate our system across various extremely challenging scenarios (Section IXC), to demonstrate its capability for highprecision mapping (Section IXD), and to validate the functionality of the individual modules within our system (Sections IA through ID in the Supplementary Material[53]). MARSLVIG dataset is employed for application demonstrations (Section X) and ablation study (Section IE in the Supplementary Material[53]).
VIIIA NTUVIRAL, Hilti and MARSLVIG Dataset
The NTUVIRAL dataset, collected at the Nanyang Technological University campus using an aerial platform, presents diverse scenarios embodying unique aerial operational challenges. Specifically, the “sbs” sequences can only provide noisy visual features from distant objects. The “nya” sequences present challenges to LiDAR SLAM due to semitransparent surfaces and to visual SLAM owing to intricate flight dynamics and low lighting conditions. The dataset is equipped with a 16channel OS1 gen1^{3}^{3}3https://ouster.com/products/os1lidarsensor LiDAR sampled at 10 Hz and with a builtin IMU at 100 Hz, and two synchronized pinhole cameras triggered at 10 Hz. The left camera is used for evaluation.
The Hilti’22 and Hilti’23 datasets, collected by handheld and robot devices, encompass indoor and outdoor sequences from environments like construction sites, offices, labs and parking areas. These sequences introduce numerous challenges from long corridors, basements, and stairs, with textureless features, varying illumination conditions, and insufficient LiDAR plane constraints. Handheld sequences use a Hesai PandarXT32^{4}^{4}4https://www.hesaitech.com/product/xt32/ LiDAR at 10 Hz, five wideangle cameras at 40 Hz, which is downsampled into 10 Hz, and an external Bosch BMI085 IMU at 400 Hz. Meanwhile, the robotmounted sequences feature a Robosense BPearl^{5}^{5}5https://www.robosense.ai/en/rslidar/RSBpearl LiDAR at 10 Hz, eight omnidirectional cameras at 10 Hz, and an Xsens MTi670 IMU at 200 Hz. In both cases, the frontfacing camera is used for all systems under evaluation. Millimeteraccurate ground truth, obtained through a motion capture system (MoCap) or a Total Station [54], is provided for each sequence. Note that the ground truth of the Hilti datasets is not opensource; therefore, algorithmic results on these datasets are evaluated via the Hilti official website. Since “Site 3” in Hilti’23 does not provide indepth analysis plots (e.g., RMSE), we exclude these four sequences, but our scoring results for these sequences can still be found on their official website^{6}^{6}6https://hiltichallenge.com/leaderboard2022.html, https://hiltichallenge.com/leaderboard2023.html. NTUVIRAL and Hilti contribute a total of 25 sequences.
The MARSLVIG dataset provides highaltitude, groundfacing mapping data that encompasses diverse unstructured terrains such as jungles, mountains, and islands. The dataset was collected via a DJI M300 RTK quadrotor, which is equipped with a Livox Avia^{7}^{7}7https://www.livoxtech.com/avia LiDAR (with builtin BMI088 IMU) and a highresolution globalshutter camera, both triggered at 10 Hz. This is notably distinct from the aforementioned NTUVIRAL and Hilti datasets, which use $752\times 480$ grayscale images, while the MARS dataset employs $2448\times 2048$ RGB images, thereby facilitating the generation of clear, dense colored point clouds. Therefore, we leverage this public dataset to validate our capabilities in highaltitude aerial mapping applications.
VIIIB FASTLIVO2 Private Dataset
To validate the system’s performance under more extreme conditions (e.g., LiDAR degeneration, low illumination, drastic exposure changes, and cases of no LiDAR measurements), we make a new dataset named FASTLIVO2 private dataset. The dataset, hardware device, and hardware synchronization scheme are released with the codes of this work to facilitate the reproduction of our work.
VIIIB1 Platform
Our data collection platform, illustrated in Fig. 8, is equipped with an industrial camera (MVCA01321UC), a Livox Avia LiDAR, and a DJI manifold2c (Intel i78550u CPU and 8 GB RAM) as onboard computers. The camera FoV is ${70.6}^{\circ}\times {68.5}^{\circ}$ and the LiDAR FoV is ${70.4}^{\circ}\times {77.2}^{\circ}$. All sensors are hard synchronized with a 10 Hz trigger signal, generated by STM32 synchronized timers.
VIIIB2 Sequence Description
As summarized in Table S1 in the Supplementary Material[53], FASTLIVO2 private dataset comprises 20 sequences across various scenes (e.g., campus buildings, corridors, basements, mining tunnel, etc.) characterized by structureless, cluttered, dim, variablelighting, and weakly textured environments, with a total duration of 66.9 min. Most sequences exhibit visual and/or LiDAR degeneration, such as facing a single and/or textureless plane, traversing an extremely narrow and/or dark tunnel, and experiencing varying light conditions from indoor to outdoor (see Fig. S7 in the Supplementary Material[53]). To guarantee enhanced synchronous data collection between the camera and LiDAR, we configure the camera with fixed exposure time but autogain mode in most scenarios. For the remaining sequences with autoexposure, we record their ground truth exposure times. In all sequences, the platform returns to the starting point, which enables the drift evaluation.
IX Experiment Results
In this section, we conduct extensive experiments to evaluate our proposed system.
IXA Implementation and System Configurations
We implemented the proposed FASTLIVO2 system in C++ and Robots Operating System (ROS). In the default configuration, the exposure time estimation is enabled, while normal vector refinement is turned off. LiDAR points in a scan are downsampled temporally at a 1:3 ratio. The root voxel size for the voxel map is set at 0.5 m, and the maximum layer of the internal octree is 3. The image patch size is $8\times 8$ for image alignment and $11\times 11$ for normal refinement. Within the sequential ESIKF settings, for all the experiments, the camera photometric noise is set to a constant value of 100. The LiDAR depth error and bearing angle error are adjusted to 0.02 m and 0.05° for Livox Avia LiDAR and OS116, 0.001 m and 0.001° for PandarXT32, 0.008 m and 0.01° for Robosense BPearl LiDAR. The laser beam divergence angle is set at 0.15° for Livox Avia LiDAR and OS116, and at 0.001° for the PandarXT32 and Robosense BPearl LiDAR. Our system uses the same parameters in all sequences of all datasets with the same sensor setup. The computation platform for all experiments is a desktop PC equipped with an Intel i710700K CPU and 32 GB RAM. For FASTLIVO2, we also test it on an ARM processor that is commonly used in embedded systems with reduced power and cost. The ARM platform is RB5^{8}^{8}8https://www.adlinktech.com/Products/Computer_on_Modules/SMARC/LECRB5?lang=en. with a Qualcomm Kryo585 CPU and 8 GB RAM. We refer to the implementation of FASTLIVO2 on the ARMbased platform as “FASTLIVO2 (ARM)”.
Dataset  Sequence  SDV LOAM  Our LIO  FAST LIO2  R3LIVE  LVI SAM  FAST LIVO  Ours (w/o expo)  Ours (w normal)  Ours (w/o update)  Ours 
Hilti’22  Construction Ground  25.121  0.011  0.013  0.021  $\times $  0.022  0.011  0.008  0.015  0.010 
Construction Multilevel  12.561  0.031  0.044  0.024  $\times $  0.052  0.021  0.018  0.025  0.020  
Construction Stairs  9.212  0.221  0.320  0.784  9.142  0.241  0.049  0.027  0.151  0.016  
Long Corridor  19.531  0.061  0.064  0.061  6.312  0.065  0.069  0.059  0.071  0.067  
Cupola  9.321  0.221  0.250  2.142  $\times $  0.182  0.161  0.122  0.179  0.121  
Lower Gallery  11.232  0.014  0.024  0.008  2.281  0.022  0.010  0.008  0.010  0.007  
Attic to Upper Gallery  4.551  0.223  0.720  2.412  $\times $  0.621  0.101  0.077  0.221  0.069  
Outside Building  2.622  0.030  0.028  0.029  0.952  0.052  0.042  0.033  0.050  0.035  
Hilti’23  Floor 0  4.621  0.028  0.031  0.018  $\times $  0.021  0.025  0.023  0.023  0.022 
Floor 1  7.951  0.025  0.031  0.024  8.682  0.022  0.024  0.022  0.031  0.023  
Floor 2  7.912  0.041  0.083  0.046  $\times $  0.048  0.023  0.021  0.051  0.022  
Basement  6.151  0.021  0.038  0.024  $\times $  0.035  0.020  0.018  0.018  0.016  
Stairs  9.032  0.110  0.170  0.110  3.584  0.152  0.025  0.020  0.132  0.018  
Parking 3x floors down  19.952  0.162  0.320  0.462  $\times $  0.356  0.035  0.022  0.112  0.032  
Large room  16.781  0.121  0.028  0.035  0.563  0.031  0.033  0.027  0.118  0.026  
Large room (dark)  15.012  0.051  0.040  0.059  $\times $  0.053  0.049  0.051  0.058  0.046  
NTU VIRAL  eee_01  0.301  0.122  0.212  0.072  3.901  0.191  0.069  0.066  0.109  0.068 
eee_02  1.842  0.131  0.172  0.059  0.182  0.132  0.051  0.055  0.112  0.051  
eee_03  0.301  0.124  0.213  0.078  0.287  0.192  0.068  0.070  0.099  0.068  
nya_01  0.202  0.084  0.141  0.080  0.205  0.121  0.075  0.078  0.106  0.073  
nya_02  0.214  0.153  0.212  0.084  1.296  0.182  0.076  0.081  0.118  0.075  
nya_03  0.251  0.082  0.133  0.079  0.176  0.112  0.060  0.060  0.092  0.059  
sbs_01  0.212  0.112  0.184  0.075  0.254  0.253  0.064  0.063  0.098  0.062  
sbs_02  0.233  0.123  0.161  0.076  0.221  0.134  0.062  0.048  0.116  0.061  
sbs_03  0.281  0.122  0.142  0.070  0.309  0.132  0.061  0.047  0.119  0.060  
Average  7.416  0.097  0.151  0.278  1.928  0.137  0.051  0.044  0.089  0.045 

1
$\times $ denotes the system totally failed.
IXB Benchmark Experiments
In this experiment, we conduct quantitative evaluations on 25 sequences from the NTUVIRAL, Hilti’22, and 23 open datasets. Our approach is benchmarked against several stateoftheart opensource odometry systems, including R3LIVE [36], a dense direct LiDARinertialvisual odometry system; FASTLIO2 [13], a direct LiDARinertial odometry system; SDVLOAM [40], a semidirect LiDARvisual odometry system; LVISAM [35], a featurebased LiDARinertialvisual SLAM system; and our previous work FASTLIVO [8].
These systems are downloaded from their respective GitHub repositories. For FASTLIO2, FASTLIVO, and LVISAM, we use the recommended settings for indoor and outdoor scenes equipped with multiline LiDAR sensors. For R3LIVE, we adapt the system to work with fisheye camera models and multiline LiDARs equipped with external IMUs (the default configuration only supports internal IMUs). We disable the realtime optimization of the camera intrinsic and the extrinsic ${}_{}{}^{C}\mathbf{T}_{I}^{}$ due to adverse optimization caused by insufficient IMU excitation in the datasets. Other parameters, including the window size and pyramid level for optical flow tracking, the resolution for downsampling the point cloud of the current scan and the global map, are finetuned to achieve optimal performance. Since only the vision module of SDVLOAM is opensourced, we integrate it with LeGOLOAM[7] in a loosely coupled manner, following the methodology described in the original paper[40]. This enhanced system continues to refine poses obtained from the vision module and we also open this implementation on GitHub^{9}^{9}9https://github.com/xuankuzcr/SDVLOAM_reimplementation. Given that all compared systems are odometry without loop closure, except for LVISAM, we remove the loopclosure module of LVISAM to ensure a fair comparison. Additionally, we conduct an ablation study on the exposure time estimation module, the normal refine module, and the reference patch update strategy. The default FASTLIVO2 has realtime exposure estimation and reference patch update, but no normal refinement.
The results of all methods are shown in Table II. It is seen that our method achieves the highest overall accuracy across all sequences with an average RMSE of 0.044 m, which is three times more accurate than the secondplace FASTLIVO at 0.137 m. Our system delivers the best results in most sequences, except for ”Outside Building” and ”Large Room (dark)”, where our system exhibits a slightly (millimeterlevel) higher error compared to the LiDARinertial only odometry FASTLIO2. This discrepancy can be attributed to the rich structural features but poor lighting conditions of these sequences, resulting in dim and blurred images. Consequently, fusing these lowquality images does not enhance odometry accuracy. Excluding these two sequences, our method, which leverages tightlycoupled LiDAR, inertial, and visual information, outperforms FASTLIO2, our LIO subsystem, and the LiDARvisual only odometry, SDVLOAM, significantly. Notably, SDVLOAM performs particularly poorly on the Hilti datasets due to its lack of tight integration with IMU measurements, leading to drift in the LO subsystem. Additionally, the loose coupling between LiDAR and visual observations, along with poor initial values for VO, often results in local optima or even negative optimization. Our LIO subsystem generally surpasses FASTLIO2 due to our more accurate noise modeling for each LiDAR point. In a few sequences where FASTLIO2 outperforms slightly, the differences are minimal, at the millimeter level, and negligible. Moreover, our system’s accuracy significantly exceeds that of other tightly coupled LiDARinertialvisual systems across all sequences. Among them, LVISAM fails in nine sequences primarily due to its featurebased LIO and VIO subsystems not fully utilizing raw measurements, which degrades its robustness in environments with subtle geometric or texture features. R3LIVE generally performs well, but struggles in “Construction Stairs”, “Cupola”, and “Attic to Upper Gallery” sequences, where its performance is even worse than FASTLIO2. This is because intense rotations at structureless staircases result in inadequate pose priors, causing local optima when aligning colored map points with the current frame, and ultimately leading to negative optimization. FASTLIVO and FASTLIVO2 overcome such challenges in these sequences by the patchbased image alignments. Additionally, situations where the sensors are close to walls in these sequences highlight the effectiveness of raycasting in FASTLIVO2, with the mapping results in these largescale scenes shown in Fig. S8 in the Supplementary Material [53]. On the other hand, FASTLIVO is outperformed by R3LIVE and FASTLIVO2 on the NTUVIRAL dataset, especially in unstructured scenes like “nya” sequences, where the effects of affine warping based on constant depth assumptions are inaccurate. In contrast, the pixellevel alignment of R3LIVE and the plane prior (or refinement) of FASTLIVO2 do not encounter such issues.
Comparing the different variants of FASTLIVO2, we observe that the average accuracy without realtime exposure time estimation decreases by 6 mm compared to the default, as the exposure time estimation can actively compensate illumination changes in the environment. On the other hand, the average accuracy without the reference patch update decreases by 44 mm compared to the default, as the reference patch update strategy effectively selects patches with higher resolution and avoids selecting outlier patches. Finally, the normal refinement increases the average accuracy by 1 mm, and the accuracy improvement is not consistent in all sequences. The limited improvement is mainly because normal vector refinement yields positive optimization only in simple structured scenes with nice image observations. In the NTUVIRAL dataset, images from the “eee” and “nya” sequences are extremely dim and blurry, where negative optimization is particularly severe. To further study the effectiveness of the different modules, including exposure time estimation, affine warping, reference patch update, normal convergence, ondemand raycasting, and ESIKF sequential update, we conducted a thorough study on our private dataset and MARSLVIG dataset. The results are presented in Section I (System Module Validation) in the Supplementary Material[53] due to the space limit. As confirmed in the results, our system can achieve robust and accurate pose estimation in both structured and unstructured environments, under severe light variations, in remarkably largescale scenarios with longterm, highspeed data collection, and even in extremely narrow spaces with few LiDAR measurements.
IXC LiDAR Degenerated and Visually Challenging Environments
In this experiment, we evaluate the robustness of our system under environments experiencing LiDAR degeneration and/or visual challenges, comparing it with the qualitative mapping results of FASTLIVO and R3LIVE in 8 sequences as shown in Fig. 9 and 10. Fig. 9 showcases LiDAR degeneration sequences where the LiDAR is facing a big wall while moving along the wall from one side to the other. Due to the absence of geometrical constraints since only one wall plane is being observed by the LiDAR, LIO methods would fail. It’s worth mentioning that the “HIT Graffiti Wall” sequence spans nearly 800 meters with LiDAR continuously facing the wall, leading to considerable degeneration. In all sequences, FASTLIVO2 distinctly showcases its robustness against even longterm degeneration and its capability to deliver highprecision colored point maps. In contrast, FASTLIVO managed to obtain the geometric structure but with completely blurred texture. R3LIVE struggles with both geometric structure and texture clarity. Fig. 10 showcases tests in more complicated scenarios where LiDAR and/or camera both degenerate occasionally. The degeneration directions are indicated by respective arrows. “HKU Cultural Center” (Fig. 10 (a)) showcases the mapping results of FASTLIVO2, R3LIVE, and FASTLIVO. As can be seen, R3LIVE and FASTLIVO have distorted point maps, blurred textures, and drifts exceeding 1 m. In contrast, FASTLIVO2 successfully returns to the starting point, achieving an impressive endtoend error of less than 0.01 m, while achieving a consistent point map with clear textures. “CBD Building 03” (Fig. 10 (b)) and “Mining Tunnel” (Fig. 10 (c)) display only FASTLIVO2 results, as R3LIVE and FASTLIVO failed. In Fig. 10 (b), the blue arrow represents movement towards a pure black screen, indicating concurrent LiDAR and camera degeneration. In Fig. 10 (c1) and (c2), the red points represent the LiDAR scan at that location, illustrating the areas of LiDAR degeneration due to the single plane being observed. Furthermore, the “Mining Tunnel” exhibits very dim lighting throughout the whole sequence, coupled with frequent visual and LiDAR degeneration. Despite of these challenges, FASTLIVO2 still returns to the starting point with an endtoend error of less than 0.01 m in both sequences.
IXD Highprecision Mapping
In this experiment, we validate the highprecision mapping capabilities of our system. To explore the mapping accuracy across different algorithms and ensure fairness, we compare our system with FASTLIO2, R3LIVE, and FASTLIVO in scenes characterized by rich texture and structured environments. We take the sequences “SYSU 01”, “HKU Landmark” and “CBD Building 01” as examples. Fig. S9 in the Supplementary Material[53] shows the colored point maps of these sequences reconstructed in real time. We can clearly observe that the point maps generated by FASTLIVO2 retain the finest details among all the systems, the enlarged views of the colored point maps are akin to those in the actual RGB image. In the “SYSU 01” sequence, our algorithm produces fewer white noise dots on the signboard because we normalize the image colors to a reasonable exposure time using the recovered exposure time before coloring, resulting in rarely overexposed colored point maps. The reconstruction of the human and motorcycle in “CBD Building 01” also exemplifies our ability to rebuild the details of unstructured objects. In all sequences, the estimated final position returns to the starting point with an endtoend error of less than 0.01 m. We also tested FASTLIVO2 in the remaining sequences of the prviate dataset, with mapping results shown in Fig. S10S13 in the Supplementary Material[53].
IXE Run Time Analysis
In this section, we evaluate the average computational time per LiDAR scan and image frame of our proposed system, tested on a desktop PC equipped with an Intel i710700K CPU and 32 GB RAM. Our evaluations span public datasets including Hilti’22, Hilti’23, and NTUVIRAL, and our private dataset. As shown in Table III, our system exhibits the lowest processing time across all sequences. The average computation time consumption on an Intel i7 processor is only 30.03 ms (17.13 ms per LiDAR scan and 12.90 ms per image frame), fulfilling realtime operation at 10 Hz. Besides, our system can even operate in real time on ARM processors with an average processing time per frame of just 78.44 ms. LVISAM’s LiDAR and visual feature extraction modules in LIO and VIO are timeconsuming. In addition to the time consumed by LIO and VIO, LVISAM integrates IMU preintegration constraints, visual odometry constraints, and LiDAR odometry constraints within a factor graph, further increasing the overall processing time. For R3LIVE, although also employing a direct method, its pixelwise image alignment necessitates the use of a large number of visual map points. In contrast, our approach uses sparse points with reference patches, enabling efficient alignment. Additionally, R3LIVE maintains a colored map that undergoes Bayesian updating, significantly increasing the computational load as the map resolution increases. For FASTLIO2, the average processing time per frame (Table S3 in the Supplementary Material[53] due to space constraints) is approximately 10.35 ms less than FASTLIVO2 due to not processing additional image measurements.
FASTLIVO2 also shows noticeable improvements over the predecessor FASTLIVO. The primary enhancement stems from our application of inverse compositional formulation in the sparse image alignment. Employing affine warping based on the plane prior from LiDAR points further enhances the convergence efficiency of our method. Consequently, FASTLIVO2 reduces the number of iterations per pyramid level from $10$ to $3$, while still achieving superior accuracy.
Dataset  R3LIVE  LVI SAM  FAST LIVO  FASTLIVO2 (LiDAR / Image)  FASTLIVO2 (ARM) 
Hilti’22  
Construction Ground  105.03  $\times $  52.33  36.52 (20.44 / 15.05)  96.12 
Construction Multilevel  112.13  $\times $  56.12  38.77 (21.38 / 17.39)  95.38 
Construction Stairs  125.41  138.47  51.34  39.33 (24.32 / 15.01)  98.43 
Long Corridor  120.42  109.97  48.68  41.42 (26.21 / 15.21)  94.33 
Cupola  151.52  $\times $  59.42  43.54 (26.53 / 17.01)  98.12 
Lower Gallery  119.74  131.37  51.15  41.11 (25.69 / 15.42)  92.13 
Attic to Upper Gallery  144.39  $\times $  58.61  44.21 (27.12 / 17.09)  91.15 
Outside Building  105.17  107.92  44.25  33.82 (18.91 / 14.91)  85.43 
Hilti’23  
Floor 0  117.01  $\times $  52.23  42.22 (25.13 / 17.09)  92.24 
Floor 1  106.11  106.98  50.14  43.12 (27.43 / 15.69)  93.52 
Floor 2  154.65  $\times $  53.24  41.78 (26.12 / 15.66)  94.43 
Basement  118.21  $\times $  48.23  39.65 (24.53 / 15.12)  93.42 
Stairs  122.94  114.29  48.55  38.42 (22.64 / 15.78)  95.53 
Parking 3x floors down  142.43  $\times $  51.89  43.62 (26.99 / 16.63)  94.22 
Large room  125.78  182.18  55.23  43.23 (26.43 / 16.80)  93.28 
Large room (dark)  131.31  $\times $  51.46  44.21 (29.12 / 15.09)  92.29 
NTU VIRAL  
eee_01  105.89  113.61  38.22  31.45 (17.24 / 14.21)  78.03 
eee_02  112.27  119.19  39.43  30.24 (16.23 / 14.01)  79.22 
eee_03  108.11  108.77  37.53  29.44 (16.12 / 13.32)  77.43 
nya_01  122.73  124.54  39.66  34.52 (18.14 / 16.38)  79.44 
nya_02  111.96  117.18  35.11  32.23 (17.33 / 14.90)  80.52 
nya_03  115.96  111.24  38.42  33.42 (18.12 / 15.30)  78.95 
sbs_01  112.77  115.67  39.23  31.92 (14.68 / 17.24)  78.32 
sbs_02  113.01  110.37  35.39  32.56 (17.62 / 14.94)  72.43 
sbs_03  119.91  120.73  37.41  33.62 (18.02 / 15.60)  75.56 
Private Dataset  
Retail Street  85.32  70.23  31.22  19.44 (10.42 / 9.02)  64.12 
CBD Building 01  79.22  65.32  29.14  18.22 (10.32 / 7.90)  62.33 
CBD Building 02  $\times $  $\times $  $\times $  21.53 (11.50 / 10.03)  69.98 
CBD Building 03  $\times $  $\times $  $\times $  21.89 (11.98 / 9.91)  69.92 
HKU Landmark  75.43  68.77  28.33  17.42 (10.12 / 7.30)  63.21 
HKU Lecture Center  70.42  $\times $  31.45  18.12 (10.99 / 7.13)  62.88 
HKU Centennial Garden  74.33  70.12  33.46  19.22 (10.80 / 8.42)  64.52 
HKU Cultural Center  102.34  $\times $  35.62  21.43 (11.42 / 10.01)  72.43 
HKU Main Building  75.62  $\times $  30.21  19.43 (10.33 / 9.10)  65.31 
HKUST Red Sculpture  84.22  90.13  32.43  20.14 (11.12 / 9.02)  65.42 
HIT Graffiti Wall  112.56  $\times $  39.98  21.71 (10.55 / 11.16)  68.99 
Banner Wall  74.13  $\times $  30.22  19.32 (10.01 / 9.31)  63.43 
Bright Screen Wall  73.23  $\times $  28.43  18.22 (9.92 / 8.30)  61.21 
Black Screen Wall  71.42  $\times $  27.66  17.99 (9.10 / 8.89)  60.91 
Office Building Wall  $\times $  $\times $  $\times $  19.42 (10.11 / 9.31)  62.33 
Narrow Corridor  $\times $  $\times $  $\times $  18.44 (8.32 / 10.12)  61.42 
Long Corridor  $\times $  $\times $  $\times $  19.53 (10.43 / 9.10)  62.55 
Mining Tunnel  $\times $  $\times $  $\times $  29.43 (15.42 / 14.01)  79.32 
SYSU 01  112.66  88.95  32.23  23.65 (12.51 / 11.14)  77.43 
SYSU 02  110.12  $\times $  32.12  22.63 (11.82 / 10.81)  72.31 
Average  108.36  108.45  41.43  30.03 (17.13 / 12.90)  78.44 

1
$\times $ denotes the system totally failed.
X Applications
To showcase the superior performance and versatility of FASTLIVO2 in realworld applications, we develop multiple solutions, including fully onboard autonomous UAV navigation, airborne mapping, textured mesh generation, and 3D Gaussian splatting reconstruction for 3D scene representation.
XA Fully Onboard Autonomous UAV Navigation
Given the high precision and robust localization performance of FASTLIVO2, along with its realtime capabilities, we conduct closedloop autonomous UAV flights.
XA1 System Configurations
The hardware and software setup are illustrated in Fig. 11. For hardware, we use a NUC (Intel i71360P CPU and 32 GB RAM) as the onboard computer. In terms of software, the localization component is powered by FASTLIVO2, which provides position feedback at 10Hz. The localizationr result is fed to the flight controller to achieve 200Hz feedback on position, velocity, and attitude. Besides localization, FASTLIVO2 supplies a dense registered point cloud to the planning module, the Bubble planner [55], which plans a smooth trajectory that is then tracked by an onmanifold Model Predictive Control (MPC) [56]. The MPC calculates the desired angular rates and thrust, which are tracked by respective lowlevel angular rate controllers running on the flight controller. Importantly, the MPC, Planner, and FASTLIVO2 all operate on the onboard computer in real time.
XA2 UAV Autonomous Navigation
We conduct 4 fully onboard autonomous UAV navigation experiments, “Basement”, “Woods”, “Narrow Opening”, and “SYSU Campus” (Table S2 in the Supplementary Material[53]). “Basement” and “Woods” experiments are fully autonomous flights incorporating all planning, MPC, and FASTLIVO2 modules, while “Narrow Opening” and “SYSU Campus” are manual flights with only MPC and FASTLIVO2 (without the planning component). As can be seen, “Basement” and “Woods” showcase the UAV’s successful autonomous navigation and obstacle avoidance. In “Narrow Opening”, the UAV is commanded to fly close proximity to a wall leading to few LiDAR points measurements. Nevertheless, the raycasting module recalls a greater number of visual map points, providing abundant constraints for localization, which allows for stable localization. Moreover, “Basement” and “Narrow Opening” experience LiDAR degeneration, observing only a single wall (see Fig. LABEL:fig_cover (e1) and (e4), Fig. 12 (b1b4)), along with significant exposure variations (see Fig. LABEL:fig_cover (e5e6)). Despite these challenges, our UAV system performed exceptionally well. “Woods” involves the UAV moving at high speeds up to 3 m/s, demanding rapid response from the entire UAV system (see Fig. 12 (a1a4)). “SYSU Campus”, a nondegenerated scene, primarily demonstrates the onboard highprecision mapping capabilities (see Fig. S14 in the Supplementary Material[53]). Finally, it is worth mentioning that in all these four UAV flights, severe lighting variation occurred. FASTLIVO2 is able to estimate exposure time that closely follows the groundtruth values (see Fig. S15 in the Supplementary Material[53]).
Regarding onboard computational time, the need to run MPC (at 100 Hz) and Planning (at 10 Hz) on the onboard computer consumes computational resources and memory, limiting the computation resources available to FASTLIVO2. Despite of the concurrent execution of control and planning, as illustrated in Fig. 13, the average onboard processing time per LiDAR scan and image frame for FASTLIVO2, approximately 53.47 ms, is still well below the frame period 100 ms. The average processing times for planning and MPC are 8.43 ms and 18.5 ms, respectively. The total average processing time of 80.4 ms meets very well the realtime requirements for onboard operations.
XB Airborne Mapping
Airborne mapping represents a crucial task in surveying and mapping applications. To evaluate the suitability of FASTLIVO2 for this application, we conduct an aerial mapping experiment using the public dataset MARSLVIG [52] whose hardware configuration is detailed in Section VIIIA. We evaluate the two sequences “HKairport01” and “HKisland01”, whose realtime mapping results are illustrated in Fig. LABEL:fig_cover (ac), with (a) and (c) corresponding to “HKisland01”, and (b) depicting “HKairport01”. The results demonstrate the effectiveness of FASTLIVO2 in unstructured environments such as forests and islands. The system successfully captures many fine structures and sharp coloring effects, including buildings, lane marks on roads, road curbs, tree crowns, and rocks, all of which are clearly visible. The APE (RMSE) for these sequences are 0.64 m and 0.27 m for FASTLIVO2, respectively, compared to 2.76 m and 0.52 m for R3LIVE. The average processing times on the desktop PC (Section IXA), are approximately 25.2 ms and 21.8 ms, respectively, compared to 110.5 ms and 100.2 ms for R3LIVE.
XC Supporting 3D Scene Applications: Mesh Generation, Texture, and Gaussian Splatting
Leveraging the highprecision sensor localization and dense 3D colored point map obtained from FASTLIVO2, we develop software applications for rendering pipelines including meshing and texturing, as well as emerging NeRFlike rendering pipeline such as 3D Gaussian Splatting (3DGS). For meshing, we employ VDBFusion [57] based on the Truncated Signed Distance Function (TSDF) in “CBD Building 01”, shown in Fig. 14 (a). The sharp edges on the columns and the distinct structure of the roof are clearly visible, demonstrating the high quality of the mesh. This level of detail is achieved due to the high density of FASTLIVO2’s point clouds and the exceptional accuracy of structural reconstruction. After mesh construction, we use OpenMVS [58] to perform texture mapping using the estimated camera poses in “CBD Building 01” and “Retail Street”, shown in Fig. 14 (bc). In Fig. 14 (c1c2), the texture images applied on the triangular facets are seamless and accurately aligned, resulting in a highly clear and precise texture mapping. This is attributed to pixellevel image alignment achieved by FASTLIVO2.
The dense color point clouds from FASTLIVO2 can also directly serve as the input of 3DGS. We conduct tests on the sequence “CBD Building 01” utilizing 300 frames out of a total of 1,180 images. The results are shown in Fig. 15. Compared to COLMAP[59], our method significantly reduces the time required to obtain dense point clouds and poses from 9 hours to 21 s. However, the training time increases from 10 min and 59 s to 15 min and 30 s. This increase is attributed to the denser point clouds (downsampled to 5 cm), which introduce more parameters to optimize. Nonetheless, the increased density and precision of our point clouds result in a slightly higher Peak SignaltoNoise Ratio (PSNR) compared to the PSNR obtained from COLMAP inputs.
XI Conclusion and Future Work
This paper proposed FASTLIVO2, a direct LIVO framework achieving fast, accurate, and robust state estimation while reconstructing the map on the fly. FASTLIVO2 can achieve high localization accuracy while being robust to severe LiDAR and/or visual degeneration.
The gain in speed is attributed to the use of raw LiDAR, inertial, and camera measurements within an efficient ESIKF framework with sequential update. In the image update, an inverse compositional formulation along with a sparse patchbased image alignment is further adopted to boost the efficiency. The gain in accuracy is attributed to the use (and even refine) of plane priors from LiDAR points to enhance accuracy of image alignment. Besides, a single unified voxel map is used to manage simultaneously the map points and the observed highresolution image measurements. The voxel map structure, which supports geometry construction and update, visual map point generation and update, and reference patch update, is developed and validated. The gain in robustness is due to realtime estimation of exposure time, which effectively handles environment illumination variation, and ondemand voxel raycasting to cope with LiDARs’ close proximity blind zones. The efficiency and accuracy of FASTLIVO2 were evaluated on extensive public datasets, while the robustness and effectiveness of each system module were evaluated on private dataset. The applications of FASTLIVO2 in realworld robotics applications, such as UAV navigation, 3D mapping, and model rendering, were also demonstrated.
As an odometry, FASTLIVO2 may have drifts over long distances. In the future, we could integrate loop closure and the sliding window optimization into FASTLIVO2 to mitigate this longterm drift. Moreover, the accurate and dense colored point maps could be used to extract semantic information for objectlevel semantic mapping.
References
 [1] R. MurArtal and J. D. Tardós, “Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras,” IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
 [3] J. Engel, T. Schöps, and D. Cremers, “Lsdslam: Largescale direct monocular slam,” in European conference on computer vision. Springer, 2014, pp. 834–849.
 [4] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2016.
 [5] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in realtime.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
 [6] J. Lin and F. Zhang, “Loam livox: A fast, robust, highprecision lidar odometry and mapping package for lidars of small fov,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 3126–3131.
 [7] T. Shan and B. Englot, “Legoloam: Lightweight and groundoptimized lidar odometry and mapping on variable terrain,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4758–4765.
 [8] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fastlivo: Fast and tightlycoupled sparsedirect lidarinertialvisual odometry,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 4003–4009.
 [9] T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
 [10] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “Orbslam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [11] M.Irani and P.Anandan, “All about direct methods,” in Proc. Workshop Vis. Algorithms, Theory Pract, 1999, pp. 267–277.
 [12] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semidirect monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 15–22.
 [13] W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fastlio2: Fast direct lidarinertial odometry,” IEEE Transactions on Robotics, pp. 1–21, 2022.
 [14] C. Yuan, W. Xu, X. Liu, X. Hong, and F. Zhang, “Efficient and probabilistic adaptive voxel mapping for accurate online lidar odometry,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8518–8525, 2022.
 [15] M. Meilland, A. I. Comport, and P. Rives, “Realtime dense visual tracking under large lighting variations,” in British Machine Vision Conference. British Machine Vision Association, 2011, pp. 45–1.
 [16] T. Tykkälä, C. Audras, and A. I. Comport, “Direct iterative closest point for realtime visual odometry,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 2011, pp. 2050–2056.
 [17] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgbd cameras,” in 2013 IEEE international conference on robotics and automation. IEEE, 2013, pp. 3748–3754.
 [18] J. Engel, J. Sturm, and D. Cremers, “Semidense visual odometry for a monocular camera,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1449–1456.
 [19] K. Chen, R. Nemiroff, and B. T. Lopez, “Direct lidarinertial odometry: Lightweight lio with continuoustime motion correction,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 3983–3989.
 [20] Z. Wang, L. Zhang, Y. Shen, and Y. Zhou, “Dliom: Tightlycoupled direct lidarinertial odometry and mapping,” IEEE Transactions on Multimedia, 2022.
 [21] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mapping with high robustness and low drift,” Journal of field robotics, vol. 35, no. 8, pp. 1242–1264, 2018.
 [22] W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo visual inertial lidar simultaneous localization and mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 370–377.
 [23] J. Zhang, M. Kaess, and S. Singh, “A realtime method for depth enhanced visual odometry,” Autonomous Robots, vol. 41, pp. 31–43, 2017.
 [24] J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidarmonocular visual odometry,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2018, pp. 7872–7879.
 [25] Y. Zhu, C. Zheng, C. Yuan, X. Huang, and X. Hong, “Camvox: A lowcost and accurate lidarassisted visual slam system,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5049–5055.
 [26] S.S. Huang, Z.Y. Ma, T.J. Mu, H. Fu, and S.M. Hu, “Lidarmonocular visual odometry using point and line features,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1091–1097.
 [27] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orbslam3: An accurate opensource library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
 [28] Y.S. Shin, Y. S. Park, and A. Kim, “Dvlslam: Sparse depth enhanced direct visuallidar slam,” Autonomous Robots, vol. 44, no. 2, pp. 115–130, 2020.
 [29] X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Licfusion: Lidarinertialcamera odometry,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 5848–5854.
 [30] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 965–972, 2018.
 [31] X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys, “Licfusion 2.0: Lidarinertialcamera odometry with slidingwindow planefeature tracking,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5112–5119.
 [32] D. Wisth, M. Camurri, S. Das, and M. Fallon, “Unified multimodal landmark tracking for tightly coupled lidarvisualinertial odometry,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1004–1011, 2021.
 [33] J. Lin, C. Zheng, W. Xu, and F. Zhang, “R2live: A robust, realtime, lidarinertialvisual tightlycoupled state estimator and mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7469–7476, 2021.
 [34] C. F. W. Bell B.M., “The iterated kalman filter update as a gaussnewton method,” Automatic Control IEEE Transactions, vol. 38, no. 2, pp. 294–297, 1993.
 [35] T. Shan, B. Englot, C. Ratti, and D. Rus, “Lvisam: Tightlycoupled lidarvisualinertial odometry via smoothing and mapping,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 5692–5698.
 [36] J. Lin and F. Zhang, “R 3 live: A robust, realtime, rgbcolored, lidarinertialvisual tightlycoupled state estimation and mapping package,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 10 672–10 678.
 [37] ——, “R 3 live++: A robust, realtime, radiance reconstruction package with a tightlycoupled lidarinertialvisual state estimator,” arXiv preprint arXiv:2209.03666, 2022.
 [38] J. Engel, V. Usenko, and D. Cremers, “A photometrically calibrated benchmark for monocular visual odometry,” arXiv preprint arXiv:1607.02555, 2016.
 [39] W. Wang, J. Liu, C. Wang, B. Luo, and C. Zhang, “Dvloam: Direct visual lidar odometry and mapping,” Remote Sensing, vol. 13, no. 16, p. 3340, 2021.
 [40] Z. Yuan, Q. Wang, K. Cheng, T. Hao, and X. Yang, “Sdvloam: Semidirect visuallidar odometry and mapping,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
 [41] H. Zhang, L. Du, S. Bao, J. Yuan, and S. Ma, “Lviofusion:tightlycoupled lidarvisualinertial odometry and mapping in degenerate environments,” IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3783–3790, 2024.
 [42] W. Xu and F. Zhang, “Fastlio: A fast, robust lidarinertial odometry package by tightlycoupled iterated kalman filter,” IEEE Robotics and Automation Letters, pp. 1–1, 2021.
 [43] D. He, W. Xu, and F. Zhang, “Symbolic representation and toolkit development of iterated errorstate extended kalman filters on manifolds,” IEEE Transactions on Industrial Electronics, 2023.
 [44] D. Willner, C.B. Chang, and K.P. Dunn, “Kalman filter algorithms for a multisensor system,” in 1976 IEEE conference on decision and control including the 15th symposium on adaptive processes. IEEE, 1976, pp. 570–574.
 [45] J. Ma and S. Sun, “Globally optimal distributed and sequential state fusion filters for multisensor systems with correlated noises,” Information Fusion, p. 101885, 2023.
 [46] Y. Ren, Y. Cai, F. Zhu, S. Liang, and F. Zhang, “Rogmap: An efficient robocentric occupancy grid map for largescene and highresolution lidarbased motion planning,” arXiv preprint arXiv:2302.14819, 2023.
 [47] R. M. Stereopsis, “Accurate, dense, and robust multiview stereopsis,” IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 32, no. 8, 2010.
 [48] S. Baker and I. Matthews, “Lucaskanade 20 years on: A unifying framework,” International journal of computer vision, vol. 56, pp. 221–255, 2004.
 [49] T.M. Nguyen, S. Yuan, M. Cao, Y. Lyu, T. H. Nguyen, and L. Xie, “Ntu viral: A visualinertialranginglidar dataset, from an aerial vehicle viewpoint,” The International Journal of Robotics Research, vol. 41, no. 3, pp. 270–280, 2022.
 [50] M. Helmberger, K. Morin, B. Berner, N. Kumar, G. Cioffi, and D. Scaramuzza, “The hilti slam challenge dataset,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7518–7525, 2022.
 [51] L. Zhang, M. Helmberger, L. F. T. Fu, D. Wisth, M. Camurri, D. Scaramuzza, and M. Fallon, “Hiltioxford dataset: A millimeteraccurate benchmark for simultaneous localization and mapping,” IEEE Robotics and Automation Letters, vol. 8, no. 1, pp. 408–415, 2022.
 [52] H. Li, Y. Zou, N. Chen, J. Lin, X. Liu, W. Xu, C. Zheng, R. Li, D. He, F. Kong, et al., “Marslvig dataset: A multisensor aerial robots slam dataset for lidarvisualinertialgnss fusion,” The International Journal of Robotics Research, p. 02783649241227968, 2024.
 [53] “Supplementary material: Fastlivo2: Fast, direct lidarinertialvisual odometry,” available online: https://github.com/hkumars/FASTLIVO2/blob/main/Supplementary/LIVO2˙supplementary.pdf.
 [54] C. Klug, C. Arth, D. Schmalstieg, and T. Gloor, “Measurement uncertainty analysis of a robotic total station simulation,” in IECON 201844th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2018, pp. 2576–2582.
 [55] Y. Ren, F. Zhu, W. Liu, Z. Wang, Y. Lin, F. Gao, and F. Zhang, “Bubble planner: Planning highspeed smooth quadrotor trajectories using receding corridors,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 6332–6339.
 [56] G. Lu, W. Xu, and F. Zhang, “Onmanifold model predictive control for trajectory tracking on robotic systems,” IEEE Transactions on Industrial Electronics, vol. 70, no. 9, pp. 9192–9202, 2022.
 [57] I. Vizzo, T. Guadagnino, J. Behley, and C. Stachniss, “Vdbfusion: Flexible and efficient tsdf integration of range sensor data,” Sensors, vol. 22, no. 3, 2022. [Online]. Available: https://www.mdpi.com/14248220/22/3/1296
 [58] D. Cernea, “Openmvs: multiview stereo reconstruction library. 2020,” URL: https://cdcseacave. github. io/openMVS, vol. 5, no. 6, p. 7, 2020.
 [59] J. L. Schönberger and J.M. Frahm, “Structurefrommotion revisited,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113.
Supplementary Material for FASTLIVO2: Fast, Direct LiDARInertialVisual Odometry
I. System Module Validation
In this section, we validate the key modules of our system, including the affine warping, normal refinement, reference patch update, ondemanding raycasting query, exposure time estimation, and ESIKF sequential update utilizing the FASTLIVO2 private dataset and MARSLVIG dataset.
A. Evaluation of Affine Warping
In this experiment, we aim to comprehensively evaluate the various affine warping effects based on the constant depth assumption (a common technique utilized in semidense methods), the plane prior from point clouds, and the refined plane normal of our proposed system (denoted as “Constant depth”, “Plane prior”, and “Plane normal refined“, respectively). To achieve this, we compare the mapping results and drift metrics of the three methods on “CBD Building 02” and “Office Building Wall”. As depicted in Fig. S2, “Plane normal refined” delivers the most clear and accurate mapping results and the next best is “Plane prior”. Notably, “Plane normal refined” renders text and patterns on the ground and walls, as well as lane markings, with remarkable clarity. Moreover, the drift associated with “Plane prior” and “Plane normal refined” remains below $0.01\text{}\mathrm{m}$, whereas “Constant depth” does not return to the starting position, experiencing a drift of $0.22\text{}\mathrm{m}$. Such results confirm the enhanced performance of affine warping based on the plane prior and enhancements by the plane normal refinement.
Besides, we compare warped projection effects based on “Constant depth” and “Plane prior” on the sequence “CBD Building 02” and “Office Building Wall”. We randomly select several image frames from these two sequences for the qualitative analysis. For each frame, we project the reference patches attached to the visual map points visible in the frame onto a blank image of the current frame. This process yields a novel RGB image. If the affine warping and pose estimation are both performed well, areas with patch projections will produce a seamless and minimally distorted appearance, closely resembling the raw RGB image. The comparison of warped patches is presented in Fig. S1. The results indicate that the pose accuracy and warped performance under “Plane prior” significantly outperform those under “Constant depth”
.
B. Evaluation of Reference Patch Update and Normal Convergence
In this experiment, we validate the effects of the reference patch update strategy and normal convergence on “HIT Graffiti Wall” and “HKU Centennial Garden”. As shown in Fig. S3, (a) and (b) are the reconstructed point clouds of these two sequences. On the right, for each region from A to H, we respectively present five patch observations captured at different poses, each with a patch size of $40\times 40$ pixels for visualization. These patches are observed in camera frames located at corresponding numbers on the left. As can be seen, our reference path update strategy tends to choose a highresolution reference patch that faces the plane along its normal. It is also noticed that, despite of visual map points and patches generated at nonplanar locations (e.g., tree leaves, trunk, and a lamp stand), the overall mapping quality is still high.
We also evaluate the convergence of our proposed normal estimation across patches in regions A to H. Each patch is in the size of $11\times 11$ pixels. The initial normal vector is estimated from the LiDAR points. The convergence curves, which represent the angle change between the initial and optimized normal vectors at each iteration number, are shown in Fig. S3. Regions A, C, D, and E are structured areas, whereas regions B, F, G, and H are unstructured ones. It can be observed that normal vectors for structured areas converge faster (within 6 iterations) with small normal refinement (2 to 4 degrees) because the initial normal provided by point clouds is relatively accurate. In unstructured areas, such as shrubs and tree leaves (i.e., B and F), the normal refinement is significant (up to 9 degrees) and requires 9 iterations to converge. Overall, normal refinement across these 8 regions demonstrates good convergence properties.
C. Evaluation of Ondemand Raycasting
In this experiment, we assess the performance of the ondemand raycasting module under extreme conditions where the current and recent LiDAR scans have few or even no points due to LiDARs’ close proximity blind zone^{1}^{1}1https://www.livoxtech.com/avia/specs. We use the sequence “Narrow Corridor” for an indepth analysis illustrated in Fig. S4. In this sequence, we traverse an extremely narrow tunnel, approximately $1.9\text{}\mathrm{m}$ in width, and turn to face the weakly textured wall on one side. Due to the limited points in the LiDAR scans when facing the wall, we can only acquire few visual map points through voxel query (yellow dots in Fig. S4 (b)). In this case, raycasting offers sufficient visual constraints to mitigate degeneration (blue dots in Fig. S4 (b)). The visualization results demonstrate that the ondemand raycasting module works well under challenging conditions with few points in LiDAR scans.
D. Evaluation of Exposure Time Estimation
In this experiment, we validate the exposure time estimation module in two parts: 1) For the sequence with fixed exposure and gain, we multiply each pixel of the received raw image by an exposure factor that changes sinusoidally over time. We verify the effectiveness of our estimation by comparing the estimated exposure times against the sinusoidal function applied. 2) For sequences with autoexposure and either fixed or autogain settings, we evaluate the accuracy of our estimated exposure times by comparing them with the ground truth values retrieved from the camera’s API.
In part one, we test on the sequence “Retail Street”, applying an exposure factor to images with fixed exposure and gain. As shown in Fig. S5, the estimated relative inverse exposure time matches the true values very well, evidencing the convergence of our exposure estimation in synthetic conditions. In part two, we use the sequences “HKU Centennial Garden”, “HKU Cultural Garden” and “HKU Main Building”, which have significant exposure time changes, for testing. We scale the estimated relative exposure times by the first frame to recover the actual exposure time (ms) of each frame. As shown in Fig. S5, the estimated exposure time follows closely the groundtruth values, which validates the effectiveness of our exposure time estimation module. The occasional mismatches are possibly due to the unmodeled response function and vignetting factor [38].
E. Evaluation of ESIKF Sequential Update
In this experiment, we evaluate different ESIKF update strategies for LiDAR and camera states. We compare asynchronous versus synchronous updates, as well as standard versus sequential updates. Specifically, we assess three strategies: “asynchronous (standard update)”, where the camera and LiDAR states are updated at their respective sampling times without scan recombination; “synchronous (standard update)”, where LiDAR scans are recombined to sync with camera images and the state is updated with both LiDAR and camera measurements within a standard ESIKF; and “synchronous (sequential update)”, where the LiDAR and camera are synced but the state is first updated by LiDAR measurements and then updated by camera measurements. These strategies are evaluated in terms of accuracy, robustness, and efficiency using the “AMvalley03” sequence of the MARSLVIG dataset. We select this sequence for several key reasons:

(1)
This sequence includes slopes that lead to both LiDAR and visual degenerations, making it a challenging test case.

(2)
This sequence represents an extremely largescale scenario (approximately $901\text{}\mathrm{m}$$\times $$500\text{}\mathrm{m}$$\times $$130\text{}\mathrm{m}$) with longterm and highspeed data collection (covering $600\text{}\mathrm{s}$ at a speed of $12\text{}\mathrm{m}/\mathrm{s}$), where pose deviations are prone to occur (due to the longterm and highspeed conditions), and even slight drifts can cause significant blurring in the colored point clouds (due to the large scale), leading to more pronounced comparative results.

(3)
This sequence provides the RTK ground truth data, allowing for more accurate quantitative comparisons.
We compare the qualitative mapping results, quantitative APE, and the average processing time of the three update strategies. The experimental configuration is as follows: LiDAR updates involve up to 5 iterations, visual updates use a threelevel pyramid with up to 5 iterations per level and no more than 3 iterations per level when the camera and LiDAR are updated simultaneously in a standard ESIKF, and the scale normalization factor (from visual photometric error to LiDAR pointtoplane distance) for the “synchronous (standard update)” is set to 0.0032, which has been meticulously tuned for optimal performance.
Fig. S6, (ac) present the reconstructed colored point clouds for this sequence. It is evident that the “synchronous (sequential update)” strategy produces accurate mapping results, particularly in the areas highlighted by the blue and orange boxes, where the mountain roads are reconstructed without any layering. In contrast, the other two strategies exhibit misalignments in these areas, although the “synchronous (standard update)” performs slightly better than the “asynchronous (standard update)”. The superior performance of the “synchronous (sequential update)” strategy is mainly attributed to its robustness in handling significant LiDAR and visual degenerations, as seen in the white box (c3). This area features a large, textureless slope, and the UAV passes over it at a high speed, heavily relying on a strong prior. The other two methods, which rely solely on the IMU prior, struggle to compute a relatively accurate image gradient descent direction, leading to significant linearization errors.
The APE (RMSE) metrics for the “AMvalley03” sequence are $3.12\text{}\mathrm{m}$, $2.45\text{}\mathrm{m}$, and $0.68\text{}\mathrm{m}$ for the “asynchronous (standard update)”, “synchronous (standard update)”, and “synchronous (sequential update)”, respectively. The average processing times on a desktop PC (Section IXA) are approximately $27.6\text{}\mathrm{ms}$, $49.9\text{}\mathrm{ms}$, and $23.1\text{}\mathrm{ms}$. Our proposed “synchronous (sequential update)” achieves the highest efficiency and accuracy, while the “asynchronous (standard update)” has the lowest accuracy. The “synchronous (standard update)” is the most timeconsuming, primarily because it requires fusing all LiDAR measurements at each level of the image pyramid.
Overall, our proposed “synchronous (sequential update)” offers superior accuracy and efficiency, while the “asynchronous (standard update)” has the lowest accuracy, and the “synchronous (standard update)” is the most timeconsuming.