Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

Video

Abstract

We present Co-SLAM, a real-time RGB-D SLAM system based on a neural implicit representation that performs robust camera tracking and high-fidelity surface reconstruction.

Co-SLAM represents the scene as a multi-resolution hash grid to exploit its extremely high convergence speed and ability to represent high-frequency local features. In addition, to incorporate surface coherence priors Co-SLAM adds one-blob encoding, which we show enables powerful scene completion in unobserved areas. Our joint encoding brings the best of both worlds to Co-SLAM: speed, high-fidelity reconstruction, and surface coherence priors; enabling real-time and robust online performance. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do.

Experimental results show that Co-SLAM runs at 10Hz and achieves state-of-the-art scene reconstruction results and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD).

Method

Co-SLAM consists of three major parts: 1) scene representation maps the input position into RGB and SDF values using joint coordinate and sparse parametric encoding with two shallow MLPs. 2) Tracking processing performs camera tracking by minimising our objective functions with respect to trainable camera parameters. 3) Mapping processing uses the selected pixels with tracked poses to perform bundle adjustment that jointly optimising our scene representation as well as the tracked camera poses via minimising our objective functions with respect to scene representation and trainable camera parameters.

Visualization

Qualitative comparison

Hover over image to move the zoomed in patch

Click on reference image to switch to a different image

iMAP

NICE-SLAM

Co-SLAM (Ours)

Tracking and Mapping

NICE-SLAM (1Hz)

Ours(17Hz)

(The black / red lines are the ground truth / predicted camera trajectory)

ScanNet Dataset

NICE-SLAM (1Hz)

Ours(12Hz)

NICE-SLAM Apartment

NICE-SLAM (1Hz)

Ours(12Hz)

Mesh viewer

NICE-SLAM (1Hz)

Ours(12Hz)

Comparison with ESLAM [CVPR'23]

ESLAM

Ours

In comparison to ESLAM, our method achieves online incremental mapping without forgetting previous reconstruction. This makes our method become scalable to larger-scale scenes.

BibTeX


      @article{wang2023coslam,
        title={Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM},
        author={Wang, Hengyi and Wang, Jingwen and Agapito, Lourdes},
        journal={Proceedings of the IEEE international conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2023}
      }

Acknowledgement

Research presented here has been supported by the UCL Centre for Doctoral Training in Foundational AI under UKRI grant number EP/S021566/1. This project made use of time on Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1). Hengyi Wang was supported from a sponsored research award by Cisco Research. We thank Edgar Sucar for providing mesh of iMAP, and Zihan Zhu for providing additional details of NICE-SLAM.

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

CVPR 2023

Co-SLAM is a real-time SLAM system that achieves accurate, high-fidelity scene reconstruction and completion with efficient memory usage.

Video

Abstract

Method

Visualization

Qualitative comparison

Hover over image to move the zoomed in patch

Click on reference image to switch to a different image

Tracking and Mapping

NICE-SLAM (1Hz)

Ours(17Hz)

(The black / red lines are the ground truth / predicted camera trajectory)

ScanNet Dataset

NICE-SLAM (1Hz)

Ours(12Hz)

NICE-SLAM Apartment

NICE-SLAM (1Hz)

Ours(12Hz)

Mesh viewer

NICE-SLAM (1Hz)

Ours(12Hz)

Comparison with ESLAM [CVPR'23]

ESLAM

Ours

In comparison to ESLAM, our method achieves online incremental mapping without forgetting previous reconstruction. This makes our method become scalable to larger-scale scenes.

BibTeX

Acknowledgement