AMB3R

Abstract

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose estimation, depth estimation, metric-scale estimation, and 3D reconstruction, and even surpasses optimization-based SLAM and SfM with dense reconstruction priors on common benchmarks.

TL;DR: 1) Spatial representations matter for feed-forward reconstruction; 2) Multi-view transformer by itself can be used as a feed-forward VO/SfM without the need for task-specific fine-tuning or test-time optimization.

AMB3R (Scene Representations)

AMB3R consists of a front-end that predicts pointmaps and geometric features, and a back-end that fuses them into sparse voxels, which are serialized into a 1D sequence, processed by a transformer, and unserialized back to 3D. Per-pixel features are obtained via KNN interpolation and injected into the frozen front-end via zero-convolution for final prediction.

AMB3R (VO)

— No test-time optimization, no task-specific fine-tuning

Input frames are mapped with keyframes stored in the active keyframe memory to predict camera poses and geometry. The coordinate alignment is done via 1) transforming the active keyframe map from global to local coordinates 2) estimating the relative scale of the corresponding keyframe geometry 3) transforming the local map to global coordinates via the weighted average of relative poses of each corresponding keyframe. We then select new keyframes from the newly mapped frames, and update the global keyframe memory. If the number of keyframes in active keyframe memory has not reached its capacity, we append the new keyframe; otherwise, we refresh the entire active keyframe memory by resampling from the global keyframe memory.

AMB3R (SfM)

— No test-time optimization, no task-specific fine-tuning

Our SfM mainly consists of 3 stages: 1) Image clustering that groups images into small clusters, 2) Coarse registration that registers each cluster incrementally, and 3) Global mapping that refines keyframe and non-keyframes via mapping.

BibTeX

Acknowledgement

Research presented here has been supported by the UCL Centre for Doctoral Training in Foundational AI under UKRI grant number EP/S021566/1. This project was also supported by UKRI/EPSRC AI Hub in Generative Models under grant number EP/Y028805/1. Hengyi Wang was supported from a sponsored research award by Cisco Research. The page design was inspired by Nerfies, Gaussian Splatting SLAM, and World Models.

AMB3R

Accurate Feed-forward Metric-scale
3D Reconstruction with Backend

AMB3R: Accurate Feed-forward Metric-scale
3D Reconstruction with Backend

ArXiv 2025

Abstract

AMB3R (Scene Representations)

AMB3R (VO)

— No test-time optimization, no task-specific fine-tuning

AMB3R (SfM)

— No test-time optimization, no task-specific fine-tuning

BibTeX

Acknowledgement

AMB3R

Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

ArXiv 2025

Abstract

AMB3R (Scene Representations)

AMB3R (VO)

— No test-time optimization, no task-specific fine-tuning

AMB3R (SfM)

— No test-time optimization, no task-specific fine-tuning

BibTeX

Acknowledgement

Accurate Feed-forward Metric-scale
3D Reconstruction with Backend

AMB3R: Accurate Feed-forward Metric-scale
3D Reconstruction with Backend