Loading...

AMB3R

Accurate Feed-forward Metric-scale
3D Reconstruction with Backend

arXiv 2025

AMB3R: Accurate Feed-forward Metric-scale
3D Reconstruction with Backend


Department of Computer Science,
University College London

ArXiv 2025


Abstract

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose estimation, depth estimation, metric-scale estimation, and 3D reconstruction, and even surpasses optimization-based SLAM and SfM with dense reconstruction priors on common benchmarks.


TL;DR: 1) Spatial representations matter for feed-forward reconstruction; 2) Multi-view transformer by itself can be used as a feed-forward VO/SfM without the need for task-specific fine-tuning or test-time optimization.


AMB3R (Scene Representations)


AMB3R consists of a front-end that predicts pointmaps and geometric features, and a back-end that fuses them into sparse voxels, which are serialized into a 1D sequence, processed by a transformer, and unserialized back to 3D. Per-pixel features are obtained via KNN interpolation and injected into the frozen front-end via zero-convolution for final prediction.


AMB3R (VO)

No test-time optimization, no task-specific fine-tuning


Input frames are mapped with keyframes stored in the active keyframe memory to predict camera poses and geometry. The coordinate alignment is done via 1) transforming the active keyframe map from global to local coordinates 2) estimating the relative scale of the corresponding keyframe geometry 3) transforming the local map to global coordinates via the weighted average of relative poses of each corresponding keyframe. We then select new keyframes from the newly mapped frames, and update the global keyframe memory. If the number of keyframes in active keyframe memory has not reached its capacity, we append the new keyframe; otherwise, we refresh the entire active keyframe memory by resampling from the global keyframe memory.

sf_gold room taylor office pumpkin ethshake

AMB3R (SfM)

No test-time optimization, no task-specific fine-tuning


Our SfM mainly consists of 3 stages: 1) Image clustering that groups images into small clusters, 2) Coarse registration that registers each cluster incrementally, and 3) Global mapping that refines keyframe and non-keyframes via mapping.

museum truck south mipnerf

BibTeX


      
    

Acknowledgement

Research presented here has been supported by the UCL Centre for Doctoral Training in Foundational AI under UKRI grant number EP/S021566/1. This project was also supported by UKRI/EPSRC AI Hub in Generative Models under grant number EP/Y028805/1. Hengyi Wang was supported from a sponsored research award by Cisco Research. The page design was inspired by Nerfies, Gaussian Splatting SLAM, and World Models.