MAP-ADAPT

Real-Time Quality-Adaptive Semantic 3D Maps

European Conference on Computer Vision (ECCV) 2024

Jianhao Zheng1     Daniel Barath2     Marc Pollefeys2,3     Iro Armeni1    
1Stanford University     2ETH Zürich   3Microsoft  

Abstract

Creating 3D semantic reconstructions of environments is fundamental to many applications, especially when related to autonomous agent operation (e.g. goal-oriented navigation or object interaction and manipulation). Commonly, 3D semantic reconstruction systems capture the entire scene in the same level of detail. However, certain tasks (e.g. object interaction) require a fine-grained and high-resolution map, particularly if the objects to interact are of small size or intricate geometry. In recent practice, this leads to the entire map being in the same high-quality resolution, which results in increased computational and storage costs. To address this challenge, we propose MAP-ADAPT, a real-time method for quality-adaptive semantic 3D reconstruction using RGBD frames. MAP-ADAPT is the first adaptive semantic 3D mapping algorithm that, unlike prior work, generates directly a single map with regions of different quality based on both the semantic information and the geometric complexity of the scene. Leveraging a semantic SLAM pipeline for pose and semantic estimation, we achieve comparable or superior results to state-of-the-art methods on synthetic and real-world data, while significantly reducing storage and computation requirements.

Video

Method

MAP-ADAPT (a) Given RGBD frames, we estimate (b-i) semantic segmentation and (b-iv) camera pose and compute (b-ii) geometric complexity. (c-i) We integrate geometric and semantic information (b-iii) on the TSDF voxel map. The geometric complexity and the semantic label will define the voxel size of that region of the map. (c-ii) shows the multi-resolution mesh output. The adaptive structure we use is shown in (c-iii).

Qualitative Results

HSSD Dataset

Overall Reconstruction

Fix-size (1cm) [819 MB] MAP-ADAPT-S (Ours) [297 MB]

Masks of semantics per quality level (GT)

Fix-size (1cm) Multi-TSDFs MAP-ADAPT-S (Ours)

ScanNet Dataset

Overall Reconstruction

Fix-size (1cm) [463 MB] MAP-ADAPT-S (Ours) [132 MB]

Masks of semantics per quality level (EST)

Fix-size (1cm) Multi-TSDFs MAP-ADAPT-S (Ours)

ScanNet dataset doesn't provide semantic mask for the entire scene. The semantic mask in the above visualization is from the estimated semantic segmentation per method. There's no mask provided for Multi-TSDFs as multiple submaps labelled as different semantics may occupy the same spatial regions. You can refer to the mask of fix-size method or ours to have a clue on which regions should be reconstructed in which level.

Additional Results

Top example is on HSSD and bottom one on ScanNet datasets. Geometric and completion errors are shown as heatmaps; the darker the color, the closer to the GT geometry. For semantic map, results are colorized per quality level; different semantics in the same quality level range from brighter to darker. Another heatmap is used to show the estimated geometric complexity. We highlight regions that are classified into high-quality semantics (red block) or have large geometric variance (orange block).

BibTeX

@inproceedings{zheng2024map,
      title={MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps},
      author={Zheng, Jianhao and Barath, Daniel and Pollefeys, Marc and Armeni, Iro},
      booktitle={European Conference on Computer Vision},
      year={2024},
      organization={Springer}
    }