Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution
Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution
Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertaintyaware guided depth super-resolution module. To ensure crossview consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.
3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution
Liu, Yimei
7a0af0a6-ab47-4ba7-af50-63315b1ad96c
Cao, Jingchao
99d64c10-0d7a-4def-a899-baf1a0fd2dde
Fan, Hao
9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8
Dong, Junyu
f412ed20-b213-4c97-b0fd-f77262d6ab2f
Chen, Sheng
9310a111-f79a-48b8-98c7-383ca93cbb80
Liu, Yimei
7a0af0a6-ab47-4ba7-af50-63315b1ad96c
Cao, Jingchao
99d64c10-0d7a-4def-a899-baf1a0fd2dde
Fan, Hao
9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8
Dong, Junyu
f412ed20-b213-4c97-b0fd-f77262d6ab2f
Chen, Sheng
9310a111-f79a-48b8-98c7-383ca93cbb80
Liu, Yimei, Cao, Jingchao, Fan, Hao, Dong, Junyu and Chen, Sheng
(2025)
Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution.
IEEE Transactions on Circuits and Systems for Video Technology.
(doi:10.1109/TCSVT.2025.3571940).
Abstract
Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertaintyaware guided depth super-resolution module. To ensure crossview consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.
Text
TCSVT_SR-MVS_main_final
- Accepted Manuscript
More information
Accepted/In Press date: 18 May 2025
e-pub ahead of print date: 20 May 2025
Keywords:
3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution
Identifiers
Local EPrints ID: 502668
URI: http://eprints.soton.ac.uk/id/eprint/502668
ISSN: 1558-2205
PURE UUID: 70be4338-e9fb-48d7-9070-39d64849cad6
Catalogue record
Date deposited: 04 Jul 2025 16:32
Last modified: 04 Jul 2025 16:40
Export record
Altmetrics
Contributors
Author:
Yimei Liu
Author:
Jingchao Cao
Author:
Hao Fan
Author:
Junyu Dong
Author:
Sheng Chen
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics