Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution

Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertaintyaware guided depth super-resolution module. To ensure crossview consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.

3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution

10.1109/TCSVT.2025.3571940

1558-2205

11097-11112

Liu, Yimei

7a0af0a6-ab47-4ba7-af50-63315b1ad96c

Cao, Jingchao

99d64c10-0d7a-4def-a899-baf1a0fd2dde

Fan, Hao

9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8

Dong, Junyu

f412ed20-b213-4c97-b0fd-f77262d6ab2f

Chen, Sheng

9310a111-f79a-48b8-98c7-383ca93cbb80

2 November 2025

Liu, Yimei

7a0af0a6-ab47-4ba7-af50-63315b1ad96c

Cao, Jingchao

99d64c10-0d7a-4def-a899-baf1a0fd2dde

Fan, Hao

9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8

Dong, Junyu

f412ed20-b213-4c97-b0fd-f77262d6ab2f

Chen, Sheng

9310a111-f79a-48b8-98c7-383ca93cbb80

Liu, Yimei, Cao, Jingchao, Fan, Hao, Dong, Junyu and Chen, Sheng (2025) Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 35 (11), 11097-11112. (doi:10.1109/TCSVT.2025.3571940).

Record type: Article

Abstract

Text

TCSVT_SR-MVS_main_final - Accepted Manuscript

Available under License Creative Commons Attribution.

Download (9MB)

Text

TCSVT2025-Nov-1 - Version of Record

Restricted to Repository staff only

Request a copy

More information

Accepted/In Press date: 18 May 2025

e-pub ahead of print date: 20 May 2025

Published date: 2 November 2025

Keywords: 3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution

Learn more about Next Generation Wireless research

Identifiers

Local EPrints ID: 502668

URI: http://eprints.soton.ac.uk/id/eprint/502668

DOI: doi:10.1109/TCSVT.2025.3571940

ISSN: 1558-2205

PURE UUID: 70be4338-e9fb-48d7-9070-39d64849cad6

Catalogue record

Date deposited: 04 Jul 2025 16:32

Last modified: 03 Nov 2025 17:41

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Yimei Liu

Author: Jingchao Cao

Author: Hao Fan

Author: Junyu Dong

Author: Sheng Chen

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information