The University of Southampton
University of Southampton Institutional Repository

Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution

Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution
Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution
Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertaintyaware guided depth super-resolution module. To ensure crossview consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.
3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution
1558-2205
Liu, Yimei
7a0af0a6-ab47-4ba7-af50-63315b1ad96c
Cao, Jingchao
99d64c10-0d7a-4def-a899-baf1a0fd2dde
Fan, Hao
9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8
Dong, Junyu
f412ed20-b213-4c97-b0fd-f77262d6ab2f
Chen, Sheng
9310a111-f79a-48b8-98c7-383ca93cbb80
Liu, Yimei
7a0af0a6-ab47-4ba7-af50-63315b1ad96c
Cao, Jingchao
99d64c10-0d7a-4def-a899-baf1a0fd2dde
Fan, Hao
9313d4fe-cd51-4c5e-a5cf-548b9ba8d0c8
Dong, Junyu
f412ed20-b213-4c97-b0fd-f77262d6ab2f
Chen, Sheng
9310a111-f79a-48b8-98c7-383ca93cbb80

Liu, Yimei, Cao, Jingchao, Fan, Hao, Dong, Junyu and Chen, Sheng (2025) Real-world multi-view stereo via learning RGB-D structural consistency from depth super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. (doi:10.1109/TCSVT.2025.3571940).

Record type: Article

Abstract

Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertaintyaware guided depth super-resolution module. To ensure crossview consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.

Text
TCSVT_SR-MVS_main_final - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (9MB)

More information

Accepted/In Press date: 18 May 2025
e-pub ahead of print date: 20 May 2025
Keywords: 3D reconstruction, Multi-view stereo, depth estimation, guided depth super-resolution

Identifiers

Local EPrints ID: 502668
URI: http://eprints.soton.ac.uk/id/eprint/502668
ISSN: 1558-2205
PURE UUID: 70be4338-e9fb-48d7-9070-39d64849cad6

Catalogue record

Date deposited: 04 Jul 2025 16:32
Last modified: 04 Jul 2025 16:40

Export record

Altmetrics

Contributors

Author: Yimei Liu
Author: Jingchao Cao
Author: Hao Fan
Author: Junyu Dong
Author: Sheng Chen

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×