3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image

In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend ourapproach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquelycombines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This approach addresses critical challenges in semantic scene completion (SSC), including inherent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across trials, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func-tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on performance evaluations using K-fold cross-validation, we selected the late fusion approach.This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), enabling effective management of diverse signals within the F-TSDF representation.The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space design with comprehensive spatial coverage. We propose a streamlined computer vision based approach capable of reconstructing a 3D SSC model from a single panoramic input, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and immersive acoustic characteristics compared to existing methods reported in the literature.The proposed method contributes to the design of enhanced VR environments by integrating both audio and visual signals into a unified framework. Our results supportthe development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.

semantic scene completion, Virtual reality, 3D reconstruction

University of Southampton

Alawadh, Mona

60613079-426e-425a-81d3-09a6fbb7a92c

4 June 2025

Alawadh, Mona

60613079-426e-425a-81d3-09a6fbb7a92c

Kim, Hansung

2c7c135c-f00b-4409-acb2-85b3a9e8225f

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Alawadh, Mona (2025) 3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image. Doctoral Thesis, 156pp.

Record type: Thesis (Doctoral)

Abstract

Text

Mona Alawadh_PhD_Thesis - Version of Record

Available under License University of Southampton Thesis Licence.

Download (66MB)

Text

Final-thesis-submission-Examination-Mrs-Mona-Alawadh (1)

Restricted to Repository staff only

Available under License University of Southampton Thesis Licence.

More information

Published date: 4 June 2025

Keywords: semantic scene completion, Virtual reality, 3D reconstruction

Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 501810

URI: http://eprints.soton.ac.uk/id/eprint/501810

PURE UUID: c54de745-2974-47c3-93c3-1675c1d39616

ORCID for Mona Alawadh:

orcid.org/0000-0001-5354-7681

ORCID for Hansung Kim:

orcid.org/0000-0003-4907-0491

ORCID for Mahesan Niranjan:

orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 10 Jun 2025 16:52

Last modified: 11 Sep 2025 03:18

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Mona Alawadh

Thesis advisor: Hansung Kim

Thesis advisor: Mahesan Niranjan

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information