The University of Southampton
University of Southampton Institutional Repository

3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image

3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image
3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image
In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend ourapproach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquelycombines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This approach addresses critical challenges in semantic scene completion (SSC), including inherent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across trials, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func-tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on performance evaluations using K-fold cross-validation, we selected the late fusion approach.This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), enabling effective management of diverse signals within the F-TSDF representation.The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space design with comprehensive spatial coverage. We propose a streamlined computer vision based approach capable of reconstructing a 3D SSC model from a single panoramic input, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and immersive acoustic characteristics compared to existing methods reported in the literature.The proposed method contributes to the design of enhanced VR environments by integrating both audio and visual signals into a unified framework. Our results supportthe development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.
semantic scene completion, Virtual reality, 3D reconstruction
University of Southampton
Alawadh, Mona
60613079-426e-425a-81d3-09a6fbb7a92c
Alawadh, Mona
60613079-426e-425a-81d3-09a6fbb7a92c
Kim, Hansung
2c7c135c-f00b-4409-acb2-85b3a9e8225f
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Alawadh, Mona (2025) 3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image. Doctoral Thesis, 156pp.

Record type: Thesis (Doctoral)

Abstract

In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend ourapproach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquelycombines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This approach addresses critical challenges in semantic scene completion (SSC), including inherent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across trials, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func-tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on performance evaluations using K-fold cross-validation, we selected the late fusion approach.This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), enabling effective management of diverse signals within the F-TSDF representation.The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space design with comprehensive spatial coverage. We propose a streamlined computer vision based approach capable of reconstructing a 3D SSC model from a single panoramic input, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and immersive acoustic characteristics compared to existing methods reported in the literature.The proposed method contributes to the design of enhanced VR environments by integrating both audio and visual signals into a unified framework. Our results supportthe development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.

Text
Mona Alawadh_PhD_Thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (66MB)
Text
Final-thesis-submission-Examination-Mrs-Mona-Alawadh (1)
Restricted to Repository staff only
Available under License University of Southampton Thesis Licence.

More information

Published date: 4 June 2025
Keywords: semantic scene completion, Virtual reality, 3D reconstruction

Identifiers

Local EPrints ID: 501810
URI: http://eprints.soton.ac.uk/id/eprint/501810
PURE UUID: c54de745-2974-47c3-93c3-1675c1d39616
ORCID for Mona Alawadh: ORCID iD orcid.org/0000-0001-5354-7681
ORCID for Hansung Kim: ORCID iD orcid.org/0000-0003-4907-0491
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 10 Jun 2025 16:52
Last modified: 11 Sep 2025 03:18

Export record

Contributors

Author: Mona Alawadh ORCID iD
Thesis advisor: Hansung Kim ORCID iD
Thesis advisor: Mahesan Niranjan ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×