3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image
3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image
In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend ourapproach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquelycombines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This approach addresses critical challenges in semantic scene completion (SSC), including inherent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across trials, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func-tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on performance evaluations using K-fold cross-validation, we selected the late fusion approach.This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), enabling effective management of diverse signals within the F-TSDF representation.The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space design with comprehensive spatial coverage. We propose a streamlined computer vision based approach capable of reconstructing a 3D SSC model from a single panoramic input, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and immersive acoustic characteristics compared to existing methods reported in the literature.The proposed method contributes to the design of enhanced VR environments by integrating both audio and visual signals into a unified framework. Our results supportthe development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.
semantic scene completion, Virtual reality, 3D reconstruction
University of Southampton
Alawadh, Mona
60613079-426e-425a-81d3-09a6fbb7a92c
4 June 2025
Alawadh, Mona
60613079-426e-425a-81d3-09a6fbb7a92c
Kim, Hansung
2c7c135c-f00b-4409-acb2-85b3a9e8225f
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Alawadh, Mona
(2025)
3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image.
Doctoral Thesis, 156pp.
Record type:
Thesis
(Doctoral)
Abstract
In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend ourapproach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquelycombines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This approach addresses critical challenges in semantic scene completion (SSC), including inherent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across trials, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func-tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on performance evaluations using K-fold cross-validation, we selected the late fusion approach.This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), enabling effective management of diverse signals within the F-TSDF representation.The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space design with comprehensive spatial coverage. We propose a streamlined computer vision based approach capable of reconstructing a 3D SSC model from a single panoramic input, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and immersive acoustic characteristics compared to existing methods reported in the literature.The proposed method contributes to the design of enhanced VR environments by integrating both audio and visual signals into a unified framework. Our results supportthe development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.
Text
Mona Alawadh_PhD_Thesis
- Version of Record
Text
Final-thesis-submission-Examination-Mrs-Mona-Alawadh (1)
Restricted to Repository staff only
More information
Published date: 4 June 2025
Keywords:
semantic scene completion, Virtual reality, 3D reconstruction
Identifiers
Local EPrints ID: 501810
URI: http://eprints.soton.ac.uk/id/eprint/501810
PURE UUID: c54de745-2974-47c3-93c3-1675c1d39616
Catalogue record
Date deposited: 10 Jun 2025 16:52
Last modified: 11 Sep 2025 03:18
Export record
Contributors
Author:
Mona Alawadh
Thesis advisor:
Hansung Kim
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics