Speech enhancement by using deep learning algorithms
Speech enhancement by using deep learning algorithms
Speech signals are often degraded by ambient noise, which significantly hampers speech intelligibility and quality, posing challenges for both human communication and speech-related technologies. Over the past decade, the advent of deep learning has catalysed remarkable progress in the field of speech enhancement. With the proliferation of smart devices demanding real-time processing capabilities, the development of real-time deep learning-based speech enhancement systems has become increasingly pertinent.
The primary objective of this thesis is to advance the state-of-the-art in real-time speech enhancement algorithms, with a focus on improving the intelligibility and quality of speech in noisy environments. Our research commences with an exploration into the intricacies of auditory perception and the impact of hearing loss on speech comprehension, setting the stage for the development of sophisticated speech enhancement techniques.
Traditional speech enhancement methods are reviewed in chapter 2, leading to an in-depth discussion on the selection of features critical for distinguishing speech from noise. The work transitions to deep learning neural networks, detailing architectures like LSTM-RNNs and CNNs, and their implementation in speech enhancement, emphasizing the importance of quantitative evaluations.
Chapter 3 delves into the application of Generative Adversarial Neural Networks (GANs) in the domain of speech enhancement, building upon existing research to further refine the use of these models. The chapter focuses on the innovative integration of the magnitude spectrum as an input feature, which significantly contributes to the performance enhancement of GANs. Additionally, the exploration of various deep learning architectures as potential generators within the GAN framework is presented, showcasing the adaptability and continuous improvement potential of GANs in speech enhancement.
Attention mechanisms are presented as a driving force for innovation in speech enhancement, with the novel 'Mask First, Compensation Last' topology aiming to reduce speech distortion and residual noise. Motived by them, the chapter 4 further explores a new cascaded architecture on raw waveform input against the complexity of auditory perception.
Chapter 5 brings a new combination method in speech enhancement, contrasting mapping-based and masking-based methods, and proposing a parallel dual-module system, the Compensation for Complex Domain Network (CCDN), that unifies the magnitude spectrum with complex domain details.
The final chapter addresses the challenge of data mismatch in traditional supervised methods. We proposed an innovative strategy that combines unsupervised pre-training with supervised fine-tuning. This approach not only enhances speech quality in complex noise environments but also simulates the advantages of supervised learning without requiring paired data. Our model's adaptability to real-world noise conditions and its effectiveness in various speech enhancement tasks are validated through rigorous experimental evaluations and subjective listening tests. This chapter culminates in showcasing a robust, and practical speech enhancement model fit for real-world application, highlighted by its adaptability to real-world noise conditions and the integration of unsupervised learning strategies for enhanced model robustness and versatility. By enhancing the quality of human communication and addressing challenges faced by individuals with hearing impairments or in noisy environments.
University of Southampton
Cui, Jianqiao
3961d0d6-9687-4fbc-9e17-93be8bd86a36
2024
Cui, Jianqiao
3961d0d6-9687-4fbc-9e17-93be8bd86a36
Bleeck, Stefan
c888ccba-e64c-47bf-b8fa-a687e87ec16c
Nelson, Philip
5c6f5cc9-ea52-4fe2-9edf-05d696b0c1a9
Cui, Jianqiao
(2024)
Speech enhancement by using deep learning algorithms.
University of Southampton, Doctoral Thesis, 159pp.
Record type:
Thesis
(Doctoral)
Abstract
Speech signals are often degraded by ambient noise, which significantly hampers speech intelligibility and quality, posing challenges for both human communication and speech-related technologies. Over the past decade, the advent of deep learning has catalysed remarkable progress in the field of speech enhancement. With the proliferation of smart devices demanding real-time processing capabilities, the development of real-time deep learning-based speech enhancement systems has become increasingly pertinent.
The primary objective of this thesis is to advance the state-of-the-art in real-time speech enhancement algorithms, with a focus on improving the intelligibility and quality of speech in noisy environments. Our research commences with an exploration into the intricacies of auditory perception and the impact of hearing loss on speech comprehension, setting the stage for the development of sophisticated speech enhancement techniques.
Traditional speech enhancement methods are reviewed in chapter 2, leading to an in-depth discussion on the selection of features critical for distinguishing speech from noise. The work transitions to deep learning neural networks, detailing architectures like LSTM-RNNs and CNNs, and their implementation in speech enhancement, emphasizing the importance of quantitative evaluations.
Chapter 3 delves into the application of Generative Adversarial Neural Networks (GANs) in the domain of speech enhancement, building upon existing research to further refine the use of these models. The chapter focuses on the innovative integration of the magnitude spectrum as an input feature, which significantly contributes to the performance enhancement of GANs. Additionally, the exploration of various deep learning architectures as potential generators within the GAN framework is presented, showcasing the adaptability and continuous improvement potential of GANs in speech enhancement.
Attention mechanisms are presented as a driving force for innovation in speech enhancement, with the novel 'Mask First, Compensation Last' topology aiming to reduce speech distortion and residual noise. Motived by them, the chapter 4 further explores a new cascaded architecture on raw waveform input against the complexity of auditory perception.
Chapter 5 brings a new combination method in speech enhancement, contrasting mapping-based and masking-based methods, and proposing a parallel dual-module system, the Compensation for Complex Domain Network (CCDN), that unifies the magnitude spectrum with complex domain details.
The final chapter addresses the challenge of data mismatch in traditional supervised methods. We proposed an innovative strategy that combines unsupervised pre-training with supervised fine-tuning. This approach not only enhances speech quality in complex noise environments but also simulates the advantages of supervised learning without requiring paired data. Our model's adaptability to real-world noise conditions and its effectiveness in various speech enhancement tasks are validated through rigorous experimental evaluations and subjective listening tests. This chapter culminates in showcasing a robust, and practical speech enhancement model fit for real-world application, highlighted by its adaptability to real-world noise conditions and the integration of unsupervised learning strategies for enhanced model robustness and versatility. By enhancing the quality of human communication and addressing challenges faced by individuals with hearing impairments or in noisy environments.
Text
Jianqiao_Cui_PhD_thesis
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Jianqiao-Cui (1)
Restricted to Repository staff only
More information
Published date: 2024
Identifiers
Local EPrints ID: 492126
URI: http://eprints.soton.ac.uk/id/eprint/492126
PURE UUID: 6b41058a-5274-4fb2-8637-3f57f3f2bc5d
Catalogue record
Date deposited: 17 Jul 2024 16:37
Last modified: 15 Aug 2024 02:12
Export record
Contributors
Author:
Jianqiao Cui
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics