The University of Southampton
University of Southampton Institutional Repository

Bio-inspired voice recognition for speaker identification

Bio-inspired voice recognition for speaker identification
Bio-inspired voice recognition for speaker identification
Speaker identification (SID) aims to identify the underlying speaker(s) given a speech utterance. In a speaker identification system, the first component is the front-end or feature extractor. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components is strongly determined by its quality. Existing approaches have used several feature extraction methods that have been adopted directly from the speech recognition task. However, the nature of these two tasks is contradictory given that speaker variability is one of the major error sources in speech recognition whereas in speaker recognition, it is the information that we wish to extract.

In this thesis, the possible benefits of adapting a biologically-inspired model of human auditory processing as part of the front-end of a SID system are examined. This auditory model named Auditory Image Model (AIM) generates the stabilized auditory image (SAI). Features are extracted by the SAI through breaking it into boxes of different scales. Vector quantization (VQ) is used to create the speaker database with the speakers’ reference templates that will be used for pattern matching with the features of the target speakers that need to be identified. Also, these features are compared to the Mel-frequency cepstral coefficients (MFCCs), which is the most evident example of a feature set that is extensively used in speaker recognition but originally developed for speech recognition purposes.

Additionally, another important parameter in SID systems is the dimensionality of the features. This study addresses this issue by specifying the most speaker-specific features and trying to further improve the system configuration for obtaining a representation of the auditory features with lower dimensionality.

Furthermore, after evaluating the system performance in quiet conditions, another primary topic of speaker recognition is investigated. SID systems can perform well under matched training and test conditions but their performance degrades significantly because of the mismatch caused by background noise in real-world environments. Achieving robustness to SID systems becomes an important research problem. In the second experimental part of this thesis, the developed version of the system is assessed for speaker data sets of different size. Clean speech is used for the training phase while speech in the presence of babble noise is used for speaker testing. The results suggest that the extracted auditory feature vectors lead to much better performance, i.e. higher SID accuracy, compared to the MFCC-based recognition system especially for low SNRs. Lastly, the system performance is inspected with regard to parameters related to the training and test speech data such as the duration of the spoken material. From these experiments, the system is found to produce satisfying identification scores for relatively short training and test speech segments.
University of Southampton
Iliadi, Konstantina
ed728e5b-c03f-427e-bbd5-39ca7330acb9
Iliadi, Konstantina
ed728e5b-c03f-427e-bbd5-39ca7330acb9
Bleeck, Stefan
c888ccba-e64c-47bf-b8fa-a687e87ec16c

Iliadi, Konstantina (2016) Bio-inspired voice recognition for speaker identification. University of Southampton, Doctoral Thesis, 203pp.

Record type: Thesis (Doctoral)

Abstract

Speaker identification (SID) aims to identify the underlying speaker(s) given a speech utterance. In a speaker identification system, the first component is the front-end or feature extractor. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components is strongly determined by its quality. Existing approaches have used several feature extraction methods that have been adopted directly from the speech recognition task. However, the nature of these two tasks is contradictory given that speaker variability is one of the major error sources in speech recognition whereas in speaker recognition, it is the information that we wish to extract.

In this thesis, the possible benefits of adapting a biologically-inspired model of human auditory processing as part of the front-end of a SID system are examined. This auditory model named Auditory Image Model (AIM) generates the stabilized auditory image (SAI). Features are extracted by the SAI through breaking it into boxes of different scales. Vector quantization (VQ) is used to create the speaker database with the speakers’ reference templates that will be used for pattern matching with the features of the target speakers that need to be identified. Also, these features are compared to the Mel-frequency cepstral coefficients (MFCCs), which is the most evident example of a feature set that is extensively used in speaker recognition but originally developed for speech recognition purposes.

Additionally, another important parameter in SID systems is the dimensionality of the features. This study addresses this issue by specifying the most speaker-specific features and trying to further improve the system configuration for obtaining a representation of the auditory features with lower dimensionality.

Furthermore, after evaluating the system performance in quiet conditions, another primary topic of speaker recognition is investigated. SID systems can perform well under matched training and test conditions but their performance degrades significantly because of the mismatch caused by background noise in real-world environments. Achieving robustness to SID systems becomes an important research problem. In the second experimental part of this thesis, the developed version of the system is assessed for speaker data sets of different size. Clean speech is used for the training phase while speech in the presence of babble noise is used for speaker testing. The results suggest that the extracted auditory feature vectors lead to much better performance, i.e. higher SID accuracy, compared to the MFCC-based recognition system especially for low SNRs. Lastly, the system performance is inspected with regard to parameters related to the training and test speech data such as the duration of the spoken material. From these experiments, the system is found to produce satisfying identification scores for relatively short training and test speech segments.

Text
ILIADI 22903402 final e-thesis for e-prints - Version of Record
Available under License University of Southampton Thesis Licence.
Download (3MB)

More information

Published date: October 2016

Identifiers

Local EPrints ID: 413949
URI: https://eprints.soton.ac.uk/id/eprint/413949
PURE UUID: 1e286c77-14c7-4026-af03-a34d6232b5bd
ORCID for Stefan Bleeck: ORCID iD orcid.org/0000-0003-4378-3394

Catalogue record

Date deposited: 11 Sep 2017 16:31
Last modified: 14 Mar 2019 01:41

Export record

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×