The University of Southampton
University of Southampton Institutional Repository

Improving automatic speech recognition transcription through signal processing

Improving automatic speech recognition transcription through signal processing
Improving automatic speech recognition transcription through signal processing
Automatic speech recognition (ASR) in the educational environment could be a solution to address the problem of gaining access to the spoken words of a lecture for many students who find lectures hard to understand, such as those whose mother tongue is not English or who have a hearing impairment. In such an environment, it is difficult for ASR to provide transcripts with Word Error Rates (WER) less than 25% for the wide range of speakers. Reducing the WER reduces the time and therefore cost of correcting errors in the transcripts.

To deal with the variation of acoustic features between speakers, ASR systems implement automatic vocal tract normalisation (VTN) that warps the formants (resonant frequencies) of the speaker to better match the formants of the speakers in the training set. The ASR also implements automatic dynamic time warping (DTW) to deal with variation in the speaker’s rate of speaking, by aligning the time series of the new spoken words with the time series of the matching spoken words of the training set.

This research investigates whether the ASR’s automatic estimation of VTN and DTW can be enhanced through pre-processing the recording by manually warping the formants and speaking rate of the recordings using sound processing libraries (Rubber Band and SoundTouch) before transcribing the pre-processed recordings using ASR.

An initial experiment, performed with the recordings of two male and two female speakers, showed that pre-processing the recording could improve the WER by an average of 39.5% for male speakers and 36.2% for female speakers. However the selection of the best warp factors was achieved through an iterative ‘trial and error’ approach that involved many hours calculating the word error rate for each warp factor setting.

Finding a more efficient approach for selecting the warp factors for pre-processing was then investigated.

The second experiment investigated the development of a modification function using, as its training set, the best warp factors from the ‘trial and error’ approach to estimate the modification percentage required to improve the WER of a recording. A modification function was found that on average improved the WER by 16% for female speakers and 7% for male speakers.
University of Southampton
Shah, Afnan Arafat
65a6047d-04f4-4aed-b13d-052c1a0b5529
Shah, Afnan Arafat
65a6047d-04f4-4aed-b13d-052c1a0b5529
Wald, Michael
90577cfd-35ae-4e4a-9422-5acffecd89d5

Shah, Afnan Arafat (2017) Improving automatic speech recognition transcription through signal processing. University of Southampton, Doctoral Thesis, 158pp.

Record type: Thesis (Doctoral)

Abstract

Automatic speech recognition (ASR) in the educational environment could be a solution to address the problem of gaining access to the spoken words of a lecture for many students who find lectures hard to understand, such as those whose mother tongue is not English or who have a hearing impairment. In such an environment, it is difficult for ASR to provide transcripts with Word Error Rates (WER) less than 25% for the wide range of speakers. Reducing the WER reduces the time and therefore cost of correcting errors in the transcripts.

To deal with the variation of acoustic features between speakers, ASR systems implement automatic vocal tract normalisation (VTN) that warps the formants (resonant frequencies) of the speaker to better match the formants of the speakers in the training set. The ASR also implements automatic dynamic time warping (DTW) to deal with variation in the speaker’s rate of speaking, by aligning the time series of the new spoken words with the time series of the matching spoken words of the training set.

This research investigates whether the ASR’s automatic estimation of VTN and DTW can be enhanced through pre-processing the recording by manually warping the formants and speaking rate of the recordings using sound processing libraries (Rubber Band and SoundTouch) before transcribing the pre-processed recordings using ASR.

An initial experiment, performed with the recordings of two male and two female speakers, showed that pre-processing the recording could improve the WER by an average of 39.5% for male speakers and 36.2% for female speakers. However the selection of the best warp factors was achieved through an iterative ‘trial and error’ approach that involved many hours calculating the word error rate for each warp factor setting.

Finding a more efficient approach for selecting the warp factors for pre-processing was then investigated.

The second experiment investigated the development of a modification function using, as its training set, the best warp factors from the ‘trial and error’ approach to estimate the modification percentage required to improve the WER of a recording. A modification function was found that on average improved the WER by 16% for female speakers and 7% for male speakers.

Text
Final_Thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (3MB)

More information

Published date: June 2017

Identifiers

Local EPrints ID: 418970
URI: http://eprints.soton.ac.uk/id/eprint/418970
PURE UUID: 3aa3e354-32a2-46cf-b2d2-50841eda8840
ORCID for Afnan Arafat Shah: ORCID iD orcid.org/0000-0003-1851-3878

Catalogue record

Date deposited: 27 Mar 2018 16:30
Last modified: 15 Mar 2024 19:05

Export record

Contributors

Author: Afnan Arafat Shah ORCID iD
Thesis advisor: Michael Wald

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×