Improving automatic speech recognition transcription through signal processing

Automatic speech recognition (ASR) in the educational environment could be a solution to address the problem of gaining access to the spoken words of a lecture for many students who find lectures hard to understand, such as those whose mother tongue is not English or who have a hearing impairment. In such an environment, it is difficult for ASR to provide transcripts with Word Error Rates (WER) less than 25% for the wide range of speakers. Reducing the WER reduces the time and therefore cost of correcting errors in the transcripts.

To deal with the variation of acoustic features between speakers, ASR systems implement automatic vocal tract normalisation (VTN) that warps the formants (resonant frequencies) of the speaker to better match the formants of the speakers in the training set. The ASR also implements automatic dynamic time warping (DTW) to deal with variation in the speaker’s rate of speaking, by aligning the time series of the new spoken words with the time series of the matching spoken words of the training set.

This research investigates whether the ASR’s automatic estimation of VTN and DTW can be enhanced through pre-processing the recording by manually warping the formants and speaking rate of the recordings using sound processing libraries (Rubber Band and SoundTouch) before transcribing the pre-processed recordings using ASR.

An initial experiment, performed with the recordings of two male and two female speakers, showed that pre-processing the recording could improve the WER by an average of 39.5% for male speakers and 36.2% for female speakers. However the selection of the best warp factors was achieved through an iterative ‘trial and error’ approach that involved many hours calculating the word error rate for each warp factor setting.

Finding a more efficient approach for selecting the warp factors for pre-processing was then investigated.

The second experiment investigated the development of a modification function using, as its training set, the best warp factors from the ‘trial and error’ approach to estimate the modification percentage required to improve the WER of a recording. A modification function was found that on average improved the WER by 16% for female speakers and 7% for male speakers.

University of Southampton

Shah, Afnan Arafat

65a6047d-04f4-4aed-b13d-052c1a0b5529

June 2017

Shah, Afnan Arafat

65a6047d-04f4-4aed-b13d-052c1a0b5529

Wald, Michael

90577cfd-35ae-4e4a-9422-5acffecd89d5

Shah, Afnan Arafat (2017) Improving automatic speech recognition transcription through signal processing. University of Southampton, Doctoral Thesis, 158pp.

Record type: Thesis (Doctoral)