Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
King, Simon
ddf6b68a-e917-4ed9-b8ed-80608d89f113
5 November 2020
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
King, Simon
ddf6b68a-e917-4ed9-b8ed-80608d89f113
Williams, Jennifer, Rownicka, Joanna, Gallegos, Pilar Oplustil and King, Simon
(2020)
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis.
In Speaker and Language Recognition Workshop (Workshop, 2020).
8 pp
.
(doi:10.21437/Odyssey.2020-32).
Record type:
Conference or Workshop Item
(Paper)
Abstract
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.
This record has no associated files available for download.
More information
Published date: 5 November 2020
Identifiers
Local EPrints ID: 470319
URI: http://eprints.soton.ac.uk/id/eprint/470319
PURE UUID: 36e5048f-cc82-45f5-bd10-d236943aae61
Catalogue record
Date deposited: 06 Oct 2022 16:41
Last modified: 17 Mar 2024 04:12
Export record
Altmetrics
Contributors
Author:
Jennifer Williams
Author:
Joanna Rownicka
Author:
Pilar Oplustil Gallegos
Author:
Simon King
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics