The University of Southampton
University of Southampton Institutional Repository

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
King, Simon
ddf6b68a-e917-4ed9-b8ed-80608d89f113
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
King, Simon
ddf6b68a-e917-4ed9-b8ed-80608d89f113

Williams, Jennifer, Rownicka, Joanna, Gallegos, Pilar Oplustil and King, Simon (2020) Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis. In Speaker and Language Recognition Workshop (Workshop, 2020). 8 pp . (doi:10.21437/Odyssey.2020-32).

Record type: Conference or Workshop Item (Paper)

Abstract

We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.

This record has no associated files available for download.

More information

Published date: 5 November 2020

Identifiers

Local EPrints ID: 470319
URI: http://eprints.soton.ac.uk/id/eprint/470319
PURE UUID: 36e5048f-cc82-45f5-bd10-d236943aae61
ORCID for Jennifer Williams: ORCID iD orcid.org/0000-0003-1410-0427

Catalogue record

Date deposited: 06 Oct 2022 16:41
Last modified: 17 Mar 2024 04:12

Export record

Altmetrics

Contributors

Author: Jennifer Williams ORCID iD
Author: Joanna Rownicka
Author: Pilar Oplustil Gallegos
Author: Simon King

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×