The University of Southampton
University of Southampton Institutional Repository

An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets

An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets
An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets
Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
King, Simon
3d15c0e4-beb7-484f-911d-bb3ea1562184
Gallegos, Pilar Oplustil
68d49830-5c04-44a3-b868-7da6c8bec0f4
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Rownicka, Joanna
73b0f5ec-36a7-4774-a957-7c4a9d6b6aa1
King, Simon
3d15c0e4-beb7-484f-911d-bb3ea1562184

Gallegos, Pilar Oplustil, Williams, Jennifer, Rownicka, Joanna and King, Simon (2020) An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets. Proceedings of the Annual Conference of the International Speech Communication Association: INTERSPEECH 2020, Shanghai, Shanghai, China. 25 Oct 2020 - 29 Oct 2022 . 1758 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.

This record has no associated files available for download.

More information

Published date: 31 October 2020
Additional Information: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN (Print)2308-457X
Venue - Dates: Proceedings of the Annual Conference of the International Speech Communication Association: INTERSPEECH 2020, Shanghai, Shanghai, China, 2020-10-25 - 2022-10-29

Identifiers

Local EPrints ID: 467438
URI: http://eprints.soton.ac.uk/id/eprint/467438
PURE UUID: ee576c20-e51a-4883-adb2-182c55fd48c5
ORCID for Jennifer Williams: ORCID iD orcid.org/0000-0003-1410-0427

Catalogue record

Date deposited: 08 Jul 2022 16:39
Last modified: 17 Mar 2024 04:12

Export record

Contributors

Author: Pilar Oplustil Gallegos
Author: Jennifer Williams ORCID iD
Author: Joanna Rownicka
Author: Simon King

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×