The University of Southampton
University of Southampton Institutional Repository

Improved prosody from Learned F0 Codebook representations for VQ-VAE speech waveform reconstruction

Improved prosody from Learned F0 Codebook representations for VQ-VAE speech waveform reconstruction
Improved prosody from Learned F0 Codebook representations for VQ-VAE speech waveform reconstruction
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features. The proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speakerindependent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0. Index Terms: VQ-VAE, speech synthesis, prosody, representation learning
4417-4421
Zhao, Yi
158eb3aa-9a01-428f-9cda-663af36d6495
Li, Haoyu
299e9b4a-2925-4457-9cc7-74bca82f0877
Lai, Cheng-I
0e21b289-5607-4804-9c1d-a3a78038c5a2
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Cooper, Erica
4682f025-d46c-4eb9-bf62-4f6d6a96b845
Yamagishi, Junichi
d9ec843c-4a7e-4e31-a0c3-c196d28f669d
Zhao, Yi
158eb3aa-9a01-428f-9cda-663af36d6495
Li, Haoyu
299e9b4a-2925-4457-9cc7-74bca82f0877
Lai, Cheng-I
0e21b289-5607-4804-9c1d-a3a78038c5a2
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Cooper, Erica
4682f025-d46c-4eb9-bf62-4f6d6a96b845
Yamagishi, Junichi
d9ec843c-4a7e-4e31-a0c3-c196d28f669d

Zhao, Yi, Li, Haoyu, Lai, Cheng-I, Williams, Jennifer, Cooper, Erica and Yamagishi, Junichi (2020) Improved prosody from Learned F0 Codebook representations for VQ-VAE speech waveform reconstruction. Proceedings of the Annual Conference of the International Speech Communication Association: INTERSPEECH 2020, Shanghai, Shanghai, China. 25 Oct 2020 - 29 Oct 2022 . pp. 4417-4421 .

Record type: Conference or Workshop Item (Paper)

Abstract

Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features. The proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speakerindependent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0. Index Terms: VQ-VAE, speech synthesis, prosody, representation learning

This record has no associated files available for download.

More information

Published date: 29 October 2020
Venue - Dates: Proceedings of the Annual Conference of the International Speech Communication Association: INTERSPEECH 2020, Shanghai, Shanghai, China, 2020-10-25 - 2022-10-29

Identifiers

Local EPrints ID: 467452
URI: http://eprints.soton.ac.uk/id/eprint/467452
PURE UUID: 04ca6e33-c783-430b-bf8d-53489407badc
ORCID for Jennifer Williams: ORCID iD orcid.org/0000-0003-1410-0427

Catalogue record

Date deposited: 08 Jul 2022 16:43
Last modified: 17 Mar 2024 04:12

Export record

Contributors

Author: Yi Zhao
Author: Haoyu Li
Author: Cheng-I Lai
Author: Jennifer Williams ORCID iD
Author: Erica Cooper
Author: Junichi Yamagishi

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×