Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

This work examines the content and usefulness of disentangled
phone and speaker representations from two separately trained
VQ-VAE systems: one trained on multilingual data and another
trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copysynthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be
used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by
manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations

Williams, Jennifer

3a1568b4-8a0b-41d2-8635-14fe69fbb360

Fong, Jason

bb16be41-8533-43d6-b90f-ca252a0559ba

Cooper, Erica

f01163d1-971d-4ba0-af2c-b9e39fff4310

Yamagishi, Junichi

c2e5c9eb-b9f5-4881-bbd8-50ff4af6a620

28 August 2021