A new approach to voice authenticity

Voice faking poses significant societal challenges. Currently, the prevailing assumption is that unaltered human speech can always be considered genuine, while fake speech usually comes from text-to-speech (TTS) synthesis. We argue that this type of binary distinction is oversimplified. For instance, altered playback speeds can maliciously deceive listeners, as in the ‘Drunken Nancy Pelosi’ incident. Similarly, editing of audio clips can be done ethically, e.g. for brevity or summarization in news reporting or podcasts, but editing can also create misleading narratives. In this paper, we propose a conceptual shift away from the longstanding binary paradigm of speech audio being either ‘fake’ or ‘real’. Instead, we focus on pinpointing ‘voice edits’, which encompass traditional modifications like filters and cuts, as well as neural synthesis. We delineate six categories of voice edits and curate a new challenge dataset, for which we present baseline voice edit detection systems.

10.21437/Interspeech.2024-31

Müller, Nicolas M.

e054cb2d-3ad5-4674-b44e-406a6c2c1dfe

Kawa, Piotr

fece3d41-4ee8-465e-b2f2-51853b983609

Hu, Shen

7e195648-d116-445c-b9ba-73ea177ac7be

Neu, Matthias

1d77c078-4a65-4dfa-8b2a-d1f950e64d65

Williams, Jennifer

3a1568b4-8a0b-41d2-8635-14fe69fbb360

Sperl, Philip

2d9a03d7-ae76-4c3a-bf9e-96d3fb99560d

Böttinger, Konstantin

ec031c04-8af1-411a-871b-e31201458053

1 September 2024