Multiple hypothesis tracking for overlapping speaker segmentation
Multiple hypothesis tracking for overlapping speaker segmentation
Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.
Hogg, Aidan
e2c97ca1-9ec2-4da1-9fd3-5feea6142756
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Naylor, Patrick A.
13079486-664a-414c-a1a2-01a30bf0997b
23 December 2019
Hogg, Aidan
e2c97ca1-9ec2-4da1-9fd3-5feea6142756
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Naylor, Patrick A.
13079486-664a-414c-a1a2-01a30bf0997b
Hogg, Aidan, Evers, Christine and Naylor, Patrick A.
(2019)
Multiple hypothesis tracking for overlapping speaker segmentation.
In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
IEEE..
(doi:10.1109/WASPAA.2019.8937185).
Record type:
Conference or Workshop Item
(Paper)
Abstract
Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.
This record has no associated files available for download.
More information
Published date: 23 December 2019
Identifiers
Local EPrints ID: 439390
URI: http://eprints.soton.ac.uk/id/eprint/439390
PURE UUID: 8ec1af6a-07f9-4e73-9f6f-6aa7a45ebcf2
Catalogue record
Date deposited: 21 Apr 2020 16:30
Last modified: 17 Mar 2024 04:01
Export record
Altmetrics
Contributors
Author:
Aidan Hogg
Author:
Christine Evers
Author:
Patrick A. Naylor
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics