Generating language distance metrics by language recognition using acoustic features
Generating language distance metrics by language recognition using acoustic features
A language recognition system is used to build quantitative measure of language distance. The OpenEAR toolkit is used to extract more than 6,000 features per speech sample. The features consist of 56 low level descriptors (LLDs) and their Delta and Delta Delta values, the corresponding 39 functionals. The language model training component is based on the Gentle AdaBoost algorithm. When tested on a group of 10 principally Indo-European languages, the language recognition system performs comparatively to other language recognizers.
The UPGMA tree built from the interlanguage distances identifies the major subgroups of Indo-European. Genetic algorithms are also implemented to generate the language map on the 2D plane. Although some errors remain, the obtained language tree and map are indicators of language relationships. We discuss errors in our system and more generally perspectives for the use of sound file classifiers in historical linguistics.
Sun, Le
4ace187a-9524-4fba-91c8-9f062b972d7d
Hu, Roland
8eec48f4-3ba2-4c2b-ada1-b7e6c7f778f6
Yu, Huimin
428aba1e-25b9-49c6-84a3-be5b4ff8fbc1
Sluckin, T. J.
8dbb6b08-7034-4ae2-aa65-6b80072202f6
Sun, Le
4ace187a-9524-4fba-91c8-9f062b972d7d
Hu, Roland
8eec48f4-3ba2-4c2b-ada1-b7e6c7f778f6
Yu, Huimin
428aba1e-25b9-49c6-84a3-be5b4ff8fbc1
Sluckin, T. J.
8dbb6b08-7034-4ae2-aa65-6b80072202f6
Sun, Le, Hu, Roland, Yu, Huimin and Sluckin, T. J.
(2016)
Generating language distance metrics by language recognition using acoustic features.
In 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).
IEEE..
(doi:10.1109/WCSP.2016.7752528).
Record type:
Conference or Workshop Item
(Paper)
Abstract
A language recognition system is used to build quantitative measure of language distance. The OpenEAR toolkit is used to extract more than 6,000 features per speech sample. The features consist of 56 low level descriptors (LLDs) and their Delta and Delta Delta values, the corresponding 39 functionals. The language model training component is based on the Gentle AdaBoost algorithm. When tested on a group of 10 principally Indo-European languages, the language recognition system performs comparatively to other language recognizers.
The UPGMA tree built from the interlanguage distances identifies the major subgroups of Indo-European. Genetic algorithms are also implemented to generate the language map on the 2D plane. Although some errors remain, the obtained language tree and map are indicators of language relationships. We discuss errors in our system and more generally perspectives for the use of sound file classifiers in historical linguistics.
Text
Sun_et_aL_preprint WCSP2016
- Accepted Manuscript
More information
Accepted/In Press date: 1 September 2016
e-pub ahead of print date: 24 November 2016
Additional Information:
Electronic ISBN: 978-1-5090-2860-3
USB ISBN: 978-1-5090-2859-7
Print on Demand(PoD) ISBN: 978-1-5090-2861-0
Venue - Dates:
8th international Conference on Wireless Communications & Signal Processing, , Yangzhou, China, 2016-10-13 - 2016-10-15
Organisations:
Applied Mathematics
Identifiers
Local EPrints ID: 406381
URI: http://eprints.soton.ac.uk/id/eprint/406381
PURE UUID: abe756dc-14bf-4fbe-898b-babf6733d96d
Catalogue record
Date deposited: 10 Mar 2017 10:46
Last modified: 16 Mar 2024 02:32
Export record
Altmetrics
Contributors
Author:
Le Sun
Author:
Roland Hu
Author:
Huimin Yu
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics