The University of Southampton
University of Southampton Institutional Repository

Effects of varying noise levels and lighting levels on multimodal speech and visual gesture interaction with aerobots

Effects of varying noise levels and lighting levels on multimodal speech and visual gesture interaction with aerobots
Effects of varying noise levels and lighting levels on multimodal speech and visual gesture interaction with aerobots
This paper investigated the effects of varying noise levels and varying lighting levels on speech and gesture control command interfaces for aerobots. The aim was to determine the practical suitability of the multimodal combination of speech and visual gesture in human aerobotic interaction, by investigating the limits and feasibility of use of the individual components. In order to determine this, a custom multimodal speech and visual gesture interface was developed using CMU (Carnegie Mellon University) sphinx and OpenCV (Open source Computer Vision) libraries, respectively. An experiment study was designed to measure the individual effects of each of the two main components of speech and gesture, and 37 participants were recruited to participate in the experiment. The ambient noise level was varied from 55 dB to 85 dB. The ambient lighting level was varied from 10 Lux to 1400 Lux, under different lighting colour temperature mixtures of yellow (3500 K) and white (5500 K), and different background for capturing the finger gestures. The results of the experiment, which consisted of around 3108 speech utterance and 999 gesture quality observations, were presented and discussed. It was observed that speech recognition accuracy/success rate falls as noise levels rise, with 75 dB noise level being the aerobot’s practical application limit, as the speech control interaction becomes very unreliable due to poor recognition beyond this. It was concluded that multi-word speech commands were considered more reliable and effective than single-word speech commands. In addition, some speech command words (e.g., land) were more noise resistant than others (e.g., hover) at higher noise levels, due to their articulation. From the results of the gesture-lighting experiment, the effects of both lighting conditions and the environment background on the quality of gesture recognition, was almost insignificant, less than 0.5%. The implication of this is that other factors such as the gesture capture system design and technology (camera and computer hardware), type of gesture being captured (upper body, whole body, hand, fingers, or facial gestures), and the image processing technique (gesture classification algorithms), are more important in developing a successful gesture recognition system. Some further works were suggested based on the conclusions drawn from this findings which included using alternative ASR (Automatic Speech Recognition) speech models and developing more robust gesture recognition algorithm.
Speech, Visual Gesture, UAV, mSVG, Aerobot
2076-3417
1-29
Abioye, Ayodeji, Opeyemi
dbab5a92-c958-442f-b102-8ca2e4cf874f
Prior, Stephen
9c753e49-092a-4dc5-b4cd-6d5ff77e9ced
Saddington, Peter
17445ed8-ebf1-48fa-a3f5-b30cdc61f5d3
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3
Abioye, Ayodeji, Opeyemi
dbab5a92-c958-442f-b102-8ca2e4cf874f
Prior, Stephen
9c753e49-092a-4dc5-b4cd-6d5ff77e9ced
Saddington, Peter
17445ed8-ebf1-48fa-a3f5-b30cdc61f5d3
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3

Abioye, Ayodeji, Opeyemi, Prior, Stephen, Saddington, Peter and Ramchurn, Sarvapali (2019) Effects of varying noise levels and lighting levels on multimodal speech and visual gesture interaction with aerobots. Applied Sciences, 9 (10), 1-29. (doi:10.3390/app9102066).

Record type: Article

Abstract

This paper investigated the effects of varying noise levels and varying lighting levels on speech and gesture control command interfaces for aerobots. The aim was to determine the practical suitability of the multimodal combination of speech and visual gesture in human aerobotic interaction, by investigating the limits and feasibility of use of the individual components. In order to determine this, a custom multimodal speech and visual gesture interface was developed using CMU (Carnegie Mellon University) sphinx and OpenCV (Open source Computer Vision) libraries, respectively. An experiment study was designed to measure the individual effects of each of the two main components of speech and gesture, and 37 participants were recruited to participate in the experiment. The ambient noise level was varied from 55 dB to 85 dB. The ambient lighting level was varied from 10 Lux to 1400 Lux, under different lighting colour temperature mixtures of yellow (3500 K) and white (5500 K), and different background for capturing the finger gestures. The results of the experiment, which consisted of around 3108 speech utterance and 999 gesture quality observations, were presented and discussed. It was observed that speech recognition accuracy/success rate falls as noise levels rise, with 75 dB noise level being the aerobot’s practical application limit, as the speech control interaction becomes very unreliable due to poor recognition beyond this. It was concluded that multi-word speech commands were considered more reliable and effective than single-word speech commands. In addition, some speech command words (e.g., land) were more noise resistant than others (e.g., hover) at higher noise levels, due to their articulation. From the results of the gesture-lighting experiment, the effects of both lighting conditions and the environment background on the quality of gesture recognition, was almost insignificant, less than 0.5%. The implication of this is that other factors such as the gesture capture system design and technology (camera and computer hardware), type of gesture being captured (upper body, whole body, hand, fingers, or facial gestures), and the image processing technique (gesture classification algorithms), are more important in developing a successful gesture recognition system. Some further works were suggested based on the conclusions drawn from this findings which included using alternative ASR (Automatic Speech Recognition) speech models and developing more robust gesture recognition algorithm.

Text
abioye_et_al_applsci_v10_resized - Accepted Manuscript
Restricted to Repository staff only
Request a copy
Text
applsci-09-02066 - Version of Record
Available under License Creative Commons Attribution.
Download (20MB)

More information

Submitted date: 15 March 2019
Accepted/In Press date: 14 May 2019
Published date: 19 May 2019
Keywords: Speech, Visual Gesture, UAV, mSVG, Aerobot

Identifiers

Local EPrints ID: 431055
URI: https://eprints.soton.ac.uk/id/eprint/431055
ISSN: 2076-3417
PURE UUID: 8ecb2573-7b9c-49f4-b9a6-afe7843707cf
ORCID for Ayodeji, Opeyemi Abioye: ORCID iD orcid.org/0000-0003-4637-3278
ORCID for Stephen Prior: ORCID iD orcid.org/0000-0002-4993-4942
ORCID for Sarvapali Ramchurn: ORCID iD orcid.org/0000-0001-9686-4302

Catalogue record

Date deposited: 22 May 2019 16:30
Last modified: 24 Jul 2019 00:35

Export record

Altmetrics

Contributors

Author: Stephen Prior ORCID iD
Author: Peter Saddington
Author: Sarvapali Ramchurn ORCID iD

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×