Recognizing Emotions in Video Using Multimodal DNN Feature Fusion
Recognizing Emotions in Video Using Multimodal DNN Feature Fusion
We present our system description of input-level
multimodal fusion of audio, video, and text for
recognition of emotions and their intensities for
the 2018 First Grand Challenge on Computational
Modeling of Human Multimodal Language. Our
proposed approach is based on input-level feature
fusion with sequence learning from Bidirectional
Long-Short Term Memory (BLSTM) deep neural
networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classification
and regression, allowing for overlapping emotion
labels in a video segment. This leads to an overall binary accuracy of 90%, overall 4-class accuracy of 89.2% and an overall mean-absolute-error
(MAE) of 0.12. Our work shows that an early fusion technique can effectively predict the presence
of multi-label emotions as well as their coarse grained intensities. The presented multimodal approach creates a simple and robust baseline on this
new Grand Challenge dataset. Furthermore, we
provide a detailed analysis of emotion intensity
distributions as output from our DNN, as well as
a related discussion concerning the inherent difficulty of this task.
11-19
Association for Computational Linguistics (ACL)
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Kleinegesse, Steven
896502e0-36a3-46c6-a1bf-91991bc278b4
Comanescu, Ramona
74f57d32-f69c-4e0f-85f1-9295bca2317c
Radu, Oana
139a656e-626a-417b-bed6-3abda87e9955
1 July 2018
Williams, Jennifer
3a1568b4-8a0b-41d2-8635-14fe69fbb360
Kleinegesse, Steven
896502e0-36a3-46c6-a1bf-91991bc278b4
Comanescu, Ramona
74f57d32-f69c-4e0f-85f1-9295bca2317c
Radu, Oana
139a656e-626a-417b-bed6-3abda87e9955
Williams, Jennifer, Kleinegesse, Steven, Comanescu, Ramona and Radu, Oana
(2018)
Recognizing Emotions in Video Using Multimodal DNN Feature Fusion.
In ACL 2018 Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML).
Association for Computational Linguistics (ACL).
.
(doi:10.18653/v1/W18-3302).
Record type:
Conference or Workshop Item
(Paper)
Abstract
We present our system description of input-level
multimodal fusion of audio, video, and text for
recognition of emotions and their intensities for
the 2018 First Grand Challenge on Computational
Modeling of Human Multimodal Language. Our
proposed approach is based on input-level feature
fusion with sequence learning from Bidirectional
Long-Short Term Memory (BLSTM) deep neural
networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classification
and regression, allowing for overlapping emotion
labels in a video segment. This leads to an overall binary accuracy of 90%, overall 4-class accuracy of 89.2% and an overall mean-absolute-error
(MAE) of 0.12. Our work shows that an early fusion technique can effectively predict the presence
of multi-label emotions as well as their coarse grained intensities. The presented multimodal approach creates a simple and robust baseline on this
new Grand Challenge dataset. Furthermore, we
provide a detailed analysis of emotion intensity
distributions as output from our DNN, as well as
a related discussion concerning the inherent difficulty of this task.
This record has no associated files available for download.
More information
Published date: 1 July 2018
Identifiers
Local EPrints ID: 470338
URI: http://eprints.soton.ac.uk/id/eprint/470338
PURE UUID: c1498a59-76e5-4423-8c06-a31c03e3bd1f
Catalogue record
Date deposited: 06 Oct 2022 16:55
Last modified: 20 Jul 2024 02:07
Export record
Altmetrics
Contributors
Author:
Jennifer Williams
Author:
Steven Kleinegesse
Author:
Ramona Comanescu
Author:
Oana Radu
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics