Recognizing Emotions in Video Using Multimodal DNN Feature Fusion

We present our system description of input-level
multimodal fusion of audio, video, and text for
recognition of emotions and their intensities for
the 2018 First Grand Challenge on Computational
Modeling of Human Multimodal Language. Our
proposed approach is based on input-level feature
fusion with sequence learning from Bidirectional
Long-Short Term Memory (BLSTM) deep neural
networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classification
and regression, allowing for overlapping emotion
labels in a video segment. This leads to an overall binary accuracy of 90%, overall 4-class accuracy of 89.2% and an overall mean-absolute-error
(MAE) of 0.12. Our work shows that an early fusion technique can effectively predict the presence
of multi-label emotions as well as their coarse grained intensities. The presented multimodal approach creates a simple and robust baseline on this
new Grand Challenge dataset. Furthermore, we
provide a detailed analysis of emotion intensity
distributions as output from our DNN, as well as
a related discussion concerning the inherent difficulty of this task.

10.18653/v1/W18-3302

11-19

Association for Computational Linguistics (ACL)

Williams, Jennifer

3a1568b4-8a0b-41d2-8635-14fe69fbb360

Kleinegesse, Steven

896502e0-36a3-46c6-a1bf-91991bc278b4

Comanescu, Ramona

74f57d32-f69c-4e0f-85f1-9295bca2317c

Radu, Oana

139a656e-626a-417b-bed6-3abda87e9955

1 July 2018