A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset
Metadata
Show full item recordEditorial
MDPI
Materia
Audio–visual emotion recognition Human-computer interaction Computational paralinguistics xlsr-Wav2Vec2.0 transformer Transformer Transfer learning Action Units RAVDESS Speech emotion recognition Facial emotion recognition
Date
2021-12-30Referencia bibliográfica
Luna-Jiménez, C... [et al.]. A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset. Appl. Sci. 2022, 12, 327. [https://doi.org/10.3390/app12010327]
Sponsorship
Spanish Government PID2020-118112RB-C21 PID2020-118112RB-C22 MCIN/AEI/10.13039/501100011033 TEC2017-84593-C2-1-R MCIN/AEI/10.13039/501100011033/FEDER PDC2021-120846-C42; European Union "NextGenerationEU/PRTR"); European Union's Horizon2020 research and innovation program 823907; German Research Foundation (DFG) PRE2018-083225Abstract
Emotion recognition is attracting the attention of the research community due to its multiple
applications in different fields, such as medicine or autonomous driving. In this paper, we proposed
an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a
facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer
using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy
results were achieved when we fine-tuned the whole model by appending a multilayer perceptron
on top of it, confirming that the training was more robust when it did not start from scratch and the
previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion
recognizer, we extracted the Action Units of the videos and compared the performance between
employing static models against sequential models. Results showed that sequential models beat
static models by a narrow difference. Error analysis reported that the visual systems could improve
with a detector of high-emotional load frames, which opened a new line of research to discover new
ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we
achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying
eight emotions. Results demonstrated that these modalities carried relevant information to detect
users’ emotional state and their combination allowed to improve the final system performance.