Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning
Metadatos
Mostrar el registro completo del ítemAutor
Luna Jiménez, Cristina; Griol Barres, David; Callejas Carrión, Zoraida; Kleinlein, Ricardo; Montero, Juan M.; Fernández Martínez, FernandoEditorial
MDPI
Materia
Audio–visual emotion recognition Human-computer-interaction Computational paralinguistics Spatial transformers Transfer learning Speech emotion recognition Facial emotion recognition
Fecha
2021Referencia bibliográfica
Luna-Jiménez, C.; Griol, D.; Callejas, Z.; Kleinlein, R.; Montero, J.M.; Fernández-Martínez, F. Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning. Sensors 2021, 21, 7665. https://doi.org/10.3390/s21227665
Resumen
Emotion Recognition is attracting the attention of the research community due to the
multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper,
we propose a multimodal emotion recognition system that relies on speech and facial information.
For the speech-based modality, we evaluated several transfer-learning techniques, more specifically,
embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned
the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not
start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a
framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial
images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the
frame-based systems could present some problems when they were used directly to solve a videobased task despite the domain adaptation, which opens a new line of research to discover new ways
to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models.
Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08%
accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The
results revealed that these modalities carry relevant information to detect users’ emotional state and
their combination enables improvement of system performance.