Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Luna Jiménez, Cristina; Griol Barres, David; Callejas Carrión, Zoraida; Kleinlein, Ricardo; Montero, Juan M.; Fernández Martínez, Fernando

doi:10.3390/s21227665

dc.contributor.author	Luna Jiménez, Cristina
dc.contributor.author	Griol Barres, David
dc.contributor.author	Callejas Carrión, Zoraida
dc.contributor.author	Kleinlein, Ricardo
dc.contributor.author	Montero, Juan M.
dc.contributor.author	Fernández Martínez, Fernando
dc.date.accessioned	2021-11-19T08:40:57Z
dc.date.available	2021-11-19T08:40:57Z
dc.date.issued	2021
dc.identifier.citation	Luna-Jiménez, C.; Griol, D.; Callejas, Z.; Kleinlein, R.; Montero, J.M.; Fernández-Martínez, F. Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning. Sensors 2021, 21, 7665. https://doi.org/10.3390/s21227665	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/71614
dc.description.abstract	Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a videobased task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	MDPI	es_ES
dc.rights	Atribución 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Audio–visual emotion recognition	es_ES
dc.subject	Human-computer-interaction	es_ES
dc.subject	Computational paralinguistics	es_ES
dc.subject	Spatial transformers	es_ES
dc.subject	Transfer learning	es_ES
dc.subject	Speech emotion recognition	es_ES
dc.subject	Facial emotion recognition	es_ES
dc.title	Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.3390/s21227665

Ficheros en el ítem

Nombre:: sensors-21-07665.pdf
Tamaño:: 1.717Mb
Formato:: PDF

Este ítem aparece en la(s) siguiente(s) colección(ones)

DLSI - Artículos

Mostrar el registro sencillo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución 3.0 España