A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

Luna Jiménez, Cristina; Griol Barres, David; Callejas Carrión, Zoraida

doi:10.3390/app12010327

dc.contributor.author	Luna Jiménez, Cristina
dc.contributor.author	Griol Barres, David
dc.contributor.author	Callejas Carrión, Zoraida
dc.date.accessioned	2022-03-18T08:38:33Z
dc.date.available	2022-03-18T08:38:33Z
dc.date.issued	2021-12-30
dc.identifier.citation	Luna-Jiménez, C... [et al.]. A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset. Appl. Sci. 2022, 12, 327. [https://doi.org/10.3390/app12010327]	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/73537
dc.description	The work leading to these results was supported by the Spanish Ministry of Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C21 and PID2020-118112RB-C22, funded by MCIN/AEI/10.13039/501100011033), CAVIAR (TEC2017-84593-C2-1-R, funded by MCIN/AEI/10.13039/501100011033/FEDER "Una manera de hacer Europa"), and AMIC-PoC (PDC2021-120846-C42, funded by MCIN/AEI/10.13039/501100011033 and by "the European Union "NextGenerationEU/PRTR"). This research also received funding from the European Union's Horizon2020 research and innovation program under grant agreement No 823907 (http://menhir-project.eu, accessed on 17 November 2021). Furthermore, R.K.'s research was supported by the Spanish Ministry of Education (FPI grant PRE2018-083225).	es_ES
dc.description.abstract	Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.	es_ES
dc.description.sponsorship	Spanish Government PID2020-118112RB-C21 PID2020-118112RB-C22 MCIN/AEI/10.13039/501100011033 TEC2017-84593-C2-1-R MCIN/AEI/10.13039/501100011033/FEDER PDC2021-120846-C42	es_ES
dc.description.sponsorship	European Union "NextGenerationEU/PRTR")	es_ES
dc.description.sponsorship	European Union's Horizon2020 research and innovation program 823907	es_ES
dc.description.sponsorship	German Research Foundation (DFG) PRE2018-083225	es_ES
dc.language.iso	eng	es_ES
dc.publisher	MDPI	es_ES
dc.rights	Atribución 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Audio–visual emotion recognition	es_ES
dc.subject	Human-computer interaction	es_ES
dc.subject	Computational paralinguistics	es_ES
dc.subject	xlsr-Wav2Vec2.0 transformer	es_ES
dc.subject	Transformer	es_ES
dc.subject	Transfer learning	es_ES
dc.subject	Action Units	es_ES
dc.subject	RAVDESS	es_ES
dc.subject	Speech emotion recognition	es_ES
dc.subject	Facial emotion recognition	es_ES
dc.title	A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset	es_ES
dc.type	journal article	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/823907	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.3390/app12010327
dc.type.hasVersion	VoR	es_ES

Files in this item

Name:: applsci-12-00327.pdf
Size:: 832.4Kb
Format:: PDF

This item appears in the following Collection(s)

OpenAIRE (Open Access Infrastructure for Research in Europe)
Publicaciones financiadas por Framework Programme 7, Horizonte 2020, Horizonte Europa... del European Research Council de la Unión Europea en el marco del Proyecto OpenAIRE que promueve el acceso abierto a Europa.

Show simple item record

Except where otherwise noted, this item's license is described as Atribución 3.0 España