ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

Kharitonova, Ksenia; Callejas Carrión, Zoraida; Pérez Fernández, David; Gutiérrez Fandiño, Asier; Griol Barres, David

doi:10.1016/j.dib.2023.109565

1-s2.0-S2352340923006650-main.pdf (631.5Kb)

Identificadores

URI: https://hdl.handle.net/10481/85546

DOI: 10.1016/j.dib.2023.109565

Exportar

Editorial

Elsevier

Materia

Dialogue

Conversation

Chatbots

Conversational AI

Speech

Natural language processing

Date

2023-09-14

Referencia bibliográfica

K. Kharitonova, Z. Callejas and D. Pérez-Fernández et al. / Data in Brief 50 (2023) 109565[https://doi.org/10.1016/j.dib.2023.109565]

Sponsorship

CONVERSA ( TED2021-132470B-I00 ) funded by MCIN/AEI/10.13039/50110 0 011033; European Union NextGenerationEU/PRTR

Abstract

The ChatSubs dataset [5] contains dialogue data in Spanish and three of Spain’s co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly segmented dialogues and their turns. The data processing code is pub- licly accessible. The result is 206.706 JSON files with more than 20 million dialogues and 96 million turns, which rep- resents one of the biggest dialogue corpus available, as other similar datasets in better resourced languages do not reach 500k dialogues or present less defined conversations. Thus, the ChatSubs dataset is an ideal resource for research teams that are interested in training dialogue models in Spanish, Catalan, Basque, and Galician

Collections

OpenAIRE (Open Access Infrastructure for Research in Europe)

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 Internacional