<rdf:RDF xmlns:rdf="http://www.openarchives.org/OAI/2.0/rdf/" xmlns:ow="http://www.ontoweb.org/ontology/1#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ds="http://dspace.org/ds/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:doc="http://www.lyncode.com/xoai" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/rdf/ http://www.openarchives.org/OAI/2.0/rdf.xsd">
   <ow:Publication rdf:about="oai:digibug.ugr.es:10481/88558">
      <dc:title>esCorpius-m: A massive multilingual crawling corpus with a focus on Spanish</dc:title>
      <dc:creator>Gutiérrez Fandiño, Asier</dc:creator>
      <dc:creator>Pérez Fernández, David</dc:creator>
      <dc:creator>Armengol-Estapé, Jordi</dc:creator>
      <dc:creator>Griol Barres, David</dc:creator>
      <dc:creator>Kharitonova, Ksenia</dc:creator>
      <dc:creator>Callejas Carrión, Zoraida</dc:creator>
      <dc:subject>Corpus</dc:subject>
      <dc:subject>Dataset</dc:subject>
      <dc:subject>Massive</dc:subject>
      <dc:description>In recent years, transformer-based models have played a significant role in advancing lan-&#xd;
guage modeling for natural language processing. However, they require substantial amounts of data&#xd;
and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced&#xd;
multilingual datasets obtained through web crawling. However, there are notable limitations in the&#xd;
results for some languages, including Spanish. These datasets are either smaller compared to other&#xd;
languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper,&#xd;
we present ESCORPIUS-M, a multilingual corpus extracted from around 1 petabyte of Common Crawl&#xd;
data. It is the most extensive corpus for some languages with such a level of high-quality content&#xd;
extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning&#xd;
pipeline and various deduplication methods that maintain the integrity of document and paragraph&#xd;
boundaries. We also ensure compliance with EU regulations by retaining both the source web page&#xd;
URL and the WARC shared origin URL</dc:description>
      <dc:date>2024-02-07T10:58:59Z</dc:date>
      <dc:date>2024-02-07T10:58:59Z</dc:date>
      <dc:date>2023</dc:date>
      <dc:type>journal article</dc:type>
      <dc:identifier>https://hdl.handle.net/10481/88558</dc:identifier>
      <dc:identifier>10.3390/app132212155</dc:identifier>
      <dc:language>eng</dc:language>
      <dc:relation>MCIN/AEI/10.13039/501100011033</dc:relation>
      <dc:relation>NextGenerationEU/PRTR</dc:relation>
      <dc:rights>http://creativecommons.org/licenses/by-nc/4.0/</dc:rights>
      <dc:rights>open access</dc:rights>
      <dc:rights>Atribución-NoComercial 4.0 Internacional</dc:rights>
      <dc:publisher>MDPI</dc:publisher>
   </ow:Publication>
</rdf:RDF>