A linguistically-aware computational approach to microtext location detection

Fernández Martínez, Nicolás José

88366.pdf (3.032Mb)

Identificadores

URI: http://hdl.handle.net/10481/64577

ISBN: 978-84-1306-680-6

Exportar

Editorial

Universidad de Granada

Director

Felices Lago, Ángel Miguel

Departamento

Universidad de Granada.; Universidad de Granada. Programa de Doctorado en Lenguas, Textos y Contextos

Materia

Location detection

Location extraction

Geolocation

Named-entity recognition

Natural language processing

Deep learning

Emergencies

Disasters

Detección de localizaciones

Extracción de localizaciones

Geolocalización

Reconocimiento de entidades nombradas

Procesamiento del lenguaje natural

Deep learning

Emergencias

Desastres

Fecha

2020

Fecha lectura

2020-10-21

Referencia bibliográfica

Fernández Martínez, Nicolás José. A linguistically-aware computational approach to microtext location detection. Granada: Universidad de Granada, 2020. [http://hdl.handle.net/10481/64577]

Patrocinador

Tesis Univ. Granada.

Resumen

Extracting geospatially rich knowledge from microtexts such as tweets is of utmost importance for location-based systems in emergency services to raise situational awareness about a given emergency (i.e. natural or man-made disasters), such as earthquakes, floods, pandemics, car accidents, terrorist attacks, shooting attacks, etc. (Vieweg et al., 2010; Crooks et al., 2013; Imran et al., 2014; Jongman et al., 2015; Martínez-Rojas et al., 2018; C. Zhang et al., 2019; Siriaraya et al., 2019). In other words, emergency responders and competent authorities need to understand where the incident happened, where people are in need of help, and/or which areas were affected, with the aim of coordinating effective and immediate aid and allocating resources in the affected areas and/or to the affected persons. Such systems could potentially help save lives and/or prevent further damage to environmental or urban areas in emergency- and crisisrelated contexts. The problem is that the wide majority of tweets are not geotagged (Middleton et al., 2014), so we need to resort to the messages in the search of geospatial evidence (Wallgrün et al., 2018). In this context, we present LORE, a multilingual, rule-based location-detection system for English, Spanish, and French tweets that leverages lexical datasets of place names and locationindicative words together with linguistic knowledge through Natural Language Processing and computational techniques. We also present nLORE, a Deep Learning model that feeds off the linguistic knowledge provided by LORE. One of the main contributions of our models is to capture fine-grained complex locative references, ranging from geopolitical entities (e.g. towns, cities, regions, countries, etc.) and natural landforms (e.g. mountains, rivers, lakes, hills, valleys, etc.) to points of interest (e.g. squares, cathedrals, universities, residences, restaurants, museums, etc.) and traffic ways (e.g. streets, avenues, roads, highways, etc.). LORE outperforms wellknown, general-purpose, off-the-shelf entity-recognizer systems typically used in benchmarking (Schmitt et al., 2019): Stanford NER, spaCy, NLTK, OpenNLP, Google Natural Language Cloud, and Stanza. LORE achieves an unprecedented trade-off between precision and recall, while showing similar performance when applied to other corpora. nLORE outperforms LORE by a slight margin, and confirms the usefulness of linguistic-based feature engineering in Artificial Intelligence (Linzen, 2019). Therefore, our models provide not only a quantitative advantage over other well-known entity-recognizer systems in terms of performance and accuracy but also a qualitative advantage in terms of the diversity and semantic granularity of the locative references extracted from the tweets.

La extracción de información geoespacial rica de microtextos como los tweets es sumamente importante para sistemas geolocalizadores en servicios de emergencias para contribuir a la conciencia situacional sobre una emergencia como desastres naturales o producidos por el hombre, ya sean terremotos, inundaciones, pandemias, accidentes de tráfico, ataques terroristas, tiroteos, etc. (Vieweg et al., 2010; Crooks et al., 2013; Imran et al., 2014; Jongman et al., 2015; Martínez-Rojas et al., 2018; C. Zhang et al., 2019; Siriaraya et al., 2019). Dicho de otra manera, los servicios de emergencias y autoridades competentes necesitan comprender dónde ha ocurrido el incidente, dónde necesita la gente ayuda y/o qué lugares han sido afectados con el objetivo de proporcionar asistencia inmediata y destinar recursos en aquellas áreas o a aquellas personas afectadas. Estos sistemas podrían servir para salvar vidas y prevenir futuros daños a zonas urbanas o áreas medioambientales en contextos de crisis o emergencias. El problema reside en la escasez de tweets geoetiquetados (Middleton et al., 2014); por tanto, ha de recurrirse a los mensajes de texto en búsqueda de esa evidencia geoespacial (Wallgrün et al., 2018). En este contexto, presentamos LORE, un sistema multilingüístico de detección de localizaciones en tweets en inglés, español y francés basado en reglas que integra recursos léxicos de nombres de lugar y de palabras que indican localización junto con conocimiento lingüístico proporcionado por diversas técnicas computacionales de Procesamiento de Lenguaje Natural. También introducimos nLORE, un modelo basado en Deep Learning que se nutre del conocimiento lingüístico proporcionado por LORE. Una de las contribuciones más notables de nuestros modelos tiene que ver con la granularidad semántica de los tipos de localizaciones extraídas, desde entidades geopolíticas (e.g. pueblos, ciudades, regiones, países, etc.) y accidentes geográficos (e.g. montañas, ríos, lagos, colinas, valles, etc.) hasta puntos de interés (e.g. plazas, catedrales, universidades, residencias, restaurantes, museos, etc.) y vías de tráfico (e.g. calles, avenidas, carreteras, autovías, etc.). LORE supera a sistemas conocidos de dominio general de reconocimiento de entidades nombradas que se utilizan con frecuencia en sistemas de evaluación (Schmitt et al., 2019) como Stanford NER, spaCy, NLTK, OpenNLP, Google Natural Language Cloud y Stanza, alcanzando unas puntuaciones récord de evaluación en términos de precisión y cobertura, a la vez que muestra un rendimiento similar cuando se aplica a otros corpora. nLORE llega a superar LORE por un margen estrecho y confirma la utilidad de la implementación de características lingüísticas en la Inteligencia Artificial (Linzen, 2019). En este sentido, nuestros modelos proporcionan, no solo un salto cuantitativo respecto a la competencia en términos de rendimiento y precisión, sino también un salto cualitativo dada la diversidad y granularidad semántica de las referencias locativas que se pueden extraer de los tweets.

Colecciones

Tesis

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución-NoComercial-SinDerivadas 3.0 España