How Much Training Data Is Enough? A Case Study for HTTP Anomaly-Based Intrusion Detection

Estepa, Rafael; Díaz Verdejo, Jesús Esteban; Estepa, Antonio; Madinabeitia, Germán

doi:10.1109/ACCESS.2020.2977591

dc.contributor.author	Estepa, Rafael
dc.contributor.author	Díaz Verdejo, Jesús Esteban
dc.contributor.author	Estepa, Antonio
dc.contributor.author	Madinabeitia, Germán
dc.date.accessioned	2020-05-06T12:03:26Z
dc.date.available	2020-05-06T12:03:26Z
dc.date.issued	2020-03-02
dc.identifier.citation	Estepa, R., Díaz-Verdejo, J. E., Estepa, A., & Madinabeitia, G. (2020). How Much Training Data is Enough? A Case Study for HTTP Anomaly-Based Intrusion Detection. IEEE Access, 8, 44410-44425.	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/61835
dc.description.abstract	Most anomaly-based intrusion detectors rely on models that learn from training datasets whose quality is crucial in their performance. Albeit the properties of suitable datasets have been formulated, the influence of the dataset size on the performance of the anomaly-based detector has received scarce attention so far. In this work, we investigate the optimal size of a training dataset. This size should be large enough so that training data is representative of normal behavior, but after that point, collecting more data may result in unnecessary waste of time and computational resources, not to mention an increased risk of overtraining. In this spirit, we provide a method to find out when the amount of data collected at the production environment is representative of normal behavior in the context of a detector of HTTP URI attacks based on 1-grammar. Our approach is founded on a set of indicators related to the statistical properties of the data. These indicators are periodically calculated during data collection, producing time series that stabilize when more training data is not expected to translate to better system performance, which indicates that data collection can be stopped.We present a case study with real-life datasets collected at the University of Seville (Spain) and a public dataset from the University of Saskatchewan. The application of our method to these datasets showed that more than 42% of one trace, and almost 20% of another were unnecessarily collected, thereby showing that our proposed method can be an efficient approach for collecting training data at the production environment.	es_ES
dc.description.sponsorship	This work was supported in part by the Corporación Tecnológica de Andalucía and the University of Seville through the Projects under Grant CTA 1669/22/2017, Grant PI-1786/22/2018, and Grant PI-1736/22/2017.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	IEEE	es_ES
dc.rights	Atribución 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Anomaly-based intrusion detection	es_ES
dc.subject	Dataset assessment	es_ES
dc.title	How Much Training Data Is Enough? A Case Study for HTTP Anomaly-Based Intrusion Detection	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.identifier.doi	10.1109/ACCESS.2020.2977591

Fichier(s) constituant ce document

Nom:: 09019687.pdf
Taille:: 5.780Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DTSTC - Artículos

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución 3.0 España