How Much Training Data Is Enough? A Case Study for HTTP Anomaly-Based Intrusion Detection
Metadatos
Afficher la notice complèteEditorial
IEEE
Materia
Anomaly-based intrusion detection Dataset assessment
Date
2020-03-02Referencia bibliográfica
Estepa, R., Díaz-Verdejo, J. E., Estepa, A., & Madinabeitia, G. (2020). How Much Training Data is Enough? A Case Study for HTTP Anomaly-Based Intrusion Detection. IEEE Access, 8, 44410-44425.
Patrocinador
This work was supported in part by the Corporación Tecnológica de Andalucía and the University of Seville through the Projects under Grant CTA 1669/22/2017, Grant PI-1786/22/2018, and Grant PI-1736/22/2017.Résumé
Most anomaly-based intrusion detectors rely on models that learn from training datasets whose
quality is crucial in their performance. Albeit the properties of suitable datasets have been formulated,
the influence of the dataset size on the performance of the anomaly-based detector has received scarce
attention so far. In this work, we investigate the optimal size of a training dataset. This size should be
large enough so that training data is representative of normal behavior, but after that point, collecting more
data may result in unnecessary waste of time and computational resources, not to mention an increased
risk of overtraining. In this spirit, we provide a method to find out when the amount of data collected at
the production environment is representative of normal behavior in the context of a detector of HTTP URI
attacks based on 1-grammar. Our approach is founded on a set of indicators related to the statistical properties
of the data. These indicators are periodically calculated during data collection, producing time series that
stabilize when more training data is not expected to translate to better system performance, which indicates
that data collection can be stopped.We present a case study with real-life datasets collected at the University
of Seville (Spain) and a public dataset from the University of Saskatchewan. The application of our method
to these datasets showed that more than 42% of one trace, and almost 20% of another were unnecessarily
collected, thereby showing that our proposed method can be an efficient approach for collecting training
data at the production environment.