kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors Classifier for Big Data

Maillo Hidalgo, Jesús; Ramírez-Gallego, Sergio; Triguero, Isaac; Herrera Triguero, Francisco

doi:10.1016/j.knosys.2016.06.012

dc.contributor.author	Maillo Hidalgo, Jesús
dc.contributor.author	Ramírez-Gallego, Sergio
dc.contributor.author	Triguero, Isaac
dc.contributor.author	Herrera Triguero, Francisco
dc.date.accessioned	2020-12-14T09:09:19Z
dc.date.available	2020-12-14T09:09:19Z
dc.date.issued	2017-02
dc.identifier.citation	Jesus Maillo, Sergio Ramírez, Isaac Triguero, Francisco Herrera, kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors ClassiÞer for Big Data, Knowledge-Based Systems (2016), [doi: 10.1016/j.knosys.2016.06.012]	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/64867
dc.description	This work has been supported by the Spanish National Research Project TIN2014-57251-P and the Andalusian Research Plan P11-TIC-7765. J. Maillo and S. Ramirez hold FPU scholarships from the Spanish Ministry of Education. I. Triguero held a BOF postdoctoral fellowship from Ghent University during part of the development of this work.	es_ES
dc.description.abstract	The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen rases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.	es_ES
dc.description.sponsorship	Spanish National Research Project TIN2014-57251-P	es_ES
dc.description.sponsorship	Andalusian Research Plan P11-TIC-7765	es_ES
dc.description.sponsorship	Spanish Government	es_ES
dc.description.sponsorship	Ghent University	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	K-nearest neighbors	es_ES
dc.subject	Big Data	es_ES
dc.subject	Apache Hadoop	es_ES
dc.subject	Apache spark	es_ES
dc.subject	MapReduce	es_ES
dc.title	kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors Classifier for Big Data	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1016/j.knosys.2016.06.012
dc.type.hasVersion	AM	es_ES

Fichier(s) constituant ce document

Nom:: acceptedVersion.pdf
Taille:: 1.417Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DCCIA - Artículos

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución-NoComercial-SinDerivadas 3.0 España