A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and Weighted Swarm Support Vector Machines

Al-Zoubi, Ala´ M.; Mora, Antonio M.; Faris, Hossam

doi:10.1109/ACCESS.2023.3293641

A_Multilingual_Spam_Reviews_Detection.pdf (2.243Mb)

Identificadores

URI: https://hdl.handle.net/10481/84803

DOI: 10.1109/ACCESS.2023.3293641

Exportar

Editorial

IEEE Xplore

Materia

Security

Detection

Spam reviews

Pre-trained

Word embedding

Weighted SVM

COVID-19

Multilingual

Fecha

2023-07-10

Referencia bibliográfica

A. M. Al-Zoubi, A. M. Mora and H. Faris, "A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and Weighted Swarm Support Vector Machines," in IEEE Access, vol. 11, pp. 72250-72271, 2023, [doi: 10.1109/ACCESS.2023.3293641]

Patrocinador

Projects TED2021-129938B-I0,; PID2020-113462RB-I00, PDC2022-133900-I00; PID2020-115570GB-C22, granted by Ministerio Español de Ciencia e Innovación; MCIN/AEI/10.13039/501100011033; MCIN/AEI/10.13039/501100011033; MCIN/AEI; Next GenerationEU/PRTR

Resumen

Online reviews are important information that customers seek when deciding to buy products or services. Also, organizations benefit from these reviews as essential feedback for their products or services. Such information required reliability, especially during the Covid-19 pandemic which showed a massive increase in online reviews due to quarantine and sitting at home. Not only the number of reviews was boosted but also the context and preferences during the pandemic. Therefore, spam reviewers reflect on these changes and improve their deception technique. Spam reviews usually consist of misleading, fake, or fraudulent reviews that tend to deceive customers for the purpose of making money or causing harm to other competitors. Hence, this work presents a Weighted Support Vector Machine (WSVM) and Harris Hawks Optimization (HHO) for spam review detection. The HHO works as an algorithm for optimizing hyperparameters and feature weighting. Three different language corpora have been used as datasets, namely English, Spanish, and Arabic in order to solve the multilingual problem in spam reviews. Moreover, pre-trained word embedding (BERT) has been applied alongside three-word representation methods (NGram-3, TFIDF, and One-hot encoding). Four experiments have been conducted, each focused on solving and demonstrating different aspects. In all experiments, the proposed approach showed excellent results compared with other state-ofthe- art algorithms. In other words, the WSVM-HHO achieved an accuracy of 88.163%, 71.913%, 89.565%, and 84.270%, for English, Spanish, Arabic, and Multilingual datasets, respectively. Further, a deep analysis has been conducted to investigate the context of reviews before and after the COVID-19 situation. In addition, it has been generated to create a new dataset with statistical features and merge its previous textual features for improving detection performance.

Colecciones

DTSTC - Artículos

Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional