Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings
Metadata
Show full item recordEditorial
IEEE
Materia
Arabic harassment dataset Deep learning Evolutionary algorithm Fine-tuned word embedding Hate speech Offensive language Optimization
Date
2022-07-14Referencia bibliográfica
F. Shannaq... [et al.]. "Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings," in IEEE Access, vol. 10, pp. 75018-75039, 2022, doi: [10.1109/ACCESS.2022.3190960]
Sponsorship
Ministerio Espanol de Ciencia e Innovacion (DemocratAI::UGR) PID2020-115570GB-C22Abstract
Social networks facilitate communication between people from all over the world.
Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the
spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive
and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is
cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges
of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is
to design and implement an intelligent prediction system encompassing a two-stage optimization approach
to identify and classify the offensive from the non-offensive text. In the rst stage, the proposed approach
ne-tuned the pre-trainedword embedding models by training them for several epochs on the training dataset.
The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While
in the second stage, it employed a hybrid approach of two classi ers, namely XGBoost and SVM, and a
genetic algorithm (GA) to mitigate the drawback of the classi ers in nding the optimal hyperparameter
values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus
(ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities.
The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach
produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model
achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%.