Compact class-conditional attribute category clustering: Amino acid grouping for enhanced HIV-1 protease cleavage classification
Metadata
Show full item recordEditorial
Institute of Electrical and Electronics Engineers (IEEE)
Materia
HIV-1 protease Octamer cleavage Data representation
Date
2024-08-23Referencia bibliográfica
J. A. Sáez and J. F. Vera, "Compact Class-conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, doi: 10.1109/TCBB.2024.3448617
Abstract
Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories
grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their
classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping
attribute categories in classification datasets. This approach combines the exact representation of the association between categorical
values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their
contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease
cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on
HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage
ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared
to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different
datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried
out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the
developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing
targeted therapeutic strategies.