A novel methodology to classify test cases using natural language processing and imbalanced learning Sahar Tahvili Herrera Triguero, Francisco Software testing Artificial intelligence Imbalanced classification Natural language processing Optimization IFROWANN Doc2Vec Detecting the dependency between integration test cases plays a vital role in the area of software test optimization. Classifying test cases into two main classes – dependent and independent – can be employed for several test optimization purposes such as parallel test execution, test automation, test case selection and prioritization, and test suite reduction. This task can be seen as an imbalanced classification problem due to the test cases’ distribution. Often the number of dependent and independent test cases is uneven, which is related to the testing level, testing environment and complexity of the system under test. In this study, we propose a novel methodology that consists of two main steps. Firstly, by using natural language processing we analyze the test cases’ specifications and turn them into a numeric vector. Secondly, by using the obtained data vectors, we classify each test case into a dependent or an independent class. We carry out a supervised learning approach using different methods for handling imbalanced datasets. The feasibility and possible generalization of the proposed methodology is evaluated in two industrial projects at Bombardier Transportation, Sweden, which indicates promising results. 2020-11-25T12:16:32Z 2020-11-25T12:16:32Z 2020-08-14 journal article Tahvili, S., Hatvani, L., Ramentol, E., Pimentel, R., Afzal, W., & Herrera, F. (2020). A novel methodology to classify test cases using natural language processing and imbalanced learning. Engineering applications of artificial intelligence, 95, 103878. [https://doi.org/10.1016/j.engappai.2020.103878] http://hdl.handle.net/10481/64485 10.1016/j.engappai.2020.103878 eng info:eu-repo/grantAgreement/EC/H2020/871319 http://creativecommons.org/licenses/by/3.0/es/ open access Atribución 3.0 España Elsevier