A Fuzzy Association Rule-Based Classification Model for High-Dimensional Problems With Genetic Rule Selection and Lateral Tuning

The inductive learning of fuzzy rule-based classification systems suffers from exponential growth of the fuzzy rule search space when the number of patterns and/or variables becomes high. This growth makes the learning process more difficult and, in most cases, it leads to problems of scalability (in terms of the time and memory consumed) and/or complexity (with respect to the number of rules obtained and the number of variables included in each rule). In this paper, we propose a fuzzy association rule-based classification method for high-dimensional problems, which is based on three stages to obtain an accurate and compact fuzzy rule-based classifier with a low computational cost. This method limits the order of the associations in the association rule extraction and considers the use of subgroup discovery, which is based on an improved weighted relative accuracy measure to preselect the most interesting rules before a genetic postprocessing process for rule selection and parameter tuning. The results that are obtained more than 26 real-world datasets of different sizes and with different numbers of variables demonstrate the effectiveness of the proposed approach.


I. INTRODUCTION
F UZZY rule-based classification systems (FRBCSs) [1], [2] are useful and well-known tools in the machine learning framework, since they can provide an interpretable model for the end user [3]- [6]. There are many real applications in which FRBCSs have been employed, including anomaly intrusion detection [7], image processing [8], among others. In most of these areas, the available or useful data consist of a high number of patterns (instances or examples) and/or variables. In this situation, the inductive learning of FRBCSs suffers from exponential growth of the fuzzy rule search space. This growth makes the learning process more difficult, and in most cases, it leads to problems of scalability (in terms of the time and memory con- sumed) and/or complexity (with respect to the number of rules obtained and the number of variables included in each rule) [9], [10]. Association discovery is one of the most common data mining techniques that are used to extract interesting knowledge from large datasets [11]. Much effort has been made to use its advantages for classification under the name of associative classification [12]- [19]. Association discovery aims to find interesting relationships between the different items in a database [20], while classification aims to discover a model from training data that can be used to predict the class of test patterns [21]. Both association discovery and classification rules mining are essential in practical data mining applications [11], [22], and their integration could result in greater savings and convenience for the user.
A typical associative classification system is constructed in two stages: 1) discovering the association rules inherent in a database; 2) selecting a small set of relevant association rules to construct a classifier. In order to enhance the interpretability of the obtained classification rules and to avoid unnatural boundaries in the partitioning of the attributes, different studies have been presented to obtain classification systems, which is based on fuzzy association rules [23]- [28]. For instance, in [24], the authors have made use of a genetic algorithm (GA) [29], [30] to automatically determine minimum support and confidence thresholds, mining for each chromosome a fuzzy rule set for classification by means of an algorithm, which is based on the Apriori algorithm [31], and adjusting the fuzzy confidence of these rules with the approach that was proposed by Nozaki et al. in [32]. Consequently, this approach can only be used for small problems since its computational cost is very high when we consider problems that consist of a high number of patterns and/or variables. On the other hand, in [25], the authors used an algorithm that is based on the Apriori algorithm to mine association rules only up to a certain level and to select the K most confident ones for each class among them, in order to finally employ a genetic rule-selection method that obtains a classifier from them. However, many patterns may be uncovered if we only consider the confidence measure to select the candidate rules.
In this paper, we present a fuzzy association rule-based classification method for high-dimensional problems (FARC-HD) to obtain an accurate and compact fuzzy rule-based classifier with a low computational cost. This method is based on the following three stages: 1) Fuzzy association rule extraction for classification: A search tree is employed to list all possible frequent fuzzy item sets and to generate fuzzy association rules for classification, limiting the depth of the branches in order to find a small number of short (i.e., simple) fuzzy rules. 2) Candidate rule prescreening: Even though the order of the associations is limited in the association rule extraction, the number of rules generated can be very large. In order to decrease the computational cost of the genetic postprocessing stage, we consider the use of subgroup discovery based on an improved weighted relative accuracy measure (wWRAcc ) to preselect the most interesting rules by means of a pattern weighting scheme [33]. 3) Genetic rule selection and lateral tuning: Finally, we make use of GAs to select and tune a compact set of fuzzy association rules with high classification accuracy in order to consider the known positive synergy that both techniques present (selection and tuning). Several works have successfully combined the selection of rules with the tuning of membership functions (MFs) within the same process [34], [35], taking advantage of the possibility of different coding schemes that GAs provide. The successful application of GAs to identify fuzzy systems has led to the so-called genetic fuzzy systems (GFSs) [36]- [38]. In order to assess the performance of the proposed approach, we have used 26 real-world datasets with a number of variables ranging from 4 to 90 and a number of patterns ranging from 150 to 19 020. We have developed the following studies. First, we have shown the results that are obtained from comparison with three other GFSs [38]. Second, we have compared the performance of our approach with two approaches to obtain fuzzy associative classifiers. Third, we have shown the results that are obtained from the comparison with four other classical approaches for associative classification and with the C4.5 decision tree [39]. Furthermore, in these studies, we have made use of some nonparametric statistical tests for pairwise and multiple comparison [40]- [43] of the performance of these classifiers. Then, we have shown a study on the influence of the depth of the trees and the number of evaluations in the genetic selection and tuning process. Finally, we have analyzed the scalability of the proposed approach. This paper is arranged as follows. Section II introduces the type of rules, rule weights, and inference model, which are used, and the basic definitions for fuzzy association rules and associative classification. Section III describes in detail each stage of the proposed approach. Section IV presents the experimental setup. Section V shows and discusses the results that are obtained on 26 real-world datasets. Finally, in Section VI, some concluding remarks are made.

II. PRELIMINARIES
In this section, we first describe FRBCSs. Then, we introduce the basic definitions for fuzzy association rules. Finally, we present fuzzy association rules for classification.

A. Fuzzy Rule-Based Classification Systems
Any classification problem consists of N training patterns, i.e., x p = (x p1 , . . . , x pm ), p = 1, 2, . . . , N, from S classes, where x pi is the ith attribute value (i = 1, 2, . . . , m) of the pth training pattern. In this paper, we use fuzzy rules of the following form for our classifier: where R j is the label of the jth rule, x = (x 1 , . . . , x m ) is an m-dimensional pattern vector, A j i is an antecedent fuzzy set, C j is a class label, and RW j is the rule weight.
The rule weight of each fuzzy rule R j has a great effect on the performance of fuzzy rule-based classifiers [44]. Different specifications of the rule weight have been proposed and examined in the literature. In [45], we can find some heuristic methods for rule weight specification. In this paper, we employ the most common one, i.e., the fuzzy confidence value or certainty factor (CF) [46]: where μ A j (x p ) is the matching degree of the pattern x p with the antecedent part of the fuzzy rule R j . We use the fuzzy reasoning method of the weighted vote or additive combination [46] to classify new patterns by the rule base (RB). With this method, each fuzzy rule casts a vote for its consequent class. The total strength of the vote for each class is computed as follows: The new pattern x p is classified as the class with the maximum total strength of the vote. If multiple class labels have the same maximum value for x p or no fuzzy rule is compatible with x p , this pattern is classified as the class with most patterns in the training data.

B. Fuzzy Association Rules
Association rules are used to represent and identify dependences between items in a database [11], [20]. They are expressions of the type A → B, where A and B are sets of items, and A ∩ B = . This means that if all the items in A exist in a transaction, then all the items in B with a high probability are also in the transaction, and A and B should have no common items [31]. There are many previous studies to mine association rules that are focused on databases with binary or discrete values; however, data in real-world applications usually consist of quantitative values. Designing data mining algorithms, which are able to deal with various types of data, presents a challenge to workers in this research field.
Fuzzy set theory has been used more and more frequently in data mining because of its simplicity and similarity to human reasoning [1]. The use of fuzzy sets to describe associations between data extends the types of relationships that may be represented, facilitates the interpretation of rules in linguistic terms, and avoids unnatural boundaries in the partitioning of the attribute domains. For this reason, in recent years, different studies have proposed methods to mine fuzzy association rules from quantitative data [47]- [54].
Let us consider a simple database T with two attributes (X 1 and X 2 ) and three linguistic terms with their associated MFs (see Fig. 1). Based on this definition, a simple example of fuzzy association rule is that X 1 is Middle → X 2 is High. Support and confidence are the most common measures of interest of an association rule. These measures can be defined for fuzzy association rules as follows: where |N | is the number of transactions in T , μ A (x p ) is the matching degree of the transaction x p with the antecedent part of the rule, and μ AB (x p ) is the matching degree of the transaction x p with the antecedent and consequent of the rule.

C. Fuzzy Association Rules for Classification
Over the past few years, different studies have proposed methods to obtain fuzzy association rule-based classifiers [23]- [28]. The task of classification is to find a set of rules in order to identify the classes of undetermined patterns. A fuzzy association rule can be considered to be a classification rule if the antecedent contains fuzzy item sets, and the consequent part contains only one class label (C = {C 1 , . . . , C j , . . . , C S }). A fuzzy associative classification rule, i.e., A → C j , could be measured directly in terms of support and confidence as follows: III. FUZZY ASSOCIATION RULE-BASED CLASSIFIER FOR HIGH-DIMENSIONAL PROBLEMS In this section, we will describe our proposal to obtain a fuzzy association rule-based classifier for high-dimensional problems. This method is based on the following three stages: 1) Fuzzy association rule extraction for classification: A search tree is employed to list all the possible frequent fuzzy item sets and to generate fuzzy association rules for classification. 2) Candidate rule prescreening: A rule evaluation criterion is used to preselect candidate fuzzy association rules. 3) Genetic rule selection and lateral tuning: The best cooperative rules are selected and tuned by means of a GA, considering the positive synergy between both techniques within the same process. Finally, we add a default rule considering the class with the most patterns in the training data. In the following, we will introduce the three mentioned stages, which explain in detail all their characteristics (see Sections III-A-C and present a flowchart of the algorithm (see Section III-D).

A. Stage 1. Fuzzy Association Rule Extraction for Classification
To generate the RB, we employ a search tree to list all the possible fuzzy item sets of a class. The root or level 0 of a search tree is an empty set. All attributes are assumed to have an order (in our case, the order of appearance in the training data), and the one-item sets that correspond to the attributes are listed in the first level of the search tree according to their order. If an attribute has j possible outcomes (q j linguistic terms for each quantitative attribute), it will have j one-item sets that are listed in the first level. The children of a one-item node for an attribute A are the two-item sets that include the one-item set of attribute A and a one-item set for another attribute behind attribute A in the order, and so on. If an attribute has j > 2 possible outcomes, it can be replaced by j binary variables to ensure that no more than one of these j binary attributes can appear in the same node in a search tree. An example with two attributes V 1 and V 2 with two linguistic terms L and H is detailed in Fig. 2.
An item set with a support higher than the minimum support is a frequent item set. If the support of an n-item set in a node J is less than the minimum support, it does not need to be extended more because the support of any item set in a node in the subtree, which is led by node J, will also be less than the minimum support. Likewise, if a candidate item set generates a classification rule with confidence higher than the maximum confidence, this rule has reached the quality level that is demanded by the user, and it is again unnecessary to extend it further. These properties greatly reduce the number of nodes needed for searching.
The fuzzy support of an item set can be calculated as follows: where μ A (x p ) is the matching degree of the pattern x p with the item set. The matching degree μ A (x p ) of x p to the different fuzzy regions is computed by the use of a conjunction operator, in our case, the product T-norm. Once all frequent fuzzy item sets have been obtained, the candidate fuzzy association rules for classification can be generated, setting the frequent fuzzy item sets in the antecedent of the rules and the corresponding class in the consequent. This process is repeated for each class. The number of frequent fuzzy item sets that are extracted depends directly on the minimum support. The minimum support is usually calculated considering the total number of patterns in the dataset; however, the number of patterns for each class in a dataset can be different. For this reason, our algorithm determines the minimum support of each class by the distributions of the classes over the dataset. Thus, the minimum support for class C j is defined as where minSup is the minimum support determined by the expert, and f C j is the pattern ratio of the class C j .
In this stage, we can generate a large number of candidate fuzzy association rules for classification. It is, however, very difficult for human users to handle such a large number of generated fuzzy rules and to intuitively understand long fuzzy rules with many antecedent conditions. For this reason, we only generate short fuzzy rules and with only a small number of antecedent conditions. Thus, the depth of the trees is limited to a fixed value Depth max that is determined by an expert.

B. Stage 2. Candidate Rule Prescreening
In the previous stage, we can generate a large number of candidate rules. In order to decrease the computational costs of stage 3, we consider the use of subgroup discovery to preselect the most interesting rules from the RB, which are obtained in the previous stage by means of a pattern weighting scheme [33]. This scheme treats the patterns in such a way that covered positive patterns are not deleted when the current best rule is selected. Instead, each time a rule is selected, the algorithm stores a count i for each pattern of how many times (with how many of the selected rules) the pattern has been covered.
Weights of positive patterns covered by the selected rule decrease according to the formula w(e j , i) = 1 i+1 . In the first iteration, all target class patterns are assigned the same weight, i.e., w(e j , 0) = 1, while in the following iterations the contributions of patterns are inversely proportional to their coverage by previously selected rules. This way, the patterns that are already covered by one or more selected rules decrease their weights while uncovered target class patterns whose weights have not been decreased will have a greater chance of being covered in the following iterations. Covered patterns are completely eliminated when they have been covered more than k t times. Thus, in each iteration of the process, the rules are ordered according to a rule evaluation criterion from best to worst. The best rule is selected, covered patterns are reweighted, and the procedure repeats these steps until one of the stopping criteria is satisfied: either all patterns have been covered more than k t times, or there are no more rules in the RB. This process is to be repeated for each class.
wWRAcc was used to evaluate the quality of intervalar rules in APRIORI-SD [33]. This measure was defined as follows: where N is the sum of the weights of all patterns, n (A) is the sum of the weights of all covered patterns, n (A · C j ) is the sum of the weights of all correctly covered patterns, n(C j ) is the number of patterns of class C j , and N is the number of all patterns. For instance, let us consider a simple database with two attributes X 1 and X 2 , two classes C 1 and C 2 , and five training patterns. Table I shows the five training patterns and their weights in the pth iteration of the process. In this iteration, the wWRAcc value of a simple rule, We have modified this measure to enable the handling of fuzzy rules. The new measure is defined as follows: where n (A) is the sum of the products of the weights of all covered patterns by their matching degrees with the antecedent part of the rule, n (A · C j ) is the sum of the products of the weights of all correctly covered patterns by their matching degrees with the antecedent part of the rules, and n (C j ) is the sum of the weights of patterns of class C j . Moreover, the first term in the definition of wWRAcc has been replaced by n (C j ) to reward rules that cover uneliminated patterns of class C j .
Let us consider three linguistic terms for the attributes X 1 and X 2 (see Fig. 1). Based on this definition, a simple example of fuzzy association rule for classification is: R = If X 1 is Low and X 2 is High → C 1 . This rule covers the training patterns in Table I

C. Stage 3. Rule Selection and Lateral Tuning
We consider the use of GAs to select and tune a compact set of fuzzy association rules with high classification accuracy from the RB, which are obtained in the previous stage. We consider the approach that is proposed in [35], where rules are based on the linguistic two-tuple representation [55]. This representation allows the lateral displacement of the labels considering only one parameter (symbolic translation parameter), which involves a simplification of the tuning search space that eases the derivation of optimal models, particularly, when it is combined with a rule selection within the same process enabling it to take advantage of the positive synergy that both techniques present. This way, this process to contextualize the MFs enables them to achieve a better covering degree while maintaining the original shapes, which results in accuracy improvements without a significant loss in the interpretability of the fuzzy labels. The symbolic translation parameter of a linguistic term is a number within the interval [−0.5, 0.5) that expresses the domain of a label when it is moving between its two lateral labels. Let us consider a set of labels S representing a fuzzy partition. Formally, we have the pair (S i , α i ), S i ∈ S, α i ∈ [−0.5, 0.5). An example is illustrated in Fig. 3, where we show the symbolic translation of a label that is represented by the pair (S 2 , −0.3).
Let us consider the simple problem presented in the previous section. Based on this definition, examples of classic rule and linguistic two-tuple represented rule are as follows.
Classic Rule:

Two-Tuple Representation:
In [35], two different rule representation approaches were proposed: a global approach and a local approach. In our particular case, the tuning is applied to the level of linguistic partitions (global approach). This way, the pair (X i , label) takes the same α value in all the rules where it is considered, i.e., a global collection of two tuples is considered by all the fuzzy rules. For example, X 1 is (High, 0.3) that will present the same value for those rules in which the pair "X 1 is High" was initially considered. This proposal decreases the tuning problem complexity, greatly easing the derivation of optimal models. Another important issue is that from the parameters α that are applied to each label, we could obtain the equivalent triangular MFs, by which an FRBCS that is based on linguistic two tuples could be represented as a classical Mamdani FRBCS. Notice that the class label and RW of the rule are not modified.
In the following, the main characteristics of the genetic approach that combines rule selection and lateral tuning are presented: genetic model, codification and initial gene pool, chromosome evaluation, crossover operator, and restarting approach.

1) CHC Genetic Model:
The approach that is proposed in [35] considers the use of a specific GA, the CHC algorithm [56]. The CHC algorithm is a GA that presents a good trade-off between exploration and exploitation, making it a good choice for problems with complex search spaces [57]. This genetic model makes use of a mechanism of selection of populations in order to perform an adequate global search. P parents and their corresponding offspring compete to select the best P individuals to take part in the next population. The CHC approach makes use of an incest prevention mechanism and a restarting process to encourage diversity in the population, instead of the well-known mutation operator.
This incest prevention mechanism will be considered in order to apply the crossover operator, i.e., two parents are crossed if their hamming distance divided by 2 is more than a predetermined threshold L. This threshold value is initialized as the maximum possible distance between two individuals (the number of genes in the chromosome) divided by 4. Following the original CHC scheme, L is decremented by 1 when there are no new individuals in the population in one generation. In order to make this procedure independent of the number of genes in the chromosome, in our case, L will be decremented by ϕ% of its initial value (with ϕ determined by the user, usually 10%). When L is below zero, the algorithm restarts the population (for more information, see [58]).
A scheme of this algorithm is shown in Fig. 4.

2) Codification and Initial Gene Pool:
To combine the rule selection with the global lateral tuning, a double coding scheme for both rule selection C S and lateral tuning C T is used 1) For the C S part, each chromosome is a binary vector that determines when a rule is selected or not (alleles "1" and "0," respectively). Considering the M rules that are contained in the candidate rule set, the corresponding part, i.e., C S = {c 1 , . . . , c M }, represents a subset of rules composing the final RB so that IF c i = 1 THEN (R i ∈ RB) else (R i ∈ RB), with R i being the corresponding ith rule in the candidate rule set and RB being the final RB. 2) For the C T part, a real coding is considered. This part is the joint of the α parameters of each fuzzy partition. Let us consider the following number of labels per variable: (m 1 , m 2 , . . . , m n ) with n being the number of system variables. Then, this part has the following form, where each gene is associated with the tuning value of the corresponding label: C T = (c 11 , . . . , c 1m 1 ,  c 21 , . . . , c 2m 2 , . . . , c n 1 , . . . , c nm n ).
Finally, a chromosome C is coded in the following way: C = C S C T . To make use of the available information, all the candidate rules are included in the population as an initial solution. To do this, the initial pool is obtained with the first individual having all genes with value "1" in the C S part and all genes with value "0.0" in the C T part. The remaining individuals are generated at random.
3) Chromosome Evaluation: To evaluate a determined chromosome penalizing a large number of rules, we compute the classification rate and the following function is maximized: where #Hits is the number of patterns that are correctly classified (see Section II-C), NR initial is the number of candidate rules, NR is the number of selected rules, and δ is a weighting percentage given by the system expert that determines the tradeoff between accuracy and complexity. If there is at least one class without selected rules or if there are no covered patterns, the fitness value of a chromosome will be penalized with the number of classes without selected rules and the number of uncovered patterns.

4) Crossover Operator:
The crossover operator will depend on the chromosome part where it is applied.
1) For the C T part, we consider the Parent Centric BLX (PCBLX) operator [59] (an operator that is based on BLXα). This operator is based on the concept of neighborhood, which allows the offspring genes to be around the genes of one parent or around a wide zone that is determined by both parent genes. Let us assume that X = (x 1 , . . . , x n ), and Y = (y 1 , . . . , y n ), where x i , y i ∈ [a i , b i ] ⊂ , i = 1, . . . , n, are two real-coded chromosomes that are going to be crossed. We generate the following two offspring. a) O 1 = (o 11 · · · o 1n ), where o 1i is a randomly (uniformly) chosen number from the interval [l 1 2) In the C S part, the half-uniform crossover scheme (HUX) is employed [58]. The HUX crossover exactly interchanges the mid of the alleles that are different in the parents (the genes to be crossed are randomly selected from among those that are different in the parents). This operator ensures the maximum distance of the offspring to their parents (exploration). In this case, four offspring are generated by the combination of the two from the part C T with the two from the part C S . The two best offspring obtained in this way are considered as the two corresponding descendents.
Notice that since we consider a real coding scheme for the C T part, the incest prevention mechanism has to transform each gene considering a Gray code (binary code) with a fixed number of bits per gene (BITSGENE) that is determined by the expert to calculate the hamming distance between two individuals in order to apply the crossover operators.

5) Restarting Approach:
To get away from local optima, this algorithm uses a restart approach. In this case, the best chromosome is maintained, and the remaining are generated at random. The restart procedure is applied when the threshold value L is below zero, which means that all the individuals coexisting in the population are very similar.

D. Flowchart
In accordance with the previous description, the proposed algorithm to obtain a fuzzy association rule-based classifier is described in the following.
INPUT: A dataset with size T and m attributes, each with q j predefined linguistic terms.
OUTPUT: A fuzzy associative classifier.

Stage 1. Fuzzy Association Rule Extraction for Classification.
For each class C j : Step 1: Calculate the minimum support of class C j according to (9).
Step 2: Create the levels 0 and 1 of the tree.
Step 3: Create a new level in the tree.
Step 5: If there are more than two nodes in the new level, and the depth of the tree is less than Depth max , go to Step 3.
Step 6: Generate the rules with class C j on the right-hand side.
For each class C j : Step 7: Set the weight of the patterns as 1.
Step 8: Calculate the wWRAcc value for each rule.
Step 9: Select the best rule as a part of the initial RB for Stage 3 and remove it from the candidate rule set.
Step 10: Decrease the weight of the patterns covered by the selected rule.
Step 11: If any pattern has been covered less than k t times and there are more rules in the candidate rule set, go to Step 8.

Stage 3. Rule Selection and Lateral Tuning.
Step 12: Generate the initial population with P chromosomes.
Step 13: Evaluate the population.
Step 14: Initialize the threshold value taking into account Gray codings, i.e., L = L initial .
Step 15: Generate the next population as following. 1) Shuffle the population.
2) Select the parents two by two. Each pair is crossed if the hamming distance between the parent Gray codings divided by 2 is more than L. 3) Evaluate the new individuals. 4) Join the parents with their offspring, and select the best P individuals to take part in the next population.
Step 16: If the best chromosome does not change or there are no new individuals in the population, then L = L − (L initial * 0.1).
Step 17: If L < 0, restart the population and initialize L.
Step 18: If the maximum number of evaluations is not reached, go to Step 15.
A scheme of this algorithm is shown in Fig. 5.

IV. EXPERIMENTAL SETUP
Several experiments have been carried out in this paper to evaluate the usefulness of our proposal. In the following, first, we describe the real-world databases that are used in these experiments; second, we introduce a brief description of the methods considered for comparison; third, we show the configuration of the methods (determining all the parameters used); and finally, we describe the statistical analysis that is adopted in this study.

A. Datasets
In order to analyze the performance of the proposed approach, we have considered 26 real-world datasets. Table II summarizes the main characteristics of the 26 datasets and shows the link to the Knowledge Extraction based on Evolutionary Learning (KEEL)-dataset repository [60] from which they can be downloaded, where "Attributes(R/I/N )" is the number of (Real/Integer/Nominal) attributes in the data, "Patterns" is the number of patterns, and "Classes" is the number of classes. Notice that we have removed the instances with any missing value in the datasets (Cleveland and Crx), and 12 datasets have a number of variables greater than or equal to 15.
To develop the different experiments, we consider a tenfold cross-validation model, i.e., we randomly split the dataset into ten folds, each containing 10% of the patterns of the dataset, and use nine folds for training and one for testing. 1 For each of the ten partitions, we executed three trials of the algorithms.

B. Methods Considered for Comparison
In these experiments, we compare the proposed approach with other ten methods, which are available in the KEEL software tool [61]. A brief description of these methods is as follows.
1) C4.5 [39]: This is a well-known algorithm used to generate a decision tree from a set of training data in the same way as the ID3 algorithm [62]. The extensions or improvements with respect to ID3 are that it accounts for unavailable or missing values in data, it handles continuous attribute value ranges, it chooses an appropriate attribute selection measure (maximizing the gain ratio), and it prunes the resulting decision trees. 2) Classification based on associations (CBA) [12]: This method consists of two parts. In the first part, an algorithm based on the Apriori algorithm [31] is used to mine the interval association rules for classification. In the second part, this sorts the generated rules according to their precedence relation and chooses a set of high precedence rules to cover the training data. 3) CBA2 [13]: This method is the second version of the CBA algorithm, which improves the previous system by the use of multiple class minimum support in rule generation. 4) Classification based on multiple association rules (CMAR) [14]: This method extends an efficient frequent pattern (FP) mining method, i.e., FP-Growth [63], constructs a class distribution-associated FP-tree, and mines large databases efficiently. Moreover, it applies a CR-tree structure to store and retrieve mined interval association rules efficiently, and it prunes rules effectively based on confidence, correlation (by using a weighted chi-square method), and database coverage. The classification is performed based on a weighted chi-square analysis using multiple strong association rules. [64]: This method is a modification of the GA of the SLAVE algorithm [65] in order to include a feature selection process. This is an inductive learning algorithm based on the iterative rule learning approach, in which each chromosome represents a rule, to obtain a set of disjunctive normal form (DNF) fuzzy rules. Chromosomes compete in every GA run, choosing the best rule per run. The global solution is formed by the best rules obtained when the algorithm is run multiple times. 6) Learning algorithm to discover fuzzy association rules for classification (LAFAR) [24]: This method uses a GA to automatically determine the minimum fuzzy support and the minimum fuzzy confidence. To evaluate a determined chromosome, this method finds frequent fuzzy grids and generates fuzzy classification rules from them. Once the whole classifier is obtained, the fitness value can be calculated, which maximizes the classification accuracy rate and minimizes the number of fuzzy rules. When reaching the termination condition, the chromosome with the maximum fitness value is used to test the performance of the proposed method. 7) Classification based on predictive association rules (CPAR) [15]: This method adopts a greedy algorithm to generate interval association rules directly from training data. In this process, this algorithm selects multiple literals with similar gains to build multiple rules simultaneously in order to avoid missing important rules. To perform the classification, this uses expected accuracy to evaluate each rule and uses the best k rules in prediction. 8) Fuzzy hybrid genetic-based machine learning algorithm (FH-GBML) [66]: This method follows a genetic cooperative-competitive learning (GCCL) approach and consists of two processes. The first process is used to generate good fuzzy rules while the second one is used to find good combinations of generated fuzzy rules. This method simultaneously uses multiple fuzzy partitions with different granularities for fuzzy rule extraction, using four homogeneous fuzzy partitions with triangular fuzzy sets and a don't-care condition. 9) Steady-state GA for extracting fuzzy classification rules from data (SGERD) [67]: It is a steady-state GA to generate a prespecified number of Q rules per class following a GCCL approach. In each iteration, parents and their corresponding offspring compete to select the best Q rules for each class. This method also simultaneously uses multiple fuzzy partitions with different granularities and a don'tcare condition for fuzzy rule extraction. 10) Classification with fuzzy association rules (CFAR) [27]:

5) Structural learning algorithm on vague environment (2SLAVE)
This method uses the Apriori algorithm to mine all the fuzzy association rules for classification and remove the conflicting and redundant rules to generate a compact set of rules denoted as CompSet. Then, this method selects the best rules to build the classifier by means of two processes. In the first process, for each pattern, CompSet is sorted by matching and confidence degree, rewarding the best rule that classify this pattern and punishing the rules that do not classify it. In the second process, the worst rules from CompSet are removed. These processes are iterated until the error rate in the training set increases.

C. Parameters of the Methods
The parameters of the analyzed methods are shown in Table III. 2 Notice that only the rules with a number of antecedent conditions less than or equal to 3 are examined for our proposal. This restriction is intended to facilitate the discovery of a small number of short (i.e., simple) fuzzy rules. The parameters of the remaining methods were selected according to the recommendation of the corresponding authors within each proposal, which are the default parameter settings included in the KEEL software tool [61]. Notice that in the FH-GBML algorithm, the authors used three different probabilities of don't care (0.5, 0.8, and 0.95 depending on the size of the dataset) to obtain fuzzy rules with a few antecedent fuzzy sets. In these experiments, we have used these three probabilities of don't care in each dataset and have shown in the tables the best average result obtained for each one. Furthermore, in the CFAR algorithm, the authors used 0.1 as the minimum support and this could be very high for some datasets (we are using 0.05 as the minimum support in our proposal). Likewise, we have used these two minimum supports in each dataset (0.1 and 0.05), and we have shown in the tables the best average result obtained in each one.
The initial linguistic partitions for our proposal and the fuzzy methods analyzed are comprised of five linguistic terms with uniformly distributed triangular MFs giving meaning to them, except in the FH-GBML and SGERD algorithms, where the partitions are comprised of two, three, four, and five linguistic terms for each attribute. The discretization of the continuous attributes for the CBA, CBA2, CMAR, and CPAR algorithms is done by the use of the entropy method [68]. Notice that we use a crisp label for each value of the nominal variables.

D. Statistical Analysis
In order to assess whether significant differences exist among the results, we adopt statistical analysis [41]- [43] and, in particular, nonparametric tests, according to the recommendations made in [40], where a set of simple, safe, and robust nonparametric tests for statistical comparisons of classifiers has been introduced.
For pairwise comparison, we use Wilcoxon's Signed-Ranks test [69], [70], and for multiple comparison we employ Friedman's test [71], Iman and Davenport's test [72], and Holm's method [73]. In order to perform multiple comparisons, it is necessary to check whether all the results obtained by the algorithms present any significant difference (Friedman's test and  TABLE IV  RESULTS OBTAINED BY THE ANALYZED METHODS Iman-Davenport's test), and in the case of finding one, then, we can find out by using a post-hoc test to compare the control algorithm with the remaining algorithms (Holm's test). We use α = 0.05 as the level of confidence in all cases. A wider description of these tests, together with software for their use, can also be found at: http://sci2s.ugr.es/sicidm/.

V. EXPERIMENTAL RESULTS
In this section, we analyze the results obtained in the different experiments. This section is organized as follows.
1) In Section V-A, we show a statistical study obtained from the comparison with other three GFSs, including FH-GBML [66], 2SLAVE [64], and SGERD [67]. 2) In Section V-B, we compare the performance of our approach with two other approaches to obtain a fuzzy associative classifier: the LAFAR algorithm [24] and the CFAR algorithm [27]. 3) In Section V-C, we compare the performance of our approach with the C4.5 decision tree [39] and four classical approaches for associative classification: the CBA algorithm [12], the CBA2 algorithm [13], the CMAR algorithm [14], and the CPAR algorithm [15]. 4) In Section V-D, we show an analysis of the performance of our approach, depending on the depth of the trees and the number of evaluations in the genetic process. 5) In Section V-E, we analyze the scalability of our proposal.

A. Comparison With Other Genetic Fuzzy Systems
This section analyzes the performance of our model against three recognized GFSs. The results obtained by the analyzed methods are shown in Table IV, where we have the following. 1) #R stands for the average number of rules.
2) #C stands for the average number of conditions in the antecedent of the rules. 3) Tra stands for the average classification percentage obtained over the training data. 4) Tst stands for the average classification percentage obtained over the test data. The best global result for each one is stressed in boldface in each case.
In order to compare the results, we have used nonparametric tests for multiple comparison to find the best approach (see Section IV-D), considering the average results obtained in test (Tst). First of all, we have used the Friedman and Iman-Davenport tests in order to find out whether significant differences exist among all the mean values. Table V shows the Friedman and Iman-Davenport statistics, and it relates them to the corresponding critical values for each distribution by using a level of significance, i.e., α = 0.05. The p-value obtained is also reported for each test. Given that the statistics of Friedman and Iman-Davenport are clearly greater than their associated critical values, there are significant differences among the observed results with a level of significance α ≤ 0.05. Table VI  TABLE V  RESULTS OF THE FRIEDMAN AND IMAN-DAVENPORT TESTS (α = 0.05)   TABLE VI  AVERAGE RANKINGS OF THE METHODS   TABLE VII  HOLM'S TABLE FOR  shows the rankings (which are computed by the use of a Friedman test) of the different methods that are considered in this study.
We now apply Holm's test to compare the best ranking method (FARC-HD) with the remaining methods. Table VII presents these results. In this table, the methods are ordered with respect to the z-value obtained. Holm's test rejects the hypothesis of equality with the rest of the methods (p < α/i). Therefore, by the analysis of the statistical study that is shown in Tables VI and VII, we conclude that our model is a solid approach to deal with high-dimensional datasets, as it has shown itself to be the best accuracy method when compared with the remaining fuzzy GFSs that are applied in this study.
Finally, the results presented in Table IV show that our proposal obtains a higher average number of rules (39.2 rules on average) than all the GFSs (good approaches to obtain very compact models), showing a good trade-off closer to the accuracy with rules involving no more than three attributes in their antecedent and giving the advantage of easier understanding with respect to the 2SLAVE and FH-GBML.

B. Comparison With Other Fuzzy Associative Classifiers
In this section, we compare the performance of our model with two other approaches to obtain a fuzzy associative classifier: the LAFAR algorithm [24] and the CFAR algorithm [27]. The results obtained by these methods are shown in Table VIII. (This kind of table was described in Section V-A.) Notice that we show less datasets; this is due to scalability problems in the LAFAR and CFAR algorithms, which cannot run in all datasets.
In order to compare the two algorithms, we use a Wilcoxon test, which is shown in Table IX. We can observe that the null hypothesis for the Wilcoxon test has been rejected (p-value<= α), and our proposal has achieved a higher ranking. We may conclude that our proposal also presents the best performance in this case.
On the other hand, the results presented in Table VIII show that our approach obtains an average number of rules lower than the LAFAR and CFAR algorithms. However, the CFAR algorithm obtains less rules than our approach in 11 of the 18 datasets.

C. Comparison With Classical Approaches
This section analyzes the performance of our model against five classical approaches. The results obtained by the analyzed methods are shown in Table X.
In order to compare the results, we have applied the nonparametric tests described in Section V-A. Table XI shows that the statistics of Friedman and Iman-Davenport are clearly greater than their associated critical values, and there are significant differences among the observed results with a level of significance, i.e., α ≤ 0.05. Table XII shows the rankings (computed using a Friedman test) of the different methods considered in this study. Table XIII shows that Holm's test rejects the hypothesis of equality with the rest of the methods (p < α/i). Therefore, by the analysis of the statistical study shown in Tables XII and XIII, we conclude that our model is the best performing method when compared with the remaining classical approaches applied in this study. Finally, the results presented in Table X show that our proposal obtains a smaller average number of rules than the remaining approaches.

D. Analysis of the Influence of Depth m ax and the Number of Evaluations
In this section, several experiments have been carried out to analyze the performance of our approach depending on Depth max and the number of evaluations in the genetic selection and tuning process (using the experimental setting described in Section IV). In order to make this analysis easier to interpret, we have used four representative datasets in this experiments: Yeast, Vowel, Ringnorm, and Spectfheart (8,13,20, and 44 variables, respectively). Table XIV shows the results obtained with three different values for Depth max (2, 3, and 4), where #R1 stands for the average number of rules obtained at the end of Stage 1, #R2 stands for the average number of rules obtained at the end of Stage 2, #R3 stands for the average number of rules obtained at the end of Stage 3, and "time" stands for the average runtime (in format hh : mm : ss).
By the analysis of the results presented in Table XIV, we can highlight the following facts.
1) Candidate rule prescreening allows the selection of a reduced number of interesting rules with the three values for Depth max , decreasing the computational cost in the genetic selection and tuning process. Notice that the number of rules obtained in Stage 1 for the dataset Spectfheart is higher than 300 000 rules.  2) When we use Depth max = 4, we can see how the proposed approach does not obtain important improvement in training for three of the four datasets and only improves the results obtained in test for two of the four datasets. Moreover, the increase of the computational cost is high in all datasets, where Depth max = 3 is a value with a good compromise between both properties. On the other hand, Fig. 6 depicts the accuracy obtained over the training data along with different numbers of evaluations in the genetic process with Depth max = 4. In this figure, we can highlight how this process obtains the best solution in less than 14 000 evaluations in all datasets because the initial RBs consist of a reduced number of rules.
Taking into account both studies, a good neutral choice ensuring the convergence may be to use 3 for Depth max and 15 000 for the number of evaluations in the genetic process, obtaining a good trade-off between accuracy and computational cost (good accuracy and not too great a computational cost).  Table XV shows the average runtime of the analyzed methods in the previous sections on 26 real-world problems (with a number of variables ranging from 4 to 90 and a number of patterns ranging from 150 to 19 020) and using the tenfold cross-validation model. The methods were implemented using Java, and all of the experiments were performed using a Pentium Core 2 Quad, 2.5-GHz CPU with 4 GB of memory and running Linux.

E. Analysis of Scalability
By the analysis of the results presented in Table XV, we can draw the following conclusions.
1) The SGERD algorithm presents a very low average runtime in all datasets, obtaining a good scalability when we increase the size of the problem. This method, however, should be the worst in Friedman's test when we compare the results obtained in the test data (see Table VI).
2) The 2SLAVE, FH-GBML, LAFAR, and CFAR algorithms expend a large amount of time when the number of attributes and patterns in the dataset is high. Notice that the CFAR and LAFAR cannot run in 7 and 18 of the 26 datasets, respectively.
3) The remainder of the methods obtain low computational costs in all datasets and present good results in accuracy. However, our proposal obtains the best ranking in Friedman's test when we compare the results that are obtained in the test partitions. 4) Notice that the CBA and CBA2 algorithms present similar runtimes to the CMAR and CPAR algorithms because they limit the total number of candidates rules that are generated in datasets with more than 15 variables since they cannot be completed within this limit. 5) The FARC-HD approach presents a good computational cost in all datasets, obtaining a good scalability and the best performance in accuracy.

VI. CONCLUDING REMARKS
In this paper, we have proposed a new fuzzy associative classification method for high-dimensional datasets, named FARC-HD. Our aim was to obtain accurate and compact fuzzy associative classifiers with a low computational cost. To do this, we mine fuzzy association rules limiting the order of the associations in order to obtain a reduced set of candidate rules with less attributes in the antecedent. We have made use of a pattern weighting scheme in order to reduce the number of candidate rules, preselecting the rules with the best quality. A genetic rule selection and lateral tuning have been applied to select a small set of fuzzy association rules with a high classification accuracy.
Taking into account the results obtained, we can conclude that our model is a solid approach to deal with high-dimensional datasets, as it obtains the best accuracy in the experimental study. Moreover, the FARC-HD obtains models with a reduced number of rules (39.2 rules on average) and, particularly, with few attributes in the antecedent. Finally, the limit in the depth of the trees, along with candidate rule prescreening using the fuzzy measure wWRACC , allows us to reduce the search space considerably. Thus, the genetic process for selection and tuning does not introduce an excessive computational cost in to the whole process. Jesús Alcalá-Fdez received the M.Sc. and the Ph.D. degrees in computer science, both from the University of Granada, Granada, Spain, in 2002 and 2006, respectively.
He was with the Department of Computer Science, University of Jaén, Jaén, Spain, from 2005 to 2007. He is currently an Associate Professor with the Department of Computer Science and Artificial Intelligence, University of Granada, where he is a member of the Soft Computing and Intelligent Information Systems Research Group. He has published more than 50 papers in international journals, book chapters, and conferences. He was involved in several research projects supported by the Spanish Government and the Andalusian Government. His current research interests include fuzzy association rules, genetic fuzzy systems, and data mining software.
Dr He was with Department of Computer Science, University of Jaén, Jaén, Spain, from 1998 to 2003. He is currently an Associate Professor with the Department of Computer Science and Artificial Intelligence, University of Granada, where he is a member of the Soft Computing and Intelligent Information Systems Research Group. He has published more than 75 papers in international journals, book chapters, and conferences. He was involved in several research projects supported by the Spanish Government and the European Union. His current research interests include multiobjective genetic algorithms and genetic fuzzy systems, particularly, the learning/tuning of fuzzy systems for modeling, and control with a good trade-off between accuracy and interpretability, as well as fuzzy association rules.
Dr. Alcalá co-edited the IEEE TRANSACTIONS ON FUZZY SYSTEMS Special Issue on "Genetic Fuzzy Systems: What's next," the Evolutionary Intelligence Special Issue on "Genetic Fuzzy Systems: New Advances," and the Soft Computing Special Issue on "Evolutionary Fuzzy Systems." He currently serves as a member of the editorial/reviewer board of the International Journal of Computational Intelligence Research, the Journal of Advanced Research in Fuzzy and Uncertain Systems, the Journal of Universal Computer Science, and Applied Intelligence. He is a member of the Fuzzy Systems Technical Committee of the IEEE Computational Intelligence Society and has been the President of the "Genetic Fuzzy Systems" Task Force since January 2009. He was the Program Co-Chair at the Conference of Genetic and Evolutionary Fuzzy Systems in 2010, the General Co-Chair at the Conference of Genetic and Evolutionary Fuzzy Systems in 2011, and the Area Co-Chair at the IEEE International Conference on Fuzzy Systems in 2011.
Francisco Herrera (M'10) received the M.Sc. and the Ph.D. degrees, both in mathematics, from the University of Granada, Granada, Spain, in 1988 and 1991, respectively.
He is currently a Professor with the Department of Computer Science and Artificial Intelligence, the University of Granada. He has published more than 200 papers in international journals. He is a coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (Singapore: World Scientific, 2001). His current research interests include computing with words and decision making, bibliometrics, data mining, data preparation, instance selection, fuzzy rule-based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, and memetic and genetic algorithms.
Dr. Herrera has co-edited five international books and 21 Special Issues in international journals on different Soft Computing topics. He currently acts as the Editor-in-Chief of the international journal Progress in Artificial Intelligence (Springer) and serves as an Area Editor of the journal Soft Computing (area of evolutionary and bioinspired algorithms). He acts as an Associated Editor of the journals IEEE TRANSACTIONS ON