A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms

Although much research has been undertaken, the spatial patterns, developmental course, and sexual dimorphism of brain structure associated with autism remains enigmatic. One of the difficulties in investigating differences between the sexes in autism is the small sample sizes of available imaging datasets with mixed sex. Thus, the majority of the investigations have involved male samples, with females somewhat overlooked. This paper deploys machine learning on partial least squares feature extraction to reveal differences in regional brain structure between individuals with autism and typically developing participants. A four-class classification problem (sex and condition) is specified, with theoretical restrictions based on the evaluation of a novel upper bound in the resubstitution estimate. These conditions were imposed on the classifier complexity and feature space dimension to assure generalizable results from the training set to test samples. Accuracies above 80% on gray and white matter tissues estimated from voxel-based morphometry (VBM) features are obtained in a sample of equal-sized high-functioning male and female adults with and without autism (N = 120, n = 30/group). The proposed learning machine revealed how autism is modulated by biological sex using a low-dimensional feature space extracted from VBM. In addition, a spatial overlap analysis on reference maps partially corroborated predictions of the “extreme male brain” theory of autism, in sexual dimorphic areas.


Introduction
The discovery of a characteristic pattern of structural brain differences associated with autism spectrum conditions (ASC) would be a major advance in our understanding of this complex and highly variable developmental disorder. Not only would it be a substrate upon which to build a systems narrative of autism, but from this starting point, it might be possible to run development "in reverse" to disentangle the roles of environmental and genetic risk factors in its etiology. Unfortunately, "inconsistent" is perhaps the most common adjective associated with the extant literature in this area.
Initial MRI observations focused on increased total brain, total tissue, and total lateral ventricle volumes in adults with autism, 1 but subsequent meta-analyses, including MRI-derived measurements as well as head circumference and post-mortem brain weights, were only able to detect a significant casecontrol difference amongst 4-5 year olds. 2 More recent meta-analyses based on fully automated voxel measures of gray and white matter volumes have confirmed the absence of any large difference in adults in either total gray matter (GM) 33 or total white matter (WM), 3 or in cerebellar volume. 4 Thus, if there are differences in overall brain size, then they occur very early in life 81 and are followed by a decrease in volume towards normative values during maturation in adolescence and early adulthood. 2,5

Gray and white matter distributions in the autistic brain
Our knowledge of localized changes in brain anatomy has historically come from GM and WM distributions estimated by voxel-based morphometry (VBM). Undergoing several iterations and improvements over time, VBM pipelines are automated, reliable 6 and statistically well behaved. 7 The success of the VBM technique can be measured by the large number of wide-ranging longitudinally and crosssectionally designed studies, as well as consistent cross-study patterns of tissue differences characterizing disorders like schizophrenia, 8 Alzheimer's disease 78 and major depressive disorder. 9,10 Whilst measures of global volume have generally pointed toward increases in the very early lives of those with autism, local comparisons of gray and white matter volume have variously implicated both regional increases and decreases. The greatest consistency that emerges from metaanalyses [11][12][13][15][16][17][18]33,80 is the large variance of both the primary literature and the outcomes of the derived meta-analyses, where the inclusion or exclusion of just a few primary sources can significantly alter the aggregated pattern of differences.
Explanations for the absence of a coherent narrative on the structural brain differences associated with autism have been suggested to arise from variety of origins. Whilst methodological differences vary between studies, 19 similar variability in VBM pipelines has not hindered the observation of characteristic patterns in other disorders. [8][9][10] Autism is also highly heterogeneous. 20 Indeed, under DSM-IV, four categories were contained within Autism Spectrum Disorder: Asperger's disorder, childhood disintegrative disorder and pervasive developmental disorder. It has been argued that no reliable clinical diagnosis has been made with these sub-types 21 and that no consistent biological substrate has been discovered that differentiates between them leading to a single diagnostic category of Autism Spectrum Disorder in DSM-V 22 that subsumes the sub-types. Nevertheless, it is unlikely that there will be an observation that unites these disorders, and in fact greater diversity in phenotyping, perhaps down to the individual level, is argued to be more likely to identify an underlying neurobiology for autism. 23 Stratification does appears to yield sensitivity improvements; for example, children with regressive autism appear to have increases in brain size, whereas those with nonregressive autism are not associated with a significant size change 24 ; and males and females with autism display different patterns of GM change. 27 What is almost certain is that should there be a true effect, it is spatially diffuse and generally of low effect size.
The main demographic features of ASC are its high prevalence, affecting 1% of the entire population, [29][30][31] and the significant skewing of the gender ratio towards male individuals to give values of 2-3:1 male:female, [30][31][32] and potentially contributing to the heterogeneous patterns of brain structure associated with the disorder. Previously, most studies of the biology of autism have focused predominantly on males 33 with male:female ratios in research samples in the range 8-15:1. Studies of functional imaging modalities that undertook analyses in each sex independently have found widespread significant cross-sectional differences. 34 Structural MRI studies of autism following similar stratification strategies have been successful in assessing the atypical brain areas that are shared and distinct across the sexes. 3 In a recent MRI study comprising highfunctioning male and female adults with autism, 34 exploratory analyses based on a VBM univariate statistical framework uncovered substantial heterogeneity in the neurobiology of autism. Furthermore, stratification by sex can provide the empirical evidence to test predictions from the "extreme male brain" (EMB) theory of autism 71 and other similar theories 74 that predict a relationship between autism and biological sexual differentiation.

The machine learning approach
A univariate analysis statistically processes single features independently. Features may be voxelwise comparisons, 65,66 regions of interest (ROIs), 67 or models of brain features such as multiple hypothesis testing on VBM estimates 44 and intrinsic curvature. 46 However, detecting a putative small magnitude, diffuse spatial pattern of differences in autism with a univariate approach is difficult, even in large datasets, 25,26 due to its low sensitivity to this type of effect.
Recently, effort has been expended on developing multivariate predictive models of biological sex based on univariate differential patterns of brain structure, i.e. cortical thickness. 38 In a sample of neurotypical males and females, these models were subsequently applied to males and females with autism. Multivariate approaches have also been applied to other imaging measures that are hypothesized to be altered in autism, including brain networks, 39,47 texture features, 40,42,43 and other voxel and regionwise decomposition techniques,. 48,49,69,79 In general, multivariate cross-sectional studies of autism with VBM estimates of tissue volumes perform somewhat better than univariate methods. 25,28 One of the major limitations of a multivariate computer-aided diagnosis (CAD) systems 45,[82][83][84] is the need for a sample size (l) that is sufficiently large compared to the number of dimensions d (predictors) 54 ; that is, l d. Neuroimaging rarely satisfies this condition, thus the proposed learning machines must be designed and adapted to this hostile environment, otherwise control of the generalization ability of the learning process is arguably weak. 55,58 A solution to this problem may be achieved by the application of feature extraction/selection (FES) algorithms 54 with linear classifiers rather than defining a specific metric on the data, 85 or by evaluating the role that each dimension plays in the development of the machine learning architecture. 50 The aim is twofold: (i) to reduce the number of input dimensions, retaining the relevant information by measuring the importance of each dimension, and (ii) to control the complexity of the selected classifiers to avoid overfitting, 51,52 reducing the false positive rate (type I error). In this way, linear regressors endowed with regularization (e.g. Lasso) have been very useful in removing variables of reduced relevance from the model architecture, enhancing the contribution of the remaining variables. 53 Nevertheless, regularization approaches work on the input space enforcing an aggressive sparsity that results in a reduction of the set of nonfiltered variables to a value comparable with the number of training instances, 53 which is particularly damaging in neuroimaging.
In this work, we propose a new multivariate methodology based on a one versus one group classification scheme working on a feature space. The proposed system is designed under theoretical conditions imposed on the classifier complexity and the dimensionality of the input pattern. Briefly, the method comprises a subject-specific and regionwise partial least squares (PLS) FES that is then used to identify differences from patterns of GM and WM derived from MRI in a factorial design with factors of sex (male and female) and condition (autism or control), specifically: (i) identification of potential sex-related risk and protective factors for autism; (ii) characterization of the four classes described in terms of sex and condition; and (iii) visualization of sexually dimorphic areas related to the probability of classification of autism in males and females. This paper is organized as follows: Sec. 2 introduces the dataset and individual preprocessing steps implemented. In Sec. 3, the overall methodology is presented, including a distribution-free upper bound for the true error rate of a classifier, using the resubstitution error estimate. These theoretical considerations permit the definition of reliable statistical learning and validation schemes using linear Support Vector Machine (SVM) and FES algorithms. In the context of the previous section, we demonstrate how the use of these techniques provides a meaningful overlap analysis between group-difference maps for VBM comparisons. Later, in Sec. 4, the experimental design is outlined, along with a description of a complete set of experiments, including the results obtained in each experiment, a t-test for significance in the classification experiments, and finally the spatial overlap analysis of the derived PLS-based maps with the same p-value as the t-test maps. In Sec. 5, these results are discussed and critiqued, and finally in Sec. 6, conclusions are drawn.

Materials: A Heterogenous, Multidimensional and Multiclass MRI Dataset
Participants (l = 120) included 30 right-handed premenopausal females and 30 males with autism, along with 30 neurotypical females and 30 neurotypical males (see Table 1 34,37,44 In addition, a larger multicenter male sample from the MRC AIMS project 37 was also used in this paper for spatial overlap analysis. It consisted of 84 neurotypical adult males and 84 males with autism matched for age and full-scale IQ (see Ref. 37 for further details).

Methodology
The main goal of this section is twofold: (i) to provide a statistical framework (see Fig. 1) under the conditions detailed in Sec. 3.1 for a set of linear classifiers; and (ii) to relieve the curse of dimensionality encountered in machine learning algorithms processing high-resolution images of the brain. The aim of using such filter methods 68 is to obtain a lowdimensional set of features and then to reduce the empirical risk without increasing the capacity of the set of classifiers (VC dimension, 57 or the number of separating dichotomies 61 ) by means of a linear SVM.

Upper bounds of error for machine learning in neuroimaging
In a neuroimaging classification problem, a d-dimensional input pattern x is observed and the aim is to determine to which class or condition y belongs to, considered as a random variable k = 1, . . . , K. In general, given a training sample of l pairs (x i , y i ), the parametric set of functional dependencies {F (x, α)}, where α is a parameter defining the set of specific functions with cardinality equal to N ,  is required that minimizes the probability of misclassification. In other words, we minimize on the basis of the empirical data. In this case, the functional is known as the empirical risk and can be computed without knowing P (x, y) as Let the minimum of Eq. (2) be attained for F (x, α emp ). The primary question is to establish when the decision rule F (x, α emp ) is close to F (x, α o ), that which is obtained by minimizing Eq. (1). Following the derivation shown in the appendix, the actual risk obtainedby α emp is bounded by probability 1 − η: In addition, by the use of homogeneous linear threshold functions as decision rules whenever the dimension (i.e. number of voxels/regions) of the input pattern x is far from the number of samples (i.e. scans), d l, P (α emp ) is seldom close to 0 and therefore the bound for the actual risk performs satisfactorily. In other words, for images of the brain where the number of scans is rather small compared to the number of voxels or regions, linear classifiers protect the system from overfitting, thus the variance of the actual risk is not large. Following the theory of homogeneous linear classifiers and applying the classical combinatorial geometry to develop the separating properties of decision surfaces, 55 we can take a step forward and refine the bound in Eq. (3) as where N (l, d) is the number of linearly separable dichotomies of l samples in a d-dimensional space (see further details in the appendix). Therefore, under the aforementioned conditions (d l), the empirical risk P emp (α emp ) is an indicator of the generalization ability of the statistical classifier, and the maximum deviation of the frequencies from the corresponding probability of the empirical solution (∆P = P − P emp ) can be derived with probability 1 − η (see appendix).

PLS-based CAD system
Providing a significant set of features is important to achieving a small empirical risk given a classifier of fixed complexity (capacity). The binary FES and classification stages within the multiclass classification problem are performed in a class patternspecific manner, rather than using other strategies which combine different classes for subsequent fitting of binary statistical classifiers, i.e. the one versus rest-based model. In this way, given K class patterns, a total number of N s = K(K − 1)/2 binary classifiers may be fitted based on the same number of training subsets. This one versus one classification model (see Fig. 1) allows the assessment of which neuro-phenotypes defined in the N s training subsets are dominant in the corresponding N s test subsets. After that, the classification results can be merged to provide an overall classification result, defined as a decoding process in ternary error-correcting output code algorithms. 56 All of these analyses are achieved at a high confidence level given the upper bounds of the actual risk (see Sec. 3.1), the conditions regarding the sample size and the capacity of the selected statistical classifier.

Structural magnetic resonance image acquisition and preprocessing
All 120 participants were scanned using a contemporary 3 T MRI scanner (GE Medical Systems HDx) fitted with an 8-channel receive-only RT head-coil. Simulated T1-weighted inversion recovery images were segmented and normalized to the standard Montreal Neurological Institute (MNI) space using the SPM12 software (Wellcome Trust Centre for Neuroimaging, London, UK) 63 and the CAT12 toolbox. 64 Native space GM, WM and cerebro-spinal fluid (CSF) images were obtained using standard automated segmentation routines. The native space GM and WM images were registered to a studyspecific template using a high-dimensional nonlinear diffeomorphic registration algorithm (DARTEL) 65 and then clustered into 116 standardized areas. 62 A modulation step was included to retain voxelwise information about local tissue volume. The modulated GM and WM maps were smoothed with a 4 mm full-width at half-maximum Gaussian kernel.
For the multicenter male sample, all preprocessing steps were developed in the same way as described earlier for the latter dataset. In the statistical inference part, the general linear model for VBM considered the sites (i.e. scanning machines) as covariates (categorical fixed-effect factors). GM and WM maps X = {x i }, for i = 1, . . . , l, were initially parcellated into r = 1, . . . , 116 standardized regions of a brain anatomical atlas. 62 Thus, our multivariate analysis takes as its input mean gray and white matter volumes from within each atlas region, x i (r), separately. This procedure is appropriate for our purpose of observing tissue-specific  local variations, although partial results may be rearranged by aggregation in a complete volume, as shown in the following sections and Figs. 2 and 3.

Two-sample t-test and PLS extraction
A rank FES based on the standard two-sample t-test with pooled variance in combination with a PLSbased FES 69  versus one classification model (see Fig. 1). Given a training subset comprising two classes j = 1, 2 with balanced samples sizes l 1 = l 2 = l/2, the t statistical vector is defined as where is the pooled standard deviation and s x 1 s x 2 are the unbiased estimators of the variances of the two classes (the assumption of the data in each group following a normal distribution was confirmed by a Kolmogorov-Smirnov test). The effect of this operation on the brain regions is to reduce the computational load prior to feature extraction, and to assure that regional differences in tissue volumes are considered when assessing overall patterns of tissue distribution.
Once the significant regions were identified by ranking the result of the t-test, we extracted the relevant patterns within those regions by a PLS regression between the l × d data matrix X and the l × 1 vector labels Y. Briefly, in the application of PLS to supervised classification, we maximized where the score vectors s = Xω were iteratively extracted and used to deflate the input matrix X by subtracting their rank-one approximations based on s. 69 The deflation process was accomplished by the computation of the vector of loadings p as a coefficient of regressing X on s: The vector of weights ω f , where f indexes the number of extracted features, is a local property in the images, that is, the dimensional components are not mixed in its computation, whilst the score coefficients s(i) = d j=1 X ij ω j (and the matrix of scores), and the deflation term , etc. are global. Therefore, the size of the input data d is crucial to the assessment of the relationship between GM (WM) volume and group membership with heterogeneous variances, where some statistical properties of the involved processes, such as the stationarity or the ergodicity in the correlation, must be assumed for the evaluated ROIs. In the present study, the analysis was carried out on the set regions (see Figs. 2 and 3) selected by feature selection.

Statistical parametric PLS maps
The analysis of functional and structural studies usually entails the construction of spatially extended statistical processes where each voxel of the novel image or map is the result of a statistical test. After that hypothesis testing process, the main problem is to determine the significance of extrema by the use of several statistical field-based approximations, 14 i.e. a three-dimensional (3D) t-statistical field. PLS methods have been demonstrated in the past to be very useful in describing the relation between brain activity and experimental design or behavior measures. 41,69 In this way, PLS analysis of brain activations is able to reveal additional regions of salience that are not identified by typical univariate voxelwise methods such as SPM. Assume that the label vector Y contains the two global conditions in X ({1, −1} indexed as i ∈ C 1 , C 2 ) and x i1 > x i2 , for i 1 ∈ C 1 and i 2 ∈ C 2 , without loss of generalization. The scores s will be ideally located around zero with different signs, thus from Eq. (7) for every loading component, we have for k = 1, . . . , d, and a deviation from zero of this data-weighted score summation would suggest a region associated with a particular group membership (contrast). Remapped into image space, the contents of the singular vector p indicate which pixels are most sensitive to those predefined contrasts and define the so-called PLS map. Comparing the latter expression to Eq. (5), the PLS map can be seen as a multivariate two-sample test weighted by the scores of each subject with unknown distribution, except for the normalization term that depends on the pooled standard deviation. In fact, after some manipulations, it yields wherex j (k) ≡ Ci x i (k)P (x i (k)) and P (x i (k)) = |ŝ(i)| is the frequency of the observation x i (k) that is assumed to be proportional to its score in the computation of the group mean.
The statistical significance of the PLS maps can be assessed is many ways, e.g. by the use of a permutation test. 41 In this work, we proposed the use of a parametric approach, that is very popular in neuroimaging, based on the Gaussian Random Field (GRF) theory. This theory models both the univariate probabilistic features of the resulting SPM and the nonstationary spatial covariance structure of that image. 63 This model can be applied as well to the proposed PLS map as the selected quantity that characterized the PLS computation which has shown to be very similar to the classical two-sample t-test mapping (except for the normalization factor). The methodology can be described as the following. First, given a p-value and the T SPM extracted from the one tailed two-sample t-test with contrast matrix [1, −1] for C 1 and C 2 conditions, we determine the statistically significant positive salience threshold t critic using the inverse cumulative density function of the t-test distribution (the same applies for the negative one). The biased version of the PLSbased two-sample t-test in Eq. (9) makes up a novel P SPM that is linearly projected to a (bias corrected) distributionP with the same parameters of the previous T SPM. a Finally, the positive saliences on this novel map are determined by evaluating the regions, whereP > t critic at the given p-value.

A robust statistical classifier, SVM
The use of SVMs was motivated by the minimization of the VC dimension and has been successfully shown to be a robust solution in classification learning, 58 which minimizes the separation margin between the binary-labeled training data, mapped into a (PLSbased) feature space F, by constructing a hyperplane w whose norm is minimum 58 : subject to a Here, we assume that the t-distribution is approximately normal for a high number of degrees of freedom. Note that if a variable X ∼ N (µ 1 , σ 1 ), then Z = (X − µ 1 )/ σ 1 ∼ N (0, 1). Conversely, given Z, then Y = µ 2 Z + σ 2 ∼ N (µ 2 , σ 2 ). Then, we can connect any pair of distributions by where ξ i are slack variables, C is a constant that allows a trade-off between training error and model complexity (C is usually optimized by several searching methods at the training stage, i.e. Bayesian optimization in a wide range [1e − 3, 1e3]), and the decision rule is defined as F (x, w) in Sec. 3.1. The solution is computed using w = l i=1 a i y i x i , where the multipliers 0 ≤ a i ≤ C were derived from the dual Lagrangian problem in Eq. (10).

Experiments and Results
The spatial representation of the binary classifier output and the corresponding "loading" of the PLS image as the new reference base for the analyzed input patterns is discussed and shown in this section. In this sense, we take a step forward with respect to the majority of exploratory analyses developed so far in the literature 34,44 and propose a specific cross-validation scheme with the purpose of discovering generalizable class features in the GM or WM images. By training the system using binary group differences (gender and/or condition) and under the assumption that they are distributed across statistically significant regions, we aim to generalize these features in the remaining study groups which should share the same magnitude of differences across groups. The complete analysis included a class pattern-specific two-sample test for regionwise FES, an overlap analysis 34 across group-difference PLS maps with a VBM comparison to evaluate the Autism theories, the assessment of the sample Autism label probability with the determination of confidence intervals using the Clopper-Pearson method and finally, permutation tests to check the statistical significance of the classification results obtained by the machine learning-based system.

Experimental setup
As discussed previously in Sec. 3.2, the images were parcellated according to Ref. 62 and FES algorithms were applied to the resulting images obtained from a standardized preprocessing and image registration pipeline. 64

Preprocessing
The application of the FES approach detailed in Sec. 3.2 results in several weight maps for the group comparisons as shown in Fig. 2.
With the aim of increasing the generalization ability of the proposed classification systems, only the first PLS component was considered (a single dimension d = 1), thus the associated upper bound of the actual risk, as shown in Eq. (1), provides a strong connection between the risk and the empirical risk (resubstitution error). It is worth mentioning what is actually shown in Figs. 3 and 4 is the first loading component p extracted from the brain regions, separately. Figure 3 highlights the relevant regions in terms of the PLS regression on the group differences. Group differences G1 and G6 illustrate the regions where the features associated with autism are different in males and females, respectively. In the remainder of the figures, features associated with gender are predominantly found in the PLS maps, e.g. the G2 pattern that refers to the VBM comparison between MC and FC is modulated and found in group differences G3, G4, and G5. The statistical significance of these PLS maps is discussed in Sec. 4.2, where the null distributions are modeled as a two-tailed Gaussian distribution or a t-distribution with large degrees of freedom ν.
In Fig. 4, the PLS global approach is shown on the left and is contrasted against the regionwise comparison for G1 (i.e. MC-MA). Observe that the regionwise comparison provides a noisy p-loading that is used as a reference base in the feature space for the extracted input scores. A one-sample t-test between the score data s obtained from the regionwise approach and the global approach reveals that, at a 95% significance level, 44 out of the 60 subjects present a difference in means in regions including the insula, hippocampus, and cingulum; Fig. 4.

Classification
The overall classification accuracy of each SVM model was estimated using the re-substitution error for the training subset and a l/2-holdout crossvalidation error for the test subset, testing for significance by repeating the validation procedure n = 500 times, after randomly permuting the class labels. In this sense, each of N s = 6 training subsets, under the one versus one classification model, allows us to estimate the probability of the neurophenotype in terms of sex or condition and to evaluate it on the remaining N s test subsets. Therefore, the four classes are considered as a set of participants with different proportions or combinations of these two types of effect (sex and condition).
The selection of other cross-validation procedures, i.e. K-fold, with the aim of "pure" classification leads to poor classification and generalization of results mainly due to the variability of the input patterns, the potentially preponderant role of one effect, and small sample size. Thus, we should consider the former K = 4 classes as tentative labels as they may contain spatially-dependent neurophenotypes with biases towards a particular effect that are nonuniformly distributed in the K classes.

Visualizing the patterns of brain regions representative of normative sex/condition differences
The aim of this section is to analyze the differences of the specific patterns representing sex or condition. Following a similar analysis as in Ref. 34, we undertook a two-sample t-test based on VBM comparisons between groups. Only voxelwise height thresholds with no spatial extent operation were applied under the two logical contrasts, i.e. group 1 > or < group 2 . Three sets of VBM comparisons were performed to test EMB theory predictions 71 ; namely, MC-FC, MC-MA, and FC-FA comparisons. The aim of this procedure is to generate an overlap analysis across group-difference maps 34 and compare them with the PLS maps. The PLS maps can be interpreted as two-tailed statistical tests, unlike the classical one-tailed t-test maps obtained under the SPM framework, 64 and both are almost normally distributed with a high number of degrees of freedom (e.g. ν = 57 for the condition comparison). 70 In order to make them comparable for the computation of group-difference maps, given a set of p values {0.0001, . . . , 0.05}, we linearly transform the PLS maps, P ∼ N (µ p , σ p ) to a new Gaussian distribution with mean and standard deviation of the corresponding comparison group T -maps, which are normally distributed T ∼ N(µ t , σ t ). After this transformation, the PLS maps permit testing of the logical contrasts in a single map by group 1 = group 2 . As an example, in those regions where there is a statistically significant sexual dimorphism MC = FC, we evaluated the significant differences in the neurophenotype of male and female individuals with autism, that is, MA = MC and FA = FC (see Fig. 5). As shown in the following section, the overlap analysis derived from these figures partially support EMB theory predictions by considering both directions of effect at the same time (> and <).

Spatial overlap analysis
Following the discussion in Sec. 4.2, three VBM comparisons, MC-FC, MA-MC, and FA-FC, were evaluated on GM and WM volumes. We carried out the experiments on our dataset (l = 120) and the male sample (l = 168) described in Ref. 37, using our PLS-based FE approach and the set of one-tailed contrasts obtained from the univariate 2 × 2 factorial design analysis provided by the SPM software. 63 For each comparison, a conjunction analysis 34,73 consisting of logical AND masking, were assessed and tested for significance by running Monte carlo simulations (5000 iterations). The distribution of overlaps at the same p-value threshold was sampled, i.e. from 0.0001 to 0.05 in steps of 0.0001. This spatial overlap analysis, which considers synthetic GM (WM) map overlaps, allows us to assess the probability of whether the overlapping voxels of the comparisons occur by chance. In the baseline comparisons, we included an additional spatial overlap analysis using  a multicenter male sample as detailed in Refs. 34 and 37. The purpose of this MA-MC comparison is to conduct the same analysis on a larger number of samples (l = 168) to provide higher power to detect the group differences, that were not found by the use of the current database (l = 60, males only). The spatial overlap analysis between males and females with autism is small in both approaches (T and PLS maps) for GM and WM (e.g. up to 15% and 10% for GM and up to 23% and 10% for WM, respectively) for p ≤ 0.05 as shown in Fig. 6 in magenta color lines. This finding clearly identifies the differences between the neuroanatomy of autism across sexes that was previously demonstrated by the high precision of the four-class learning machine detailed in the previous section, and in Fig. 5. The rest of the curves show a significant difference between approaches, that is, the one-tailed T maps versus the two-tailed-PLS maps. In the latter approach, we find significant evidence for overlap between structures sensitive to diagnosis and sexual dimorphism in both sexes (red and blue solid lines). This evidence is only found for the one-tailed T -based approach in females (T -MC-FC&FC-FA, red dotted line in Fig. 5).

Classification results
The output score pattern of the set of binary classifiers is shown in Fig. 7 for the training and test sets. From this set of figures what is interesting to note is the ability for generalization of the proposed system, under some constrained conditions, over the tests G2, G3, G4, and G5 (same rows on the figure). Another relevant feature found in the output score maps is the presence of vertical bars representing misclassification in training and test sets. This could be identification of the different neurophenotypes of specific individuals, e.g. participant 42 in the FC subset, as they are always located in the wrong hyperplane subspace at both training and test stages (see Table 2). From this table, it is clear that MA is the class with lower performance in classification accuracy in both stages. Surprisingly, the features associated with autism in the training of the SVM to classify G1 (males) cannot be generalized in the test set (females) although these features should be present in one class, and absent in the other (a real binary classification problem). The almost random pattern found in the test set could indicate the different nature of the features of autism in males and females. This situation is repeated in G6 where the females were used in the training stage and males made up the test set, confirming the latter hypothesis.
The quantitative analysis of the output score pattern is found in Fig. 8 where we represent the distribution of accuracies, including the notch analysis to display the variability of the median between samples, for all the brain regions and the accompanying overlay histograms. From this figure, we can observe how the learning systems generalize well with On the left column, we plot the SVM scores obtained from the training set while on the right column, we show the consequent prediction on the test set. Each row contains one of the 6 binary problems to solve the multiclass framework, depicting all the 116 regionwise features (Y -axis) versus each subject (X-axis). The hyperplane derived at each row using the training subsets on the left is employed in the prediction of the test subsets on the right. Note that for the ideal linearly-separable problem, we would obtain two black (from 1 to 30) and white (from 31 to 60) columns in all cases. In this way, the fitted learner should generalize fairly well on the remaining groups, obtaining two additional black and white columns on the test set, in case the features on the right were characterizing the groups on the left. This is clearly not the case in the 1 and 6 rows, where we are trying to predict male/female autistic features from the opposite sex. Table 2. GM (WM) overlap and region-averaged accuracy of the output score maps (%) for the selected subsets acting as training and test sets (Groups 1 and 6 are not considered). respect to the controlled upper bound (around 10%), excluding the G1 and G6 which behave as random classifiers on the test set. This effect indicates the varying nature in the differences between MC and MA (male autism features) which cannot be extrapolated to the FC-FA comparison and vice versa. In the former groups, the gender feature is the most prevalent, providing a high overlap in the output of the systems for several comparisons, see Table 2. Participants with a high or low probability of being classified as having autism may be determined by using a predictive mapping approach. 37 Given the output score map of a participant with sex S and condition C at the test&training stages, we first calculated the predictive class probability with respect to the male typical neurophenotype. For this purpose, the ratio was obtained of the number of regions associated with a classification of "male" to the overall number of brain regions. Then, we collected all these probabilities, one for each individual, into a discrete set of bins from 0 to 1 in steps of 0.25, and computed the sample Autism probability (that is, the probability of the male neurophenotype, denoted in Fig. 9 as P (ph = M | S, C = ASD)) as the ratio of the number of individuals classified with autism to the total number of individuals within the bin. As shown in Fig. 9 for G2-5, where confidence intervals are determined using the Clopper-Pearson method in the binomial test, 72 the probability of an individual being classified as autistic in males (blue) and females (red) evolves differently across the maleneurophenotypic axis. In addition, within gender, this probability increases or decreases depending on whether the group is processed at the test or training stage. This difference in behavior is due to the presence (or absence) of the autism feature that adds to the gender-based difference between groups, during the training stage.
In addition, we studied the binary classification problem using the group comparisons (MC-MA and   9. (Color online) Probability for autism as a function of normative sex-related phenotypic variability in brain morphology: Probability estimates for autism across four discrete bins along the axis of predictive class probabilities for sex are plotted for G2-5 and training (T ) and test (t) stages. For male (blue) and female (red) models, the probability for autism behaves in a different manner with increasingly male-typical class predictions, that is, it increases (decreases) in females (males) at the test stage and contrarily at the training stage.
FC-FA) using a classical cross-validation scheme used in small sample sizes, i.e. leave-one-out cross-validation and l = 60. In this case, we employed the same feature selection and extraction methods with a similar parameter tuning prior to linear SVMbased classification. As an example, for the MC-MA comparison, the few regions extracted on GM by this more restrictive cross-validation scheme are coincident with the regions highlighted in Fig. 5, i.e. right middle frontal gyrus, right pallidum, etc. yielding classification rates from 70% up to 80% in the study groups from GM and WM tissues.
Finally, using a one versus one decoding scheme based on a MAP strategy, the multiclass problem may be undertaken on a regionwise basis and a output score map of the linear SVM may be derived with four class levels. Under the upper bound (∼0.1) on the actual risk for d = 1 and using linear decision functions, the six binary classifiers can be combined obtaining the results shown in Fig. 10. Note in the upper subfigure that individuals with autism are ordered from male to female, and a deep (light) color represents control (autism) participant sex.
A regionwise analysis using this decoding strategy with GM reveals regions with a high accuracy (up to 83.33%), namely, left and right parietal, temporal, occipital lobes, calcarine sulcus, etc. and a mean across region of 62.61 ± 8.23 with confusion depicted in the bottom of the same figure. With WM, an improvement in performance is observed (up to 93.33%) mainly due to the improved classification of the MC group (averaged accuracy across regions of 67.50 ± 9.99), with relevant regions located at middle frontal gyrus, inferior parietal lobe, cerebellar cortical crus II, middle occipital and temporal lobes, etc. Additionally, in this multiclass problem, we employed the output score maps depicted in Fig. 7 to decode the MAP output class for each region and then a majority voting scheme across the regions was applied to determine the final output class for each individual. Overall, the four groups are represented by different image patterns that can be classified with an accuracy up to 86.7% (91.70% on WM) using the information contained in the entirety of the image, as shown in the confusion matrix (Fig. 11).   11. Confusion matrix of the majority voting multiclass classification scheme corresponding to the output score map in Fig. 10. Observe how the overall classification result outperforms the regionwise approach, due to the presence of outliers in the four-class problem (patients containing a high ratio of misclassified regions).

Permutation tests
Although the experimental conditions were rigorously established in the previous sections, that is, the results obtained by linear classifiers in low dimensional spaces are theoretically meaningful at a significance level η (see Sec. 3.1), a permutation test to validate these results was applied to the whole dataset. For the complete set of binary classifiers, the cross-validation procedure was repeated n = 500 times after randomly permuting the class labels to derive p-value maps. As expected, the classification obtained with this analysis for the set of classifiers is almost random and, at the 95% level of confidence, almost all the regions were considered as significant in this analysis as shown in Table 3. The result of this analysis allows the construction of relevance maps for each classification model such as those displayed in Fig. 12, where, as an example, we illustrate the prediction of sex based on VBM features using the sex classifier for G2.

Discussion
From the perspective of this machine learning-based approach, the neuroanatomy of autism in males and females is comprised of separate distinguishing patterns. A distinguishing pattern similarly identifies male and female controls. These group differences are observed even with a small sample set, although some theoretical bounds and suitable methodologies must be imposed to highlight such differences. The interactions between sex and diagnosis observed in this paper extend those previously found in prior work 34,37 that also showed sex × diagnosis group differences on GM and WM tissues. In particular, greater accuracy rates were obtained with the classification apparatus described herein.
In addition, based on this multivariate methodology, we also found that there was a significant overlap between the neuroanatomical features of autism in males and females, unlike the previous approaches, aligning with the predictions of the EMB theory in terms of the directionality of effect, and with the "gender incoherence" hypothesis 74 ). These effects were observed by the use of a regionwise PLS-based activation map transformed into the same parameters as those used in the one-tailed t maps derived from the group differences. In fact, the approach proposed in this paper considers not only the EMB theory 71 directionality, but also its inverse, that is, predicting that males with autism are feminized in terms of neuroanatomy. 74 This effect was clearly observed on GM and WM VBM comparisons and agrees with the results obtained by testing how the effect of autism overlaps with the effect of "feminization" using one-tailed contrasts. 34 Thus, on balance, these results may be a demonstration of gender incoherence of males with autism rather than "masculinization".
The heterogeneity of autism was detected by the assessment of the output score maps derived from the machine learning theory. The use of crossvalidation groups further reduced the limited sample size and degraded the performance of the CAD system. Although some considerations and preventive measures regarding the curse of dimensionality were carried out to be conservative, the observed effects and group differences could be partially samplespecific and the current analysis requires replication on larger datasets. These difficulties are detected in terms of the presence of outliers, that is, participants that are misclassified by resubstitution in almost all the analyzed regions. Those participants that are located in the "wrong" feature subspace present an heterogeneous pattern that affects the performance of the SVM when they are considered in any validation/training fold. The output score maps revealed this misclassified sex/condition-specific neurophenotype on GM and WM tissues across regions and participants.

Conclusions
In summary, the research developed here is a first attempt to describe both sex-typical multivariate neurobiological phenotypes by the use of machine learning and an MRI sample of equal-sized highfunctioning male and female adults with and without autism. Although the main limitation of the current equal-sized gender datasets is the samples sizes, we avoided this difficulty by imposing some restrictions on the learning parameters of the system by the use of a novel upper bound in the resubstitution estimate, obtaining a good trade-off between empirical risk and the variance of the actual risk estimation (upper bound). In addition, the system used a set of features extracted from PLS activation maps that are demonstrated to be statistically significant and in accordance with two theories of the neurobiology of autism. The complete classification system in a one versus one configuration achieves, under these theoretical conditions, high classification results (up to 86%) in a four-label classification problem, thus outperforming the classification rate of the previous published works on this field. 34

A.1. On the upper bound of the actual risk
In terms of the one-sided uniform convergence of the means, we are interested in assessing for a given significance level η: where P (α i ) = P (F (x, α i )). Of course, with a sample size l → ∞, the law of large numbers expressed in terms of the third Hoeffding inequality 59 for any functional α i establishes that where Γ i = sup i (γ i ) and γ i = P (α i ) − P emp (α i ) and N is the finite number of functional dependencies. Since the inequality is valid for all decision functions F (x, α i ), the actual risk obtained by α emp is bounded with probability 1 − η by that is equivalent to Eq. (3). In general, these bounds could be further improved by considering the relative deviations 57,60 under scenarios, where P (α i ) tends to the extremes 0, 1, but that is far from our problem.

Definition.
A set of l vectors is in general position in d-space if every subset of d or fewer vectors is linearly independent.
Consider that the training sample {x i , y i } is distributed randomly in general position, then the number of linearly separable dichotomies of the set of input patterns is equal to N , that is, the number of decision functions F (x, α) when the training sample is not a root of the set (F (x i , α) = 0). As shown in Ref. 55

Appendix B
The MRC AIMS Consortium is a UK collaboration between the Institute of Psychiatry, Psychology and Neuroscience (IoPPN) at Kings College, London, the Autism Research Centre, University of Cambridge, and the Autism Research Group, University of Oxford. The Consortium members are