Introducing instance label correlation in multiple instance learning. Application to cancer detection on histopathological images

In the last years, the weakly supervised paradigm of multiple instance learning (MIL) has become very popular in many different areas. A paradigmatic example is computational pathology, where the lack of patch-level labels for whole-slide images prevents the application of supervised models. Probabilistic MIL methods based on Gaussian Processes (GPs) have obtained promising results due to their excellent uncertainty estimation capabilities. However, these are general-purpose MIL methods that do not take into account one important fact: in (histopathological) images, the labels of neighboring patches are expected to be correlated. In this work, we extend a state-of-the-art GP-based MIL method, which is called VGPMIL-PR, to exploit such correlation. To do so, we develop a novel coupling term inspired by the statistical physics Ising model. We use variational inference to estimate all the model parameters. Interestingly, the VGPMIL-PR formulation is recovered when the weight that regulates the strength of the Ising term vanishes. The performance of the proposed method is assessed in two real-world problems of prostate cancer detection. We show that our model achieves better results than other state-of-the-art probabilistic MIL methods. We also provide different visualizations and analysis to gain


Introduction
Multiple instance learning (MIL) has caught great attention in fields where there is a challenging lack of labelled data.Although it has been applied in many different areas [1], we will focus on the case of computational pathology.In the last years, thanks to the increasing digitalization of whole-slide images (WSIs), the field of computational pathology is developing computer-aided diagnosis systems based on machine learning for cancer detection [2,3].The goal of computational pathology is to provide a fast and reliable diagnosis for the most prototypical cases, letting the pathologists focus on the most challenging ones.
Ultimately, this will enable a much wider access to early cancer diagnosis [4].
In order to make accurate predictions, machine learning classification methods need to be trained using a labelled set of instances [5].In the case of computational pathology, these instances are typically patches from the WSIs (and not the complete images themselves) [6,7,8].The reason for this is twofold: i) it is useful to have predictions at patch level in order to know where exactly in the image the cancer is located, and ii) WSIs are extremely large and cannot be directly fed to a classifier.As a consequence, notice that expert pathologists must label every single patch in the training data as cancerous or not (we will consider the binary problem cancer/no-cancer throughout this work).Given the large number of patches and the limited availability of pathologists, this becomes a daunting task in real practice [9].
To address this problem, different weakly supervised learning paradigms have been proposed in recent years.Here we focus on MIL, which has become very popular in the medical domain [10,8].The idea in MIL is that instances are grouped in bags, and only bag labels are needed for training.In the case of WSIs, all the patches coming from the same image are considered a bag.Therefore, the labelling workload on pathologists decreases enormously: from labelling every single patch, to only labelling the complete WSI as cancerous or not.
Different machine learning algorithms have been developed to learn under the MIL setting.Notice that dealing with uncertainty is essential in MIL models, since instance-level labels are unknown.To deal with uncertainties, different probabilistic methods have been developed, such as Dirichlet Process Mixture Models [11], Markov chain [12], Monte-Carlo chain [13,14] and Gaussian Processes (GPs) [15,16].In particular, GPs have attracted plenty of attention in the last years, due to their expressive power and their capacity to handle uncertainty in a principled manner.Moreover, we are interested in this type of probabilistic models, since they will allow for introducing correlations in a theoretically sound way.
Among GP-based MIL methods, we will focus on the two most successful ones: VGPMIL and VGPMIL-PR.VGPMIL [17] was proposed in 2017 to overcome the limitations of two earlier formulations [15,16] (namely, the use of the inefficient Laplace approximation and the impossibility to obtain instance-level predictions, respectively).In short, VGPMIL relies on variational inference and allows for closed-form updates of its parameters.However, the use of the logistic function implies that VGPMIL needs to resort to a theoretical approximation during inference (namely, the Jaakola bound [17, Eq. ( 10)]).As shown in [18], such approximation hurts predictive performance in practice.As an alternative, the authors of [18] propose the utilization of the probit function, which removes the need for the aforementioned approximation.This method, which will be referred to as VGPMIL-PR, is considered the current state of the art among probabilistic MIL approaches.
Methods such as VGPMIL and VGPMIL-PR are general-purpose MIL models that can be used in any MIL problem (that is, whenever the label is known only at bag level, see different use-cases in [17,19]).However, the underlying MIL assumption that the labels of the instances in a bag are independent of each other is unrealistic in many real problems.For example, in the particular case of WSI images (and in many image-related MIL problems), the labels of neighboring patches are expected to be correlated [20].We hypothesize that the predictive performance of MIL methods can be enhanced by incorporating this type of prior knowledge into the model.
In this work, we introduce a novel GP-based MIL algorithm that takes into account the correlation between the labels of neighboring patches, and we apply it to the real-world problem of prostate cancer detection on histopathological images.We model the correlation through a coupling term inspired by the Ising model [5,Section 19.4.1], an statistical physics method that has found several applications in computer vision [20,21].Our GP-MIL modeling builds on VGPMIL-PR, so our method will be referred to as VGPMIL-PR-I (Ising).In VGPMIL-PR-I, a hyperparameter λ regulates the influence of the Ising-inspired terms.Variational inference is used to estimate the model parameters, and the update formulas of VGPMIL-PR are recovered when λ → 0 (that is, when the influence of the coupling term vanishes).In the experimental section, we show that VGPMIL-PR-I outperforms the state-of-the-art GP-based MIL approaches VGPMIL and VGPMIL-PR when predicting at both instance and bag levels, while keeping an analogous computational cost.Moreover, to gain insights into the influence of the new coupling term, we analyze the role of λ, and provide several visualizations for the predictions.
The rest of the paper is organized as follows.Section 2 presents the probabilistic model and inference for the novel VGPMIL-PR-I.Closely related methods such as VGPMIL and VGPMIL-PR are also discussed in this section.Section 3 focuses on the empirical evaluation of the model, including the data description, the experimental framework, and the discussion of results.Section 4 provides the main conclusions and some future outlook.

Probabilistic model and inference
In this section we present the theoretical description for VGPMIL-PR-I.Specifically, Section 2.1 explains the problem formulation and the main notation.Section 2.2 explains the closely-related methods VGPMIL and VGPMIL-PR, which are at the base of our formulation.Section 2.3 introduces the novel coupling term that accounts for patch label correlation, which is used to define VGPMIL-PR-I.Section 2.4 shows how to perform variational inference to estimate the parameters in VGPMIL-PR-I.Section 2.5 explains the procedure to make predictions at both instance and bag levels.

Notation and problem formulation
Our notation follows the state-of-the-art work [22].The training data is given by a set of bags X = {X b } b∈B and their corresponding labels y = {y b } b∈B .
We deal with a binary problem, i.e.
(N is the total amount of instances).Notice that different bags may have different amounts of instances.
Each instance x i is given by a vector in R D .In the MIL setting, one assumes that each instance has its (unknown) label h i ∈ {0, 1}.We write h b for the labels of all the instances belonging to bag b.The MIL labelling assumption dictates that a bag is considered positive (class 1) if at least one of its instances is positive.Mathematically, this is where 1[•] is the indicator function (i.e. it equals one when its argument is true and zero otherwise).Finally, we will collectively denote h = {h b } b∈B .
In the case of WSIs, each X b is an image, which is composed of its patches {x i } i∈b .Each patch has an unknown label h i (0 for non-cancerous and 1 for cancerous), and we only have access to the bag label y b (whether the image is cancerous or not, i.e. whether it contains at least one patch that is cancerous).
The goal in MIL is to train a model based only on bag labels {y b } b∈B .And such model must be able to predict at both instance and bag levels.That is, given a previously unseen instance x ⋆ ∈ R D , we are interested in the probability p(h ⋆ = 1).Likewise, given a previously unseen complete bag X ⋆ , we are interested in p(y ⋆ = 1).

Background: VGPMIL and VGPMIL-PR
As mentioned in the introduction, our model is inspired by two closely related methods: VGPMIL and VGPMIL-PR.To understand our contribution, it is essential to fully understand their formulations, which we explain next.

VGPMIL formulation
VGPMIL was introduced in [17].The idea is to consider a sparse GP classification model [23] to describe the relationship between instance features X and their (unknown) labels h.Then, an additional bag likelihood must be considered to model the (observed) bag labels y given the instance labels h.
Both components are described next.
The sparse GP classification model.Instances x i are associated latent variables f i ∈ R which are modelled through a GP, f ∼ GP(0, κ).We write κ for the GP kernel, which encodes the properties of the considered functions.Then, the instance labels h i are defined from f i through a classification likelihood ν: Specifically, VGPMIL uses the logistic function ν(x) = (1 + e −x ) −1 .Intuitively, a large (resp.low) value of f i implies that the class is likely to be one (resp.zero).
Moreover, since standard GPs scale poorly with the number of training instances N , VGPMIL makes use of sparse GPs [23], which summarize the training data through M ≪ N inducing points.These inducing points u = {u 1 , . . ., u M } represent the value of the GP at some inducing points locations Z = {z 1 , . . ., z M } (just like f = {f 1 , . . ., f N } are the GP values at X = {x 1 , . . ., x N }).Therefore, the distributions of u and f |u are: The expression of K is given by the particular sparse GP approach used in VGPMIL, which is FITC [23], so we have is standard in GP literature, we are writing K XX for the N ×N covariance matrix The definitions for K XZ and K ZZ are analogous.
The bag likelihood.VGPMIL introduces the following parameterization to model the bag labels from the instance labels: where and H is a large and fixed value (in their examples, they use H = 100).Eq. ( 5) approximates the MIL assumption introduced in eq. ( 1): if some instance label h i is one, then the bag label y b is one with very high probability (namely, with probability H H+1 ).Otherwise (that is, if all instance labels h i are zero), the bag label is one with very low probability (namely, 1 H+1 ).In summary, VGPMIL is given by eqs.(3), ( 4), ( 2) and (5).For additional details, the interested reader is referred to the original work [17].

VGPMIL-PR formulation
VGPMIL-PR was recently proposed in [18] as an improvement over VGPMIL.
Namely, the logistic function used by VGPMIL in eq. ( 2) is not conjugate with the Gaussian distribution coming from the GP, recall eqs.(3)-( 4).This means that, in order to achieve mathematical tractability, VGPMIL needs to resort to the Jaakola bound [17,Eq. (10)].However, the use of this bound introduces an approximation in the training objective.As shown in [18], such approximation damages the predictive performance in practice.Consequently, the authors of [18] introduce an alternative formulation based on the probit function, VGPMIL-PR.
They show that, via a variable augmentation approach, VGPMIL-PR allows for directly optimizing the training objective (without approximations).
More specifically, VGPMIL-PR uses the probit function ν(x) = x −∞ N (t|0, 1)dt in eq. ( 2).Also, the bag likelihood is given by eq. ( 1) (instead of eq. ( 5)).Then, to circumvent the need for approximations, VGPMIL-PR leverages a variable augmentation approach [24].Namely, for each instance we introduce a new variable m i ∈ R between f i and h i , which is defined as m i ∼ N (f i , 1).Since we are using a probit likelihood, we have that where I is the identity matrix (of size |b|) and p( Importantly, these augmented variables m will prove extremely helpful to introduce the Ising correlation in the next section. In summary, VGPMIL-PR is given by eqs.( 3), ( 4), ( 6) and (7). Figure 1(a) shows the probabilistic graphical model for VGPMIL-PR.

Correlating patch labels: VGPMIL-PR-I
As mentioned in the introduction, VGPMIL and VGPMIL-PR are general MIL approaches that can be used in any MIL problem.Indeed, there are plenty of applications where MIL methods can be used.For instance, think of a recommendation system where a reviewer has not evaluated every single item in the database, but has reviewed "groups" of them (e.g., he/she likes science-fiction movies, although he/she has not rated individual movies).Consider also a task of anomaly detection in which we do not have labels for individual transactions, but we only know whether there was some anomalous behavior in a certain period of time (which contains many different transactions).
Here we focus in the particular use-case of images, where bags are images and their instances are their patches.In this case, there exists very valuable information coming from the structure of the image itself, which can be exploited in the model.For example, it is natural to think that neighboring patches in the same image are likely to have similar labels.The main goal of this work is to introduce such correlation into the VGPMIL-PR formulation.To do so, we are inspired by the Ising model.
The novel coupling term.The Ising model arose from statistical physics to describe the behavior of magnets.In some magnets, called ferro-magnets, neighboring spins tend to line up in the same direction, whereas in other kinds of magnets, called anti-ferromagnets, the spins are repelled from their neighbors [5].This type of interactions based on the Ising model have been used previously in machine learning and computer vision to describe relationships between pixels of an image, see e.g.[20,21,5].However, to the best of our knowledge, they have never been used in the context of MIL.
Our first idea was to consider an Ising model over the patch labels of each image, {h i } i∈b .However, inference proved very challenging in this case, due to the non-conjugacy of the Ising model and the GP-based MIL formulation.
As an alternative, we considered a continuous counterpart of the Ising model over the variables m b = {m i } i∈b introduced in VGPMIL-PR, which are directly related to the patch labels (recall from Section 2.2.2 that that such continuous version of the Ising model corresponds to the well-known Conditional Autoregression (CAR) [25].Importantly, as we will see in the rest of this section, this alternative formulation yields a Gaussian distribution on m b , which can be treated analytically together with the GP-based MIL model.
Specifically, we consider the following coupling term for each image, which is defined over the augmented variables m b , recall Section 2.2.2: Notice that C(m b ) is always in the range [0, 1], and it becomes close to zero when the value of m is very different for neighboring patches.For the second equality in eq. ( 8), notice that the sum only produces quadratic terms in m, so it can be written as m ⊺ b C b m b for some positive semidefinite matrix C b .Since m determines the label of each patch (recall from Section 2.2.2 that ), the term C(m b ) can be used to favor "smoothness" in the labels associated to the different patches.Also, the hyperparameter λ, which can be set to any non-negative value, regulates the strength of the coupling term: the larger λ, the more importance is given to differences in m.For example, when λ = 0, C(m b ) becomes constant and it does not account for correlation between patch labels.

An example of C(m b ).
To illustrate the proposed coupling term, consider an image with five patches P 1 , . . ., P 5 distributed as in Figure 2. In this case, the quadratic terms of m b are: and therefore we have: In general, it is easy to compute the matrix C b for any given image.Notice that it is always a positive semidefinite matrix, and thus it is associated to a (singular) normal distribution.
The VGPMIL-PR-I formulation.To introduce the new coupling term C(m b ) in the MIL formulation, we modify eq. ( 6) and define: quadratic terms in m b , the new distribution can be written as a Gaussian: with Σ b = (λC b + I) −1 .Notice that this new formulation provides a generalization of VGPMIL-PR.Namely, when λ → 0, we have that Σ b → I and we recover eq. ( 6).
Notice also that, instead of FITC, in eq. ( 4) we leverage the more recent sparse GP approach introduced in [26].Basically, this means that 4).Our method will be referred to as VGPMIL-PR-I (I denotes Ising).The probabilistic graphical model is depicted in Figure 1(b).

Variational inference
In order to make inference in the proposed model, we need to compute the posterior distribution p(u, f , m|y).However, this is not analytically tractable due to the definition of the bag likelihood in eq. ( 7), which depends on the sign of the m i 's.Following [17] and [18], we leverage standard mean-field variational inference (VI) theory [27,Section 10.1.1]to calculate an approximate posterior distribution that factorizes as Applying the well-known mean-field VI update equation [27, Eq. (10.9)], we have that q(u) and q(m) can be iteratively computed as where and Here we are writing Σ for the N × N block-diagonal matrix that contains all the Σ b 's, b ∈ B. Also, we are writing K bZ for the |b| × M matrix of covariances between X b and Z. Very importantly, notice that these update rules generalize those derived in [18] for VGPMIL-PR.Namely, when λ → 0, we have that Σ b , Σ → I, and then eqs.( 14)-( 18) match eqs.( 15)-( 19) in [18].
All the computations involved in eqs.( 14)-( 18) are straightforward, except for well-known that the expectation of a truncated multivariate Gaussian cannot be obtained in closed-form [28].Notice that this is not an issue for VGPMIL-PR [18], where the absence of Ising terms implies dealing with univariate Gaussians, whose expectations can be analytically computed.
To overcome the problem, we first tried to leverage numerical methods proposed in [29] to approximate the expectation for the multivariate truncated case.However, these methods proved computationally too expensive to be integrated within our iterative calculation of q(u) and q(m).16)- (18).
Specifically, the expression for each E q(m b ) (m b ) is as follows.For bags with The expectation of such a distribution is well-known and can be obtained in closed-form [30]: where ϕ and Φ are, respectively, the density and cumulative distribution functions of a standard Gaussian N (0, 1) (recall that both are efficiently implemented in standard software packages such as Python's Scipy).We have also abbreviated For bags with y b = 1, we proceed analogously to [19] to obtain the normalization constant Z of the distribution of interest (that is, the factorized Gaussian Then, the expectation of each q(m i ), i ∈ b, is given by: where ), E i is given by eq. ( 19), and we are again The full training algorithm is summarized in Algorithm 1.It is an iterative process that alternates the updates between q(u) and q(m).Details on the GP kernel and initializations used in this work are provided in Section 3.1.The code for the proposed method will be publicly available upon acceptance of the paper.

Making predictions
Suppose we are given a new bag X ⋆ = {x ⋆ i } i∈b⋆ .As explained at the end of section 2.1, we are interested in both instance-level and bag-level predictions.8) and the example at eq. (10).
Initialize GP kernel parameters and inducing points locations, as well as the posterior distributions q(u) and q(m).Details on initializations in the text.
For this, we first need to compute the predictive distributions over m ⋆ .By using the learned posterior q(u) along with p(f |u), we can obtain the joint distribution over f ⋆ : with µ ⋆ and S ⋆ given by the standard sparse GP predictions: Here, µ u and Σ u are the parameters learned during training, recall eq. ( 16) and (17).Naturally, the subscript ⋆ in the kernel matrices K indicates that we are using the new bag X ⋆ .Then, since the distribution p(m|f ) is also Gaussian, recall eq. ( 12), we can compute the joint distribution over m ⋆ in closed-form: with µ ⋆ m and S ⋆ m given by Here, µ ⋆ and S ⋆ are given by eq. ( 22), and Σ ⋆ is the matrix that accounts for correlation among instances in the test bag X ⋆ , which is defined analogously to the training case, recall matrix Σ b in eq. ( 12).
Once we have the joint distribution over m ⋆ , the instance-level and bag-level predictions are given as: Notice that the integral in eq. ( 26) can be computed efficiently with the cumulative distribution function of a multivariate Gaussian, which is also available in most standard statistical packages, such as Python's Scipy.
Indeed, if we do not consider correlation among instances in the test bag, i.e.
Σ ⋆ = I, then eqs.( 25) and ( 26) match those in [18] (last two equations before Section 3.5).Finally, although here we have detailed how to make predictions for a complete previously unseen bag X ⋆ , the same process can be applied to make predictions on previously unseen individual instances x ⋆ (patches).

Experiments
In this section we thoroughly evaluate VGPMIL-PR-I in a real-world problem of prostate cancer detection.The experimental framework, including data, metrics and baselines, is explained in Section 3.1.The results are discussed in Section 3.2.Finally, in Section 3.3 we evaluate our method in a much larger prostate cancer detection dataset: the well-known PANDA challenge.

Experimental framework
Data description.In this paper we focus on the problem of prostate cancer detection.However, notice that the algorithm can be applied for any other type of cancer (and more generally, for any other type of image).Prostate cancer is the most commonly occurring cancer in men, and the second most commonly occurring cancer overall, according to the latest 2020 statistics on agestandardized incidence rate from the World Health Organisation (WHO) Global Cancer Observatory [31].We will use the prostate cancer database presented in Cancerous Non-cancerous [32], which is called SICAPv2 and is publicly available.Although this database includes information on the Gleason score, which is used to evaluate the severity of the disease, in this work we will focus on the binary task of presence/absence of cancer.
We use the original partition of the dataset, which contains 95 training and 31 test WSIs, respectively.These very large images are split in 512x512 patches, resulting in a total amount of 15132 patches for training and 5246 for testing.
Following the MIL paradigm, for the training set we only use binary labels benign/malign at the level of images (bags), but we do not have information at the level of patches (instances).In order to evaluate the predictive performance at instance-level, we do have labels for the patches in the test set.The amount of cancerous (resp.non-cancerous) images for the train set is 70 (resp.25).For the test set, it is 25 (resp.6).For illustration purposes, a couple of cancerous and non-cancerous patches are shown in Figure 3.In order to train our model, each patch is represented through a 128-dimensional feature vector extracted in previous work [7].
Baselines and metrics.Since our model is framed in the field of probabilistic GP-based MIL methods, we compare with the two most popular approaches VGPMIL [17] and VGPMIL-PR [18], which were reviewed in Section 2.2.For a fair comparison, the parameters used for the baselines are the same as those used for our method (described in next paragraph).For those parameters that do not have an analogous in our method (e.g. the initialization of q(y) in VGPMIL), we use the default values proposed in the original papers.To evaluate the performance of the compared methods we use four metrics: accuracy, precision, recall and F1-score (which provides a trade-off between precision and recall).
For all the metrics, we use the standard implementations in the popular Python scikit-learn library [33].
Experimental details.For the underlying GPs, in this work we use the wellknown squared exponential kernel [34], i.e. κ(x, y) Following [17] and [18], we use standard values for the kernel hyperparameters, i.e. γ = 1 and ℓ equals the square root of the number of features of x, y (in this work we set ℓ = 11 ≈ √ 128).The number of inducing points is set to M = 200, and their locations are initialized through K-means clustering as in previous work [17,18] (namely, 100 of them are obtained by doing clustering on the patches that belong to the positive images, and the other 100 on the patches that belong to the negative ones).The number of iterations is set to T = 200, which was enough to achieve convergence in practice.The expectation of the posterior distribution E q(m) (m) is initialized with a standard Gaussian for each instance independently.Notice that the initialization of q(u) is irrelevant since it gets updated first in Algorithm 1.As for the value of λ, which regulates the strength of the Ising correlation (recall eq. ( 8)), we will analyze five different values in the experiments, λ ∈ {0.1, 0.5, 1.0, 5.0, 10.0}.This will allow us to empirically illustrate the effect of λ.

Experimental results
In this section we evaluate the performance of the compared methods on the aforementioned prostate cancer problem.We analyze eight different research questions, which are discussed in the following paragraphs.
Predictions at the level of instances (patches).Although they only use bag labels for training, the compared methods can make predictions at the level of instances, recall Section 2.5.This is important to determine more precisely in which region (patch) the cancer is present.Table 1 shows the results when making predictions at patch level.We observe that VGPMIL-PR-I outperforms both baselines in all the metrics for four out of five values of λ (and for λ = 0.1, the baselines are only better in terms of precision).We also appreciate that VGPMIL-PR-I results are robust across different runs, obtaining low values of standard deviation.This stability is important for real-world applications, where one wants to avoid high sensitivity to random initializations.Finally, notice that, as argued in [19], we also observe that VGPMIL-PR (slightly) outperforms VGPMIL.
Predictions at the level of bags (images).Table 2 shows the results when making predictions at the level of images.We observe that VGPMIL-PR-I outperforms both baselines in all the metrics when λ ∈ {0.results than VGPMIL, as expected.Also, the predictions at the level of images are very stable across runs (notice the zero standard deviation).
Analyzing the confusion matrices at bag level.
Here we analyze more in detail the results presented in the previous paragraph (i.e. at the level of images).
Table 3 shows the confusion matrices for the compared methods.Notice that both baselines classify all the 25 positive (cancerous) images correctly.However, the difficulties arise at the non-cancerous images.This happens because of the nature of the MIL problem: as soon as a few patches obtain a non-negligible probability of cancer, the image will be likely predicted as cancerous (recall that the MIL formulation establishes that a bag has positive class as soon as one instance inside the bag has positive class).
Interestingly, the Ising model can help to avoid isolated false positive predictions on the patch-level (which lead to false positive bag predictions) by using the patch-level correlation.This is reflected in Table 3, where we observe that increasingly more negative images are predicted correctly as λ gets higher.In contrast, notice that strong correlation damage the performance in the positive class, as they penalize the appearance of positive patches (which would break the homogeneity of the bag, where most patches do not contain cancer).Table 3: Confusion matrices obtained at the level of images for the compared methods.
Therefore, we conclude that instance correlation is beneficial when used with a low-to-medium intensity.For instance, in this application λ = 0.5 is the best performing value, and it will be the one used by default in the sequel.
Analyzing the instance-level predictions for negative bags.In the previous paragraph, we have explained that negative images are incorrectly classified because a few patches inside them get classified as positive.Here we provide a visualization to support this. Figure 4 shows how the patch-level predictions are distributed inside the six negative images available in the test set.We observe that the amount of patches with a non-negligible probability of cancer gets reduced as we move from VGPMIL to VGPMIL-PR, and then to VGPMIL-PR-I.This translates into better performance at bag-level (observe that the amount of green-axis subplots increases in the same sequence VGPMIL → VGPMIL-PR → VGPMIL-PR-I).Notice that the improvement from VGPMIL to VGPMIL-PR is larger than from VGPMIL-PR to VGPMIL-PR-I.This may be due to the simplification that was introduced when computing the expectation of the truncated multivariate Gaussian in VGPMIL-PR-I, recall the third-to-last paragraph in Section 2.4. in the test set.Each column is an image (the header is the image identifier in the SICAPv2 dataset).The rows refers to the three compared methods.Each subplot has red/green axis depending on whether the image is correctly classified or not by that method.The blue bars inside the subplots represent the probability of cancer for the different patches inside the image (for ease of visualization, they are sorted in increasing order).
Visualizing the predictions.In the last paragraph, we have analyzed how the patch-level predictions are distributed inside non-cancerous images quantitatively.Indeed, Figure 4 represents each patch through a bar.However, this hampers the qualitative visualization from a medical viewpoint.Here we focus on such qualitative assessment by visualizing the predictions obtained for image 16B0028138, see Figure 5.We have chosen this image because it illustrates best the effect of the coupling term.Notice that, thanks to these terms, the proposed VGPMIL-PR-I manages to keep all patches with a probability closer to zero than VGPMIL and VGPMIL-PR.As a consequence, VGPMIL-PR-I is the only method that correctly classifies this image as non-cancerous, recall  the other two methods, there are some patches that trigger the image prediction to be cancerous.
An explicit analysis on the role of λ.The hyperparameter λ is at the core of the novel VGPMIL-PR-I.It was introduced in the probabilistic model to regulate the strength of the coupling term, recall eq. ( 8).This role has been confirmed indirectly in Table 3: when λ gets higher, the predictive performance improves for negative bags and degrades for positive ones.This can be explained because a higher λ homogenizes the patch-level predictions and difficulties the appearance of positive patches.Here we perform a more direct measure to gain insights into the role of λ.Specifically, we define the "variability inside a bag" as the standard deviation of the probability of cancer for all the patches inside that bag.Therefore, this metric measures the dispersion in the patch-level predictions obtained inside a bag. Figure 6 shows the evolution of this metric for VPGMIL-PR-I as λ increases.As theoretically expected, the metric decreases as λ gets higher.Also, notice that the metric value for VGPMIL-PR is higher.This is explained because VGPMIL-PR does not incorporate Ising correlation, i.e. λ = 0.
The value for VPGMIL is even higher, 0.30, and it is not included in Figure 6 λ for ease of visualization.This greater value is probably due to the additional approximations that VGPMIL involves, which deepens the independence among patches.
Computational cost.Finally, we report the computational training and testing time for the compared methods, see Table 4.The results are in the same order of magnitude in all cases, which justifies the practical utility of the novel VGPMIL-PR-I, which obtained better predictive performance, recall Tables 1 and 2. In fact, VGPMIL-PR-I is slightly faster than VGPMIL, since the Jaakola bound approximation leveraged in the latter introduces additional parameters ξ to be estimated.As theoretically expected, the computational cost of VGPMIL-PR-I and VGPMIL-PR is analogous, since the update equations for the former are just a generalization of those for the latter, recall Sections 2.4 and 2.5.Finally, notice that the value of λ does not affect the computational cost of VGPMIL-PR-I, as λ only regulates the intensity of the Ising terms (but it does not introduce any additional computation).
Comparison to other related MIL approaches.So far we have focused on the comparison of VGPMIL-PR-I with VGPMIL and VGPMIL-PR.Since VGPMIL-PR-I builds on the same type of GP-based modeling, this is the most meaningful comparison in order to evaluate our main contribution (the Ising term to account for correlations among patches).However, to provide a wider perspective, it is interesting to compare the novel VGPMIL-PR-I to other stateof-the-art and popular families of MIL methods.We consider three families: attention-based methods, where the two algorithms proposed in [35] are the most popular approaches; MIL methods based on pseudo-labels such as the recent [7]; and classical pooling/aggregation methods such us the mean aggregation [36].
These will be referred to as Att-MIL, Gated-Att-MIL, PS-MIL and Mean-Agg, respectively.
Let us discuss the results both at instance (patch) and bag (image) levels.For the former, notice that the formulation of attention-based methods (Att-MIL and Gated-Att-MIL) and classical aggregation methods (Mean-Agg) do not allow for making predictions at instance level in a natural way.Namely, the instance-level labels are not modelled explicitly in this type of methods, and this is precisely one of their main limitations [37].The predictive performance at the level of images (bags) is shown.The results are the mean and standard deviation over five independent runs.The algorithm PS-MIL was run only once because of its high computational training cost.

Evaluation on a larger dataset: PANDA
The SICAPv2 dataset used so far is of medium-size (total amount of 126 WSIs, leading to 20378 patches; recall Section 3.1).This has allowed us to carry out a very detailed analysis of the results.In this section we show that the novel VGPMIL-PR-I also performs well on larger datasets, such us the well-known PANDA set.Although scalability is not an issue from a theoretical perspective, since the model is based on sparse GPs, it is important to verify it in practice.
PANDA also tackles the problem of prostate cancer detection, and was presented at the MICCAI 2020 conference as a challenge2 .Since the test set of PANDA is not publicly available, we use the train/test split proposed in [38], where each split follows the overall class proportions.Namely, the dataset used here features a total amount of 10503 WSIs, which leads to 1107931 patches.
Notice that this is much larger than SICAPv2 (83 times larger in terms of WSIs).  to provide instance-level predictions, which is not the case for attention-based models.We conclude that, for the PANDA dataset, the proposed method is the best choice in comparison to the other tested approaches.

Conclusions, limitations and future work
In this work we have introduced VGPMIL-PR-I, a novel MIL methodology that incorporates instance label correlation through a coupling term inspired by the Ising model.VGPMIL-PR-I is a generalization of another probabilistic MIL method, whose formulation is theoretically recovered when the influence of the Ising term converges to zero.In the experimental section, we have shown that VGPMIL-PR-I outperforms other related state-of-the-art probabilistic MIL approaches in two real-world problems of prostate cancer detection, effectively reducing false positive bag predictions and providing instance-level predictions.
We have also provided different visualizations to better understand the behavior of the proposed model, specially the influence of the new coupling term.
As discussed along the paper, our model presents several limitations which we summarize next.Firstly, we needed to introduce a diagonal approximation to compute the expectation of the truncated mulivariate Gaussian in VGPMIL-PR-I, recall Section 2.4.This is probably reflected in the empirical performance, as the improvement when moving from VGPMIL to VGPMIL-PR is generally larger than when moving from VGPMIL-PR to VGPMIL-PR-I.Secondly, we have observed that the behaviour of VGPMIL-PR-I depends on the value of λ, which regulates the strength of the coupling term.Although we have discussed the role of λ and tested different values, it remains a hyperparameter that has to be found empirically using the validation set.We believe that its value (or even distribution over it) could be estimated from the data by introducing λ in the probabilistic modeling.Even more, λ could be estimated per image, since the level of correlation could be image-dependent.Thirdly, we have observed that the image-level performance of VGPMIL-PR-I is not generally better than that of attention-based methods.This is probably due to the different nature of the models.Indeed, the explicit modeling of instance label in GP-based models, which allows them to provide instance-level predictions, may come at the cost of less accurate bag-level predictions. In

Figure 1 :
Figure 1: Probabilistic graphical model for VGPMIL-PR (a) and VGPMIL-PR-I (b).Gray nodes are observed variables, and white ones are latent variables to be estimated.The main difference is that VGPMIL-PR-I introduces correlation between instances in the same bag.Therefore, the distribution of m b given f b does not factorize across instances.The correlation is introduced through a novel term inspired by the Ising model, see Section 2.3 for details.
Analogously to the rest of variables, we write m b = {m i } i∈b for all the m i 's inside bag b, and we use m = {m b } b∈B to collectively denote all the m i 's in the model.Then, by marginalizing out h, we have:
Notice that the probability of a configuration m b is proportional to the coupling term C(m b ), which favors smoothness across labels of neighboring patches.Decisively, since both C(m b ) and N (m b |f b , I) only contain (the exponential of) Therefore, we decided to approximate the multivariate Gaussian N (m b |µ m b , Σ b ) by the factorized N (m b |µ m b , diag(Σ b )) and utilize the expression for one-dimensional truncated Gaussians.Although such approximation reduces the influence of the Ising correlation at this specific computation, notice that the coupling terms, which are included in Σ b , affect the update equations in more places across eqs.(

Algorithm 1
Training procedure for VGPMIL-PR-I.Input : Bags X = {X b } b∈B and bag labels y = {y b } b∈B .Calculate the matrices C b that account for the instance correlation inside each bag b ∈ B, recall eq. (

Figure 3 :
Figure 3: Two examples of cancerous (left) and non-cancerous (right) patches in the SICAPv2 test set.

Figure 4 :
Figure 4: Patch-level predictions inside each one of the six negative (non-cancerous) WSIs

Figure 5 :Figure 6 :
Figure 5: Patch level predictions obtained by the compared methods for image 16B0028138, which is non-cancerous.The original image is shown in (a).For predictions (b)-(d), the brightness of the patch is proportional to the probability of cancer (the brighter, the more probability).
addition to the aforementioned ideas, this work opens other future research lines.First, seeing the performance boost obtained in GP-based methods through the novel coupling term, and taking into account the good results of attention-based methods in bag-level prediction, it is very interesting to explore the modeling of instance label correlations in the context of attention-based methods.Second, notice that we are using mean-field variational inference to estimate the model parameters in VGPMIL-PR-I.A promising alternative is to estimate them by directly optimizing the evidence lower bound (ELBO).Finally, although we have focused on modeling correlation between neighboring patches in histopathogical images, we expect that the ideas behind our proposal can boost further research in MIL, by exploiting the particular structure of the data used in different applications.

Table 1 :
Predictive performance at the level of patches (instances).In bold, we highlight the values of λ for which VGPMIL-PR-I gets better (or equal) performance than both baselines in all the metrics.The results are the mean and standard deviation over five independent runs.

Table 2 :
Predictive performance at the level of images (bags).In bold, we highlight the values of λ for which VGPMIL-PR-I gets better (or equal) performance than both baselines in all the metrics.The results are the mean and standard deviation over five independent runs.

Table 4 :
Computational cost for training and testing the compared methods (in seconds).We are using 200 iterations in all cases, recall the experimental details in Section 3.1.The results are the mean and standard deviation over five independent runs.

Table 5 :
Comparison with other related MIL methods which are not based on the GP modeling.

Table 6
shows the predictive performance at image level, an aspect where attention-based methods stood out in the previous dataset.In this case, we observe that VGPMIL-PR-I obtains consistently better results.Additionally, as outlined in the last research question in Section 3.2, VGPMIL-PR-I is able

Table 6 :
Predictive performance at the level of images (bags) in the PANDA dataset.The results are the mean and standard deviation over five independent runs.The algorithm PS-MIL was run only once because of its high computational training cost.