References:
Laugel et al. (2019a), Laugel et al. (2019b), Mahajan et al. (2019), Singla et al. (2019), Artelt and Hammer (2020), Dandl et al. (2020), Kanamori et al. (2020), Delaney et al. (2021), Pawelczyk et al. (2021), Rasouli and Yu (2021), Smyth and Keane (2022)
Toggle Text Reference
To ensure that counterfactuals remain realistic and trustworthy, they should lie close to the true data manifold. Counterfactuals that deviate significantly from the distribution of training data are unlikely to be plausible. Two principal approaches are used to evaluate this alignment: direct estimation of data conformity and distance-based proximity to the training data.
In the direct approach, several statistical or unsupervised methods determine whether the counterfactual is an outlier. This includes calculating its likelihood under a kernel density estimator [Artelt and Hammer (2020)] or from known data distributions in synthetic setups [Mahajan et al. (2019)]. Other methods include Local Outlier Factor [Kanamori et al. (2020), Delaney et al. (2021)] and Isolation Forests [Delaney et al. (2021)], which can be applied either across the full dataset or restricted to samples of the counterfactual's target class [Artelt and Hammer (2020)]. [Pawelczyk et al. (2021)] assess how many of a counterfactual's nearest neighbors share its class label.
Alternatively, proximity can be evaluated through distance-based approaches, relying on any suitable distance measure (or inverted similarity function, see Similarity Measures). These include distances to the nearest neighbors in the training data [Laugel et al. (2019a), Dandl et al. (2020)], or to the nearest training instances that share the same class label as the counterfactual or the original instance [Rasouli and Yu (2021), Smyth and Keane (2022)]. Specific domains may require tailored metrics such as the Fréchet Inception Distance (FID) for images [Singla et al. (2019)].
Raw distance scores are often used directly [Dandl et al. (2020)], but some methods normalize distances, for example, by comparing the counterfactual–input distance to average distances between random pairs in the dataset [Laugel et al. (2019a), Laugel et al. (2019b)], or to the original instance–counterfactual distance [Smyth and Keane (2022)].
In the direct approach, several statistical or unsupervised methods determine whether the counterfactual is an outlier. This includes calculating its likelihood under a kernel density estimator [Artelt and Hammer (2020)] or from known data distributions in synthetic setups [Mahajan et al. (2019)]. Other methods include Local Outlier Factor [Kanamori et al. (2020), Delaney et al. (2021)] and Isolation Forests [Delaney et al. (2021)], which can be applied either across the full dataset or restricted to samples of the counterfactual's target class [Artelt and Hammer (2020)]. [Pawelczyk et al. (2021)] assess how many of a counterfactual's nearest neighbors share its class label.
Alternatively, proximity can be evaluated through distance-based approaches, relying on any suitable distance measure (or inverted similarity function, see Similarity Measures). These include distances to the nearest neighbors in the training data [Laugel et al. (2019a), Dandl et al. (2020)], or to the nearest training instances that share the same class label as the counterfactual or the original instance [Rasouli and Yu (2021), Smyth and Keane (2022)]. Specific domains may require tailored metrics such as the Fréchet Inception Distance (FID) for images [Singla et al. (2019)].
Raw distance scores are often used directly [Dandl et al. (2020)], but some methods normalize distances, for example, by comparing the counterfactual–input distance to average distances between random pairs in the dataset [Laugel et al. (2019a), Laugel et al. (2019b)], or to the original instance–counterfactual distance [Smyth and Keane (2022)].

