VXAI LogoExplorerDFKI Logo
Model-Agnostic Explanation Consistency
Contextuality
I
Desiderata
Plausibility(Consistency)
Explanation Type
FAExE(CE)(WBS)(NLE)
References:
Fan et al. (2020), Nguyen et al. (2020), Hvilshøj et al. (2021), Jiang et al. (2023)
Toggle Text Reference
To assess whether explanations reflect generalizable patterns rather than model-specific artifacts (such as adversarial shortcuts in counterfactual explanations), several authors evaluate explanations across different models trained on the same dataset. In general, the approach involves training additional black-box models (oracles) on the same data and task. These oracles are then used to evaluate explanations from the original model.
One strategy is to directly compute the similarity between explanantia generated by different models for the same input [Fan et al. (2020)]. While this has been reported for FAs, it is applicable to any explanation type, provided a suitable similarity metric is chosen (see Similarity Measures).
Alternatively, explanations from the original model are evaluated using the oracles:
[Nguyen et al. (2020)] use perturbation-based evaluation (e.g., Metric “Guided Perturbation Fidelity”) on a secondary model to assess whether the explanation highlights features that are generally important across models.
[Hvilshøj et al. (2021)] propose that counterfactuals should change the prediction in both the original model and the oracle. Only such counterfactuals are considered plausible.
• An extension of this approach trains one oracle per class and computes the Jensen-Shannon-Divergence between its predictions on the original and counterfactual input. Ideally, only the target and original class should exhibit strong divergence [Hvilshøj et al. (2021)].
[Jiang et al. (2023)] define a neighborhood of models via small weight perturbations and count how many counterfactuals remain valid across all neighbors.
This approach might be extended to other explanation types. However, its interpretive strength remains limited: it is unclear whether explanations should be similar across models, as differing model architectures may learn distinct (yet valid) rationales. As a result, high or low agreement does not always reflect explanation quality, making this metric inherently context-dependent.