VXAI LogoExplorerDFKI Logo
Internal Faithfulness
Contextuality
II
Desiderata
Fidelity
Explanation Type
WBS
References:
Messalas et al. (2019), Anders et al. (2020), Amparore et al. (2021)
Toggle Text Reference
Since the WBS models can achieve similar predictive performance as the original black-box without relying on the same underlying reasoning [Messalas et al. (2019), Anders et al. (2020)], it is essential to evaluate their internal fidelity, meaning the similarity in how both models justify their predictions. This can be achieved by comparing post-hoc explanations of the original and surrogate model for the same inputs. Using feature attribution methods (e.g., SHAP from [Lundberg (2017)]), a typical approach is to measure the average overlap of the top-kk features between both models' explanantia [Messalas et al. (2019)]. Other similarity metrics and explanation types may also be used.
Alternatively, [Amparore et al. (2021)] compare counterfactuals generated from each model, treating their similarity as a proxy for the alignment of decision boundaries. This provides a structural view of how well the surrogate captures the black-box model's rationale beyond mere output agreement.