Internal Faithfulness - Research Metrics Explorer

Back to Overview

Internal Faithfulness

Contextuality

Desiderata

Fidelity

Explanation Type

WBS

References:

Messalas et al. (2019), Anders et al. (2020), Amparore et al. (2021)

Toggle Text Reference

Since the WBS models can achieve similar predictive performance as the original black-box without relying on the same underlying reasoning [Messalas et al. (2019), Anders et al. (2020)], it is essential to evaluate their internal fidelity, meaning the similarity in how both models justify their predictions. This can be achieved by comparing post-hoc explanations of the original and surrogate model for the same inputs. Using feature attribution methods (e.g., SHAP from [Lundberg (2017)]), a typical approach is to measure the average overlap of the top-

k

features between both models' explanantia [Messalas et al. (2019)]. Other similarity metrics and explanation types may also be used.
Alternatively, [Amparore et al. (2021)] compare counterfactuals generated from each model, treating their similarity as a proxy for the alignment of decision boundaries. This provides a structural view of how well the surrogate captures the black-box model's rationale beyond mere output agreement.

Output Faithfulness

Setup Consistency