VXAI LogoExplorerDFKI Logo
Adversarial Model Resilience
Contextuality
IV
Desiderata
Consistency
Explanation Type
FA(ExE)(CE)(WBS)(NLE)
References:
Heo et al. (2019), Pruthi et al. (2019), Viering et al. (2019), Dimanov et al. (2020)
Toggle Text Reference
Explanation methods should depend meaningfully on the internal parameters of the black-box model. However, this sensitivity can be exploited to adversarially manipulate them. Specifically, small but carefully chosen changes to the model's weights can alter the generated explanations without significantly changing predictions. This manipulation can serve to obfuscate undesirable behaviors or bias within a model.
To assess the robustness of explanation methods against such attacks, various strategies perturb the model parameters and observe the resulting changes in explanantia. These attacks can be:
Targeted, where the modified model is encouraged to produce explanantia similar to a predefined target [Heo et al. (2019), Viering et al. (2019)].
Untargeted, where the goal is to produce explanantia that differ substantially from the original ones [Heo et al. (2019), Pruthi et al. (2019), Dimanov et al. (2020)].
To preserve prediction behavior, constraints may be imposed on the model modification. [Dimanov et al. (2020)] bound the change in prediction scores , but other restrictions are thinkable, such as limiting the weight difference norm to ensure the model remains close to the original.
The success of the manipulation is measured depending on the attack setup. For targeted attacks, its the similarity between the manipulated explanans and the predefined target [Viering et al. (2019)]. Untargeted attacks are evaluated using the dissimilarity between the manipulated and original explanans [Dimanov et al. (2020), Pruthi et al. (2019)].
All approaches were originally introduced for FA, but by applying suitable similarity measures (see Similarity Measures), they can be extended to other explanation types as well.