VXAI LogoExplorerDFKI Logo
Adversarial Input Resilience
Contextuality
III
Desiderata
Continuity
Explanation Type
FA(ExE)(CE)(WBS)(NLE)
References:
Singh et al. (2018), Wang et al. (2018b), Chen et al. (2019a), Dombrowski et al. (2019), Ghorbani et al. (2019), Subramanya et al. (2019), Boopathy et al. (2020), Kuppa and Le-Khac (2020), Zhang et al. (2020), Huang et al. (2023a)
Toggle Text Reference
Explanations can be vulnerable to Adversarial Attacks, where the goal is to manipulate either the explanans [Ghorbani et al. (2019)] or prediction [Singh et al. (2018), Wang et al. (2018b), Subramanya et al. (2019)]. More sophisticated approaches aim to manipulate one, while additionally restricting changes to the other. Two main types exist:
Explanans manipulation: The explanans is altered while the prediction remains fixed [Dombrowski et al. (2019), Boopathy et al. (2020), Kuppa and Le-Khac (2020), Huang et al. (2023a)].
Prediction manipulation: The prediction changes while the explanans remains similar [Zhang et al. (2020), Huang et al. (2023a)].
Evaluation typically measures the distance between the adversarial output (explanans or prediction) and either its original or targeted counterpart. The input change is usually bounded. Performance is reported either as aggregated distance metrics [Dombrowski et al. (2019), Ghorbani et al. (2019), Zhang et al. (2020), Huang et al. (2023a)] or as attack success rates based on predefined thresholds [Kuppa and Le-Khac (2020), Zhang et al. (2020), Huang et al. (2023a)].
While the reported authors focus on FAs, the approach can be easily adapted to other explanation types by selecting appropriate similarity measures.