Helper Functions

This page describes key helper functions used across various metrics as described in Appendix B.2 of our paper. These components support modularity and can often be replaced or customized.

Overview

We categorize helper functions into three classes:Perturbation Approaches,Normalization of Explanantia,Similarity Measures. Each section includes a description of available variants and their intended use.

Perturbation Approaches

Toggle Text Reference

Perturbations are small changes typically applied to input features and are recurring components in metrics targeting Fidelity and Continuity. There exists a wide range of perturbation strategies, and the choice of approach can significantly affect both metric results and their interpretation [Brunke et al. (2020), Funke et al. (2022), Rong et al. (2022)].
We distinguish first by the Perturbation Scope, i.e., the parts of the input that are modified. Perturbations can be applied at a fine-grained level (e.g., individual features) or on more structured, high-level groupings. For image data, scope definitions may involve aggregating pixels into fixed grids [Schulz et al. (2020)] or segmenting into superpixels [Ribeiro et al. (2016), Kapishnikov et al. (2019), Rieger and Hansen (2020)]. In time-series data, one may perturb fixed-length windows, with the target time-step at the beginning or middle [Schlegel et al. (2019), Schlegel et al. (2020)]. In topologically ordered domains (e.g., images, time-series, or graphs), adjacent features can be perturbed together, such as modifying the area surrounding a focal pixel [Samek et al. (2016), Brahimi et al. (2019)]. Higher-level approaches include perturbing Concepts [Shawi et al. (2021)] or internal activations associated with object parts [Zhang et al. (2019b)].
Once the scope is defined, a variety of Perturbation Functions can be applied. In fact, any perturbation function might be suitable [Hameed et al. (2022), Schlegel and Keim (2023)]. However, here we present some of the most common in literature. At the simplest level, features may be removed by setting them to zero or dropping them entirely, especially in structured domains such as graphs, text, or time-series data, as employed by many authors, (e.g., [Bach et al. (2015), Ancona et al. (2017), Alvarez-Melis and Jaakkola (2018a), Chu et al. (2018), Arya et al. (2019), DeYoung et al. (2019), Schlegel et al. (2019), Cong et al. (2020), Singh et al. (2020), Warnecke et al. (2020), Bajaj et al. (2021), Faber et al. (2021), Singh et al. (2021), Jin et al. (2023)]). Alternatively, features can be replaced by a fixed value, e.g., the per-channel or per-instance mean [Petsiuk (2018), Schlegel et al. (2019), Schlegel et al. (2020), Hameed et al. (2022), Jin et al. (2023)].
To generate less deterministic perturbations, authors propose adding random noise (e.g., Gaussian) or drawing values from uniform distributions [Yeh et al. (2019), Bhatt et al. (2020), Sturmfels et al. (2020), Bajaj et al. (2021), Funke et al. (2022), Veerappa et al. (2022)]. Other strategies leverage the spatial structure of the data: for instance, applying blurring or interpolation [Sturmfels et al. (2020), Rong et al. (2022)], or reordering spatial regions [Schlegel et al. (2019), Chen et al. (2020)]. Instead of applying synthetic noise, values can be resampled from the marginal distribution of a feature, from its nearest neighbor, or even from an opposite-class example [Guo et al. (2018a), Hameed et al. (2022)]. Where influence regions are known, perturbations can be constrained to lie inside or outside these intervals [Velmurugan et al. (2021a)].
In the NLP domain, word embeddings can be noised, or tokens substituted using synonym sets and domain knowledge [Yin et al. (2021)]. For image inputs, another strategy is cropping and resizing to emphasize or suppress local information [Dabkowski and Gal (2017)].
Finally, when perturbations are guided by a FA, their intensity can be scaled proportionally to the assigned importance scores [Chattopadhay et al. (2018), Guo et al. (2018a), Jung and Oh (2021)].

Normalization of Explanantia

Toggle Text Reference

Since FAs and CEs are typically represented as real-valued vectors, computed through various mechanisms, their value ranges are not inherently standardized. However, many metrics either explicitly require the explanans to lie within a fixed range or implicitly assume comparability across explanantia, making normalization a necessary preprocessing step. A widely used normalization method is Min-Max Scaling [Binder et al. (2023), Brandt et al. (2023)], which maps all values into a fixed interval (typically

[0, 1]

). Alternative strategies include normalization based on the square root of the average second-moment estimate, offering robustness to outliers and variance shifts [Binder et al. (2023)]. This limited selection can be extended through any suitable normalization approach.

Similarity Measures

Toggle Text Reference

Across various metrics, it is necessary to calculate similarities or distances between two explanantia, especially when concerned with the desideratum of Continuity and in metrics relying on Ground-Truth evaluations. Throughout the reported literature, various approaches have been reported, which differ based on the type of explanation. While some measures directly compute similarity, others quantify distance or disparity. In this work, we adopt a similarity-based framing, either directly or by transforming distance measures, to ensure that higher values uniformly indicate greater explanatory agreement. Analogously, we can compute the similarity between explananda, adopting measures that are presented in the following.
For FAs, similarity may be measured using arbitrary inverted distance or loss measures (e.g.,

L_p

, MSE, cosine distance, JS-Divergence, or Bhattacharyya Coefficient), potentially normalized (e.g., by standard deviation) [Alvarez-Melis and Jaakkola (2018b), Alvarez-Melis and Jaakkola (2018a), Chu et al. (2018), Wu and Mooney (2018), Jain and Wallace (2019), Jia et al. (2019), Mitsuhara et al. (2019), Pope et al. (2019), Trokielewicz et al. (2019), Yeh et al. (2019), Zhang et al. (2019a), Jia et al. (2020), Agarwal et al. (2022b), Atanasova et al. (2022), Dai et al. (2022), Fouladgar et al. (2022), Agarwal et al. (2023), Huang et al. (2023a), Nematzadeh et al. (2023)]. Alternatively, rank correlation measures such as Spearman's or Kendall's Tau can be applied [Das et al. (2017), Adebayo et al. (2018), Chen et al. (2019a), Dombrowski et al. (2019), Ghorbani et al. (2019), Nguyen and Martínez (2020), Rajapaksha et al. (2020), Sanchez-Lengeling et al. (2020), Liu et al. (2021a), Yin et al. (2021), Krishna et al. (2022), Huang et al. (2023a)]. When binarizing FA outputs through thresholding, feature-wise evaluation measures such as accuracy, precision,

F_1

, or AUROC are commonly used [Chen et al. (2018b), Yang et al. (2018a), Jia et al. (2019), Jia et al. (2020), Sanchez-Lengeling et al. (2020), Bykov et al. (2021), Joshi et al. (2021), Park and Wallraven (2021), Amoukou et al. (2022), Chen et al. (2022), Funke et al. (2022), Tjoa and Guan (2022), Wilming et al. (2022), Agarwal et al. (2023)]. Similarly, IoU can be calculated over binarized features [Oramas et al. (2017), Fan et al. (2020), Kim et al. (2021), Situ et al. (2021), Vermeire et al. (2022)], or top-

k

intersection measures may be used [Ghorbani et al. (2019), Mishra et al. (2020), Rajapaksha et al. (2020), Warnecke et al. (2020), Amparore et al. (2021), Bajaj et al. (2021)]. For saliency maps, specialized similarity measures are available, such as SSIM [Adebayo et al. (2018), Dombrowski et al. (2019), Rebuffi et al. (2020), Graziani et al. (2021), Sun et al. (2023)], Earth Movers Distance [Park et al. (2018), Wu and Mooney (2018)], Normalized Cross Correlation [Baumgartner et al. (2018), Bass et al. (2020)], or Mutual Information [Sun et al. (2023)]. While primarily established for FAs, similar similarity functions can be naturally applied to CE-based explanations as well.
For WBSs and ExEs, the choice of similarity measure depends strongly on the underlying model or domain. For linear predictive models, coefficient mismatch is a common choice [Lakkaraju et al. (2020)], whereas rule- and tree-based explanantia may be compared by their rule overlap, feature usage, or node structures [Bastani et al. (2017), Guidotti et al. (2019), Lakkaraju et al. (2020), Rajapaksha et al. (2020), Margot and Luta (2021)].
NLEs can be compared using standard natural language processing measures [Camburu et al. (2018), Chuang et al. (2018), Liu et al. (2018a), Wu and Mooney (2018), Chen et al. (2019d), Rajani et al. (2019), Wickramanayake et al. (2019), Li et al. (2020a), Sun et al. (2020), Jang and Lukasiewicz (2021), Atanasova (2024)]. Those include for instance BLEU [Papineni et al. (2002)], METEOR [Banerjee and Lavie (2005)], ROUGE [Lin (2004)], CIDEr [Vedantam et al. (2015)], or SPICE [Anderson et al. (2016)]. In addition, several measures have been proposed specifically for evaluating natural language explanations [Xie et al. (2021), Du et al. (2022), Rodis et al. (2024), Park et al. (2018)].
When comparing similarities over multiple instances, the most natural aggregation is to compute the mean similarity [Fan et al. (2020), Fouladgar et al. (2022), Yeh et al. (2019)]. Depending on the evaluation goal, alternative aggregation strategies may offer more informative insights. For example, worst-case stability, defined as the minimum similarity across inputs, can be used to quantify robustness [Alvarez-Melis and Jaakkola (2018b), Alvarez-Melis and Jaakkola (2018a), Yeh et al. (2019), Yin et al. (2021), Fouladgar et al. (2022)].