Categorization Scheme

This page describes the categorization scheme introduced in Section 4.1 of our paper. The scheme is structured along three dimensions: Desiderata, Explanation Type, and Contextuality.

Overview

The following table provides a high-level overview of all values in each dimension. Click on a value to jump to its description below.

Dimension	Values
Desiderata	Parsimony Plausibility Coverage Fidelity Continuity Consistency Efficiency
Explanation Type	FA ExE CE WBS NLE
Contextuality	I II III IV V

Desiderata

Desiderata describe criteria for what constitutes a good explanation. We propose a set of seven functionality-grounded desiderata.

Parsimony

The explanation should keep the explanans concise to support interpretability.

Toggle Text Reference

The primary purpose of an explanation is to convey information about the black-box model or its decision process to humans. Therefore, the resulting explanans must be expressed in a way that the human mind can grasp easily to increase the explanations success. While actual interpretability, can only be evaluated through human-grounded evaluation, Parsimony is one of the most prevalent proxies defined to serve as a functionality-grounded approximation. Since our mental capacity is limited and we tend to struggle with an overload of information [Miller (1956), Miller (2019), Alangari et al. (2023b)], providing short and simple explanantia helps humans understand more effectively.
[Nauta et al. (2023)] introduce the property of Compactness, arguing that a briefer explanans is easier to understand. Similarly, they use Covariate Complexity to assess how complex the features are that constitute the explanans, where higher interpretability is supported by providing a few high-level concepts in favor of a very granular explanans. Both of these aspects are summarized under Parsimony by [Markus et al. (2021), Zhou et al. (2021)], preferring simpler explanantia over longer or more complex ones. The scheme used in the Quantus library [Hedström et al. (2023), Bommer et al. (2024)] defines a group called Complexity. It specifically tests for concise explanantia and aims to have as few features as possible to be easier to understand. The associated interpretability desiderata from other authors are defined less explicitly, but similarly favoring simpler explanantia [Andrews et al. (1995), Johansson et al. (2004), Guidotti et al. (2018)], proposing that simple explanantia should be short [Alvarez-Melis and Jaakkola (2018a), Jesus et al. (2021), Alangari et al. (2023b)], promoting small explanantia and only focussing on relevant parts [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)], and expecting an explanans with concentrated information to facilitate human understanding [Belaid et al. (2022)].
Following the proposed definitions, we include Parsimony as one of our desiderata. It expects explanantia to be as brief and concise as possible, to ensure that rationales can be understood easily and fast. We focus Parsimony exclusively on this aspect, as other associated properties are either covered by separate desiderata (such as truthfulness of the explanation) or excluded entirely as they are not functionality-grounded (general understandability of the explanation).

Plausibility

The explanation should shape the explanans to align with human expectations.

Toggle Text Reference

To improve the acceptance of explanations and facilitate their interpretation, the Plausibility desideratum is contained in most authors' interpretability desiderata. Coherence as defined by [Nauta et al. (2023)] is the accordance of an explanans with the user's previous knowledge and expectations. The metrics classified under Localization by [Hedström et al. (2023)] evaluate whether an explanation shows a rationale similar to what humans would expect. This is concordant with the definition of Comprehensibility [Alvarez-Melis and Jaakkola (2018a), Jesus et al. (2021), Alangari et al. (2023b)], which states that an explanans should be similar to what a human expert would choose as the correct rationale. Furthermore, [Nauta et al. (2023)] introduce Contrastivity, which supports Plausibility, as an explanans should be specific to the given explanandum. Similarly, Clarity is introduced by [Markus et al. (2021), Zhou et al. (2021)] expecting explanantia to be unambiguous.
We include the Plausibility desideratum, which encompasses the idea that explanations should align with human knowledge and intuition. On one hand, this includes human expectations towards the result (explanans), e.g., “The model focuses on what a human would focus on”. On the other hand, the XAI methods' behavior (explanation) should also be aligned with human intuition, e.g., “The outputs for individual inputs should differ”.

Coverage

The explanation should provide an explanans for every explanandum.

Toggle Text Reference

The extent to which an explanation or explanans can be applied is considered by two frameworks. Unfortunately, definitions from both surveys are vague. [Markus et al. (2021), Zhou et al. (2021)] define the Broadness of an explanation as “how generally applicable” it is, without further elaboration on the implications of this definition. More concretely, Representativeness is presented by [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)]. It reflects the number of explananda that are covered by an individual explanans, although this definition focuses mainly on the distinction between global and local explanation methods.
To add more clarity to these definitions, we include Coverage with an alternative definition. It defines the amount of explananda that are covered by the explanation, i.e. reflecting whether there exists an explanans for every data input or output.

Fidelity

The explanation should make the explanans reflect the model's true reasoning.

Toggle Text Reference

Fidelity is one of the most frequently discussed concepts in the literature and combines two closely related aspects: Correctness and Completeness. While some works introduce these as separate desiderata, others group them together under the umbrella of Fidelity.
Correctness refers to whether the explanation truthfully represents the internal logic and decision process of the black-box model. It is one of the most frequently emphasized desiderata across the reviewed frameworks. Without correctness, even the most interpretable or simple explanation may provide no meaningful insight. Terms like Faithfulness, Truthfulness, and Fidelity are often used interchangeably in literature to describe this idea. Correctness encompasses both local fidelity for individual explanantia and global alignment across the dataset [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)]. The general consensus is that an explanation should reveal what truly drives the model's outputs [Alvarez-Melis and Jaakkola (2018a), Markus et al. (2021), Zhou et al. (2021), Belaid et al. (2022), Alangari et al. (2023b), Nauta et al. (2023)]. It is commonly assessed by how well the explanation reflects or mimics the model's behavior [Andrews et al. (1995), Johansson et al. (2004), Guidotti et al. (2018)].
Completeness, in contrast, describes how much of the model's reasoning is captured by the explanation. According to the Co-12 properties by [Nauta et al. (2023)], an explanation should ideally include the full scope of the model's rationale.Some authors treat Completeness as a sub-aspect of Fidelity [Markus et al. (2021), Zhou et al. (2021)], while others define Fidelity itself as the capacity to capture all of the information embodied in the model [Andrews et al. (1995), Johansson et al. (2004), Guidotti et al. (2018)].
Although it is theoretically possible to have an explanation that is partially correct but incomplete (e.g., providing a heatmap that highlights only one of several relevant features), or complete but partially incorrect (e.g., including all the right features alongside irrelevant ones), neither scenario is desirable. If key features are missing or irrelevant ones are included, the explanans ultimately misrepresents the model's behavior. While Correctness and Completeness can be distinguished conceptually, they are tightly interwoven in practice and difficult to evaluate in isolation. Since our desiderata are intended to capture orthogonal evaluation dimensions, and these two cannot be meaningfully disentangled, we combine them under the unified criterion of Fidelity.

Continuity

The explanation should ensure that similar explananda yield similar explanantia.

Toggle Text Reference

Just as the robustness or stability of a standard AI model is of great interest, similar expectations apply to explainability. Most frameworks highlight this desideratum, and various metrics have been proposed to assess how stable or reliable explanations are. However, the terminology used in literature is inconsistent, at times overlapping and at other times diverging in meaning.
[Nauta et al. (2023)] introduce the term Continuity as the smoothness of the explanation, i.e. similar explananda should yield similar explanantia. Others refer to this idea as Stability [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)], describing it as the resilience against slight variations in input features that do not alter the model's prediction. The term Robustness is used by [Alvarez-Melis and Jaakkola (2018a), Jesus et al. (2021), Alangari et al. (2023b)], to describe the same behavior, referring to it as a key requirement for trustworthy XAI.
The Quantus toolkit reflects the prevalence of this concept, providing a “Robustness” metric category [Hedström et al. (2023), Bommer et al. (2024)], which assesses the similarity of explanantia under minor changes in input. Finally, [Belaid et al. (2022)] cover the same idea under the term Stability. In addition, they assess Fragility, which they define as the resilience of explanations against malicious manipulation, such as adversarial attacks.
Our Continuity desideratum covers both of the mentioned properties. It includes the smoothness of explanations with respect to “naïve” changes in the explanandum that ideally do not affect the model's behavior, as well as the resilience of explanations against malicious manipulation attempts. Note, that this includes changes over the input data as well as the model. We decided to adopt the term Continuity instead of Stability or Robustness, to reduce the possible confusion with model robustness.

Consistency

The explanation should produce stable explanantia across repeated evaluations.

Toggle Text Reference

While Continuity investigates the smoothness and similarity between similar but different inputs, the Consistency of explanations for identical inputs also needs to be considered. However, few authors explicitly consider this desideratum.
Consistency is introduced by [Nauta et al. (2023)] as a direct measure of the determinism of an XAI algorithm. Similarly, one part of the definition of Stability by [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)] considers variations in explanations based on non-determinism. The oldest formulation of Consistency is given by [Andrews et al. (1995)] and considers explanation methods to be consistent when they produce equivalent results under repetition.
However, several frameworks additionally consider the similarity of explanantia generated from different models trained on the same data [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020), Nauta et al. (2023)]. Yet, different models can produce the same prediction while relying on entirely different internal reasoning. This is especially true as there are often multiple valid reasons for the same event, also known as the Rashomon Effect [Breiman (2001), Leventi-Peetz and Weber (2022)].
We include Consistency using the initial formulations, i.e., explanations should be deterministic or self-consistent, always presenting the same explanans for identical explananda. While the latter definition is present in one of the identified metrics, we do not explicitly add it to the definition of our Consistency desideratum, as we do not believe that different explananda (inputs), i.e. different models, necessarily result in identical explanantia (outputs).

Efficiency

The explanation should compute the explanans efficiently and broadly.

Toggle Text Reference

Finally, out of practical considerations, we want explanations to be conveniently applicable. This includes considering the range of models or situations in which the algorithm can be effectively applied. Simultaneously, it also includes the time it takes to compute an individual explanans.
The first property is introduced as Portability and Translucency throughout literature [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)]. Portability is the variety of models for which an explanation can be used, while Translucency is the necessity of the explanation algorithm to have access to the internals of the model. Similarly, [Johansson et al. (2004)] measure Generality, given by the restrictions or overhead necessary to apply an explanation to specific models. [Belaid et al. (2022)] refer to Portability as the diverse set of models to which the explanation can be applied.
Secondly, the Algorithmic Complexity [Robnik-Šikonja and Bohanec (2018), Carvalho et al. (2019), Molnar (2020)] considers the time it takes to generate an explanans. Naturally, the amount of necessary time depends not only on the inherent complexity of the explanation algorithm, but also on the Scalability, i.e. its ability to efficiently handle larger models and input spaces [Johansson et al. (2004)]. Using the “Stress test”, [Belaid et al. (2022)] explicitly evaluate the runtime behavior with respect to increasing input size.
We subsume both of these aspects under a general desideratum called Efficiency. It includes the algorithmic or computational properties of the explanation, which might influence the choice of a specific XAI algorithm over another.

Explanation Type

We categorize VXAI metrics based on the accepted input. Apart from a few exceptions, all metrics are agnostic to the underlying black-box model or data format (e.g., tabular or image). Therefore, we do not consider this dimension separately.

Feature Attributions (FA)

Toggle Text Reference

A FA explanation returns a vector

e\in\mathbb{R}^d

, which usually (but not necessarily) has the same dimension as the input

x

. Each dimension represents an input feature, e.g., column in tabular data, (super-)pixel in images, or node in graphs. The value

e_j \in e

then represents the relevance of the given feature towards the explained prediction. Depending on the underlying explanation algorithm, values can be positive or negative, and they may be inherently bounded to a given range or unbounded. Some FA methods assign continuous importance scores to each feature, while others produce binary or thresholded outputs that identify a subset of important features. Furthermore, Saliency Maps (which highlight important regions in the input, typically used in image-based tasks) present a special case of FAs, as features are not completely independent but exhibit spatial relationships. Usually, FAs are used as local explanations [Bach et al. (2015), Ribeiro et al. (2016), Lundberg (2017), Shrikumar et al. (2017), Sundararajan et al. (2017)], assigning feature importance for a single input prediction. However, they can also be global, e.g., showing the global impact of features [Lundberg (2017), Molnar (2020)].

Concept Explanations (CE)

Toggle Text Reference

Similar to FAs, CEs can also be seen as a vector of importance scores. However, they are conceptually different as they represent a higher-level idea, above individual features, e.g., visual patterns rather than individual pixels. Therefore, there are usually fewer concepts than input features. Further, concepts are usually extracted from some intermediary representation or embedding inside the black-box model. While low-level features in FA are usually only meaningful for a given input (e.g., pixel), concepts are meaningful on their own, and the same concept can be, and usually is, present in multiple inputs. Hence, CE strives a middle ground between local and global explanations, where the concepts themselves and their average contribution to prediction is a global explanation, while the detection of concepts in a single input is a local explanation [Kim et al. (2018)].

Example Explanations (ExE)

Toggle Text Reference

ExEs are located in the input space, and therefore may consist of any type, including tabular data, images, graphs, and more. While the most prominent type of ExEs are Counterfactuals (minimally altered inputs that lead to a different prediction [Wachter et al. (2017)]), other types such as Prototypes (typical examples representing a class [Kim et al. (2016)]) or “Factuals” (altered instances that lead to the same prediction [Dhurandhar et al. (2019)]) also belong to this category. ExEs may consist of a single explaining instance or can be a list of instances. Local ExE are common in the form of Counterfactuals, showing what would need to change to alter the individual prediction of a model to a desired outcome [Wachter et al. (2017), Karimi et al. (2020), Mothilal et al. (2020), Verma et al. (2024)]. Conversely, a global ExE could be a list of class prototypes or influential training samples [Kim et al. (2016), Koh and Liang (2017), Molnar (2020)].

White-Box Surrogates (WBS)

Toggle Text Reference

WBSs aim to approximate the underlying black-box model. They achieve this through a surrogate model which is considered interpretable itself and therefore can serve as the explanans. This includes reconstructions of the full black-box model over the entire data space [Craven and Shavlik (1995), Friedman and Popescu (2008)] but also local surrogates, which only serve as explanans for a given subset of instances [Ribeiro et al. (2016), Ribeiro et al. (2018)]. Common types of inherently interpretable WBS are trees, rule-sets, or linear models.

Natural Language Explanations (NLE)

Toggle Text Reference

NLEs have been explored well before the rise of Large Language Models (LLMs), particularly through joint training setups that generate textual justifications alongside model predictions [Ras et al. (2022)]. However, the growing capabilities of LLMs have made NLEs increasingly prominent. Leveraging such models, the explanans may be generated alongside the prediction [Camburu et al. (2018), Wei et al. (2022)] or post-hoc, potentially using a separate model [Bills et al. (2023)]. A more classical approach are template-based NLEs, where predefined building blocks are selected based on the results of other explanation methods (e.g., FAs) [Lucieri et al. (2022), Das et al. (2023)]. However, since these template-based approaches merely wrap an existing explanans in textual form, we do not consider them to be genuine NLEs. Instead, we propose to evaluate the underlying explanantia and explanations directly.
Notably, similar to the formulation of desiderata, our categorization scheme based on explanation types can be extended horizontally. We note that there is an overlap between different categories, as they represent both the final given explanans and the explanation process. LIME [Ribeiro et al. (2016)] is a typical example, as its explanation fits a local WBS to generate a corresponding FA explanans. Similarly, there is a connection between CEs and other types, as WBSs can be leveraged to generate counterfactuals [Pornprasit et al. (2021)], as can FAs[Ge et al. (2021b), Albini et al. (2022)]. Rather than being a limitation, this overlap benefits our framework: it enables metrics designed for one explanation type to be applicable to others, facilitating broader reuse and comparison.

Contextuality

We propose to distinguish metrics based on their evaluation context, which defines how strongly they depend on or intervene in the underlying model or data. We identify five levels, each introducing progressively deeper contextual interaction.

Level I: Explanans-Centric: Evaluates only the explanans in relation to the raw input instance, fully independent of the model.
Level II: Model Observation: Relies on access to model outputs or internal activations to assess behavior.
Level III: Input Intervention: Perturbs input data and observes resulting changes in predictions or explanantia
Level IV: Model Intervention: Alters the model itself, e.g., by retraining or parameter randomization.
Level V: A Priori Constrained: Requires specific data, architectures, or experimental setups.