Autoencoder Plausibility - Research Metrics Explorer

Back to Overview

Autoencoder Plausibility

Contextuality

Desiderata

Plausibility

Explanation Type

ExE

References:

Looveren and Klaise (2021)

Toggle Text Reference

Autoencoders are trained to capture the underlying structure of the dataset. Based on this idea, [Looveren and Klaise (2021)] propose evaluating the plausibility of counterfactual examples via reconstruction errors from class-specific and general-purpose autoencoders.
Two specific scores are introduced:

• IM1 (Class-specific reconstruction comparison): This metric compares how well a counterfactual

z

is reconstructed by an autoencoder trained on the target class (

\Phi_{y^*_z}

) versus the originally predicted class (

\Phi_{\hat{y}_x}

\frac{\|z - \Phi_{y^*_z}(z)\|_2^2}{\|z - \Phi_{\hat{y}_x}(z)\|_2^2+ \epsilon}

.
A low IM1 score implies that the counterfactual is more representative of the target class than of its original class, indicating class-specific plausibility.

• IM2 (General manifold plausibility): This score compares the reconstructions from a general autoencoder

\Phi_\mathcal{Y}

and a class-specific one:

\frac{\|\Phi_\mathcal{Y}(z) - \Phi_{y^*_z}(z)\|_2^2}{\|z\|_1+ \epsilon}

.
A low IM2 score implies that the counterfactual aligns well with both the general data manifold and the target class distribution, thus indicating higher plausibility.

While these metrics provide valuable insight into how realistic or semantically valid counterfactuals are, they come with caveats. Training multiple autoencoders (e.g., per class) introduces significant computational overhead. Further, reconstruction quality may vary across classes, and [Hvilshøj et al. (2021)] showed that autoencoders can be sensitive to small perturbations, potentially undermining the consistency of the plausibility assessment.

Minimality

Diversity