References:
Krishnan and Wu (2017), Guo et al. (2020)
Toggle Text Reference
To evaluate the fidelity of ExEs, the model is retrained on a dataset modified according to the explanation. The effect of these modifications on model predictions is used to assess the explanatory quality.
[Guo et al. (2020)] remove the identified influential training points from the dataset, retrain the model, and measure the change in loss for the explanandum. If the removed instances were truly helpful, model performance should degrade. Conversely, if they were misleading, removal should lead to an improvement. Thus, a greater change in loss signals a more correct and complete set of influential instances.
In an alternative approach, [Krishnan and Wu (2017)] retain the identified influential instances but randomly flips the labels of all remaining training examples before retraining. If the explanation captures all relevant information, the prediction should remain stable. Hence, lower prediction variance after retraining indicates a more complete and accurate explanans.
[Guo et al. (2020)] remove the identified influential training points from the dataset, retrain the model, and measure the change in loss for the explanandum. If the removed instances were truly helpful, model performance should degrade. Conversely, if they were misleading, removal should lead to an improvement. Thus, a greater change in loss signals a more correct and complete set of influential instances.
In an alternative approach, [Krishnan and Wu (2017)] retain the identified influential instances but randomly flips the labels of all remaining training examples before retraining. If the explanation captures all relevant information, the prediction should remain stable. Hence, lower prediction variance after retraining indicates a more complete and accurate explanans.

