References:
Lakkaraju et al. (2020)
Toggle Text Reference
A WBS is only useful if its behavior remains consistent with the black-box even under slight perturbations to the input.
To assess this, [Lakkaraju et al. (2020)] propose measuring the difference in Metric “Output Faithfulness” between the original and perturbed inputs. Specifically, for each input a perturbed variant is created (e.g., through noise or adversarial modification). The metric then compares the agreement between the black-box model and the surrogate model before and after the perturbations
A lower difference indicates a more robust surrogate, as it preserves faithfulness across perturbations. High differences may signal that the surrogate captures only superficial model behavior or overfits to specific input patterns.
To assess this, [Lakkaraju et al. (2020)] propose measuring the difference in Metric “Output Faithfulness” between the original and perturbed inputs. Specifically, for each input a perturbed variant is created (e.g., through noise or adversarial modification). The metric then compares the agreement between the black-box model and the surrogate model before and after the perturbations
A lower difference indicates a more robust surrogate, as it preserves faithfulness across perturbations. High differences may signal that the surrogate captures only superficial model behavior or overfits to specific input patterns.

