Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing
Interpretability methods like Integrated Gradients, LIME, etc. are popular choices for explaining NLP model predictions with word importance scores. These interpretations need to be robust for trustworthy applications of NLP in high-stake areas like medicine or finance. Our work demonstrates that an interpretation itself can be fragile even when the predictions are robust. By performing simple and minor word perturbations in an input text, Integrated Gradients and LIME generate a substantially different explanation, yet the generated input achieves the same prediction label as the seed input. Due to only a few word-level swaps, the perturbed input is semantically and spatially similar to its seed input (therefore, interpretations should have been similar too). Empirically, we observe that the average rank-order correlation between a seed input’s interpretation and perturbed inputs’ drops by over 20% when less than 10% of words are perturbed. Further, correlation keeps decreasing as more words are gradually perturbed. We demonstrate that this on 4 different text classification datasets namely - SST-2, AG-News, IMDb, and Yelp across 2 different models DistilBERT and RoBERTa.
- Yangfeng Ji, Chair
- Yanjun Qi, Advisor
- Vicente Ordóñez Román