Balanced Adversarial Training: Balancing Tradeoffs Between Oversensitivity and Undersensitivity in NLP Models
Traditional (oversensitive) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. Undersensitive adversarial examples are the opposite---the adversary's goal is to find a small perturbation that changes the true label of an input while preserving the classifier's prediction. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to oversensitive adversarial examples. However, recent work has shown that using these techniques to improve robustness for image classifiers may make a model more vulnerable to undersensitive adversarial examples. We demonstrate the same phenomenon applies to NLP models, showing that training methods that improve robustness to synonym-based attacks (oversensitive adversarial examples) tend to increase a model's vulnerability to antonym-based attacks (undersensitive adversarial examples) for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training which incorporates contrastive learning to increase robustness against both over- and undersensitive adversarial examples.
- Matthew Dwyer, Chair, CS/SEAS/UVA
- David Evans (Co-Advisor), CS/SEAS/UVA
- Yangfeng Ji (Co-Advisor), CS/SEAS/UVA
- Tom Fletcher, CS/SEAS/UVA