Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers
Although various techniques have been proposed to generate adversarial samples for white-box attacks on text, little attention has been paid to black-box attacks, which are more realistic scenarios. Our project presents a novel approach DeepWordBug to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. We introduce novel scoring strategies to find the most important tokens to modify such that the classifier will make a wrong prediction. Simple character-level transformations are applied to the highest-ranked tokens in order to minimize the edit distance of the perturbation, yet change the original classification. We evaluated DeepWordBug on eight real-world text datasets, including text classification, sentiment analysis and spam detection. We compare the result of DeepWordBug with two baselines: Random (Black-box) and Gradient (White-box). Our experimental results indicate that DeepWordBug leads to a decrease from the original classification accuracy up to 63\% on average for a Word-LSTM model and up to 46\% on average for a Char-CNN model, both of which models are state-of-the-art.
Alfred Weaver(Chair), Prof. Yanjun Qi(Advisor). Prof. Mohammed Mahmoody and Prof. Hongning Wang