Abstract:
Deep neural networks for natural language processing tasks are vulnerable to adversarial input perturbations. Existing works have proposed to improve the robustness against specific adversarial input perturbations (e.g., token substitutions), but do not consider general perturbations such as token insertions, token deletions, token swaps, etc. To fill this gap, we present a technique to train models that are robust to user-defined string transformations. Our technique combines data augmentation---to detect worst-case transformed inputs---and verifiable training using abstract interpretation---to further increase the robustness of the model on the worst-case transformed inputs. We use our technique to train models on the AG and SST2 datasets and show that the resulting models are robust to combinations of user-defined transformations mimicking spelling mistakes and other meaning-preserving transformations.