Abstract:
Sign Language Production (SLP) aims to translate spoken language expressions into sign language expressions such as a sequence of sign poses or a sign video. Previous SLP works have used an autoregressive approach to learn the relationship between spoken words and sign poses. However, since the approaches work autoregressively, the decoder unintentionally regresses to the mean and even suffers from error propagation. In this work, we propose Non-Autoregressive Sign Language Production with Gaussian space (NSLP-G), a novel SLP model that uses non-autoregressive decoding to generate sign poses. To avoid direct regression, NSLP-G makes use of two phases. The first phase is to build a pose generator capable of generating various sign poses in a continuous sign pose space. At the second phase, we use a non-autoregressive Transformer to map from the source sentence to the target distribution. To validate the results of our model, we assess the quality of produced sign poses using Frechet Gesture Distance, Mean Absolute Error of Joint coordination and back-translation evaluation. Experimental results show that NSLP-G outperforms the state-of-the-art model on the RWTH-PHOENIX-Weather 2014T dataset.