|Tejaswini Ananthanarayana (Rochester Institute of Technology, USA), Lipisha Chaudhary (Rochester Institute of Technology, USA), Ifeoma Nwogu (Rochester Institute of Technology, USA)|
Sign language translation without transcription has only recently started to gain attention. In our work, we focus on improving the state-of-the-art translation by introducing a multi-feature fusion architecture with enhanced input features. As sign language is challenging to segment, we obtain the input features by extracting overlapping scaled segments across the video and obtaining their 3D CNN representations. We exploit the attention mechanism in the fusion architecture by initially learning dependencies between different frames of the same video and later fusing them to learn the relations between different features from the same video. In addition to 3D CNN features, we also analyze pose-based features. Our robust methodology outperforms the state-of-the-art sign language translation model by achieving higher BLEU 3 – BLEU 4 scores and also outperforms the state-of-the-art sequence attention models by achieving a 43.54% increase in BLEU 4 score. We conclude that the combined effects of feature scaling and feature fusion make our model more robust in predicting longer n-grams which are crucial in continuous sign language translation.