AIRO IDLab Ghent University - imec | Sign Language Recognition

In Flanders, only about 13,000 people can communicate in Flemish sign language (Vlaamse Gebarentaal, VGT). For many of those people, VGT is their preferred language.

Since most hearing people do not understand sign language, signers and non-signers mostly communicate through interpreters, or through written language. Neither is practical for day-to-day iteraction, or getting to know each other on an informal basis. Interpreters are only available by appointment and need to be paid, and not all signers are equally fluent in written communication.

If each person could communicate in the language they feel most familiar with, communication could become a lot easier. In the European SignON project, we leveraged machine learning and AI to automatically translate between different European sign languages and different spoken or written languages. While automatically translating from or into sign languages for open, informal communication is still a very far-off goal, we believe that a first step in the form of communication in specific use cases or scenarios is feasible within the next few years.

The whole SignON platform development is user-driven, with a strong participation of native signers and the deaf communities from different countries. For the technical development, IDLab-AIRO investigated the use of deep learning techniques in order to create a sign language recognition and understanding system. Other partners contributed to the translation and language generation parts.

Sign languages are complex visual languages, with some specific properties that are not present in spoken or written languages. In order to understand these properties and incorporate them in our model development, several experts in sign language linguistics are also involved in the project.

From a deep learning perspective, another difficulty is the fact that only very small labeled datasets are available, at least in comparison to those for speech recognition and natural language processing on written text. Furthermore, sign languages have their own grammar and dialects. This makes sign language recognition and translation a very challenging and very exciting problem from the perspective of data efficiency.

At IDLab-AIRO, our current research builds on the pioneering work of one of our alumni, dr. Lionel Pigou. His research into the use of convolutional neural networks for sign language recognition is still highly cited in the domain today.

Our goal is to use domain and task knowledge to increase the performance of sign language recognition models to the point of usability.

Publications

Frozen pretrained transformers for neural sign language translation

De Coster, Mathieu, D’Oosterlinck, Karel, Pizurica, Marija, Rabaey, Paloma, Verlinden, Severine, Van Herreweghe, Mieke, and Dambre, Joni

In Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL) 2021

Abs Biblio

One of the major challenges in sign language translation from a sign language to a spoken language is the lack of parallel corpora. Recent works have achieved promising results on the RWTH-PHOENIX-Weather 2014T dataset, which consists of over eight thousand parallel sentences between German sign language and German. However, from the perspective of neural machine translation, this is still a tiny dataset. To improve the performance of models trained on small datasets, transfer learning can be used. While this has been previously applied in sign language translation for feature extraction, to the best of our knowledge, pretrained language models have not yet been investigated. We use pretrained BERT-base and mBART-50 models to initialize our sign language video to spoken language text translation model. To mitigate overfitting, we apply the frozen pretrained transformer technique: we freeze the majority of parameters during training. Using a pretrained BERT model, we outperform a baseline trained from scratch by 1 to 2 BLEU-4. Our results show that pretrained language models can be used to improve sign language translation performance and that the self-attention patterns in BERT transfer in zero-shot to the encoder and decoder of sign language translation models.
Gesture and sign language recognition with temporal residual networks

Pigou, Lionel, Van Herreweghe, Mieke, and Dambre, Joni

In 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017) 2017

Biblio
Isolated sign recognition from RGB video using pose flow and self-attention

De Coster, Mathieu, Van Herreweghe, Mieke, and Dambre, Joni

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021

Abs Biblio

Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we reduce the impact of this data limitation by automatically pre-extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame human pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently released AUTSL dataset for isolated sign recognition and obtain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspection of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.
Leveraging frozen pretrained written language models for neural sign language translation

De Coster, Mathieu, and Dambre, Joni

INFORMATION 2022

Abs Biblio

We consider neural sign language translation: machine translation from signed to written languages using encoder–decoder neural networks. Translating sign language videos to written language text is especially complex because of the difference in modality between source and target language and, consequently, the required video processing. At the same time, sign languages are low-resource languages, their datasets dwarfed by those available for written languages. Recent advances in written language processing and success stories of transfer learning raise the question of how pretrained written language models can be leveraged to improve sign language translation. We apply the Frozen Pretrained Transformer (FPT) technique to initialize the encoder, decoder, or both, of a sign language translation model with parts of a pretrained written language model. We observe that the attention patterns transfer in zero-shot to the different modality and, in some experiments, we obtain higher scores (from 18.85 to 21.39 BLEU-4). Especially when gloss annotations are unavailable, FPTs can increase performance on unseen data. However, current models appear to be limited primarily by data quality and only then by data quantity, limiting potential gains with FPTs. Therefore, in further research, we will focus on improving the representations used as inputs to translation models.
Machine translation from signed to spoken languages : state of the art and challenges

De Coster, Mathieu, Shterionov, Dimitar, Van Herreweghe, Mieke, and Dambre, Joni

UNIVERSAL ACCESS IN THE INFORMATION SOCIETY 2024

Abs Biblio

Automatic translation from signed to spoken languages is an interdisciplinary research domain on the intersection of computer vision, machine translation (MT), and linguistics. While the domain is growing in terms of popularity-the majority of scientific papers on sign language (SL) translation have been published in the past five years-research in this domain is performed mostly by computer scientists in isolation. This article presents an extensive and cross-domain overview of the work on SL translation. We first give a high level introduction to SL linguistics and MT to illustrate the requirements of automatic SL translation. Then, we present a systematic literature review of the state of the art in the domain. Finally, we outline important challenges for future research. We find that significant advances have been made on the shoulders of spoken language MT research. However, current approaches often lack linguistic motivation or are not adapted to the different characteristics of SLs. We explore challenges related to the representation of SL data, the collection of datasets and the evaluation of SL translation models. We advocate for interdisciplinary research and for grounding future research in linguistic analysis of SLs. Furthermore, the inclusion of deaf and hearing end users of SL translation applications in use case identification, data collection, and evaluation, is of utmost importance in the creation of useful SL translation models.
Sign language recognition with transformer networks

De Coster, Mathieu, Van Herreweghe, Mieke, and Dambre, Joni

In PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) 2020

Abs Biblio

Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation.