A Data Augmentation Approach for Sign-Language-To-Text Translation In-The-Wild

Authors Fabrizio Nunnari , Cristina España-Bonet , Eleftherios Avramidis

Thumbnail PDF


  • Filesize: 0.9 MB
  • 8 pages

Document Identifiers

Author Details

Fabrizio Nunnari
  • German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus D3.2, Saarbrücken, Germany
Cristina España-Bonet
  • German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus D3.2, Saarbrücken, Germany
Eleftherios Avramidis
  • German Research Center for Artificial Intelligence (DFKI), Berlin, Germany

Cite AsGet BibTex

Fabrizio Nunnari, Cristina España-Bonet, and Eleftherios Avramidis. A Data Augmentation Approach for Sign-Language-To-Text Translation In-The-Wild. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 36:1-36:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


In this paper, we describe the current main approaches to sign language translation which use deep neural networks with videos as input and text as output. We highlight that, under our point of view, their main weakness is the lack of generalization in daily life contexts. Our goal is to build a state-of-the-art system for the automatic interpretation of sign language in unpredictable video framing conditions. Our main contribution is the shift from image features to landmark positions in order to diminish the size of the input data and facilitate the combination of data augmentation techniques for landmarks. We describe the set of hypotheses to build such a system and the list of experiments that will lead us to their verification.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Machine learning
  • Human-centered computing → Accessibility technologies
  • Computing methodologies → Computer graphics
  • sing language
  • video recognition
  • end-to-end translation
  • data augmentation


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2131-2140, Hong Kong, China, November 2019. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/D19-1219.
  2. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59-66, Xi'an, 2018. IEEE. Google Scholar
  3. Necati Cihan Camgöz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7784-7793, 2018. URL: https://doi.org/10.1109/CVPR.2018.00812.
  4. Necati Cihan Camgöz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10020-10030. IEEE, 2020. URL: https://doi.org/10.1109/CVPR42600.2020.01004.
  5. Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172-186, 2019. Google Scholar
  6. Nicola Covre, Fabrizio Nunnari, Alberto Fornaser, and Mariolino De Cecco. Generation of action recognition training data through rotoscoping and augmentation of synthetic animations. In Augmented Reality, Virtual Reality, and Computer Graphics, pages 23-42, Cham, June 2019. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-25999-0_3.
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/N19-1423.
  8. David M. Eberhard, Gary F. Simons, and Charles D. Fenning. Ethnologue: Languages of the World. Twenty-third edition. SIL International, Dallas, Texas, 2020. Google Scholar
  9. R. Elliott, J. R. W. Glauert, J. R. Kennaway, and I. Marshall. The development of language processing support for the visicast project. In Proceedings of the Fourth International ACM Conference on Assistive Technologies, Assets '00, page 101–108, New York, NY, USA, 2000. Association for Computing Machinery. URL: https://doi.org/10.1145/354324.354349.
  10. Jens Forster, Christoph Schmidt, Thomas Hoyoux, Oscar Koller, Uwe Zelle, Justus Piater, and Hermann Ney. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. In Language Resources and Evaluation, pages 3785-3789, Istanbul, Turkey, May 2012. Google Scholar
  11. Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather. In LREC, pages 1911-1916, 2014. Google Scholar
  12. Alexis Heloir and Michael Kipp. EMBR - A Realtime Animation Engine for Interactive Embodied Agents. In Proceedings of the 9th International Conference on Intelligent Virtual Agents (IVA-09), 2009. Google Scholar
  13. Sang-Ki Ko, Kim Kim, Hyedong Jung, and Choong sang Cho. Neural sign language translation based on human keypoint estimation. Applied Sciences, 9:2683, 2019. Google Scholar
  14. Sang-Ki Ko, Jae Gi Son, and Hyedong Jung. Sign language recognition with recurrent neural network using human keypoint detection. In Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, RACS '18, page 326–328, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3264746.3264805.
  15. Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for Video+Language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2046-2065, Online, November 2020. Association for Computational Linguistics. Google Scholar
  16. Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.06165.
  17. Vincenzo Lombardo, Cristina Battaglino, Rossana Damiano, and Fabrizio Nunnari. An avatar-based interface for the italian sign language. In Proceedings of the 2011 International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS '11, pages 589-594, Washington, DC, USA, June 2011. IEEE Computer Society. URL: https://doi.org/10.1109/CISIS.2011.97.
  18. Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172 [cs], 2019. URL: http://arxiv.org/abs/1906.08172.
  19. Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, and Didier Stricker. Simple and effective deep hand shape and pose regression from a single depth image. Computers & Graphics, 85:85-91, 2019. URL: https://doi.org/10.1016/j.cag.2019.10.002.
  20. Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, Kiarash Tamaddon, Alexis Heloir, and Didier Stricker. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In 2018 International Conference on 3D Vision (3DV), pages 110-119. IEEE, September 2018. URL: https://doi.org/10.1109/3DV.2018.00023.
  21. Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Google Scholar
  22. Taro Miyazaki, Yusuke Morita, and Masanori Sano. Machine translation from spoken language to sign language using pre-trained language model as encoder. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, pages 139-144, Marseille, France, May 2020. European Language Resources Association (ELRA). Google Scholar
  23. Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE ICCV Workshops, October 2017. Google Scholar
  24. Florian Schicktanz, Lan Thao Nguyen, Aeneas Stankowski, and Eleftherios Avramidis. Evaluating the translation of speech to virtually-performed sign language on AR glasses. In IEEE, editor, Proceedings of the Thirteenth International Conference on Quality of Multimedia Experience (QoMEX), 2021. Google Scholar
  25. Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. Sign Language Production using Neural Machine Translation and Generative Adversarial Networks. In British Machine Vision Conference, Northumbria, UK, 2018. British Machine Vision Association. Google Scholar
  26. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998-6008. Curran Associates, Inc., 2017. Google Scholar
  27. Kayo Yin and Jesse Read. Better sign language translation with STMC-transformer. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5975-5989, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. URL: https://doi.org/10.18653/v1/2020.coling-main.525.
  28. Jiangbin Zheng, Zheng Zhao, Min Chen, Jing Chen, Chong Wu, Yidong Chen, Xiaodong Shi, and Yiqi Tong. An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences. Computational Intelligence and Neuroscience, 2020:11, 2020. URL: https://doi.org/10.1155/2020/8816125.
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail