Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications (Short Paper)

Authors Simone Wills, Yu Bai, Cristian Tejedor-García , Catia Cucchiarini, Helmer Strik

Thumbnail PDF


  • Filesize: 470 kB
  • 8 pages

Document Identifiers

Author Details

Simone Wills
  • Radboud University, Nijmegen, The Netherlands
Yu Bai
  • Radboud University, Nijmegen, The Netherlands
  • NovoLearning, Nijmegen, The Netherlands
Cristian Tejedor-García
  • Radboud University, Nijmegen, The Netherlands
Catia Cucchiarini
  • Radboud University, Nijmegen, The Netherlands
Helmer Strik
  • Radboud University, Nijmegen, The Netherlands


Special thanks go to all the children who participated, their parents, their teachers, and the schools.

Cite AsGet BibTex

Simone Wills, Yu Bai, Cristian Tejedor-García, Catia Cucchiarini, and Helmer Strik. Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications (Short Paper). In 12th Symposium on Languages, Applications and Technologies (SLATE 2023). Open Access Series in Informatics (OASIcs), Volume 113, pp. 7:1-7:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children’s pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

Subject Classification

ACM Subject Classification
  • Human-centered computing → Human computer interaction (HCI)
  • Applied computing → Education
  • Automatic Speech Recognition
  • ASR
  • Child Speech
  • Non-Native Speech
  • Human-computer Interaction
  • Whisper
  • Wav2Vec2.0


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449-12460, 2020. Google Scholar
  2. Yu Bai, Ferdy Hubers, Catia Cucchiarini, and Helmer Strik. ASR-Based Evaluation and Feedback for Individualized Reading Practice. In Proc. Interspeech 2020, pages 3870-3874, 2020. URL:
  3. Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747, 2023. Google Scholar
  4. Mohamed Benzeghiba, Renato De Mori, Olivier Deroo, Stephane Dupont, Teodora Erbes, Denis Jouvet, Luciano Fissore, Pietro Laface, Alfred Mertins, Christophe Ris, et al. Automatic speech recognition and speech variability: A review. Speech Communication, 49(10):763-786, 2007. Intrinsic Speech Variations. URL:
  5. Catia Cucchiarini, Ambra Neri, and Helmer Strik. Oral proficiency training in dutch l2: The contribution of asr-based corrective feedback. Speech Commun., 51:853-863, 2009. Google Scholar
  6. Catia Cucchiarini, Helmer Strik, and Lou Boves. Quantitative assessment of second language learners' fluency: comparisons between read and spontaneous speech. The Journal of the Acoustical Society of America, 111 6:2862-73, 2002. Google Scholar
  7. Catia Cucchiarini, Hugo Van hamme, Olga van Herwijnen, and Felix Smits. Jasmin-cgn: Extension of the spoken dutch corpus with speech of elderly people, children and non-natives in the human-machine interaction modality. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), Genoa, Italy, May 2006. European Language Resources Association (ELRA). URL:
  8. Joost Doremalen, Catia Cucchiarini, and Helmer Strik. Optimizing automatic speech recognition for low-proficient non-native speakers. EURASIP Journal on Audio, Speech, and Music Processing, 2010, January 2010. URL:
  9. Roberto Gretter, Marco Matassoni, Daniele Falavigna, A Misra, Chee Wee Leong, Katherine Knill, and Linlin Wang. Etlt 2021: Shared task on automatic speech recognition for non-native children’s speech. In Interspeech, pages 3845-3849, 2021. Google Scholar
  10. Denis Liakin, Walcir Cardoso, and Natallia Liakina. Learning l2 pronunciation with a mobile speech recognizer: French/y/. Calico Journal, 32(1):1-25, 2015. Google Scholar
  11. Ikuyo Masuda-Katsuse. Pronunciation practice support system for children who have difficulty correctly pronouncing words. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. Google Scholar
  12. Susana Perez Castillejo. Automatic speech recognition: Can you understand me? Research-publishing. net, 2021. Google Scholar
  13. Martin Raab, Rainer Gruhn, and Elmar Noeth. Non-native speech databases. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 413-418, 2007. URL:
  14. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. Google Scholar
  15. Martin Russell and Shona D’Arcy. Challenges for computer recognition of children’s speech. In Workshop on speech and language technology in education, 2007. Google Scholar
  16. Cristian Tejedor-García, David Escudero-Mancebo, Valentín Cardeñoso-Payo, and César González-Ferreras. Using challenges to enhance a learning game for pronunciation training of English as a second language. IEEE Access, 8:74250-74266, 2020. URL:
  17. Shelley Shwu-Ching Young and Yi-Hsuan Wang. The game embedded call system to facilitate english vocabulary acquisition and pronunciation. Journal of Educational Technology & Society, 17(3):239-251, 2014. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail