Robust Phoneme Recognition with Little Data

Authors Christopher Dane Shulby , Martha Dais Ferreira , Rodrigo F. de Mello , Sandra Maria Aluisio



PDF
Thumbnail PDF

File

OASIcs.SLATE.2019.4.pdf
  • Filesize: 448 kB
  • 11 pages

Document Identifiers

Author Details

Christopher Dane Shulby
  • Institute of Mathematical and Computer Sciences - University of Sao Paulo, Brazil
  • Samsung SIDI Institute, São Paulo, Brazil
  • www.nilc.icmc.usp.br
Martha Dais Ferreira
  • Institute of Mathematical and Computer Sciences - University of Sao Paulo, Brazil
Rodrigo F. de Mello
  • Institute of Mathematical and Computer Sciences - University of Sao Paulo, Brazil
Sandra Maria Aluisio
  • Institute of Mathematical and Computer Sciences - University of Sao Paulo, Brazil

Acknowledgements

We would like to thank our NILC colleagues for their input and support. A great thanks goes to CEMAI. It was only possible to run these experiments thanks to the Euler super-computer cluster at the ICMC - University of Sao Paulo.

Cite AsGet BibTex

Christopher Dane Shulby, Martha Dais Ferreira, Rodrigo F. de Mello, and Sandra Maria Aluisio. Robust Phoneme Recognition with Little Data. In 8th Symposium on Languages, Applications and Technologies (SLATE 2019). Open Access Series in Informatics (OASIcs), Volume 74, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/OASIcs.SLATE.2019.4

Abstract

A common belief in the community is that deep learning requires large datasets to be effective. We show that with careful parameter selection, deep feature extraction can be applied even to small datasets.We also explore exactly how much data is necessary to guarantee learning by convergence analysis and calculating the shattering coefficient for the algorithms used. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. We present a two-fold novelty for this situation where a carefully designed CNN architecture, together with a knowledge-driven classifier achieves nearly state-of-the-art phoneme recognition results with absolutely no pretraining or external weight initialization. We also beat the best replication study of the state of the art with a 28% FER. More importantly, we are able to achieve transparent, reproducible frame-level accuracy and, additionally, perform a convergence analysis to show the generalization capacity of the model providing statistical evidence that our results are not obtained by chance. Furthermore, we show how algorithms with strong learning guarantees can not only benefit from raw data extraction but contribute with more robust results.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Speech recognition
Keywords
  • feature extraction
  • acoustic modeling
  • phoneme recognition
  • statistical learning theory

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10):1533-1545, 2014. Google Scholar
  2. Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), pages 4277-4280. IEEE, 2012. Google Scholar
  3. Rimah Amami and Noureddine Ellouze. Study of phonemes confusions in hierarchical automatic phoneme recognition system. arXiv preprint, 2015. URL: http://arxiv.org/abs/1508.01718.
  4. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321-357, 2002. Google Scholar
  5. Xie Chen, Adam Eversole, Gang Li, Dong Yu, and Frank Seide. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks. In Interspeech, pages 26-29, 2012. Google Scholar
  6. Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Katya Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. arXiv preprint, 2017. URL: http://arxiv.org/abs/1712.01769.
  7. R. Fernandes de Mello, M. Dais Ferreira, and M. Antonelli Ponti. Providing theoretical learning guarantees to Deep Learning Networks. ArXiv e-prints, 2017. Google Scholar
  8. Martha Dais Ferreira, Deborah Cristina Correa, Luis Gustavo Nonato, and Rodrigo F de Mello. Designing Architectures of Convolutional Neural Networks to Solve Practical Problems. 2017 Elesvier pre-print, 2017. Google Scholar
  9. Alex Graves and Navdeep Jaitly. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, volume 14, pages 1764-1772, 2014. Google Scholar
  10. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273-278. IEEE, 2013. Google Scholar
  11. Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint, 2014. URL: http://arxiv.org/abs/1408.2873.
  12. Zhiheng Huang, Geoffrey Zweig, and Benoit Dumoulin. Cache based recurrent neural network language model inference for first pass speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6354-6358. IEEE, 2014. Google Scholar
  13. Tae Yoon Kim, Chang Woo Han, Sangha Kim, Donghoon Ahn, Seokyeong Jeong, and Jae Won Lee. Korean LVCSR system development for personal assistant service. In Consumer Electronics (ICCE), 2016 IEEE International Conference on, pages 93-96. IEEE, 2016. Google Scholar
  14. K-F Lee and H-W Hon. Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1641-1648, 1989. Google Scholar
  15. Ken MacLean. Tutorial: Create Acoustic Model - Manually, 2018. Accessed: 2018-03-01. URL: http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/triphones/step-10.
  16. Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14-22, 2012. Google Scholar
  17. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206-5210. IEEE, 2015. Google Scholar
  18. Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for LVCSR. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8614-8618. IEEE, 2013. Google Scholar
  19. Christopher Dane Shulby, Martha Dais Ferreira, Rodrigo F de Mello, and Sandra Maria Aluisio. Acoustic Modeling Using a Shallow CNN-HTSVM Architecture. 2017 Brazilian Conference on Intelligent Systems, pages 85-90, 2017. Google Scholar
  20. William Song and Jim Cai. End-to-end deep neural network for automatic speech recognition. Technical Report, 2015. Google Scholar
  21. László Tóth. Phone recognition with hierarchical convolutional deep maxout networks. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):25, 2015. Google Scholar
  22. Timo van Niedek, Tom Heskes, and David van Leeuwen. Phonetic Classification in TensorFlow. Bachelor’s Thesis, Radboud University, 2016. Google Scholar
  23. Vladimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998. Google Scholar
  24. Keith Vertanen. HTK Acoustic Models, 2018. Accessed: 2018-03-01. URL: https://www.keithv.com/software/htk/us/.
  25. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint, 2014. URL: http://arxiv.org/abs/1409.2329.
  26. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint, 2016. URL: http://arxiv.org/abs/1611.03530.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail