Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

Chakravarthi, Bharathi Raja; Arcan, Mihael; McCrae, John P.

doi:10.4230/OASIcs.LDK.2019.6

Abstract

Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.

Steven Abney and Steven Bird. The Human Language Project: Building a Universal Corpus of the World’s Languages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 88-97. Association for Computational Linguistics, 2010. URL: http://www.aclweb.org/anthology/P10-1010.
Iñaki Alegria, Xabier Artola, Arantza Diaz De Ilarraza, and Kepa Sarasola. Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque, 2011.
Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65-72. Association for Computational Linguistics, 2005. URL: http://www.aclweb.org/anthology/W05-0909.
Alevtina Bemova, Karel Oliva, and Jarmila Panevova. Some Problems of Machine Translation Between Closely Related Languages. In Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics, 1988. URL: http://www.aclweb.org/anthology/C88-1010.
Kamadev Bhanuprasad and Mats Svenson. Errgrams - A Way to Improving ASR for Highly Inflected Dravidian Languages. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II, 2008. URL: http://www.aclweb.org/anthology/I08-2113.
Irshad Ahmad Bhat, Vandan Mujadia, Aniruddha Tammewar, Riyaz Ahmad Bhat, and Manish Shrivastava. IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search. In Proceedings of the Forum for Information Retrieval Evaluation, FIRE '14, pages 48-53, New York, NY, USA, 2015. ACM. URL: http://dx.doi.org/10.1145/2824864.2824872.
Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Naman Jain, and Dipti Misra Sharma. A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 397-408, 2016. URL: http://aclweb.org/anthology/C/C16/C16-1039.pdf.
Michael Bloodgood and Benjamin Strauss. Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 21-25. Association for Computational Linguistics, 2017. URL: http://dx.doi.org/10.18653/v1/W17-2504.
Bharathi Raja Chakravarthi, Mihael Arcan, and John P. McCrae. Improving Wordnets for Under-Resourced Languages Using Machine Translation. In Proceedings of the 9th Global WordNet Conference. The Global WordNet Conference 2018 Committee, 2018. URL: http://compling.hss.ntu.edu.sg/events/2018-gwc/pdfs/GWC2018_paper_16.
Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. A Teacher-Student Framework for Zero-Resource Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1925-1935. Association for Computational Linguistics, 2017. URL: http://dx.doi.org/10.18653/v1/P17-1176.
Colin Cherry and Hisami Suzuki. Discriminative Substring Decoding for Transliteration. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1066-1075. Association for Computational Linguistics, 2009. URL: http://www.aclweb.org/anthology/D09-1111.
Deborah Coughlin. Correlating automated and human assessments of machine translation quality. In In Proceedings of MT Summit IX, pages 63-70, 2003.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866-875. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/N16-1101.
Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T. Yarman Vural, and Yoshua Bengio. Multi-way, Multilingual Neural Machine Translation. Comput. Speech Lang., 45(C):236-252, September 2017. URL: http://dx.doi.org/10.1016/j.csl.2016.10.006.
Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. Zero-Resource Translation with Multi-Lingual Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268-277. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/D16-1026.
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. In Proceedings of the International Workshop on Spoken Language Translation, 2016. URL: http://workshop2016.iwslt.org/downloads/IWSLT_2016_paper_5.pdf.
Jan Hajic, Jan Hric, and Kubon Vladislav. Machine Translation of Very Close Languages. In Sixth Applied Natural Language Processing Conference, 2000. URL: http://www.aclweb.org/anthology/A00-1002.
Auður Hauksdóttir. An Innovative World Language Centre : Challenges for the Use of Language Technology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA), 2014. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/795_Paper.pdf.
International Phonetic Association. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press, 1999.
Johnson, Melvin and Schuster, Mike and Le, Quoc V. and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Viégas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5:339-351, 2017. URL: http://aclweb.org/anthology/Q17-1024.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations, pages 67-72. Association for Computational Linguistics, 2017. URL: http://www.aclweb.org/anthology/P17-4012.
Kevin Knight and Jonathan Graehl. Machine Transliteration. Computational Linguistics, 24(4), 1998. URL: http://www.aclweb.org/anthology/J98-4003.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177-180. Association for Computational Linguistics, 2007. URL: http://www.aclweb.org/anthology/P07-2045.
Steven Krauwer. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of SPECOM 2003, pages 8-15, 2003.
Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. Morphological Analysis of the Dravidian Language Family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 217-222. Association for Computational Linguistics, 2017. URL: http://aclweb.org/anthology/E17-2035.
Anoop Kunchukuttan and Pushpak Bhattacharyya. Learning variable length units for SMT between related languages via Byte Pair Encoding. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 14-24. Association for Computational Linguistics, 2017. URL: http://aclweb.org/anthology/W17-4102.
Anoop Kunchukuttan, Mitesh Khapra, Gurneet Singh, and Pushpak Bhattacharyya. Leveraging Orthographic Similarity for Multilingual Neural Transliteration. Transactions of the Association for Computational Linguistics, 6:303-316, 2018. URL: http://aclweb.org/anthology/Q18-1022.
Mike Maxwell and Baden Hughes. Frontiers in Linguistic Annotation for Lower-Density Languages. In Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 29-37. Association for Computational Linguistics, 2006. URL: http://www.aclweb.org/anthology/W06-0605.
David R. Mortensen, Siddharth Dalmia, and Patrick Littell. Epitran: Precision G2P for Many Languages. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, May 2018. European Language Resources Association (ELRA).
Preslav Nakov and Hwee Tou Ng. Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1358-1367. Association for Computational Linguistics, 2009. URL: http://www.aclweb.org/anthology/D09-1141.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. URL: http://www.aclweb.org/anthology/P02-1040.
Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392-395. Association for Computational Linguistics, 2015. URL: http://dx.doi.org/10.18653/v1/W15-3049.
Maja Popović, Mihael Arcan, and Filip Klubička. Language Related Issues for Machine Translation between Closely Related South Slavic Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 43-52. The COLING 2016 Organizing Committee, 2016. URL: http://www.aclweb.org/anthology/W16-4806.
Maja Popović and Nikola Ljubešić. Exploring cross-language statistical machine translation for closely related South Slavic languages. In Proceedings of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pages 76-84. Association for Computational Linguistics, 2014. URL: http://dx.doi.org/10.3115/v1/W14-4210.
Matt Post, Chris Callison-Burch, and Miles Osborne. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 401-409. Association for Computational Linguistics, 2012.
P. Prakash and R. Malatesha Joshi. Orthography and Reading in Kannada: A Dravidian Language, pages 95-108. Springer Netherlands, Dordrecht, 1995. URL: http://dx.doi.org/10.1007/978-94-011-1162-1_7.
Loganathan Ramasamy, Ondřej Bojar, and Zdeněk Žabokrtský. Morphological Processing for English-Tamil Statistical Machine Translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 113-122, 2012.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715-1725. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/P16-1162.
Jorg Tiedemann and Lars Nygaard. The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus . In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04). European Language Resources Association (ELRA), 2004. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/320.pdf.
Devadath V V and Dipti Misra Sharma. Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages. In Proceedings of the ACL 2016 Student Research Workshop, pages 37-42. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/P16-3006.
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568-1575. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/D16-1163.

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

Authors Bharathi Raja Chakravarthi , Mihael Arcan, John P. McCrae

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

Authors Bharathi Raja Chakravarthi , Mihael Arcan, John P. McCrae

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References