Improving NLTK for Processing Portuguese

Ferreira, João; Gonçalo Oliveira, Hugo; Rodrigues, Ricardo

doi:10.4230/OASIcs.SLATE.2019.18

Abstract

Python has a growing community of users, especially in the AI and ML fields. Yet, Computational Processing of Portuguese in this programming language is limited, in both available tools and results. This paper describes NLPyPort, a NLP pipeline in Python, primarily based on NLTK, and focused on Portuguese. It is mostly assembled from pre-existent resources or their adaptations, but improves over the performance of existing alternatives in Python, namely in the tasks of tokenization, PoS tagging, lemmatization and NER.

Cite As Get BibTex

João Ferreira, Hugo Gonçalo Oliveira, and Ricardo Rodrigues. Improving NLTK for Processing Portuguese. In 8th Symposium on Languages, Applications and Technologies (SLATE 2019). Open Access Series in Informatics (OASIcs), Volume 74, pp. 18:1-18:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/OASIcs.SLATE.2019.18

Author Details

João Ferreira

Department of Informatics Engineering of the University of Coimbra, Portugal

Hugo Gonçalo Oliveira

Centre for Informatics and Systems of the University of Coimbra, Portugal
Department of Informatics Engineering of the University of Coimbra, Portugal

Ricardo Rodrigues

Centre for Informatics and Systems of the University of Coimbra, Portugal
College of Education of the Polytechnic Institute of Coimbra, Portugal

Funding

This work was funded by FCT’s INCoDe 2030 initiative, in the scope of the demonstration project AIA, "Apoio Inteligente a Empreendedores (Chatbots)". We also thank Fábio Lopes for his help with NER based on CRF.

References

Steven Bird and Edward Loper. NLTK: The Natural Language Toolkit. In Proceedings of ACL 2004 on Interactive poster and demonstration sessions, page 31. ACL, 2004.
Erick Rocha Fonseca and João Luís G Rosa. Mac-Morpho Revisited: Towards Robust Part-of-Speech Tagging. In Proceedings of 7superscriptth Brazilian Symposium in Information and Human Language Technology, 2013.
Cláudia Freitas, Paulo Rocha, and Eckhard Bick. Um mundo novo na Floresta Sintá(c)tica - o treebank do Português. Calidoscópio, 6(3):142-148, 2008.
Cláudia Freitas, Paula Carvalho, Hugo Gonçalo Oliveira, Cristina Mota, and Diana Santos. Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In Proceedings of 7th International Conference on Language Resources and Evaluation, LREC 2010, La Valleta, Malta, May 2010. ELRA.
Pablo Gamallo and Marcos Garcia. LinguaKit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática, 9(1):19-28, 2017.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. 18th International Conference on Machine Learning, ICML '01, pages 282-289. Morgan Kaufmann, 2001.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55-60. ACL Press, 2014.
Naoaki Okazaki. CRFsuite: a Fast Implementation of Conditional Random Fields (CRFs), 2007. URL: http://www.chokkan.org/software/crfsuite/.
Apache OpenNLP. Apache software foundation. URL http://opennlp. apache. org, 2011.
Viviane Moreira Orengo and Christian Huyck. A Stemming Algorithm for the Portuguese Language. In Proceedings of 8th International Symposium on String Processing and Information Retrieval (SPIRE), pages 183–-193, Laguna de San Raphael, Chile, 2001.
Lluís Padró and Evgeny Stanilovsky. FreeLing 3.0: Towards wider multilinguality. In Proceedings of 8th International Conference on Language Resources and Evaluation (LREC-2012), pages 2473-2479, Istanbul, Turkey, May 2012. ELRA.
André Ricardo Oliveira Pires. Named Entity Extraction from Portuguese Web Text. Master’s thesis, Faculdade de Engenharia da Universidade do Porto, 2017.
Elisabete Ranchhod, Cristina Mota, and Jorge Baptista. A Computational Lexicon of Portuguese for Automatic Text Parsing. In Proceedings of SIGLEX99 Workshop: Standardizing Lexical Resources. ACL Press, 1999.
Ricardo Rodrigues, Hugo Gonçalo Oliveira, and Paulo Gomes. NLPPort: A Pipeline for Portuguese NLP (Short Paper). In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
Diana Santos and Eckhard Bick. Providing Internet access to Portuguese corpora: the AC/DC project. In Proceedings 2nd International Conference on Language Resources and Evaluation, LREC 2000, pages 205-210, 2000.

Improving NLTK for Processing Portuguese

Authors João Ferreira, Hugo Gonçalo Oliveira , Ricardo Rodrigues

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message