Improving NLTK for Processing Portuguese

Authors João Ferreira, Hugo Gonçalo Oliveira , Ricardo Rodrigues



PDF
Thumbnail PDF

File

OASIcs.SLATE.2019.18.pdf
  • Filesize: 450 kB
  • 9 pages

Document Identifiers

Author Details

João Ferreira
  • Department of Informatics Engineering of the University of Coimbra, Portugal
Hugo Gonçalo Oliveira
  • Centre for Informatics and Systems of the University of Coimbra, Portugal
  • Department of Informatics Engineering of the University of Coimbra, Portugal
Ricardo Rodrigues
  • Centre for Informatics and Systems of the University of Coimbra, Portugal
  • College of Education of the Polytechnic Institute of Coimbra, Portugal

Cite AsGet BibTex

João Ferreira, Hugo Gonçalo Oliveira, and Ricardo Rodrigues. Improving NLTK for Processing Portuguese. In 8th Symposium on Languages, Applications and Technologies (SLATE 2019). Open Access Series in Informatics (OASIcs), Volume 74, pp. 18:1-18:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/OASIcs.SLATE.2019.18

Abstract

Python has a growing community of users, especially in the AI and ML fields. Yet, Computational Processing of Portuguese in this programming language is limited, in both available tools and results. This paper describes NLPyPort, a NLP pipeline in Python, primarily based on NLTK, and focused on Portuguese. It is mostly assembled from pre-existent resources or their adaptations, but improves over the performance of existing alternatives in Python, namely in the tasks of tokenization, PoS tagging, lemmatization and NER.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
Keywords
  • NLP
  • Tokenization
  • PoS tagging
  • Lemmatization
  • Named Entity Recognition

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Steven Bird and Edward Loper. NLTK: The Natural Language Toolkit. In Proceedings of ACL 2004 on Interactive poster and demonstration sessions, page 31. ACL, 2004. Google Scholar
  2. Erick Rocha Fonseca and João Luís G Rosa. Mac-Morpho Revisited: Towards Robust Part-of-Speech Tagging. In Proceedings of 7superscriptth Brazilian Symposium in Information and Human Language Technology, 2013. Google Scholar
  3. Cláudia Freitas, Paulo Rocha, and Eckhard Bick. Um mundo novo na Floresta Sintá(c)tica - o treebank do Português. Calidoscópio, 6(3):142-148, 2008. Google Scholar
  4. Cláudia Freitas, Paula Carvalho, Hugo Gonçalo Oliveira, Cristina Mota, and Diana Santos. Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In Proceedings of 7th International Conference on Language Resources and Evaluation, LREC 2010, La Valleta, Malta, May 2010. ELRA. Google Scholar
  5. Pablo Gamallo and Marcos Garcia. LinguaKit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática, 9(1):19-28, 2017. Google Scholar
  6. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. 18th International Conference on Machine Learning, ICML '01, pages 282-289. Morgan Kaufmann, 2001. Google Scholar
  7. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55-60. ACL Press, 2014. Google Scholar
  8. Naoaki Okazaki. CRFsuite: a Fast Implementation of Conditional Random Fields (CRFs), 2007. URL: http://www.chokkan.org/software/crfsuite/.
  9. Apache OpenNLP. Apache software foundation. URL http://opennlp. apache. org, 2011. Google Scholar
  10. Viviane Moreira Orengo and Christian Huyck. A Stemming Algorithm for the Portuguese Language. In Proceedings of 8th International Symposium on String Processing and Information Retrieval (SPIRE), pages 183–-193, Laguna de San Raphael, Chile, 2001. Google Scholar
  11. Lluís Padró and Evgeny Stanilovsky. FreeLing 3.0: Towards wider multilinguality. In Proceedings of 8th International Conference on Language Resources and Evaluation (LREC-2012), pages 2473-2479, Istanbul, Turkey, May 2012. ELRA. Google Scholar
  12. André Ricardo Oliveira Pires. Named Entity Extraction from Portuguese Web Text. Master’s thesis, Faculdade de Engenharia da Universidade do Porto, 2017. Google Scholar
  13. Elisabete Ranchhod, Cristina Mota, and Jorge Baptista. A Computational Lexicon of Portuguese for Automatic Text Parsing. In Proceedings of SIGLEX99 Workshop: Standardizing Lexical Resources. ACL Press, 1999. Google Scholar
  14. Ricardo Rodrigues, Hugo Gonçalo Oliveira, and Paulo Gomes. NLPPort: A Pipeline for Portuguese NLP (Short Paper). In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  15. Diana Santos and Eckhard Bick. Providing Internet access to Portuguese corpora: the AC/DC project. In Proceedings 2nd International Conference on Language Resources and Evaluation, LREC 2000, pages 205-210, 2000. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail