Automatic Classification of Portuguese Proverbs

Authors Jorge Baptista , Sónia Reis



PDF
Thumbnail PDF

File

OASIcs.SLATE.2022.2.pdf
  • Filesize: 0.74 MB
  • 8 pages

Document Identifiers

Author Details

Jorge Baptista
  • University of Algarve, Faro, Portugal
  • INESC-ID Lisbon, Portugal
Sónia Reis
  • University Algarve, Faro, Portugal
  • INESC-ID Lisbon, Portugal

Cite As Get BibTex

Jorge Baptista and Sónia Reis. Automatic Classification of Portuguese Proverbs. In 11th Symposium on Languages, Applications and Technologies (SLATE 2022). Open Access Series in Informatics (OASIcs), Volume 104, pp. 2:1-2:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/OASIcs.SLATE.2022.2

Abstract

In this paper, natural language processing (NLP) and machine learning methods and tools are applied to the task of topic (thematic or semantic) classification of Portuguese proverbs. This is a difficult task since proverbs are usually very short sentences. Such classification should allow an easier selection of the most relevant proverbs for a given situation, considering their context in discourse or within a text. For that, we used, on the one hand, a collection of +32,000 proverbial expressions organized "thematically" into a large set of previously attributed topics (+2,200) and, on the other hand, the Orange data mining toolkit, along with the NLP and machine learning tools it provides. Since the classification provided in the collection of proverbs is, for the most part, based only on a keyword in the body of the proverbs, 2 experiments were set up, to determine the feasibility of the task with a modicum of effort and the most promising configurations applicable. Different sample sizes, 100 and 50 proverbs randomly selected per topic, corresponding to Scenario 1 and 2, respectively, were contrasted; several preprocessing strategies were explored, and different data representation methods tested against several learning algorithms. Results show that Neural Networks is the best performing model, achieving the best classification accuracy of 70% and 61%, in the two different experimental scenarios, Scenario 1 and 2, respectively. Some of the inaccurate classification cases seem to indicate that the machine learning approach can sometimes do a better job than a human classifier, especially considering the manual attribution of the topics by the collection’s author, the sheer number of topics involved, and the very unbalanced distribution of proverbs per topic. Based on the results achieved, the paper presents some proposals for future work to cope with such difficulties.

Subject Classification

ACM Subject Classification
  • Human-centered computing
Keywords
  • Portuguese Proverbs
  • Automatic Topic Classification
  • Machine Learning

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. José João Almeida. Dicionário aberto de calão e expressões idiomáticas [online]. Available at. http://natura.di.uminho.pt/~jj/pln/calao/dicionario.pdf, 2014.
  2. Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 14:2349-2353, 2013. Google Scholar
  3. Ana Lopes. Texto Proverbial Português: elementos para uma análise semântica e pragmática. PhD thesis, Universidade de Coimbra, Coimbra, Portugal, 1992. Google Scholar
  4. José Pedro Machado. O Grande Livro dos Provérbios. Editorial Notícias, Lisboa, 1996. Google Scholar
  5. José Ricardo Marques da Costa. O Livro dos Provérbios Portugueses. Editorial Presença, Lisboa, 1999. Google Scholar
  6. Rui Mendes and Hugo Gonçalo Oliveira. Comparing different methods for assigning Portuguese proverbs to news headlines. In 11th International Conference on Computational Creativity ((ICCC'20), pages 153-160, 2020. Google Scholar
  7. António Moreira. Provérbios Portugueses. Editorial Notícias, 1996. Google Scholar
  8. S.A. Noah and F. Ismail. Automatic classifications of Malay proverbs using naïve Bayesian algorithm. Information Technology Journal, 7:1016-1022, 2008. Google Scholar
  9. Sónia Reis. Expressões proverbiais do português: Usos, variação formal e identificação automática. PhD thesis, Universidade do Algarve, Faro, Portugal, 2020. Google Scholar
  10. Sónia Reis and Jorge Baptista. Determinação de um mínimo paremiológico do português europeu. Acta Scientiarum. Language and Culture, 2(42):e52114, 2020. Google Scholar
  11. Sónia Reis, Jorge Baptista, and Nuno Mamede. Provérbios portugueses usuais: distribuição em corpora. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 325-334. Sociedade Brasileira de Computação, 2021. Google Scholar
  12. Fernando Ribeiro de Mello. Nova Recolha e Provérbios Portugueses e outros lugares-comuns. Edições Afrodite, 2 edition, 1986. Google Scholar
  13. João Rodrigues, António Branco, Steven Neale, and João Silva. LX-DSemVectors: Distributional Semantics Models for Portuguese. In International Conference on Computational Processing of the Portuguese Language (PROPOR 2016), pages 259-270. Springer, 2016. Google Scholar
  14. Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Ricardo Cerri and Ronaldo C. Prati, editors, Intelligent Systems. BRACIS 2020, pages 403-417. Springer, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail