Automatic Classification of Portuguese Proverbs

Baptista, Jorge; Reis, Sónia

doi:10.4230/OASIcs.SLATE.2022.2

Abstract

In this paper, natural language processing (NLP) and machine learning methods and tools are applied to the task of topic (thematic or semantic) classification of Portuguese proverbs. This is a difficult task since proverbs are usually very short sentences. Such classification should allow an easier selection of the most relevant proverbs for a given situation, considering their context in discourse or within a text. For that, we used, on the one hand, a collection of +32,000 proverbial expressions organized "thematically" into a large set of previously attributed topics (+2,200) and, on the other hand, the Orange data mining toolkit, along with the NLP and machine learning tools it provides. Since the classification provided in the collection of proverbs is, for the most part, based only on a keyword in the body of the proverbs, 2 experiments were set up, to determine the feasibility of the task with a modicum of effort and the most promising configurations applicable. Different sample sizes, 100 and 50 proverbs randomly selected per topic, corresponding to Scenario 1 and 2, respectively, were contrasted; several preprocessing strategies were explored, and different data representation methods tested against several learning algorithms. Results show that Neural Networks is the best performing model, achieving the best classification accuracy of 70% and 61%, in the two different experimental scenarios, Scenario 1 and 2, respectively. Some of the inaccurate classification cases seem to indicate that the machine learning approach can sometimes do a better job than a human classifier, especially considering the manual attribution of the topics by the collection’s author, the sheer number of topics involved, and the very unbalanced distribution of proverbs per topic. Based on the results achieved, the paper presents some proposals for future work to cope with such difficulties.

Cite As Get BibTex

Jorge Baptista and Sónia Reis. Automatic Classification of Portuguese Proverbs. In 11th Symposium on Languages, Applications and Technologies (SLATE 2022). Open Access Series in Informatics (OASIcs), Volume 104, pp. 2:1-2:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/OASIcs.SLATE.2022.2

Author Details

Jorge Baptista

University of Algarve, Faro, Portugal
INESC-ID Lisbon, Portugal

Sónia Reis

University Algarve, Faro, Portugal
INESC-ID Lisbon, Portugal

Funding

Baptista, Jorge: This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under project ref. UIDB/50021/2020.

Supplementary Materials

Dataset https://doi.org/10.13140/RG.2.2.22354.02242

References

José João Almeida. Dicionário aberto de calão e expressões idiomáticas [online]. Available at. http://natura.di.uminho.pt/~jj/pln/calao/dicionario.pdf, 2014.
Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 14:2349-2353, 2013.
Ana Lopes. Texto Proverbial Português: elementos para uma análise semântica e pragmática. PhD thesis, Universidade de Coimbra, Coimbra, Portugal, 1992.
José Pedro Machado. O Grande Livro dos Provérbios. Editorial Notícias, Lisboa, 1996.
José Ricardo Marques da Costa. O Livro dos Provérbios Portugueses. Editorial Presença, Lisboa, 1999.
Rui Mendes and Hugo Gonçalo Oliveira. Comparing different methods for assigning Portuguese proverbs to news headlines. In 11th International Conference on Computational Creativity ((ICCC'20), pages 153-160, 2020.
António Moreira. Provérbios Portugueses. Editorial Notícias, 1996.
S.A. Noah and F. Ismail. Automatic classifications of Malay proverbs using naïve Bayesian algorithm. Information Technology Journal, 7:1016-1022, 2008.
Sónia Reis. Expressões proverbiais do português: Usos, variação formal e identificação automática. PhD thesis, Universidade do Algarve, Faro, Portugal, 2020.
Sónia Reis and Jorge Baptista. Determinação de um mínimo paremiológico do português europeu. Acta Scientiarum. Language and Culture, 2(42):e52114, 2020.
Sónia Reis, Jorge Baptista, and Nuno Mamede. Provérbios portugueses usuais: distribuição em corpora. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 325-334. Sociedade Brasileira de Computação, 2021.
Fernando Ribeiro de Mello. Nova Recolha e Provérbios Portugueses e outros lugares-comuns. Edições Afrodite, 2 edition, 1986.
João Rodrigues, António Branco, Steven Neale, and João Silva. LX-DSemVectors: Distributional Semantics Models for Portuguese. In International Conference on Computational Processing of the Portuguese Language (PROPOR 2016), pages 259-270. Springer, 2016.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Ricardo Cerri and Ronaldo C. Prati, editors, Intelligent Systems. BRACIS 2020, pages 403-417. Springer, 2020.

Automatic Classification of Portuguese Proverbs

Authors Jorge Baptista , Sónia Reis

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message