Reasoning with Portuguese Word Embeddings

Cunha, Luís Filipe; Almeida, J. João; Simões, Alberto

doi:10.4230/OASIcs.SLATE.2022.17

Abstract

Representing words with semantic distributions to create ML models is a widely used technique to perform Natural Language processing tasks. In this paper, we trained word embedding models with different types of Portuguese corpora, analyzing the influence of the models' parameterization, the corpora size, and domain. Then we validated each model with the classical evaluation methods available: four words analogies and measurement of the similarity of pairs of words. In addition to these methods, we proposed new alternative techniques to validate word embedding models, presenting new resources for this purpose. Finally, we discussed the obtained results and argued about some limitations of the word embedding models' evaluation methods.

Cite As Get BibTex

Luís Filipe Cunha, J. João Almeida, and Alberto Simões. Reasoning with Portuguese Word Embeddings. In 11th Symposium on Languages, Applications and Technologies (SLATE 2022). Open Access Series in Informatics (OASIcs), Volume 104, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/OASIcs.SLATE.2022.17

Author Details

Luís Filipe Cunha

Department of Informatics, University of Minho, Braga, Portugal

J. João Almeida

Centro ALGORITMI, Departamento de Informática, University of Minho, Braga, Portugal

Alberto Simões

2Ai – School of Technology, IPCA, Barcelos, Portugal

Funding

Almeida, J. João: This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.
Simões, Alberto: This project was funded by Portuguese national funds (PIDDAC), through the FCT – Fundação para a Ciência e Tecnologia and FCT/MCTES under the scope of the project UIDB/05549/2020.

References

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19-27, Boulder, Colorado, June 2009. Association for Computational Linguistics. URL: https://aclanthology.org/N09-1003.
Marco Baroni, Brian Murphy, Eduard Barbu, and Massimo Poesio. Strudel: A corpus-based semantic model based on properties and types. Cognitive Science, 34(2):222-254, 2010. URL: https://doi.org/10.1111/j.1551-6709.2009.01068.x.
Luís Filipe da Costa Cunha and José Carlos Ramalho. Ner in archival finding aids: Extended. Machine Learning and Knowledge Extraction, 4(1):42-65, 2022. URL: https://doi.org/10.3390/make4010003.
Idalete Dias, Sílvia Araújo, Alberto Simões, José Almeida, Nuno Carvalho, Ana Oliveira, and André Santos. The Per-Fide Corpus: A New Resource for Corpus-Based Terminology, Contrastive Linguistics and Translation Studies, pages 177-200. Bloomsbury, April 2014.
Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3519-3530, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. URL: https://aclanthology.org/C16-1332.
Nathan Hartmann, Erick Fonseca, Christopher Shulby, Marcos Treviso, Jessica Rodrigues, and Sandra Aluisio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks, 2017. URL: https://doi.org/10.48550/ARXIV.1708.06025.
Felix Hill, Roi Reichart, and Anna Korhonen. SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665-695, December 2015. URL: https://doi.org/10.1162/COLI_a_00237.
Thang Luong, Richard Socher, and Christopher Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104-113, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL: https://aclanthology.org/W13-3512.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, 2013.
Saif M. Mohammad. Word affect intensities. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), Miyazaki, Japan, 2018.
Hugo Gonçalo Oliveira, Tiago Sousa, and Ana Oliveira Alves. Tales: Test set of portuguese lexical-semantic relations for assessingword embeddings. In HI4NLP@ECAI, 2020.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. URL: http://www.aclweb.org/anthology/D14-1162.
Andreia Querido, Rita Carvalho, Joao Rodrigues, Marcos Garcia, Joao Silva, Catarina Correia, Nuno Rendeiro, Rita Pereira, Marisa Campos, and António Branco. Lx-lr4distsemeval: a collection of language resources for the evaluation of distributional semantic models of portuguese. Revista da Associação Portuguesa de Linguística, 3:265-283, September 2017. URL: https://doi.org/10.26334/2183-9077/rapln3ano2017a15.
Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45-50, Valletta, Malta, May 2010. ELRA. URL: http://is.muni.cz/publication/884893/en.
João Rodrigues and António Branco. Finely tuned, 2 billion token based word embeddings for Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL: https://aclanthology.org/L18-1382.
Diana Santos and Eckhard Bick. Providing Internet access to Portuguese corpora: the AC/DC project. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC'00), Athens, Greece, May 2000. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2000/pdf/85.pdf.
Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).

Reasoning with Portuguese Word Embeddings

Authors Luís Filipe Cunha , J. João Almeida , Alberto Simões

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message