On the Utility of Word Embeddings for Enriching OpenWordNet-PT

Gonçalo Oliveira, Hugo; Aguiar, Fredson Silva de Souza; Rademaker, Alexandre

doi:10.4230/OASIcs.LDK.2021.21

Abstract

The maintenance of wordnets and lexical knwoledge bases typically relies on time-consuming manual effort. In order to minimise this issue, we propose the exploitation of models of distributional semantics, namely word embeddings learned from corpora, in the automatic identification of relation instances missing in a wordnet. Analogy-solving methods are first used for learning a set of relations from analogy tests focused on each relation. Despite their low accuracy, we noted that a portion of the top-given answers are good suggestions of relation instances that could be included in the wordnet. This procedure is applied to the enrichment of OpenWordNet-PT, a public Portuguese wordnet. Relations are learned from data acquired from this resource, and illustrative examples are provided. Results are promising for accelerating the identification of missing relation instances, as we estimate that about 17% of the potential suggestions are good, a proportion that almost doubles if some are automatically invalidated.

Francis Bond and Ryan Foster. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1352-1362, 2013.
Francis Bond and Kyonghee Paik. A survey of wordnets and their licenses. Small, 8(4):5, 2012.
Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. Inducing relational knowledge from bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7456-7463. AAAI, 2020.
Jordan Boyd-Graber, Christiane Fellbaum, Daniel Osherson, and Robert Schapire. Adding dense, weighted connections to wordnet. In Proceedings of the third international WordNet conference, pages 29-36. Citeseer, 2006.
Nicoletta Calzolari, Laura Pecchia, and Antonio Zampolli. Working on the italian machine dictionary: a semantic approach. In COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics. Association for Computational Linguistics, 1973.
Leonel Figueiredo de Alencar, Bruno Cuconato, and Alexandre Rademaker. Morphobr: An open source large-coverage full-form lexicon for morphological analysis of portuguese. Texto Livre: Linguagem e Tecnologia, 11(3):1-25, 2018.
Gerard De Melo and Gerhard Weikum. Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 513-522, 2009.
Valeria de Paiva, Alexandre Rademaker, and Gerard de Melo. OpenWordNet-PT: An Open Brazilian WordNet for Reasoning. In Proceedings of 24th International Conference on Computational Linguistics, COLING (Demo Paper), 2012.
Valeria de Paiva, Livy Real, Alexandre Rademaker, and Gerard de Melo. Nomlex-pt: A lexicon of portuguese nominalizations. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, 2014. European Language Resources Association (ELRA).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), NAACL-HLT 2019, pages 4171-4186. Association for Computational Linguistics, 2019.
Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings the 26th International Conference on Computational Linguistics: Technical papers COLING 2016, COLING 2016, pages 3519-3530, 2016.
Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55-65, 2019.
Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.
Morton A Gernsbacher. Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of experimental psychology: General, 113(2):256, 1984.
Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't. In Proceedings of the NAACL 2016 Student Research Workshop, pages 8-15. ACL, 2016.
Hugo Gonçalo Oliveira and Paulo Gomes. ECO and Onto.PT: A flexible approach for creating a Portuguese wordnet automatically. Language Resources and Evaluation, 48(2):373-393, 2014.
Hugo Gonçalo Oliveira, Tiago Sousa, and Ana Alves. TALES: Test set of Portuguese lexical-semantic relations for assessing word embeddings. In Proceedings of the ECAI 2020 Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP 2020), volume 2693 of CEUR Workshop Proceedings, pages 41-47. CEUR-WS.org, 2020.
Hugo Gonçalo Oliveira, Diana Santos, Paulo Gomes, and Nuno Seco. PAPEL: A dictionary-based lexical ontology for Portuguese. In Proceedings of Computational Processing of the Portuguese Language - 8th International Conference (PROPOR 2008), volume 5190 of LNCS/LNAI, pages 31-40, Aveiro, Portugal, September 2008. Springer.
Zelig Harris. Distributional structure. Word, 10(2-3):1456-1162, 1954.
Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Proceedings 11th Brazilian Symposium in Information and Human Language Technology (STIL 2017), 2017.
Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of 14th Conference on Computational Linguistics, COLING 92, pages 539-545, Morristown, NJ, USA, 1992. Association for Computational Linguistics.
Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora. Automated wordnet construction using word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 12-23, 2017.
J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. Biometrics, pages 159-174, 1977.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of the Workshop track of ICLR, 2013.
Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217-250, 2012.
Lluís Padró and Evgeny Stanilovsky. FreeLing 3.0: Towards wider multilinguality. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2473-2479, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).
Allan Paivio, John C Yuille, and Stephen A Madigan. Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of experimental psychology, 76(1p2):1, 1968.
Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Procs of 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, pages 113-120, Sydney, Australia, 2006. ACL Press.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 1532-1543. ACL, 2014.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463-2473. Association for Computational Linguistics, 2019.
Heidi Sand, Erik Velldal, and Lilja Øvrelid. Wordnet extension via word embeddings: Experiments on the norwegian wordnet. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 298-302, 2017.
Rion Snow, Daniel Jurafsky, and Andrew Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in neural information processing systems, 17:1297-1304, 2005.
Tiago Sousa, Hugo Gonçalo Oliveira, and Ana Alves. Exploring different methods for solving analogies with Portuguese word embeddings. In Proceedings 9th Symposium on Languages, Applications and Technologies, SLATE 2020, July 13-14, 2020, School of Technology, Polytechnic Institute of Cávado and Ave, Portugal, volume 83 of OASIcs, pages 9:1-9:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. Bertimbau: Pretrained bert models for brazilian portuguese. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS 2020), volume 12319 of LNCS, pages 403-417, Cham, 2020. Springer.
P. Vossen. EuroWordNet: A multilingual database with lexical semantic networks. Computers and the humanities. Springer Netherlands, 1998.
Gregor Wiedemann, Steffen Remus, Avi Chawla, and Chris Biemann. Does bert make any sense? interpretable word sense disambiguation with contextualized embeddings. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pages 161-170, Erlangen, Germany, 2019. German Society for Computational Linguistics & Language Technology.

On the Utility of Word Embeddings for Enriching OpenWordNet-PT

Authors Hugo Gonçalo Oliveira , Fredson Silva de Souza Aguiar , Alexandre Rademaker

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

On the Utility of Word Embeddings for Enriching OpenWordNet-PT

Authors Hugo Gonçalo Oliveira , Fredson Silva de Souza Aguiar , Alexandre Rademaker

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message