On the Utility of Word Embeddings for Enriching OpenWordNet-PT

Authors Hugo Gonçalo Oliveira , Fredson Silva de Souza Aguiar , Alexandre Rademaker



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.21.pdf
  • Filesize: 0.58 MB
  • 13 pages

Document Identifiers

Author Details

Hugo Gonçalo Oliveira
  • CISUC, Department of Informatics Engineering, University of Coimbra, Portugal
Fredson Silva de Souza Aguiar
  • FGV/EMAp, Rio de Janeiro, Brazil
Alexandre Rademaker
  • IBM Research, Rio de Janeiro, Brazil
  • FGV/EMAp, Rio de Janeiro, Brazil

Acknowledgements

The research described in this paper was partially conducted in the scope of the COST Action CA18209 Nexus Linguarum (European network for Web-centred linguistic data science).

Cite AsGet BibTex

Hugo Gonçalo Oliveira, Fredson Silva de Souza Aguiar, and Alexandre Rademaker. On the Utility of Word Embeddings for Enriching OpenWordNet-PT. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.LDK.2021.21

Abstract

The maintenance of wordnets and lexical knwoledge bases typically relies on time-consuming manual effort. In order to minimise this issue, we propose the exploitation of models of distributional semantics, namely word embeddings learned from corpora, in the automatic identification of relation instances missing in a wordnet. Analogy-solving methods are first used for learning a set of relations from analogy tests focused on each relation. Despite their low accuracy, we noted that a portion of the top-given answers are good suggestions of relation instances that could be included in the wordnet. This procedure is applied to the enrichment of OpenWordNet-PT, a public Portuguese wordnet. Relations are learned from data acquired from this resource, and illustrative examples are provided. Results are promising for accelerating the identification of missing relation instances, as we estimate that about 17% of the potential suggestions are good, a proportion that almost doubles if some are automatically invalidated.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Lexical semantics
  • Computing methodologies → Language resources
Keywords
  • word embeddings
  • lexical resources
  • wordnet
  • analogy tests

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Francis Bond and Ryan Foster. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1352-1362, 2013. Google Scholar
  2. Francis Bond and Kyonghee Paik. A survey of wordnets and their licenses. Small, 8(4):5, 2012. Google Scholar
  3. Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. Inducing relational knowledge from bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7456-7463. AAAI, 2020. Google Scholar
  4. Jordan Boyd-Graber, Christiane Fellbaum, Daniel Osherson, and Robert Schapire. Adding dense, weighted connections to wordnet. In Proceedings of the third international WordNet conference, pages 29-36. Citeseer, 2006. Google Scholar
  5. Nicoletta Calzolari, Laura Pecchia, and Antonio Zampolli. Working on the italian machine dictionary: a semantic approach. In COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics. Association for Computational Linguistics, 1973. Google Scholar
  6. Leonel Figueiredo de Alencar, Bruno Cuconato, and Alexandre Rademaker. Morphobr: An open source large-coverage full-form lexicon for morphological analysis of portuguese. Texto Livre: Linguagem e Tecnologia, 11(3):1-25, 2018. Google Scholar
  7. Gerard De Melo and Gerhard Weikum. Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 513-522, 2009. Google Scholar
  8. Valeria de Paiva, Alexandre Rademaker, and Gerard de Melo. OpenWordNet-PT: An Open Brazilian WordNet for Reasoning. In Proceedings of 24th International Conference on Computational Linguistics, COLING (Demo Paper), 2012. Google Scholar
  9. Valeria de Paiva, Livy Real, Alexandre Rademaker, and Gerard de Melo. Nomlex-pt: A lexicon of portuguese nominalizations. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, 2014. European Language Resources Association (ELRA). Google Scholar
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), NAACL-HLT 2019, pages 4171-4186. Association for Computational Linguistics, 2019. Google Scholar
  11. Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings the 26th International Conference on Computational Linguistics: Technical papers COLING 2016, COLING 2016, pages 3519-3530, 2016. Google Scholar
  12. Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55-65, 2019. Google Scholar
  13. Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998. Google Scholar
  14. Morton A Gernsbacher. Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of experimental psychology: General, 113(2):256, 1984. Google Scholar
  15. Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't. In Proceedings of the NAACL 2016 Student Research Workshop, pages 8-15. ACL, 2016. Google Scholar
  16. Hugo Gonçalo Oliveira and Paulo Gomes. ECO and Onto.PT: A flexible approach for creating a Portuguese wordnet automatically. Language Resources and Evaluation, 48(2):373-393, 2014. Google Scholar
  17. Hugo Gonçalo Oliveira, Tiago Sousa, and Ana Alves. TALES: Test set of Portuguese lexical-semantic relations for assessing word embeddings. In Proceedings of the ECAI 2020 Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP 2020), volume 2693 of CEUR Workshop Proceedings, pages 41-47. CEUR-WS.org, 2020. Google Scholar
  18. Hugo Gonçalo Oliveira, Diana Santos, Paulo Gomes, and Nuno Seco. PAPEL: A dictionary-based lexical ontology for Portuguese. In Proceedings of Computational Processing of the Portuguese Language - 8th International Conference (PROPOR 2008), volume 5190 of LNCS/LNAI, pages 31-40, Aveiro, Portugal, September 2008. Springer. Google Scholar
  19. Zelig Harris. Distributional structure. Word, 10(2-3):1456-1162, 1954. Google Scholar
  20. Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Proceedings 11th Brazilian Symposium in Information and Human Language Technology (STIL 2017), 2017. Google Scholar
  21. Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of 14th Conference on Computational Linguistics, COLING 92, pages 539-545, Morristown, NJ, USA, 1992. Association for Computational Linguistics. Google Scholar
  22. Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora. Automated wordnet construction using word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 12-23, 2017. Google Scholar
  23. J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. Biometrics, pages 159-174, 1977. Google Scholar
  24. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of the Workshop track of ICLR, 2013. Google Scholar
  25. Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217-250, 2012. Google Scholar
  26. Lluís Padró and Evgeny Stanilovsky. FreeLing 3.0: Towards wider multilinguality. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2473-2479, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). Google Scholar
  27. Allan Paivio, John C Yuille, and Stephen A Madigan. Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of experimental psychology, 76(1p2):1, 1968. Google Scholar
  28. Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Procs of 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, pages 113-120, Sydney, Australia, 2006. ACL Press. Google Scholar
  29. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 1532-1543. ACL, 2014. Google Scholar
  30. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463-2473. Association for Computational Linguistics, 2019. Google Scholar
  31. Heidi Sand, Erik Velldal, and Lilja Øvrelid. Wordnet extension via word embeddings: Experiments on the norwegian wordnet. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 298-302, 2017. Google Scholar
  32. Rion Snow, Daniel Jurafsky, and Andrew Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in neural information processing systems, 17:1297-1304, 2005. Google Scholar
  33. Tiago Sousa, Hugo Gonçalo Oliveira, and Ana Alves. Exploring different methods for solving analogies with Portuguese word embeddings. In Proceedings 9th Symposium on Languages, Applications and Technologies, SLATE 2020, July 13-14, 2020, School of Technology, Polytechnic Institute of Cávado and Ave, Portugal, volume 83 of OASIcs, pages 9:1-9:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. Google Scholar
  34. Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. Bertimbau: Pretrained bert models for brazilian portuguese. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS 2020), volume 12319 of LNCS, pages 403-417, Cham, 2020. Springer. Google Scholar
  35. P. Vossen. EuroWordNet: A multilingual database with lexical semantic networks. Computers and the humanities. Springer Netherlands, 1998. Google Scholar
  36. Gregor Wiedemann, Steffen Remus, Avi Chawla, and Chris Biemann. Does bert make any sense? interpretable word sense disambiguation with contextualized embeddings. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pages 161-170, Erlangen, Germany, 2019. German Society for Computational Linguistics & Language Technology. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail