ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese

Authors Ana Alves , Hugo Gonçalo Oliveira , Ricardo Rodrigues , Rui Encarnação



PDF
Thumbnail PDF

File

OASIcs.SLATE.2018.12.pdf
  • Filesize: 0.5 MB
  • 17 pages

Document Identifiers

Author Details

Ana Alves
  • CISUC / ISEC, Polytechnic Institute of Coimbra, Portugal
Hugo Gonçalo Oliveira
  • CISUC / Department of Informatics Engineering, {University of Coimbra, Portugal}
Ricardo Rodrigues
  • CISUC / ESEC, Polytechnic Institute of Coimbra, Portugal
Rui Encarnação
  • CISUC, University of Coimbra, Portugal

Cite AsGet BibTex

Ana Alves, Hugo Gonçalo Oliveira, Ricardo Rodrigues, and Rui Encarnação. ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese. In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Open Access Series in Informatics (OASIcs), Volume 62, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/OASIcs.SLATE.2018.12

Abstract

Semantic Textual Similarity (STS) aims at computing the proximity of meaning transmitted by two sentences. In 2016, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes the development of ASAPP, a system that participated in ASSIN, but has been improved since then, and now achieves the best results in this task. ASAPP learns a STS function from a broad range of lexical, syntactic, semantic and distributional features. This paper describes the features used in the current version of ASAPP, and how they are exploited in a regression algorithm to achieve the best published results for ASSIN to date, in both European and Brazilian Portuguese.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
Keywords
  • natural language processing
  • semantic textual similarity
  • semantic relations
  • word embeddings
  • character n-grams
  • supervised machine learning

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In 10th Intl. Workshop on Semantic Evaluation (SemEval), pages 497-511, 2016. Google Scholar
  2. Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In 6th Intl. Workshop on Semantic Evaluation, pages 385-393, 2012. Google Scholar
  3. Ana Alves, Adriana Ferrugento, Mariana Lourenço, and Filipe Rodrigues. ASAP: Automatic semantic alignment for phrases. In 8th Intl. Workshop on Semantic Evaluation (SemEval), pages 104-108, 2014. Google Scholar
  4. Ana Alves, Ricardo Rodrigues, and Hugo Gonçalo Oliveira. Asapp: Alinhamento semântico automático de palavras aplicado ao português. Linguamática, 8(2):43-58, 2016. Google Scholar
  5. Ana Alves, David Simões, Hugo Gonçalo Oliveira, and Adriana Ferrugento. ASAP-II: From the alignment of phrases to textual similarity. In 9th Intl. Workshop on Semantic Evaluation (SemEval 2015), pages 184-189, 2015. Google Scholar
  6. Anabela Barreiro. Port4NooJ: an open source, ontology-driven portuguese linguistic system with applications in machine translation. In Intl. NooJ Conference (NooJ'08), 2010. Google Scholar
  7. Anderson Pinheiro Cavalcanti, Rafael Ferreira Leite de Mello, Máverick André Dionísio Ferreira, Vitor Belarmino Rolim, and João Vitor Soares Tenório. Statistical and semantic features to measure sentence similarity in Portuguese. In Proceedings of 6th Brazilian Conference on Intelligent Systems, pages 342-347, 2017. Google Scholar
  8. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In 11th Intl. Workshop on Semantic Evaluation (SemEval), pages 1-14, 2017. URL: http://dx.doi.org/10.18653/v1/S17-2001.
  9. Bento C. Dias-da-Silva. Wordnet.Br: An exercise of human language technology research. In 3rd Intl. WordNet Conf. (GWC), pages 301-303, 2006. Google Scholar
  10. Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998. Google Scholar
  11. Pedro Fialho, Ricardo Marques, Bruno Martins, Luísa Coheur, and Paulo Quaresma. INESC-ID@ASSIN: Medição de similaridade semântica e reconhecimento de inferência textual. Linguamática, 8(2):33-42, 2016. Google Scholar
  12. Erick Fonseca, Leandro Santos, Marcelo Criscuolo, and Sandra Aluísio. Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática, 8(2):3-13, 2016. Google Scholar
  13. Hugo Gonçalo Oliveira. Comparing and combining Portuguese lexical-semantic knowledge bases. In 6superscriptth Symposium on Languages, Applications and Technologies (SLATE), pages 16:1-16:14, 2017. Google Scholar
  14. Hugo Gonçalo Oliveira, Ana Oliveira Alves, and Ricardo Rodrigues. Gradually improving the computation of semantic textual similarity in Portuguese. In 18th EPIA Conference on Artificial Intelligence, volume 10423, pages 841-854, 2017. URL: http://dx.doi.org/10.1007/978-3-319-65340-2_68.
  15. Hugo Gonçalo Oliveira, Diana Santos, Paulo Gomes, and Nuno Seco. PAPEL: A dictionary-based lexical ontology for Portuguese. In 8th Intl. Conf. Computational Processing of the Portuguese Language (PROPOR), volume 5190, pages 31-40, 2008. Google Scholar
  16. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10-18, 2009. URL: http://dx.doi.org/10.1145/1656274.1656278.
  17. Nathan Hartmann. Solo Queue at ASSIN: Combinando abordagens tradicionais e emergentes. Linguamática, 8(2):59-64, 2016. Google Scholar
  18. Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In 11th Brazilian Symposium in Information and Human Language Technology (STIL), 2017. Google Scholar
  19. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998. Google Scholar
  20. Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In 12th Australian Joint Conf. on Artificial Intelligence, pages 1-12, 1999. Google Scholar
  21. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Authorship attribution in the wild. Languages Resourses Evaluation, 45(1):83-94, 2011. URL: http://dx.doi.org/10.1007/s10579-009-9111-2.
  22. David Mackay. Introduction to Gaussian Processes. In Neural Networks and Machine Learning. Springer, 1998. Google Scholar
  23. Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In 8th Intl. Workshop on Semantic Evaluation (SemEval), pages 1-8, 2014. Google Scholar
  24. Erick Maziero, Thiago Pardo, Ariani Felippo, and Bento Dias-da-Silva. A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390-392, 2008. Google Scholar
  25. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Workshop track of the Intl. Conf. on Learning Representations (ICLR), 2013. Google Scholar
  26. Valeria Paiva, Alexandre Rademaker, and Gerard Melo. OpenWordNet-PT: An open Brazilian Wordnet for reasoning. In 24th Intl. Conf. on Computational Linguistics (COLING), 2012. Google Scholar
  27. Vladia Pinheiro, Vasco Furtado, and Adriano Albuquerque. Semantic textual similarity of portuguese-language texts: An approach based on the semantic inferentialism model. In 11th Conf. on the Computational Processing of the Portuguese Language (PROPOR), pages 183-188, 2014. URL: http://dx.doi.org/10.1007/978-3-319-09761-9_19.
  28. Ricardo Rodrigues, Hugo Gonçalo-Oliveira, and Paulo Gomes. NLPPort: A pipeline for portuguese nlp. In 7superscriptth Symposium on Languages, Applications and Technologies (SLATE), pages 18:1-18:9, 2018. Google Scholar
  29. Barbara Rychalska, Katarzyna Pakulska, Krystyna Chodorowska, Wojciech Walczak, and Piotr Andruszkiewicz. Samsung Poland NLP team at SemEval-2016 task 1: Necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In 10th Intl. Workshop on Semantic Evaluation (SemEval), pages 602-608, 2016. Google Scholar
  30. Alberto Simões and Xavier Guinovart. Bootstrapping a Portuguese wordnet from Galician, Spanish and English wordnets. In Advances in Speech and Language Technologies for Iberian Languages, volume 8854 of LNCS, pages 239-248, 2014. Google Scholar
  31. Alberto Simões, Álvaro Sanromán, and José Almeida. Dicionário-Aberto: A source of resources for the Portuguese language processing. In 10th Intl. Conf. on the Computational Processing of the Portuguese Language (PROPOR), volume 7243, pages 121-127, 2012. Google Scholar
  32. Junfeng Tian, Zhiheng Zhou, Man Lan, and Yuanbin Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In 11th Intl. Workshop on Semantic Evaluation (SemEval), pages 191-197, 2017. URL: http://dx.doi.org/10.18653/v1/S17-2028.