ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese

Alves, Ana; Gonçalo Oliveira, Hugo; Rodrigues, Ricardo; Encarnação, Rui

doi:10.4230/OASIcs.SLATE.2018.12

Abstract

Semantic Textual Similarity (STS) aims at computing the proximity of meaning transmitted by two sentences. In 2016, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes the development of ASAPP, a system that participated in ASSIN, but has been improved since then, and now achieves the best results in this task. ASAPP learns a STS function from a broad range of lexical, syntactic, semantic and distributional features. This paper describes the features used in the current version of ASAPP, and how they are exploited in a regression algorithm to achieve the best published results for ASSIN to date, in both European and Brazilian Portuguese.

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In 10th Intl. Workshop on Semantic Evaluation (SemEval), pages 497-511, 2016.
Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In 6th Intl. Workshop on Semantic Evaluation, pages 385-393, 2012.
Ana Alves, Adriana Ferrugento, Mariana Lourenço, and Filipe Rodrigues. ASAP: Automatic semantic alignment for phrases. In 8th Intl. Workshop on Semantic Evaluation (SemEval), pages 104-108, 2014.
Ana Alves, Ricardo Rodrigues, and Hugo Gonçalo Oliveira. Asapp: Alinhamento semântico automático de palavras aplicado ao português. Linguamática, 8(2):43-58, 2016.
Ana Alves, David Simões, Hugo Gonçalo Oliveira, and Adriana Ferrugento. ASAP-II: From the alignment of phrases to textual similarity. In 9th Intl. Workshop on Semantic Evaluation (SemEval 2015), pages 184-189, 2015.
Anabela Barreiro. Port4NooJ: an open source, ontology-driven portuguese linguistic system with applications in machine translation. In Intl. NooJ Conference (NooJ'08), 2010.
Anderson Pinheiro Cavalcanti, Rafael Ferreira Leite de Mello, Máverick André Dionísio Ferreira, Vitor Belarmino Rolim, and João Vitor Soares Tenório. Statistical and semantic features to measure sentence similarity in Portuguese. In Proceedings of 6th Brazilian Conference on Intelligent Systems, pages 342-347, 2017.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In 11th Intl. Workshop on Semantic Evaluation (SemEval), pages 1-14, 2017. URL: http://dx.doi.org/10.18653/v1/S17-2001.
Bento C. Dias-da-Silva. Wordnet.Br: An exercise of human language technology research. In 3rd Intl. WordNet Conf. (GWC), pages 301-303, 2006.
Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.
Pedro Fialho, Ricardo Marques, Bruno Martins, Luísa Coheur, and Paulo Quaresma. INESC-ID@ASSIN: Medição de similaridade semântica e reconhecimento de inferência textual. Linguamática, 8(2):33-42, 2016.
Erick Fonseca, Leandro Santos, Marcelo Criscuolo, and Sandra Aluísio. Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática, 8(2):3-13, 2016.
Hugo Gonçalo Oliveira. Comparing and combining Portuguese lexical-semantic knowledge bases. In 6superscriptth Symposium on Languages, Applications and Technologies (SLATE), pages 16:1-16:14, 2017.
Hugo Gonçalo Oliveira, Ana Oliveira Alves, and Ricardo Rodrigues. Gradually improving the computation of semantic textual similarity in Portuguese. In 18th EPIA Conference on Artificial Intelligence, volume 10423, pages 841-854, 2017. URL: http://dx.doi.org/10.1007/978-3-319-65340-2_68.
Hugo Gonçalo Oliveira, Diana Santos, Paulo Gomes, and Nuno Seco. PAPEL: A dictionary-based lexical ontology for Portuguese. In 8th Intl. Conf. Computational Processing of the Portuguese Language (PROPOR), volume 5190, pages 31-40, 2008.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10-18, 2009. URL: http://dx.doi.org/10.1145/1656274.1656278.
Nathan Hartmann. Solo Queue at ASSIN: Combinando abordagens tradicionais e emergentes. Linguamática, 8(2):59-64, 2016.
Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In 11th Brazilian Symposium in Information and Human Language Technology (STIL), 2017.
Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998.
Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In 12th Australian Joint Conf. on Artificial Intelligence, pages 1-12, 1999.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Authorship attribution in the wild. Languages Resourses Evaluation, 45(1):83-94, 2011. URL: http://dx.doi.org/10.1007/s10579-009-9111-2.
David Mackay. Introduction to Gaussian Processes. In Neural Networks and Machine Learning. Springer, 1998.
Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In 8th Intl. Workshop on Semantic Evaluation (SemEval), pages 1-8, 2014.
Erick Maziero, Thiago Pardo, Ariani Felippo, and Bento Dias-da-Silva. A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390-392, 2008.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Workshop track of the Intl. Conf. on Learning Representations (ICLR), 2013.
Valeria Paiva, Alexandre Rademaker, and Gerard Melo. OpenWordNet-PT: An open Brazilian Wordnet for reasoning. In 24th Intl. Conf. on Computational Linguistics (COLING), 2012.
Vladia Pinheiro, Vasco Furtado, and Adriano Albuquerque. Semantic textual similarity of portuguese-language texts: An approach based on the semantic inferentialism model. In 11th Conf. on the Computational Processing of the Portuguese Language (PROPOR), pages 183-188, 2014. URL: http://dx.doi.org/10.1007/978-3-319-09761-9_19.
Ricardo Rodrigues, Hugo Gonçalo-Oliveira, and Paulo Gomes. NLPPort: A pipeline for portuguese nlp. In 7superscriptth Symposium on Languages, Applications and Technologies (SLATE), pages 18:1-18:9, 2018.
Barbara Rychalska, Katarzyna Pakulska, Krystyna Chodorowska, Wojciech Walczak, and Piotr Andruszkiewicz. Samsung Poland NLP team at SemEval-2016 task 1: Necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In 10th Intl. Workshop on Semantic Evaluation (SemEval), pages 602-608, 2016.
Alberto Simões and Xavier Guinovart. Bootstrapping a Portuguese wordnet from Galician, Spanish and English wordnets. In Advances in Speech and Language Technologies for Iberian Languages, volume 8854 of LNCS, pages 239-248, 2014.
Alberto Simões, Álvaro Sanromán, and José Almeida. Dicionário-Aberto: A source of resources for the Portuguese language processing. In 10th Intl. Conf. on the Computational Processing of the Portuguese Language (PROPOR), volume 7243, pages 121-127, 2012.
Junfeng Tian, Zhiheng Zhou, Man Lan, and Yuanbin Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In 11th Intl. Workshop on Semantic Evaluation (SemEval), pages 191-197, 2017. URL: http://dx.doi.org/10.18653/v1/S17-2028.

ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese

Authors Ana Alves , Hugo Gonçalo Oliveira , Ricardo Rodrigues , Rui Encarnação

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese

Authors Ana Alves , Hugo Gonçalo Oliveira , Ricardo Rodrigues , Rui Encarnação

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References