From Lexical to Semantic Features in Paraphrase Identification

Authors Pedro Fialho , Luísa Coheur , Paulo Quaresma



PDF
Thumbnail PDF

File

OASIcs.SLATE.2019.9.pdf
  • Filesize: 372 kB
  • 11 pages

Document Identifiers

Author Details

Pedro Fialho
  • INESC-ID, Lisboa, Portugal
  • Universidade de Évora, Portugal
Luísa Coheur
  • INESC-ID, Lisboa, Portugal
  • Instituto Superior Tecnico, Universidade de Lisboa, Portugal
Paulo Quaresma
  • INESC-ID, Lisboa, Portugal
  • Universidade de Évora, Portugal

Acknowledgements

This work was supported by national funds through Fundação para a Ciência e Tecnologia (FCT) with reference UID/CEC/50021/2019, through the international project RAGE with reference H2020-ICT-2014-1/644187 and by FCT’s INCoDe 2030 initiative, in the scope of the demonstration project AIA, "Apoio Inteligente a empreendedores (chatbots)", which also supports the scholarship of Pedro Fialho.

Cite As Get BibTex

Pedro Fialho, Luísa Coheur, and Paulo Quaresma. From Lexical to Semantic Features in Paraphrase Identification. In 8th Symposium on Languages, Applications and Technologies (SLATE 2019). Open Access Series in Informatics (OASIcs), Volume 74, pp. 9:1-9:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/OASIcs.SLATE.2019.9

Abstract

The task of paraphrase identification has been applied to diverse scenarios in Natural Language Processing, such as Machine Translation, summarization, or plagiarism detection. In this paper we present a comparative study on the performance of lexical, syntactic and semantic features in the task of paraphrase identification in the Microsoft Research Paraphrase Corpus. In our experiments, semantic features do not represent a gain in results, and syntactic features lead to the best results, but only if combined with lexical features.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
  • Theory of computation → Support vector machines
  • Information systems → Near-duplicate and plagiarism detection
Keywords
  • paraphrase identification
  • lexical features
  • syntactic features
  • semantic features

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178-186, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/W13-2322.
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993-1022, March 2003. URL: http://dl.acm.org/citation.cfm?id=944919.944937.
  3. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, T. J. Watson, Vincent J. Della Pietra, and Jenifer C. Lai. Class-Based n-gram Models of Natural Language. Computational Linguistics, 18(4), 1992. URL: http://aclweb.org/anthology/J92-4003.
  4. Shu Cai and Kevin Knight. Smatch: an Evaluation Metric for Semantic Feature Structures. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers, pages 748-752. The Association for Computer Linguistics, 2013. URL: http://aclweb.org/anthology/P/P13/P13-2131.pdf.
  5. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1-27:27, 2011. Software available at URL: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  6. Bill Dolan and Chris Brockett. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing, January 2005. URL: https://www.microsoft.com/en-us/research/publication/automatically-constructing-a-corpus-of-sentential-paraphrases/.
  7. Christiane Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, 1998. Google Scholar
  8. Pedro Fialho, Ricardo Marques, Bruno Martins, Luísa Coheur, and Paulo Quaresma. INESC-ID@ASSIN: Medição de Similaridade Semântica e Reconhecimento de Inferência Textual. Linguamática, 8(2):33-42, December 2016. URL: https://www.linguamatica.com/index.php/linguamatica/article/view/v8n2-4.
  9. Simone Filice, Giovanni Da San Martino, and Alessandro Moschitti. Structural Representations for Learning Relations between Pairs of Texts. In The Association for Computer Linguistics, editor, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1003-1013. The Association for Computer Linguistics, 2015. URL: http://aclweb.org/anthology/P/P15/P15-1097.pdf.
  10. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. A Discriminative Graph-Based Parser for the Abstract Meaning Representation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1426-1436. Association for Computational Linguistics, 2014. Google Scholar
  11. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57-60, New York City, USA, June 2006. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/N06-2015.
  12. Fuad Issa, Marco Damonte, Shay B. Cohen, Xiaohui Yan, and Yi Chang. Abstract Meaning Representation for Paraphrase Detection. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 442-452. Association for Computational Linguistics, 2018. URL: https://doi.org/10.18653/v1/N18-1041.
  13. Thorsten Joachims. Making large-Scale SVM Learning Practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169-184. MIT Press, Cambridge, MA, 1999. Google Scholar
  14. Hans Kamp and Uwe Reyle. From Discourse to Logic - Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory, volume 42 of Studies in linguistics and philosophy. Springer, 1993. URL: https://doi.org/10.1007/978-94-017-1616-1.
  15. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A Continuously Growing Dataset of Sentential Paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224-1234. Association for Computational Linguistics, 2017. URL: http://aclweb.org/anthology/D17-1126.
  16. Thomas K. Landauer and Susan T. Dumais. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2):211-240, 1997. Google Scholar
  17. Mihai C. Lintean and Vasile Rus. Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics. In G. Michael Youngblood and Philip M. McCarthy, editors, Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida. May 23-25, 2012. AAAI Press, 2012. URL: http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS12/paper/view/4421.
  18. Nitin Madnani, Joel Tetreault, and Martin Chodorow. Re-examining Machine Translation Metrics for Paraphrase Identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12, pages 182-190, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. URL: http://dl.acm.org/citation.cfm?id=2382029.2382055.
  19. Jerome L. McClendon, Naja A. Mack, and Larry F. Hodges. The Use of Paraphrase Identification in the Retrieval of Appropriate Responses for Script Based Conversational Agents. In William Eberle and Chutima Boonthum-Denecke, editors, Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference, FLAIRS 2014, Pensacola Beach, Florida, May 21-23, 2014. Press, 2014. URL: http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS14/paper/view/7793.
  20. Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI'06, pages 775-780. AAAI Press, 2006. URL: http://dl.acm.org/citation.cfm?id=1597538.1597662.
  21. Amita Misra, Brian Ecker, and Marilyn Walker. Measuring the Similarity of Sentential Arguments in Dialogue. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 276-287. Association for Computational Linguistics, 2016. URL: https://doi.org/10.18653/v1/W16-3636.
  22. Alessandro Moschitti. Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, ECML'06, pages 318-329, Berlin, Heidelberg, 2006. Springer-Verlag. Google Scholar
  23. Alessandro Moschitti. Making Tree Kernels Practical for Natural Language Learning. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 113-120, 2006. URL: http://www.aclweb.org/anthology/E06-1015.
  24. Sebastian Pado, Michel Galley, Dan Jurafsky, and Christopher D. Manning. Robust Machine Translation Evaluation with Entailment Features. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 297-305. Association for Computational Linguistics, 2009. URL: http://www.aclweb.org/anthology/P09-1034.
  25. Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71-106, 2005. URL: https://doi.org/10.1162/0891201053630264.
  26. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311-318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. URL: https://doi.org/10.3115/1073083.1073135.
  27. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011. Google Scholar
  28. Konrad Rieck and Christian Wressnegger. Harry: A Tool for Measuring String Similarity. J. Mach. Learn. Res., 17(1):258-262, January 2016. URL: http://dl.acm.org/citation.cfm?id=2946645.2946654.
  29. Vasile Rus, Mihai Lintean, Rajendra Banjade, Nobal Niraula, and Dan Stefanescu. SEMILAR: The Semantic Similarity Toolkit. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 163-168, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/P13-4028.
  30. Aliaksei Severyn and Alessandro Moschitti. Large-scale Support Vector Learning with Structural Kernels. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, ECML PKDD'10, pages 229-244, Berlin, Heidelberg, 2010. Springer-Verlag. URL: http://dl.acm.org/citation.cfm?id=1888339.1888355.
  31. Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y. Ng. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 801-809. Curran Associates, Inc., 2011. URL: http://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf.
  32. Dan Stefanescu, Rajendra Banjade, and Vasile Rus. Latent Semantic Analysis Models on Wikipedia and TASA. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA), 2014. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/403_Paper.pdf.
  33. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In The Association for Computer Linguistics, editor, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1556-1566. The Association for Computer Linguistics, 2015. URL: http://aclweb.org/anthology/P/P15/P15-1150.pdf.
  34. Rob van der Goot and Gertjan van Noord. ROB: using semantic meaning to recognize paraphrases. In Daniel M. Cer, David Jurgens, Preslav Nakov, and Torsten Zesch, editors, Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 40-44. The Association for Computer Linguistics, 2015. URL: http://aclweb.org/anthology/S/S15/S15-2007.pdf.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail