Evaluation of Distributional Models with the Outlier Detection Task

Author Pablo Gamallo



PDF
Thumbnail PDF

File

OASIcs.SLATE.2018.13.pdf
  • Filesize: 394 kB
  • 8 pages

Document Identifiers

Author Details

Pablo Gamallo
  • Centro de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Galiza

Cite AsGet BibTex

Pablo Gamallo. Evaluation of Distributional Models with the Outlier Detection Task. In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Open Access Series in Informatics (OASIcs), Volume 62, pp. 13:1-13:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/OASIcs.SLATE.2018.13

Abstract

In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for outlier detection task in English and Portuguese are released.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Unsupervised learning
Keywords
  • distributional semantics
  • dependency analysis
  • outlier detection
  • similarity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19-27, 2009. Google Scholar
  2. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 238-247, 2014. Google Scholar
  3. Miroslav Batchkarov, Thomas Kober, Jeremy Reffin, Julie Weeds, and David Weir. A critique of word similarity as a method for evaluating distributional semantic models. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 7-12, 2016. Google Scholar
  4. Biemann, C., and Riedl M. Text: Now in 2D! A framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1):55-95, 2013. Google Scholar
  5. William Blacoe and Mirella Lapata. A comparison of vector-based representations for semantic composition. In Empirical Methods in Natural Language Processing (EMNLP), pages 546-556, 2012. Google Scholar
  6. Stefan Bordag. A comparison of co-occurrence and similarity measures as simulations of context. In Computational Linguistics and Intelligent Text Processing (CICLing), pages 52-63, 2008. Google Scholar
  7. José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 43-50, 2016. Google Scholar
  8. Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993. Google Scholar
  9. Manaal Faruqui and Chris Dyer. Non-distributional word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 464-469, 2015. Google Scholar
  10. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1):116-131, 2002. Google Scholar
  11. Pablo Gamallo. Comparing window and syntax based strategies for semantic extraction. In Computational processing of the Portuguese language, pages 41-50, 2008. Google Scholar
  12. Pablo Gamallo. Comparing different properties involved in word similarity extraction. In 14th Portuguese Conference on Artificial Intelligence (EPIA'09), pages 634-645, 2009. Google Scholar
  13. Pablo Gamallo. Dependency parsing with compression rules. In Proceedings of the 14th International Workshop on Parsing Technology (IWPT), pages 107-117, 2015. Google Scholar
  14. Pablo Gamallo. Comparing explicit and predictive distributional semantic models endowed with syntactic contexts. Language Resources and Evaluation, 51(3):727-743, 2017. Google Scholar
  15. Pablo Gamallo, Alexandre Agustini, and Gabriel Lopes. Clustering syntactic positions with similar semantic requirements. Computational Linguistics, 31(1):107-146, 2005. Google Scholar
  16. Pablo Gamallo and Stefan Bordag. Is singular value decomposition useful for word simalirity extraction. Language Resources and Evaluation, 45(2):95-119, 2011. Google Scholar
  17. Yoav Goldberg and Joakim Nivre. A dynamic oracle for arc-eager dependency parsing. In 24th International Conference on Computational Linguistics Proceedings of the Conference (COLING), pages 959-976, 2012. Google Scholar
  18. Gregory Grefenstette. Evaluation techniques for automatic semantic extraction: Comparing syntactic and window-based approaches. In Workshop on Acquisition of Lexical Knowledge from Text (SIGLEX), pages 205-216, 1993. Google Scholar
  19. Eric Huang, Richard Socher, and Christopher Manning. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 873-882, 2012. Google Scholar
  20. Rémi Lebret and Ronan Collobert. Rehabilitation of count-based models for word vector representations. In Computational Linguistics and Intelligent Text Processing (CICLing), volume 9041, pages 417-429, 2015. Google Scholar
  21. Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 302-308, 2014. Google Scholar
  22. Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL), pages 171-180, 2014. Google Scholar
  23. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211-225, 2015. Google Scholar
  24. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746-751, 2013. Google Scholar
  25. Sebastian Padó and Mirella Lapata. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161-199, 2007. Google Scholar
  26. Muntsa Padró, Marco Idiart, Aline Villavicencio, and Carlos Ramisch. Nothing like good old frequency: Studying context filters for distributional thesauri. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 419-424, 2014. Google Scholar
  27. Yves Peirsman, Kris Heylen, and Dirk Speelman. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In CoSMO Workshop, pages 9-16, 2007. Google Scholar
  28. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. Google Scholar
  29. Violeta Seretan and Eric Wehrli. Accurate collocation extraction using a multilingual parser. In 21st International Conference on Computational Linguistics, pages 953-960, 2006. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail