Evaluation of Distributional Models with the Outlier Detection Task

Gamallo, Pablo

doi:10.4230/OASIcs.SLATE.2018.13

File

Author Details

Pablo Gamallo

Centro de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Galiza

Cite As Get BibTex

Pablo Gamallo. Evaluation of Distributional Models with the Outlier Detection Task. In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Open Access Series in Informatics (OASIcs), Volume 62, pp. 13:1-13:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/OASIcs.SLATE.2018.13

Abstract

In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for outlier detection task in English and Portuguese are released.

Subject Classification

ACM Subject Classification

Computing methodologies → Unsupervised learning

Keywords

distributional semantics
dependency analysis
outlier detection
similarity

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19-27, 2009.
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 238-247, 2014.
Miroslav Batchkarov, Thomas Kober, Jeremy Reffin, Julie Weeds, and David Weir. A critique of word similarity as a method for evaluating distributional semantic models. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 7-12, 2016.
Biemann, C., and Riedl M. Text: Now in 2D! A framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1):55-95, 2013.
William Blacoe and Mirella Lapata. A comparison of vector-based representations for semantic composition. In Empirical Methods in Natural Language Processing (EMNLP), pages 546-556, 2012.
Stefan Bordag. A comparison of co-occurrence and similarity measures as simulations of context. In Computational Linguistics and Intelligent Text Processing (CICLing), pages 52-63, 2008.
José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 43-50, 2016.
Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.
Manaal Faruqui and Chris Dyer. Non-distributional word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 464-469, 2015.
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1):116-131, 2002.
Pablo Gamallo. Comparing window and syntax based strategies for semantic extraction. In Computational processing of the Portuguese language, pages 41-50, 2008.
Pablo Gamallo. Comparing different properties involved in word similarity extraction. In 14th Portuguese Conference on Artificial Intelligence (EPIA'09), pages 634-645, 2009.
Pablo Gamallo. Dependency parsing with compression rules. In Proceedings of the 14th International Workshop on Parsing Technology (IWPT), pages 107-117, 2015.
Pablo Gamallo. Comparing explicit and predictive distributional semantic models endowed with syntactic contexts. Language Resources and Evaluation, 51(3):727-743, 2017.
Pablo Gamallo, Alexandre Agustini, and Gabriel Lopes. Clustering syntactic positions with similar semantic requirements. Computational Linguistics, 31(1):107-146, 2005.
Pablo Gamallo and Stefan Bordag. Is singular value decomposition useful for word simalirity extraction. Language Resources and Evaluation, 45(2):95-119, 2011.
Yoav Goldberg and Joakim Nivre. A dynamic oracle for arc-eager dependency parsing. In 24th International Conference on Computational Linguistics Proceedings of the Conference (COLING), pages 959-976, 2012.
Gregory Grefenstette. Evaluation techniques for automatic semantic extraction: Comparing syntactic and window-based approaches. In Workshop on Acquisition of Lexical Knowledge from Text (SIGLEX), pages 205-216, 1993.
Eric Huang, Richard Socher, and Christopher Manning. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 873-882, 2012.
Rémi Lebret and Ronan Collobert. Rehabilitation of count-based models for word vector representations. In Computational Linguistics and Intelligent Text Processing (CICLing), volume 9041, pages 417-429, 2015.
Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 302-308, 2014.
Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL), pages 171-180, 2014.
Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211-225, 2015.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746-751, 2013.
Sebastian Padó and Mirella Lapata. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161-199, 2007.
Muntsa Padró, Marco Idiart, Aline Villavicencio, and Carlos Ramisch. Nothing like good old frequency: Studying context filters for distributional thesauri. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 419-424, 2014.
Yves Peirsman, Kris Heylen, and Dirk Speelman. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In CoSMO Workshop, pages 9-16, 2007.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014.
Violeta Seretan and Eric Wehrli. Accurate collocation extraction using a multilingual parser. In 21st International Conference on Computational Linguistics, pages 953-960, 2006.

Evaluation of Distributional Models with the Outlier Detection Task

Author Pablo Gamallo

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Evaluation of Distributional Models with the Outlier Detection Task

Author Pablo Gamallo

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message