Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text

Pinto, Alexandre; Gonçalo Oliveira, Hugo; Oliveira Alves, Ana

doi:10.4230/OASIcs.SLATE.2016.3

File

Subject Classification

Keywords

Natural language processing
toolkits
formal text
social media
benchmark

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Nowadays, there are many toolkits available for performing common natural language processing tasks, which enable the development of more powerful applications without having to start from scratch. In fact, for English, there is no need to develop tools such as tokenizers, part-of-speech (POS) taggers, chunkers or named entity recognizers (NER). The current challenge is to select which one to use, out of the range of available tools. This choice may depend on several aspects, including the kind and source of text, where the level, formal or informal, may influence the performance of such tools. In this paper, we assess a range of natural language processing toolkits with their default configuration, while performing a set of standard tasks (e.g. tokenization, POS tagging, chunking and NER), in popular datasets that cover newspaper and social network text. The obtained results are analyzed and, while we could not decide on a single toolkit, this exercise was very helpful to narrow our choice.

Cite As Get BibTex

Alexandre Pinto, Hugo Gonçalo Oliveira, and Ana Oliveira Alves. Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text. In 5th Symposium on Languages, Applications and Technologies (SLATE'16). Open Access Series in Informatics (OASIcs), Volume 51, pp. 3:1-3:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/OASIcs.SLATE.2016.3

Author Details

Alexandre Pinto

Hugo Gonçalo Oliveira

Ana Oliveira Alves

References

Samet Atdag and Vincent Labatut. A Comparison of Named Entity Recognition Tools Applied to Biographical Texts. In Systems and Computer Science (ICSCS), 2013 2nd International Conference on, pages 228-233, Villeneuve d'Ascq, France, August 2013. IEEE.
Steven Bird. NLTK: The Natural Language Toolkit. In Proceedings of the COLING/ACL on Interactive Presentation Sessions, COLING-ACL'06, pages 69-72, Sydney, Australia, 2006.
Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. Greenwood, Diana Maynard, and Niraj Aswani. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.
Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. GATE: An Architecture for Development of Robust HLT Applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 168-175, Philadelphia, Pennsylvania, 2002.
Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. MBT: A Memory-Based Part of Speech Tagger-Generator. arXiv preprint cmp-lg/9607012, 1996.
Tom De Smedt and Walter Daelemans. Pattern for Python. The Journal of Machine Learning Research, 13(1):2063-2067, 2012.
Štefan Dlugolinský, Peter Krammer, Marek Ciglan, Michal Laclavík, and Ladislav Hluchý. Combining Named Enitity Recognition Tools. In Making Sense of Microposts (#MSM2013), Rio de Janeiro, Brazil, May 2013.
David Ferrucci and Adam Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3-4):327-348, September 2004.
Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named Entity Recognition through Classifier Combination. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 168-171. Edmonton, Canada, 2003.
Marcos Garcia and Pablo Gamallo. Yet Another Suite of Multilingual NLP Tools. In Languages, Applications and Technologies - Revised Selected Papers of 4th International Symposium SLATE, Madrid, Spain, June 2015, CCIS, pages 65-75. Springer, 2015.
Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A Smith. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, pages 42-47, Portland, Oregon, 2011.
Fréderic Godin, Pedro Debevere, Erik Mannens, Wesley De Neve, and Rik Van de Walle. Leveraging Existing Tools for Named Entity Recognition in Microposts. In Making Sense of Microposts (#MSM2013), pages 36-39, Rio de Janeiro, Brazil, May 2013.
Meritxell González Bermúdez. An analysis of Twitter corpora and the difference between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pages 1-7. CEUR-WS. org, 2015.
Mena Habib, Maurice Van Keulen, and Zhemin Zhu. Concept extraction challenge: University of Twente at #msm2013. In Making Sense of Microposts (#MSM2013) Concept Extraction Challenge, pages 17-20, 2013. URL: http://ceur-ws.org/Vol-1019/paper_14.pdf.
Edward Loper and Steven Bird. NLTK: The Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP'02, pages 63-70, Philadelphia, Pennsylvania, 2002.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55-60, Baltimore, USA, 2014.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist., 19(2):313-330, June 1993. URL: http://dl.acm.org/citation.cfm?id=972470.972475.
Lance A. Ramshaw and Mitchell P. Marcus. Text Chunking using Transformation-Based Learning. In Proceedings of the ACL Third Workshop on Very Large Corpora, pages 82-94, June 1995.
Alan Ritter, Sam Clark, and Oren Etzioni. Named Entity Recognition in Tweets: An Experimental Study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, July 2011.
Giuseppe Rizzo, Raphaël Troncy, Sebastian Hellmann, and Martin Bruemmer. NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. LDOW, 937, 2012.
Giuseppe Rizzo, Marieke van Erp, and Raphaël Troncy. Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web. In International Conference on Language Resources and Evaluation, pages 4593-4600, 2014.
Kepa Joseba Rodriquez, Mike Bryant, Tobias Blanke, and Magdalena Luszczynska. Comparison of Named Entity Recognition Tools for Raw OCR Text. In KONVENS, pages 410-414, Vienna, Austria, 2012.
Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif Mohammad, Alan Ritter, and Veselin Stoyanov. SemEval-2015 Task 10: Sentiment Analysis in Twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 451-463, Denver, Colorado, June 2015. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/S15-2078.
Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, Mars 2002.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142-147. Edmonton, Canada, 2003.

Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text

Authors Alexandre Pinto, Hugo Gonçalo Oliveira, Ana Oliveira Alves

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message