Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text

Authors Alexandre Pinto, Hugo Gonçalo Oliveira, Ana Oliveira Alves



PDF
Thumbnail PDF

File

OASIcs.SLATE.2016.3.pdf
  • Filesize: 489 kB
  • 16 pages

Document Identifiers

Author Details

Alexandre Pinto
Hugo Gonçalo Oliveira
Ana Oliveira Alves

Cite AsGet BibTex

Alexandre Pinto, Hugo Gonçalo Oliveira, and Ana Oliveira Alves. Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text. In 5th Symposium on Languages, Applications and Technologies (SLATE'16). Open Access Series in Informatics (OASIcs), Volume 51, pp. 3:1-3:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)
https://doi.org/10.4230/OASIcs.SLATE.2016.3

Abstract

Nowadays, there are many toolkits available for performing common natural language processing tasks, which enable the development of more powerful applications without having to start from scratch. In fact, for English, there is no need to develop tools such as tokenizers, part-of-speech (POS) taggers, chunkers or named entity recognizers (NER). The current challenge is to select which one to use, out of the range of available tools. This choice may depend on several aspects, including the kind and source of text, where the level, formal or informal, may influence the performance of such tools. In this paper, we assess a range of natural language processing toolkits with their default configuration, while performing a set of standard tasks (e.g. tokenization, POS tagging, chunking and NER), in popular datasets that cover newspaper and social network text. The obtained results are analyzed and, while we could not decide on a single toolkit, this exercise was very helpful to narrow our choice.
Keywords
  • Natural language processing
  • toolkits
  • formal text
  • social media
  • benchmark

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Samet Atdag and Vincent Labatut. A Comparison of Named Entity Recognition Tools Applied to Biographical Texts. In Systems and Computer Science (ICSCS), 2013 2nd International Conference on, pages 228-233, Villeneuve d'Ascq, France, August 2013. IEEE. Google Scholar
  2. Steven Bird. NLTK: The Natural Language Toolkit. In Proceedings of the COLING/ACL on Interactive Presentation Sessions, COLING-ACL'06, pages 69-72, Sydney, Australia, 2006. Google Scholar
  3. Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. Greenwood, Diana Maynard, and Niraj Aswani. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013. Google Scholar
  4. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. GATE: An Architecture for Development of Robust HLT Applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 168-175, Philadelphia, Pennsylvania, 2002. Google Scholar
  5. Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. MBT: A Memory-Based Part of Speech Tagger-Generator. arXiv preprint cmp-lg/9607012, 1996. Google Scholar
  6. Tom De Smedt and Walter Daelemans. Pattern for Python. The Journal of Machine Learning Research, 13(1):2063-2067, 2012. Google Scholar
  7. Štefan Dlugolinský, Peter Krammer, Marek Ciglan, Michal Laclavík, and Ladislav Hluchý. Combining Named Enitity Recognition Tools. In Making Sense of Microposts (#MSM2013), Rio de Janeiro, Brazil, May 2013. Google Scholar
  8. David Ferrucci and Adam Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3-4):327-348, September 2004. Google Scholar
  9. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named Entity Recognition through Classifier Combination. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 168-171. Edmonton, Canada, 2003. Google Scholar
  10. Marcos Garcia and Pablo Gamallo. Yet Another Suite of Multilingual NLP Tools. In Languages, Applications and Technologies - Revised Selected Papers of 4th International Symposium SLATE, Madrid, Spain, June 2015, CCIS, pages 65-75. Springer, 2015. Google Scholar
  11. Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A Smith. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, pages 42-47, Portland, Oregon, 2011. Google Scholar
  12. Fréderic Godin, Pedro Debevere, Erik Mannens, Wesley De Neve, and Rik Van de Walle. Leveraging Existing Tools for Named Entity Recognition in Microposts. In Making Sense of Microposts (#MSM2013), pages 36-39, Rio de Janeiro, Brazil, May 2013. Google Scholar
  13. Meritxell González Bermúdez. An analysis of Twitter corpora and the difference between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pages 1-7. CEUR-WS. org, 2015. Google Scholar
  14. Mena Habib, Maurice Van Keulen, and Zhemin Zhu. Concept extraction challenge: University of Twente at #msm2013. In Making Sense of Microposts (#MSM2013) Concept Extraction Challenge, pages 17-20, 2013. URL: http://ceur-ws.org/Vol-1019/paper_14.pdf.
  15. Edward Loper and Steven Bird. NLTK: The Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP'02, pages 63-70, Philadelphia, Pennsylvania, 2002. Google Scholar
  16. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55-60, Baltimore, USA, 2014. Google Scholar
  17. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist., 19(2):313-330, June 1993. URL: http://dl.acm.org/citation.cfm?id=972470.972475.
  18. Lance A. Ramshaw and Mitchell P. Marcus. Text Chunking using Transformation-Based Learning. In Proceedings of the ACL Third Workshop on Very Large Corpora, pages 82-94, June 1995. Google Scholar
  19. Alan Ritter, Sam Clark, and Oren Etzioni. Named Entity Recognition in Tweets: An Experimental Study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, July 2011. Google Scholar
  20. Giuseppe Rizzo, Raphaël Troncy, Sebastian Hellmann, and Martin Bruemmer. NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. LDOW, 937, 2012. Google Scholar
  21. Giuseppe Rizzo, Marieke van Erp, and Raphaël Troncy. Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web. In International Conference on Language Resources and Evaluation, pages 4593-4600, 2014. Google Scholar
  22. Kepa Joseba Rodriquez, Mike Bryant, Tobias Blanke, and Magdalena Luszczynska. Comparison of Named Entity Recognition Tools for Raw OCR Text. In KONVENS, pages 410-414, Vienna, Austria, 2012. Google Scholar
  23. Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif Mohammad, Alan Ritter, and Veselin Stoyanov. SemEval-2015 Task 10: Sentiment Analysis in Twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 451-463, Denver, Colorado, June 2015. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/S15-2078.
  24. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, Mars 2002. Google Scholar
  25. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142-147. Edmonton, Canada, 2003. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail