CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Authors Christian Chiarcos , Niko Schenk

Thumbnail PDF


  • Filesize: 0.65 MB
  • 14 pages

Document Identifiers

Author Details

Christian Chiarcos
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany
Niko Schenk
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Cite AsGet BibTex

Christian Chiarcos and Niko Schenk. CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Open Access Series in Informatics (OASIcs), Volume 70, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSV-related data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology.

Subject Classification

ACM Subject Classification
  • Applied computing → Format and notation
  • Applied computing → Document management and text processing
  • Applied computing → Annotation
  • Software and its engineering → Interoperability
  • data heterogeneity
  • tokenization
  • tab-separated values (TSV) format
  • linguistic annotation
  • merging


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Thorsten Brants. TnT: A Statistical Part-of-speech Tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC '00, pages 224-231, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. URL:
  2. Lou Burnard. Reference guide for the British national corpus (XML Edition), 2007. URL:
  3. Jean Carletta, Stefan Evert, Ulrich Heid, Jonathan Kilgour, Judy Robertson, and Holger Voormann. The NITE XML Toolkit: Flexible annotation for multimodal language data. Behavior Research Methods, Instruments, & Computers, 35(3):353-363, 2003. URL:
  4. Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. RST Discourse Treebank, 2002. LDC Catalog No.: LDC2002T07, ISBN, 1-58563-223-6. Google Scholar
  5. Christian Chiarcos, Julia Ritz, and Manfred Stede. By all these lovely tokens... Merging Conflicting Tokenizations. In Proceedings of the Third Linguistic Annotation Workshop, pages 35-43, Suntec, Singapore, August 2009. Association for Computational Linguistics. URL:
  6. Christian Chiarcos, Julia Ritz, and Manfred Stede. By all these lovely tokens... Merging conflicting tokenizations. Language resources and evaluation, 46(1):53-74, 2012. Google Scholar
  7. James Clarke, Vivek Srikumar, Mark Sammons, and Dan Roth. An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 2012. ELRA. Google Scholar
  8. R. H. Dekker and G. Middell. Computer-Supported Collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements. In 2nd Conference on Supporting Digital Humanities 2011 (SDH-2011), University of Copenhagen, Denmark, 2011. Google Scholar
  9. Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a Long Solved Problem. A Survey, Contrastive Experiment, Recommendations, and Toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378-382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL:
  10. Stefan Evert and Andrew Hardie. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of Corpus Linguistics 2011 (CL2011), University of Birmingham, 2011. Google Scholar
  11. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90%Solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short '06, pages 57-60, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL:
  12. Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlỳ, and Vít Suchomel. The Sketch Engine: Ten years on. Lexicography, 1(1):7-36, 2014. Google Scholar
  13. Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the Workshop on Human Language Technology, HLT '94, pages 114-119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. URL:
  14. Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. Treebank-3, 1999. LDC Catalog No.: LDC99T42, ISBN, 1-58563-163-9. Google Scholar
  15. Edward M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. J. ACM, 23(2):262-272, 1976. URL:
  16. Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. The NomBank Project: An Interim Report. In In Proceedings of the NAACL/HLT Workshop on Frontiers in Corpus Annotation, 2004. Google Scholar
  17. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at International Conference on Learning Representations, 2013. Google Scholar
  18. Christoph Müller and Michael Strube. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197-214. Peter Lang, Frankfurt a.M., Germany, 2006. Google Scholar
  19. Eugene W. Myers. AnO(ND) difference algorithm and its variations. Algorithmica, 1(1):251-266, 1986. URL:
  20. Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles. Comput. Linguist., 31(1):71-106, March 2005. URL:
  21. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. The Penn Discourse TreeBank 2.0. In Proceedings, 6th International Conference on Language Resources and Evaluation, pages 2961-2968, Marrakech, Morocco, 2008. Google Scholar
  22. James Pustejovsky, Adam Meyers, Martha Palmer, and Massimo Poesio. Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, chapter Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference, pages 5-12. Association for Computational Linguistics, 2005. URL:
  23. Michael Roth. Role Semantics for Better Models of Implicit Discourse Relations. In IWCS 2017 — 12th International Conference on Computational Semantics — Short papers, 2017. URL:
  24. Niko Schenk, Christian Chiarcos, Kathrin Donandt, Samuel Rönnqvist, Evgeny Stepanov, and Giuseppe Riccardi. Do We Really Need All Those Rich Linguistic Features? A Neural Network-Based Approach to Implicit Sense Labeling. In Proceedings of the CoNLL-16 shared task, pages 41-49. Association for Computational Linguistics, 2016. URL:
  25. Carina Silberer and Anette Frank. Casting Implicit Role Linking as an Anaphora Resolution Task. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval '12, pages 1-10, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. URL:
  26. Mihai Surdeanu, Tom Hicks, and Marco Antonio Valenzuela-Escárcega. Two Practical Rhetorical Structure Theory Parsers. In HLT-NAACL, pages 1-5. The Association for Computational Linguistics, 2015. Google Scholar
  27. Florian Wolf, Edward Gibson, Amy Fisher, and Meredith Knight. Discourse Graphbank, 2005. LDC Catalog No.: LDC2005T08, ISBN, 1-58563-320-8. Google Scholar
  28. Kaoru Yamamoto, Taku Kudo, Akihiko Konagaya, and Yuji Matsumoto. Protein Name Tagging for Biomedical Annotation in Text. In Sophia Ananiadou and Jun'ichi Tsujii, editors, Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 65-72, 2003. Google Scholar
  29. Li Yujian and Liu Bo. A Normalized Levenshtein Distance Metric. IEEE Trans. Pattern Anal. Mach. Intell., 29(6):1091-1095, June 2007. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail