CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Chiarcos, Christian; Schenk, Niko

doi:10.4230/OASIcs.LDK.2019.7

File

Author Details

Christian Chiarcos

Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Niko Schenk

Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Cite AsGet BibTex

Christian Chiarcos and Niko Schenk. CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Open Access Series in Informatics (OASIcs), Volume 70, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/OASIcs.LDK.2019.7

Abstract

The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSV-related data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology.

Subject Classification

ACM Subject Classification

Applied computing → Format and notation
Applied computing → Document management and text processing
Applied computing → Annotation
Software and its engineering → Interoperability

Keywords

data heterogeneity
tokenization
tab-separated values (TSV) format
linguistic annotation
merging

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Thorsten Brants. TnT: A Statistical Part-of-speech Tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC '00, pages 224-231, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. URL: http://dx.doi.org/10.3115/974147.974178.
Lou Burnard. Reference guide for the British national corpus (XML Edition), 2007. URL: http://www.natcorp.ox.ac.uk/XMLedition/URGbnctags.html.
Jean Carletta, Stefan Evert, Ulrich Heid, Jonathan Kilgour, Judy Robertson, and Holger Voormann. The NITE XML Toolkit: Flexible annotation for multimodal language data. Behavior Research Methods, Instruments, & Computers, 35(3):353-363, 2003. URL: http://dx.doi.org/10.3758/BF03195511.
Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. RST Discourse Treebank, 2002. LDC Catalog No.: LDC2002T07, ISBN, 1-58563-223-6.
Christian Chiarcos, Julia Ritz, and Manfred Stede. By all these lovely tokens... Merging Conflicting Tokenizations. In Proceedings of the Third Linguistic Annotation Workshop, pages 35-43, Suntec, Singapore, August 2009. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/W/W09/W09-3005.
Christian Chiarcos, Julia Ritz, and Manfred Stede. By all these lovely tokens... Merging conflicting tokenizations. Language resources and evaluation, 46(1):53-74, 2012.
James Clarke, Vivek Srikumar, Mark Sammons, and Dan Roth. An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 2012. ELRA.
R. H. Dekker and G. Middell. Computer-Supported Collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements. In 2nd Conference on Supporting Digital Humanities 2011 (SDH-2011), University of Copenhagen, Denmark, 2011.
Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a Long Solved Problem. A Survey, Contrastive Experiment, Recommendations, and Toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378-382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/P12-2074.
Stefan Evert and Andrew Hardie. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of Corpus Linguistics 2011 (CL2011), University of Birmingham, 2011.
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90%Solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short '06, pages 57-60, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL: http://dl.acm.org/citation.cfm?id=1614049.1614064.
Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlỳ, and Vít Suchomel. The Sketch Engine: Ten years on. Lexicography, 1(1):7-36, 2014.
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the Workshop on Human Language Technology, HLT '94, pages 114-119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. URL: http://dx.doi.org/10.3115/1075812.1075835.
Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. Treebank-3, 1999. LDC Catalog No.: LDC99T42, ISBN, 1-58563-163-9.
Edward M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. J. ACM, 23(2):262-272, 1976. URL: http://dx.doi.org/10.1145/321941.321946.
Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. The NomBank Project: An Interim Report. In In Proceedings of the NAACL/HLT Workshop on Frontiers in Corpus Annotation, 2004.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at International Conference on Learning Representations, 2013.
Christoph Müller and Michael Strube. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197-214. Peter Lang, Frankfurt a.M., Germany, 2006.
Eugene W. Myers. AnO(ND) difference algorithm and its variations. Algorithmica, 1(1):251-266, 1986. URL: http://dx.doi.org/10.1007/BF01840446.
Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles. Comput. Linguist., 31(1):71-106, March 2005. URL: http://dx.doi.org/10.1162/0891201053630264.
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. The Penn Discourse TreeBank 2.0. In Proceedings, 6th International Conference on Language Resources and Evaluation, pages 2961-2968, Marrakech, Morocco, 2008.
James Pustejovsky, Adam Meyers, Martha Palmer, and Massimo Poesio. Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, chapter Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference, pages 5-12. Association for Computational Linguistics, 2005. URL: http://aclweb.org/anthology/W05-0302.
Michael Roth. Role Semantics for Better Models of Implicit Discourse Relations. In IWCS 2017 — 12th International Conference on Computational Semantics — Short papers, 2017. URL: http://aclweb.org/anthology/W17-6934.
Niko Schenk, Christian Chiarcos, Kathrin Donandt, Samuel Rönnqvist, Evgeny Stepanov, and Giuseppe Riccardi. Do We Really Need All Those Rich Linguistic Features? A Neural Network-Based Approach to Implicit Sense Labeling. In Proceedings of the CoNLL-16 shared task, pages 41-49. Association for Computational Linguistics, 2016. URL: http://dx.doi.org/10.18653/v1/K16-2005.
Carina Silberer and Anette Frank. Casting Implicit Role Linking as an Anaphora Resolution Task. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval '12, pages 1-10, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. URL: http://dl.acm.org/citation.cfm?id=2387636.2387638.
Mihai Surdeanu, Tom Hicks, and Marco Antonio Valenzuela-Escárcega. Two Practical Rhetorical Structure Theory Parsers. In HLT-NAACL, pages 1-5. The Association for Computational Linguistics, 2015.
Florian Wolf, Edward Gibson, Amy Fisher, and Meredith Knight. Discourse Graphbank, 2005. LDC Catalog No.: LDC2005T08, ISBN, 1-58563-320-8.
Kaoru Yamamoto, Taku Kudo, Akihiko Konagaya, and Yuji Matsumoto. Protein Name Tagging for Biomedical Annotation in Text. In Sophia Ananiadou and Jun'ichi Tsujii, editors, Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 65-72, 2003.
Li Yujian and Liu Bo. A Normalized Levenshtein Distance Metric. IEEE Trans. Pattern Anal. Mach. Intell., 29(6):1091-1095, June 2007. URL: http://dx.doi.org/10.1109/TPAMI.2007.1078.

CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Authors Christian Chiarcos , Niko Schenk

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Authors Christian Chiarcos , Niko Schenk

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References