An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

Chiarcos, Christian; Ionov, Maxim; Glaser, Luis; Fäth, Christian

doi:10.4230/OASIcs.LDK.2021.20

Abstract

In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects. CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018], syntactic parsing of historical languages [Chiarcos et al., 2018; Chiarcos et al., 2018], the consolidation of syntactic and semantic annotations [Chiarcos and Fäth, 2019], a bridge between RDF corpora and a traditional corpus query language [Ionov et al., 2020], and language contact studies [Chiarcos et al., 2018]. We describe a novel extension of CoNLL-RDF, introducing a formal data model, formalized as an ontology. The ontology is a basis for linking RDF corpora with other Semantic Web resources, but more importantly, its application for transformation between different TSV formats is a major step for providing interoperability between CoNLL formats.

Frank Abromeit and Christian Chiarcos. Automatic Detection of Language and Annotation Model Information in CoNLL Corpora. In Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski, editors, 2nd Conference on Language, Data and Knowledge (LDK 2019), volume 70 of OpenAccess Series in Informatics (OASIcs), pages 23:1-23:9, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu. Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 397-407, 2015.
Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, 5(3):1-22, 2009.
Christian Chiarcos. A Generic Formalism to Represent Linguistic Corpora in RDF and OWL/DL. In Proc. of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 3205-3212. ELRA, 2012.
Christian Chiarcos. POWLA: Modeling Linguistic Corpora in OWL/DL. In Elena Simperl, Philipp Cimiano, Axel Polleres, Oscar Corcho, and Valentina Presutti, editors, The Semantic Web: Research and Applications, pages 225-239, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
Christian Chiarcos, Kathrin Donandt, Hasmik Sargsian, M Ionov, and J Wichers Schreur. Towards llod-based language contact studies. a case study in interoperability. In Proc. of the 6th Workshop on Linked Data in Linguistics (LDL), 2018.
Christian Chiarcos and Christian Fäth. CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian Chiarcos, and Sebastian Hellmann, editors, Language, Data, and Knowledge, pages 74-88, Cham, Switzerland, 2017. Springer.
Christian Chiarcos and Christian Fäth. Graph-based annotation engineering: towards a gold corpus for role and reference grammar. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
Christian Chiarcos and Christian Fäth. Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar. In 2nd Conference on Language, Data and Knowledge (LDK-2019), pages 9:1-9:11. OpenAccess Series in Informatics, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2019.
Christian Chiarcos and Luis Glaser. A Tree Extension for CoNLL-RDF. In Proc. of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020), pages 7161-7169, Marseille, France, 2020. ELRA.
Christian Chiarcos, Ilya Khait, Émilie Pagé-Perron, Niko Schenk, Christian Fäth, Julius Steuer, William Mcgrath, Jinyan Wang, et al. Annotating a low-resource language with llod technology: Sumerian morphology and syntax. Information, 9(11):290, 2018.
Christian Chiarcos, Benjamin Kosmehl, Christian Fäth, and Maria Sukhareva. Analyzing Middle High German syntax with RDF and SPARQL. In Proc. of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), pages 4525-4534, Miyazaki, Japan, 2018.
O. Christ. A modular and flexible architecture for an integrated corpus query system. In Papers in Computational Lexicography (COMPLEX-1994), page 22–32, Budapest, Hungary, 1994.
Philipp Cimiano, Christian Chiarcos, John P. McCrae, and Jorge Gracia. Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing, Cham, 2020.
Souripriya Das, Seema Sundara, and Richard Cyganiak. R2RML: RDB to RDF Mapping Language. W3C Recommendation. https://www.w3.org/TR/r2rml, 2012.
Stefan Evert and Andrew Hardie. Twenty-first Century Corpus Workbench: Updating a Query Architecture for the New Millennium. In Proc. of the Corpus Linguistics 2011 Conference, pages 1-21, Birmingham, UK, 2011.
Antske Fokkens, Aitor Soroa, Zuhaitz Beloki, Niels Ockeloen, German Rigau, Willem Robert van Hage, and Piek Vossen. NAF and GAF: Linking Linguistic Annotations. In Proc. of the Tenth Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation, pages 9-16, 2014.
Christian Fäth, Christian Chiarcos, Björn Ebbrecht, and Maxim Ionov. Fintan - Flexible, Integrated Transformation and annotation eNgineering. In Proc. of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020), pages 7212-7221, Marseille, France, 2020. ELRA.
A. Ghiran and R. A. Buchmann. Semantic Integration of Security Knowledge Sources. In Twelfth International Conference on Research Challenges in Information Science (RCIS-2018), pages 1-9, 2018.
Noori Haider and Fokhray Hossain. CSV2RDF: Generating RDF Data from CSV File Using Semantic Web Technologies. Journal of Theoretical and Applied Information Technology, 96(20):6889-6902, 2018.
Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. Integrating NLP Using Linked Data. In Camille Salinesi, Moira C. Norrie, and Óscar Pastor, editors, Advanced Information Systems Engineering, volume 7908, pages 98-113. Springer Berlin Heidelberg, 2013.
Eero Antero Hyvönen, Petri Leskinen, Minna Tamper, and Jouni Antero Tuominen. Semantic National Biography of Finland. In Eetu Mäkelä, Mikko Tolonen, and Jouni Tuominen, editors, Proc. of the DHN 2018, CEUR Workshop Proceedings, pages 372-385, International, 2018. CEUR Workshop Proceedings.
N. Ide and L. Romary. International Standard for a Linguistic Annotation Framework. Natural language engineering, 10(3-4):211-225, 2004.
Maxim Ionov, Florian Stein, Sagar Sehgal, and Christian Chiarcos. cqp4rdf: Towards a suite for rdf-based corpus linguistics. In European Semantic Web Conference, pages 115-121. Springer, 2020.
ISO. Language Resource Management - Linguistic Annotation Framework (LAF). Standard, International Organization for Standardization, Geneva, 2012. Project leader: Nancy Ide.
Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. The Sketch Engine: Ten Years On. Lexicography, 1(1):7-36, 2014.
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sebastian J Mielke, Arya D McCarthy, Sandra Kübler, et al. UniMorph 2.0: Universal Morphology. In Proc. of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), pages 1868-1873, 2018.
Francesco Mambrini and Marco Passarotti. Linked open treebanks. interlinking syntactically annotated corpora in the lila knowledge base of linguistic resources for latin. In Proc. of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pages 74-81, 2019.
Francesco Mambrini and Marco Passarotti. Linked Open Treebanks. Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin. In Proc. of TLT, SyntaxFest 2019, pages 74-81, Paris, France, 2019. Association for Computational Linguistics.
Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-330, 1993.
Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Universal Dependencies v1: A Multilingual Treebank Collection. In Proc. of the Tenth International Conference on Language Resources and Evaluation (LREC-2016), pages 1659-1666, 2016.
Emilio Rubiera, Luis Polo, Diego Berrueta, and Adil El Ghali. TELIX: An RDF-based Model for Linguistic Annotation. In Extended Semantic Web Conference, pages 195-209. Springer, 2012.
Robert Sanderson, Paolo Ciccarese, and Benjamin Young. Web Annotation Data Model. Technical report, W3C Recommendation, 2017. URL: https://www.w3.org/TR/annotation-model/.
Erik F Sang and Sabine Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. arXiv preprint, 2000. URL: http://arxiv.org/abs/cs/0009008.
Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. of International Conference on New Methods in Language Processing, pages 44-49, Manchester, UK, 1994.
Minna Tamper, Petri Leskinen, Kasper Apajalahti, and Eero Hyvönen. Using biographical texts as linked data for prosopographical research and applications. In Euro-Mediterranean Conference, pages 125-137. Springer, 2018.
Marc Verhagen, Keith Suderman, Di Wang, Nancy Ide, Chunqi Shi, Jonathan Wright, and James Pustejovsky. The LAPPS Interchange Format. In International Workshop on Worldwide Language Service Infrastructure, pages 33-47. Springer, 2015.
Karin Verspoor and Kevin Livingston. Towards Adaptation of Linguistic Annotations to Scholarly Annotation Formalisms on the Semantic Web. In Proc. of the Sixth Linguistic Annotation Workshop, pages 75-84, 2012.

An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

Authors Christian Chiarcos , Maxim Ionov , Luis Glaser , Christian Fäth

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

Authors Christian Chiarcos , Maxim Ionov , Luis Glaser , Christian Fäth

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References