An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

Authors Christian Chiarcos , Maxim Ionov , Luis Glaser , Christian Fäth



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.20.pdf
  • Filesize: 0.84 MB
  • 14 pages

Document Identifiers

Author Details

Christian Chiarcos
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany
Maxim Ionov
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany
Luis Glaser
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany
Christian Fäth
  • Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Cite As Get BibTex

Christian Chiarcos, Maxim Ionov, Luis Glaser, and Christian Fäth. An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.LDK.2021.20

Abstract

In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects.
CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018], syntactic parsing of historical languages [Chiarcos et al., 2018; Chiarcos et al., 2018], the consolidation of syntactic and semantic annotations [Chiarcos and Fäth, 2019], a bridge between RDF corpora and a traditional corpus query language [Ionov et al., 2020], and language contact studies [Chiarcos et al., 2018].
We describe a novel extension of CoNLL-RDF, introducing a formal data model, formalized as an ontology. The ontology is a basis for linking RDF corpora with other Semantic Web resources, but more importantly, its application for transformation between different TSV formats is a major step for providing interoperability between CoNLL formats.

Subject Classification

ACM Subject Classification
  • Information systems → Graph-based database models
  • Computing methodologies → Language resources
  • Computing methodologies → Knowledge representation and reasoning
Keywords
  • language technology
  • data models
  • CoNLL-RDF
  • ontology

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Frank Abromeit and Christian Chiarcos. Automatic Detection of Language and Annotation Model Information in CoNLL Corpora. In Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski, editors, 2nd Conference on Language, Data and Knowledge (LDK 2019), volume 70 of OpenAccess Series in Informatics (OASIcs), pages 23:1-23:9, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Google Scholar
  2. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu. Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 397-407, 2015. Google Scholar
  3. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, 5(3):1-22, 2009. Google Scholar
  4. Christian Chiarcos. A Generic Formalism to Represent Linguistic Corpora in RDF and OWL/DL. In Proc. of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 3205-3212. ELRA, 2012. Google Scholar
  5. Christian Chiarcos. POWLA: Modeling Linguistic Corpora in OWL/DL. In Elena Simperl, Philipp Cimiano, Axel Polleres, Oscar Corcho, and Valentina Presutti, editors, The Semantic Web: Research and Applications, pages 225-239, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. Google Scholar
  6. Christian Chiarcos, Kathrin Donandt, Hasmik Sargsian, M Ionov, and J Wichers Schreur. Towards llod-based language contact studies. a case study in interoperability. In Proc. of the 6th Workshop on Linked Data in Linguistics (LDL), 2018. Google Scholar
  7. Christian Chiarcos and Christian Fäth. CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian Chiarcos, and Sebastian Hellmann, editors, Language, Data, and Knowledge, pages 74-88, Cham, Switzerland, 2017. Springer. Google Scholar
  8. Christian Chiarcos and Christian Fäth. Graph-based annotation engineering: towards a gold corpus for role and reference grammar. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  9. Christian Chiarcos and Christian Fäth. Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar. In 2nd Conference on Language, Data and Knowledge (LDK-2019), pages 9:1-9:11. OpenAccess Series in Informatics, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2019. Google Scholar
  10. Christian Chiarcos and Luis Glaser. A Tree Extension for CoNLL-RDF. In Proc. of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020), pages 7161-7169, Marseille, France, 2020. ELRA. Google Scholar
  11. Christian Chiarcos, Ilya Khait, Émilie Pagé-Perron, Niko Schenk, Christian Fäth, Julius Steuer, William Mcgrath, Jinyan Wang, et al. Annotating a low-resource language with llod technology: Sumerian morphology and syntax. Information, 9(11):290, 2018. Google Scholar
  12. Christian Chiarcos, Benjamin Kosmehl, Christian Fäth, and Maria Sukhareva. Analyzing Middle High German syntax with RDF and SPARQL. In Proc. of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), pages 4525-4534, Miyazaki, Japan, 2018. Google Scholar
  13. O. Christ. A modular and flexible architecture for an integrated corpus query system. In Papers in Computational Lexicography (COMPLEX-1994), page 22–32, Budapest, Hungary, 1994. Google Scholar
  14. Philipp Cimiano, Christian Chiarcos, John P. McCrae, and Jorge Gracia. Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing, Cham, 2020. Google Scholar
  15. Souripriya Das, Seema Sundara, and Richard Cyganiak. R2RML: RDB to RDF Mapping Language. W3C Recommendation. https://www.w3.org/TR/r2rml, 2012.
  16. Stefan Evert and Andrew Hardie. Twenty-first Century Corpus Workbench: Updating a Query Architecture for the New Millennium. In Proc. of the Corpus Linguistics 2011 Conference, pages 1-21, Birmingham, UK, 2011. Google Scholar
  17. Antske Fokkens, Aitor Soroa, Zuhaitz Beloki, Niels Ockeloen, German Rigau, Willem Robert van Hage, and Piek Vossen. NAF and GAF: Linking Linguistic Annotations. In Proc. of the Tenth Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation, pages 9-16, 2014. Google Scholar
  18. Christian Fäth, Christian Chiarcos, Björn Ebbrecht, and Maxim Ionov. Fintan - Flexible, Integrated Transformation and annotation eNgineering. In Proc. of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020), pages 7212-7221, Marseille, France, 2020. ELRA. Google Scholar
  19. A. Ghiran and R. A. Buchmann. Semantic Integration of Security Knowledge Sources. In Twelfth International Conference on Research Challenges in Information Science (RCIS-2018), pages 1-9, 2018. Google Scholar
  20. Noori Haider and Fokhray Hossain. CSV2RDF: Generating RDF Data from CSV File Using Semantic Web Technologies. Journal of Theoretical and Applied Information Technology, 96(20):6889-6902, 2018. Google Scholar
  21. Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. Integrating NLP Using Linked Data. In Camille Salinesi, Moira C. Norrie, and Óscar Pastor, editors, Advanced Information Systems Engineering, volume 7908, pages 98-113. Springer Berlin Heidelberg, 2013. Google Scholar
  22. Eero Antero Hyvönen, Petri Leskinen, Minna Tamper, and Jouni Antero Tuominen. Semantic National Biography of Finland. In Eetu Mäkelä, Mikko Tolonen, and Jouni Tuominen, editors, Proc. of the DHN 2018, CEUR Workshop Proceedings, pages 372-385, International, 2018. CEUR Workshop Proceedings. Google Scholar
  23. N. Ide and L. Romary. International Standard for a Linguistic Annotation Framework. Natural language engineering, 10(3-4):211-225, 2004. Google Scholar
  24. Maxim Ionov, Florian Stein, Sagar Sehgal, and Christian Chiarcos. cqp4rdf: Towards a suite for rdf-based corpus linguistics. In European Semantic Web Conference, pages 115-121. Springer, 2020. Google Scholar
  25. ISO. Language Resource Management - Linguistic Annotation Framework (LAF). Standard, International Organization for Standardization, Geneva, 2012. Project leader: Nancy Ide. Google Scholar
  26. Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. The Sketch Engine: Ten Years On. Lexicography, 1(1):7-36, 2014. Google Scholar
  27. Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sebastian J Mielke, Arya D McCarthy, Sandra Kübler, et al. UniMorph 2.0: Universal Morphology. In Proc. of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), pages 1868-1873, 2018. Google Scholar
  28. Francesco Mambrini and Marco Passarotti. Linked open treebanks. interlinking syntactically annotated corpora in the lila knowledge base of linguistic resources for latin. In Proc. of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pages 74-81, 2019. Google Scholar
  29. Francesco Mambrini and Marco Passarotti. Linked Open Treebanks. Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin. In Proc. of TLT, SyntaxFest 2019, pages 74-81, Paris, France, 2019. Association for Computational Linguistics. Google Scholar
  30. Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-330, 1993. Google Scholar
  31. Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Universal Dependencies v1: A Multilingual Treebank Collection. In Proc. of the Tenth International Conference on Language Resources and Evaluation (LREC-2016), pages 1659-1666, 2016. Google Scholar
  32. Emilio Rubiera, Luis Polo, Diego Berrueta, and Adil El Ghali. TELIX: An RDF-based Model for Linguistic Annotation. In Extended Semantic Web Conference, pages 195-209. Springer, 2012. Google Scholar
  33. Robert Sanderson, Paolo Ciccarese, and Benjamin Young. Web Annotation Data Model. Technical report, W3C Recommendation, 2017. URL: https://www.w3.org/TR/annotation-model/.
  34. Erik F Sang and Sabine Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. arXiv preprint, 2000. URL: http://arxiv.org/abs/cs/0009008.
  35. Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. of International Conference on New Methods in Language Processing, pages 44-49, Manchester, UK, 1994. Google Scholar
  36. Minna Tamper, Petri Leskinen, Kasper Apajalahti, and Eero Hyvönen. Using biographical texts as linked data for prosopographical research and applications. In Euro-Mediterranean Conference, pages 125-137. Springer, 2018. Google Scholar
  37. Marc Verhagen, Keith Suderman, Di Wang, Nancy Ide, Chunqi Shi, Jonathan Wright, and James Pustejovsky. The LAPPS Interchange Format. In International Workshop on Worldwide Language Service Infrastructure, pages 33-47. Springer, 2015. Google Scholar
  38. Karin Verspoor and Kevin Livingston. Towards Adaptation of Linguistic Annotations to Scholarly Annotation Formalisms on the Semantic Web. In Proc. of the Sixth Linguistic Annotation Workshop, pages 75-84, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail