Automatic Detection of Language and Annotation Model Information in CoNLL Corpora

Abromeit, Frank; Chiarcos, Christian

doi:10.4230/OASIcs.LDK.2019.23

Abstract

We introduce AnnoHub, an on-going effort to automatically complement existing language resources with metadata about the languages they cover and the annotation schemes (tagsets) that they apply, to provide a web interface for their curation and evaluation by means of domain experts, and to publish them as a RDF dataset and as part of the (Linguistic) Linked Open Data (LLOD) cloud. In this paper, we focus on tabular formats with tab-separated values (TSV), a de-facto standard for annotated corpora as popularized as part of the CoNLL Shared Tasks. By extension, other formats for which a converter to CoNLL and/or TSV formats does exist, can be processed analoguously. We describe our implementation and its evaluation against a sample of 93 corpora from the Universal Dependencies, v.2.3.

Cite As Get BibTex

Frank Abromeit and Christian Chiarcos. Automatic Detection of Language and Annotation Model Information in CoNLL Corpora. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Open Access Series in Informatics (OASIcs), Volume 70, pp. 23:1-23:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/OASIcs.LDK.2019.23

Author Details

Frank Abromeit

Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Christian Chiarcos

Applied Computational Linguistics Lab, Goethe University Frankfurt, Germany

Funding

The research described in this paper was conducted in the context of the Specialized Information Service Linguistics, funded by German Research Foundation (DFG/LIS, 2017-2019). The contributions of the second author were conducted with additional support from the Horizon 2020 Research and Innovation Action "Pret-a-LLOD. Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors" (H2020-ICT-2018-2, 2019-2021).

Supplementary Materials

https://annohub.linguistik.de

References

Christian Chiarcos and Christian Fäth. CoNLL-RDF: Linked corpora done in an NLP-friendly way. In Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian Chiarcos, and Sebastian Hellmann, editors, Language, Data, and Knowledge, pages 74-88, Cham, Switzerland, 2017. Springer.
Christian Chiarcos, Christian Fäth, Heike Renner-Westermann, Frank Abromeit, and Vanya Dimitrova. Lin|gu|is|tik: Building the Linguist’s Pathway to Bibliographies, Libraries, Language Resources and Linked Open Data. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 2016. European Language Resources Association (ELRA).
Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann. Linked Data in Linguistics. Springer, 2012.
Christian Chiarcos and Maria Sukhareva. OLiA - Ontologies of Linguistic Annotation. Semantic Web Journal,518:379–386, 2015.
Christian Chiarcos, Maria Sukhareva, Roland Mittmann, Timothy Price, Gaye Detmold, and Jan Chobotsky. New Technologies for Old Germanic. Resources and Research on Parallel Bibles in Older Continental Western Germanic. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pages 22-31. Association for Computational Linguistics, 2014.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 2016. European Language Resources Association (ELRA).
Andrea Zielinski and Christian Simon. Morphisto - An Open Source Morphological Analyzer for German. In Proceedings of the 2009 Conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008, pages 224-231, Amsterdam, The Netherlands, 2009. IOS Press.

Automatic Detection of Language and Annotation Model Information in CoNLL Corpora

Authors Frank Abromeit , Christian Chiarcos

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message