The Shortcomings of Language Tags for Linked Data When Modeling Lesser-Known Languages

Gillis-Webber, Frances; Tittel, Sabine

doi:10.4230/OASIcs.LDK.2019.4

Abstract

In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified.

K. Baldinger. Dictionnaire étymologique de l'ancien français - DEAF. Presses de L'Université Laval / Niemeyer / De Gruyter, Québec/Tübingen/Berlin, since 1971. [Continued by Frankwalt Möhren, and Thomas Städtler; DEAFél: https://deaf-server.adw.uni-heidelberg.de].
A. Bellandi, E. Giovannetti, and A. Weingart. Multilingual and Multiword Phenomena in a lemon Old Occitan Medico-Botanical Lexicon. Information, 9 (3), 52, 2018.
T. Berners-Lee. Linked Data. World Wide Web Consortium, 2006.
Ch. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, 5:1-22, 2009.
M. Brenzinger. The twelve modern Khoisan languages. In A. Witzlack-Makarevich and M. Ernszt, editors, Khoisan Languages and Linguistics: Proceedings of the 3rd International Symposium July 6-10, 2008, Riezlern / Kleinwalsertal, pages 1-32. Köppe Verlag, 2008.
Ch. Chiarcos, J. McCrae, Ph. Cimiano, and Ch. Fellbaum. Towards Open Data for Linguistics: Lexical Linked Data. In A. Oltramari et al., editor, New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, pages 7-25. Springer, Berlin, Heidelberg, 2013.
Ch. Chiarcos and M. Sukhareva. Linking Etymological Databases. A Case Study in Germanic. In 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing, page 41, 2014.
P. Cimiano, J.P. McCrae, and P. Buitelaar. Lexicon model for ontologies: community report, 10 May 2016. Ontology-Lexicon Community Group under the W3C Community Final Specification Agreement (FSA), 2016. URL: https://www.w3.org/2016/05/ontolex/.
R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1. concepts and abstract syntax: W3C recommendation 25 February 2014, 2014. URL: https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.
Gerard de Melo. Lexvo.org: Language-Related Information for the Linguistic Linked Data Cloud. Semantic Web, 6(4):393-400, August 2015.
Th. Declerck, E. Wandl-Vogt, and K. Mörth. Towards a Pan European Lexicography by Means of Linked (Open) Data. In I. Kosem et. al., editor, Electronic Lexicography in the 21st Century: Linking Lexical Data in the Digital Age. Proceedings of the eLex 2015 Conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom, pages 342-355. Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., 2015.
International Organization for Standardization. Language codes - ISO 639. URL: https://www.iso.org/iso-639-language-codes.html.
F. Gillis-Webber. Conversion of the English-Xhosa Dictionary for Nurses to a linguistic linked data framework. Information, 9(11), 2018. URL: http://dx.doi.org/10.3390/info9110274.
F. Gillis-Webber, S. Tittel, and C. M. Keet. A Model for Language Annotations on the Web, 2019. (submitted).
J. Gracia, M. Villegas, A. Gómez-Pérez, and N. Bel. The Apertium Bilingual Dictionaries on the Web of Data. In Semantic Web - Interoperability, Usability, Applicability, pages 1-10. IOS Press, 2017.
R. Güldermann. Towards casting a wider net over N∥ng: chances and challenges of archival Khoisan resources, 2014. URL: https://www.iaaw.hu-berlin.de/de/region/afrika/afrika/linguistik/mitarbeiter/1683070/dokumente/2014-03-cape-town-nng-h.
H. Hammarström, R. Forkel, and M. Haspelmath. Glottolog 3.3., 2018. accesssed 21-02-2019.
SIL International. ISO 639-3: Relationship between ISO 639-3 and the other parts of ISO 639, 2017. URL: https://iso639-3.sil.org/about/relationships.
SIL International. ISO 639-3: Scope of denotation for language identifiers, 2017. URL: https://iso639-3.sil.org/about/scope.
SIL International. ISO 639-3: Types of individual languages, 2017. URL: https://iso639-3.sil.org/about/types.
R. Ishida. Language Tags in HTML and XML, 2014. URL: https://www.w3.org/International/articles/language-tags/index.en.
F. Khan, J.E. Díaz-Vera, and M. Monachini. The Representation of an Old English Emotion Lexicon as Linked Open Data. In John P. McCrae et al., editor, Proceedings of the LREC 2016 Workshop "LDL 2016 – 5th Workshop on Linked Data in Linguistics: Managing, Building and Using Linked Language Resources", 24 May 2016 - Portorož, Slovenia, pages 73-76, 2016.
G. Köbler. Wörterbuch des althochdeutschen Sprachschatzes. Schöningh, Paderborn, 1993.
L. Lezcano, S. Sánchez-Alonso, and A. Roa-Valverde. A Survey on the Exchange of Linguistic Resources. Program, 47,3:263-281, 2013.
J. Lieberman, R. Singh, and Ch. Goad. W3C geospatial vocabulary: W3C incubator group report 23 October 2007, 2007.
J.P. McCrae, J. Bosque-Gil, J. Gracia, P. Buitelaar, and P. Cimiano. The OntoLex-Lemon model: Development and Applications. In Proceedings of ELEX 2017: Lexicography from Scratch. September 2017, pages 19-21, 2017.
S. Moran and M. Brümmer. Lemon-aid: Using Lemon to Aid Quantitative Historical Linguistic Analysis. In Ch. Chiarcos et al., editor, Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013), Pisa, September 2013, pages 28-33. Ass. for Comp. Linguistics, 2013.
A. Phillips and M. Davis. Tags for Identifiying Languages. BCP, 47, 2009.
C.M. Schlebusch, P. Skoglund, and P. Sjödin et al. Genomic Variation in Seven Khoe-San Groups Reveals Adaptation and Complex African History. Science, 338(6105):374-379, 2012.
S. Shah and M. Brenzinger. Ouma Geelmeid ke kx’u ∥xa∥xa N∣uu. Centre for African Language Diversity, University of Cape Town, Cape Town, 2016.
S. Tittel, H. Bermúdez-Sabel, and Ch. Chiarcos. Using RDFa to Link Text and Dictionary Data for Medieval French. In J.P. McCrae et al., editor, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 6th Workshop on Linked Data in Linguistics (LDL-2018), Miyazaki, Japan, 2018, pages 30-38, Paris (ELRA), 2018.
S. Tittel and Ch. Chiarcos. Historical Lexicography of Old French and Linked Open Data: Transforming the Resources of the Dictionnaire étymologique de l'ancien français with OntoLex-Lemon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). GLOBALEX Workshop (GLOBALEX-2018), Miyazaki, Japan, 2018, pages 58-66, Paris (ELRA), 2018.
M. Van Der Merwe. Giving breath to a dying history, 2015. URL: https://www.dailymaverick.co.za/article/2015-01-23-giving-breath-to-a-dying-history/#.Wyvou9WFMsk.
W. von Wartburg. Französisches Etymologisches Wörterbuch. Eine darstellung des galloromanischen sprachschatzes - FEW. ATILF, since 1922. [Continued by O. Jänicke, C.T. Gossen, J.-P. Chambon, J.-P. Chauveau, and Yan Greub].
D. Wood, M. Zaidman, L. Ruth, and M. Hausenblas. Linked data: structured data on the web. Manning Publications Co., New York, 2014.

The Shortcomings of Language Tags for Linked Data When Modeling Lesser-Known Languages

Authors Frances Gillis-Webber , Sabine Tittel

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message