OASIcs, Volume 93

3rd Conference on Language, Data and Knowledge (LDK 2021)

LDK 2021, September 1-3, 2021, Zaragoza, Spain

Editors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch


OASIcs, Volume 70

2nd Conference on Language, Data and Knowledge (LDK 2019)

LDK 2019, May 20-23, 2019, Leipzig, Germany

Editors: Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski

Complete Volume
OASIcs, Volume 93, LDK 2021, Complete Volume

Authors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

OASIcs, Volume 93, LDK 2021, Complete Volume

3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 1-516, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

Front Matter, Table of Contents, Preface, Conference Organization

3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Invited Talk
The JeuxDeMots Project (Invited Talk)

Authors: Mathieu Lafourcade

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

The JeuxDeMots project aims at building a very large knowledge base in French, both common sense and specialized, using games, contributory approaches, and inference mechanisms. A dozen games have been designed as part of this project, each one allowing to collect specific information, or to consolidate the information acquired through the other games. With this presentation, the data collected and constructed since the launch of the project in the summer of 2007 will be analyzed both qualitatively and quantitatively. In particular, the following aspects will be detailed: the structure of the lexical and semantic network, some types of relations (semantic, ontological, subjective, semantic roles, associations of ideas), annotation of relations (meta-information), semantic refinements (management of polysemy), the creation of clusters allowing the representation of richer knowledge (n-argument relations) that make an implicit neural network. Finally, I will describe some complementary acquisition methods and applications such as a bot for endogenous contributions, a chatbot making inferences and semantic extraction from texts.

Mathieu Lafourcade. The JeuxDeMots Project (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Invited Talk
A Smell is Worth a Thousand Words: Olfactory Information Extraction and Semantic Processing in a Multilingual Perspective (Invited Talk)

Authors: Sara Tonelli

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

More than any other sense, smell is linked directly to our emotions and our memories. However, smells are intangible and very difficult to preserve, making it hard to effectively identify, consolidate, and promote the wide-ranging role scents and smelling have in our cultural heritage. While some novel approaches have been recently proposed to monitor so-called urban smellscapes and analyse the olfactory dimension of our environments (Quercia et al., 2015), when it comes to smellscapes from the past little research has been done to keep track of how places, events and people have been described from an olfactory perspective. Fortunately, some key prerequisites for addressing this problem are now in place. In recent years, European cultural heritage institutions have invested heavily in large-scale digitisation: we hold a wealth of object, text and image data which can now be analysed using artificial intelligence. What remains missing is a methodology for the extraction of scent-related information from large amounts of texts, as well as a broader awareness of the wealth of historical olfactory descriptions, experiences and memories contained within the heritage datasets. In this talk, I will describe ongoing activities towards this goal, focused on text mining and semantic processing of olfactory information. I will present the general framework designed to annotate smell events in documents, and some preliminary results on information extraction approaches in a multilingual scenario. I will discuss the main findings and the challenges related to modelling textual descriptions of smells, including the metaphorical use of smell-related terms and the well-known limitations of smell vocabulary in European languages compared to other senses.

Sara Tonelli. A Smell is Worth a Thousand Words: Olfactory Information Extraction and Semantic Processing in a Multilingual Perspective (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 2:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Invited Talk
Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk)

Authors: Mikel L. Forcada

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

While machine translation has historically been rule-based, that is, based on dictionaries and rules written by experts, most present-day machine translation is corpus-based. In the last few years, statistical machine translation, the dominant corpus-based approach, has been displaced by neural machine translation in most applications, in view of the better results reported, particularly for languages with very different syntax. But both statistical and neural machine translation need to be trained on large amounts of parallel data, that is, sentences in one language carefully paired with their translations in their other language, and this is a resource that may not be available for some low-resource languages. While some of the languages of Spain may be considered to be reasonably endowed with parallel corpora connecting them to Spanish or even to English - Basque, Catalan, Galician -, and are well-served with machine translation systems, there are many other languages which cannot afford them such as Aranese Occitan, Aragonese, or Asturian/Leonese. Fortunately, languages in this last group belong to the Romance language family, as Spanish does, and this makes translation from and into Spanish under a rule-based paradigm the only feasible approach. After describing briefly the main machine translation paradigms, I will describe the Apertium free/open-source rule-based machine translation platform, which has been used to build machine translation systems for these low-resource languages of Spain, indeed, sometimes the only ones available. The free/open-source setting has made linguistic data for these languages available for anyone in their linguistic communities to build other linguistic technologies for these low-resourced languages. For example, the Apertium family of bilingual and monolingual data has been converted into RDF and they have been made accessible on the Web as linked data.

Mikel L. Forcada. Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 3:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Crazy New Idea
A Computational Simulation of Children’s Language Acquisition (Crazy New Idea)

Authors: Ben Ambridge

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

Many modern NLP models are already close to simulating children’s language acquisition; the main thing they currently lack is a "real world" representation of semantics that allows them to map from form to meaning and vice-versa. The aim of this "Crazy Idea" is to spark a discussion about how we might get there.

Ben Ambridge. A Computational Simulation of Children’s Language Acquisition (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 4:1-4:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Crazy New Idea
Get! Mimetypes! Right! (Crazy New Idea)

Authors: Christian Chiarcos

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

This paper identifies three technical requirements - availability of data, sustainable hosting and resolvable URIs for hosted data - as minimal pre-conditions for Linguistic Linked Open Data technology to develop towards a mature technological ecosystem that third party applications can build upon. While a critical amount of data is available (and it continues to grow), there does not seem to exist a hosting solution that combines the prospects of long-term availability with an unrestricted capability to support resolvable URIs. In particular, data hosting services do currently not allow data to be declared as RDF content by means of their media type (mime type), so that the capability of clients to recognize formats and to resolve URIs on that basis is severely limited.

Christian Chiarcos. Get! Mimetypes! Right! (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 5:1-5:4, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Crazy New Idea
Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea)

Authors: Tobias Weber

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

This paper discusses the role of low-resource languages in NLP through the lens of different stakeholders. It argues that the current "consumerist approach" to language data reinforces a vicious circle which increases the technological exclusion of minority communities. Researchers' decisions directly affect these processes to the detriment of minorities and practitioners engaging in language work in these communities. In line with the conference topic, the paper concludes with strategies and prerequisites for creating a positive feedback loop in our research benefiting language work within the next decade.

Tobias Weber. Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 6:1-6:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Representing the Under-Represented: a Dataset of Post-Colonial, and Migrant Writers

Authors: Marco Antonio Stranisci, Viviana Patti, and Rossana Damiano

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

In today’s media and in the Web of Data, non-Western people still suffer a lack of representation. In our work, we address this issue by presenting a pipeline for collecting and semantically encoding Wikipedia biographies of writers who are under-represented due to their non-Western origins, or their legal status in a country. The two main components of the ontology will be described, together with a framework for mapping textual biographies to their corresponding semantic representations. A description of the data set, and some examples of biographical texts conversion to the Ontology Classes, will be provided.

Marco Antonio Stranisci, Viviana Patti, and Rossana Damiano. Representing the Under-Represented: a Dataset of Post-Colonial, and Migrant Writers. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup

Authors: Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela, and Eero Hyvönen

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

This paper presents a knowledge graph created by transforming the plenary debates of the Parliament of Finland (1907-) into Linked Open Data (LOD). The data, totaling over νm{900 000} speeches, with automatically created semantic annotations and rich ontology-based metadata, are published in a Linked Open Data Service and are used via a SPARQL API and as data dumps. The speech data is part of larger LOD publication FinnParla that also includes prosopographical data about the politicians. The data is being used for studying parliamentary language and culture in Digital Humanities in several universities. To serve a wider variety of users, the entirety of this data was also produced using Parla-CLARIN markup. We present the first publication of all Finnish parliamentary debates as data. Technical novelties in our approach include the use of both Parla-CLARIN and an RDF schema developed for representing the speeches, integration of the data to a new Parliament of Finland Ontology for deeper data analyses, and enriching the data with a variety of external national and international data sources.

Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela, and Eero Hyvönen. Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 8:1-8:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Towards a Corpus of Historical German Plays with Emotion Annotations

Authors: Thomas Schmidt, Katrin Dennerlein, and Christian Wolff

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

In this paper, we present first work-in-progress annotation results of a project investigating computational methods of emotion analysis for historical German plays around 1800. We report on the development of an annotation scheme focussing on the annotation of emotions that are important from a literary studies perspective for this time span as well as on the annotation process we have developed. We annotate emotions expressed or attributed by characters of the plays in the written texts. The scheme consists of 13 hierarchically structured emotion concepts as well as the source (who experiences or attributes the emotion) and target (who or what is the emotion directed towards). We have conducted the annotation of five example plays of our corpus with two annotators per play and report on annotation distributions and agreement statistics. We were able to collect over 6,500 emotion annotations and identified a fair agreement for most concepts around a κ-value of 0.4. We discuss how we plan to improve annotator consistency and continue our work. The results also have implications for similar projects in the context of Digital Humanities.

Thomas Schmidt, Katrin Dennerlein, and Christian Wolff. Towards a Corpus of Historical German Plays with Emotion Annotations. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 9:1-9:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Enriching a Lexical Resource for French Verbs with Aspectual Information

Authors: Anna Kupść, Pauline Haas, Rafael Marín, and Antonio Balvet

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

The paper presents a syntactico-semantic lexicon of over a thousand French verbs. It has been created by manually adding lexical aspect features to verb frames from TreeLex [Kupść and Abeillé, 2008]. We present how the original syntactic resource has been adapted to the current project, our aspect assignment procedure and an overview of the resulting lexical resource.

Anna Kupść, Pauline Haas, Rafael Marín, and Antonio Balvet. Enriching a Lexical Resource for French Verbs with Aspectual Information. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 10:1-10:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Annotation of Fine-Grained Geographical Entities in German Texts

Authors: Julián Moreno-Schneider, Melina Plakidis, and Georg Rehm

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)

We work on the creation of a corpus, crawled from the internet, on the Berlin district of Moabit, primarily meant for training NER systems in German and English. Typical NER corpora and corresponding systems distinguish persons, organisations and locations, but do not distinguish different types of location entities. For our tourism-inspired use case, we need fine-grained annotations for toponyms. In this paper, we outline the fine-grained classification of geographical entities, the resulting annotations and we present preliminary results on automatically tagging toponyms in a small, bootstrapped gold corpus.

Julián Moreno-Schneider, Melina Plakidis, and Georg Rehm. Annotation of Fine-Grained Geographical Entities in German Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 11:1-11:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

