License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/DagSemProc.06491.7
URN: urn:nbn:de:0030-drops-10467
Go to the corresponding Portal

Ensslin, Astrid ; Durrell, Martin ; Bennett, Paul

GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora

06491.EnsslinAstrid.ExtAbstract.1046.pdf (0.04 MB)


Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other.
We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, "abperlende" (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity.
Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by "conventional" and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation.

BibTeX - Entry

  author =	{Ensslin, Astrid and Durrell, Martin and Bennett, Paul},
  title =	{{GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora}},
  booktitle =	{Digital Historical Corpora- Architecture, Annotation, and Retrieval},
  pages =	{1--2},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2007},
  volume =	{6491},
  editor =	{Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L\"{u}deling},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-10467},
  doi =		{10.4230/DagSemProc.06491.7},
  annote =	{Keywords: Early Modern German; newspaper corpus; GerManC; variation; annotation; tagging}

Keywords: Early Modern German; newspaper corpus; GerManC; variation; annotation; tagging
Collection: 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Issue Date: 2007
Date of publication: 13.06.2007

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI