GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora

Authors Astrid Ensslin, Martin Durrell, Paul Bennett

Thumbnail PDF


  • Filesize: 39 kB
  • 2 pages

Document Identifiers

Author Details

Astrid Ensslin
Martin Durrell
Paul Bennett

Cite AsGet BibTex

Astrid Ensslin, Martin Durrell, and Paul Bennett. GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)


Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other. We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, "abperlende" (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity. Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by "conventional" and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation.
  • Early Modern German; newspaper corpus; GerManC; variation; annotation; tagging


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads