License
When quoting this document, please refer to the following
URN: urn:nbn:de:0030-drops-10467
URL: http://drops.dagstuhl.de/opus/volltexte/2007/1046/
Go to the corresponding Portal


Ensslin, Astrid ; Durrell, Martin ; Bennett, Paul

GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora

pdf-format:
Document 1.pdf (40 KB)


Abstract

Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other. We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, ‘abperlende’ (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity. Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by ‘conventional’ and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation.

BibTeX - Entry

@InProceedings{ensslin_et_al:DSP:2007:1046,
  author =	{Astrid Ensslin and Martin Durrell and Paul Bennett},
  title =	{GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora},
  booktitle =	{Digital Historical Corpora- Architecture, Annotation, and Retrieval},
  year =	{2007},
  editor =	{Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L{\"u}deling },
  number =	{06491},
  series =	{Dagstuhl Seminar Proceedings},
  ISSN =	{1862-4405},
  publisher =	{Internationales Begegnungs- und Forschungszentrum f{\"u}r Informatik (IBFI), Schloss Dagstuhl, Germany},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2007/1046},
  annote =	{Keywords: Early Modern German; newspaper corpus; GerManC; variation; annotation; tagging}
}

Keywords: Early Modern German; newspaper corpus; GerManC; variation; annotation; tagging
Seminar: 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Issue Date: 2007
Date of publication: 13.06.2007


DROPS-Home | Fulltext Search | Imprint Published by LZI