Information Access to Historical Documents from the Early New High German Period

Authors: Andreas Hauser, Markus Heller, Elisabeth Leiss, Klaus U. Schulz, and Christiane Wanzeck

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we comment on the difficulties that result for retrieval and data mining on historical texts. In the second part of the paper we report on our own work in the area with a focus on special matching strategies that help to relate modern language keywords with old variants. The basis of our studies is a collection of documents from the Early New High German period. These texts come with a very rich spectrum on word variants and spelling variations.

Andreas Hauser, Markus Heller, Elisabeth Leiss, Klaus U. Schulz, and Christiane Wanzeck. Information Access to Historical Documents from the Early New High German Period. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

06491 Abstracts Collection – Digital Historical Corpora- Architecture, Annotation, and Retrieval

Authors: Lou Burnard, Milena Dobreva, Norbert Fuhr, and Anke Lüdeling

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Lou Burnard, Milena Dobreva, Norbert Fuhr, and Anke Lüdeling. 06491 Abstracts Collection – Digital Historical Corpora- Architecture, Annotation, and Retrieval. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

06491 Summary – Digital Historical Corpora- Architecture, Annotation, and Retrieval

Authors: Lou Burnard, Milena Dobreva, Norbert Fuhr, and Anke Lüdeling

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The seminar "Digital Historical Corpora" brought together scholars from (historical) linguistics, (historical) philology, computational linguistics and computer science who work with collections of historical texts. The issues that were discussed include digitization, corpus design, corpus architecture, annotation, search, and retrieval.

Lou Burnard, Milena Dobreva, Norbert Fuhr, and Anke Lüdeling. 06491 Summary – Digital Historical Corpora- Architecture, Annotation, and Retrieval. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

A Cross-Language Approach to Historic Document Retrieval

Authors: Jaap Kamps, Marijn Koolen, Frans Adriaans, and Maarten de Rijke

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives, like DigiCULT, make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience. Natural languages evolve over time, changing in pronunciation and spelling, and new words are introduced continuously, while older words may disappear out of everyday use. For these reasons, queries involving modern words may not be very effective for retrieving documents that contain many historic terms. Although reading a 300-year-old document might not be problematic because the words are still recognizable, the changes in vocabulary and spelling can make it difficult to use a search engine to find relevant documents. To illustrate this, consider the following example from our collection of 17th century Dutch law texts. Looking for information on the tasks of a lawyer (modern Dutch: {it advocaat}) in these texts, the modern spelling will not lead you to documents containing the 17th century Dutch spelling variant {it advocaet}. Since spelling rules were not introduced until the 19th century, 17th century Dutch spelling is inconsistent. Being based mainly on pronunciation, words were often spelled in several different variants, which poses a problem for standard retrieval engines. We therefore define Historic Document Retrieval (HDR) as the retrieval of relevant historic documents for a modern query. Our approach to this problem is to treat the historic and modern languages as different languages, and use cross-language information retrieval (CLIR) techniques to translate one language into the other.

Jaap Kamps, Marijn Koolen, Frans Adriaans, and Maarten de Rijke. A Cross-Language Approach to Historic Document Retrieval. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

A Multifunctional Historical Document Research System

Authors: Eva Dyllong

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

In this talk, the key components of a multifunctional historical document research system are discussed. An ongoing project which aims at creating a representative corpus of documents that reflect the impact of the German philosopher Friedrich Nietzsche in the period 1865-1945 forms the case study for the system. The realisation of the system includes several working fields: the collection of relevant historical documents, the digitization and choice of a suitable library-oriented data standards for archival storage, the design and implementation of a database, the development of fuzzy techniques for searching on documents with a non-standard orthography, the preparation of communication, annotation and visualisation tools, and the design of a user interface adapted for heterogeneous user group ranging from interested amateurs to experts.

Eva Dyllong. A Multifunctional Historical Document Research System. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Authors: Nikola Ikonomov and Milena Dobreva

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects. The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs from different regions of the country. Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss, the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration and digital preservation of the dialectological archive. Within the framework of this project more than the half of the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization activities are done in the Institute on a regular basis. As a result a large collection of sound files has been gathered. Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available for phonological and linguistic research. Such corpora typically include besides the sound files two basic elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus. In our work we will present considerations on how these tasks could be realized in the case of the corpus of Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be used in similar institutions storing folklore archives, history related spoken records etc.

Nikola Ikonomov and Milena Dobreva. CREATION OF A DIGITAL CORPUS OF BULGARIAN DIALECTS. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-4, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

DeutschDiachronDigital - A Diachronic Corpus of German

Authors: Anke Lüdeling

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The talk describes the design and the architecture of a diachronic corpus of German.

Anke Lüdeling. DeutschDiachronDigital - A Diachronic Corpus of German. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora

Authors: Astrid Ensslin, Martin Durrell, and Paul Bennett

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in historical corpora and evaluates methods of dealing with them on the other. We are currently engaged in a project which aims to compile a representative corpus of German for the period 1650-1800. Looking at exemplary data from the first stage of this project (1650-1700), which consists of newspaper texts from this period, we first aim from the perspective of corpus linguistics to identify the problems associated with the morphological, syntactical and graphemic peculiarities that are characteristic of that particular stage. Specific phenomena which significantly complicate automatic tagging, lemmatisation and parsing include, for instance, "abperlende" (Admoni 1980; Demske-Neumann 1990), i.e. complex and often asyndetic syntax; non-syntactic, prosodic, virgulated punctuation (Demske et al. 2004; cf. Stolt 1990), inflectional variability (e.g. Admoni 1990; Besch & Wegera 1987), as well as partly unsystematic and almost experimental allomorphic and allographic (Kettmann, 1992) diversity. Secondly, we outline a methodology which is intended to facilitate the construction and annotation of such corpora which antedate linguistic standardisation. This is informed by "conventional" and innovative tagging techniques and tools, which are evaluated in terms of utility and accuracy. Finally, we attempt to evaluate the degree to which annotation tools for specialist corpora of this kind can be developed which will substitute for manual or semi-automated annotation.

Astrid Ensslin, Martin Durrell, and Paul Bennett. GerManC - Towards a Methodology for Constructing and Annotating Historical Corpora. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Guideline: Multiple Hierarchies

Authors: Andreas Witt

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

As the title of the Dagstuhl Seminar ``Digital Historical Corpora - Architecture, Annotation, and Retrieval'' already suggests, corpus architecture and corpus annotation is an important topic for representing (historical) texts. Especially the limitation of SGML-based markup languages to tree structured annotations raises a special problems when dealing with manuscripts: How is it possible to represent overlap. This problem was addressed by the Text Encoding Initiative (TEI) and by several scholars. This text gives an overview of several techniques for handling the overlap problem.

Andreas Witt. Guideline: Multiple Hierarchies. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Information-Analytical System "Manuscript": technologies and tools of creation of electronic collections of ancient and medieval documents

Authors: Victor Baranov

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The paper is devoted to the possibilities of the Manuscript system (http://manuscripts.ru/) designed for preparation of electronic scientific publications of ancient manuscripts on the Internet. The primary consideration is given to the specialized modules of the system ensuring 1) input, storage, editing and processing of materials in the database, 2) textologic, linguistic and paleographic analyses of manuscripts/texts and 3) preparation of dummy copies and publication of manuscripts and research apparatus. All modules interact with a common database allowing processing text/manuscript units organized into hierarchies and nets, their relationships and values that adequately reflect modeled objects and their relationships. The report also shows the possibilities of the system modules for a comprehensive study of texts and their units.

Victor Baranov. Information-Analytical System "Manuscript": technologies and tools of creation of electronic collections of ancient and medieval documents. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Joseph Wright's English Dialect Dictionary (1898-1905) Computerised: architecture and retrieval routine

Authors: Manfred Markus

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The Innsbruck government-funded project SPEED (Spoken English in Early Dialects), scheduled for 2006 to 2009, has the aim of digitising and evaluating the famous English Dialect Dictionary by Joseph Wright (1898-1906). This paper topicalises the value of the electronic version of the dictionary and problems of its complex architecture, as well as the retrieval routine aimed at. The paper is an elaborated version of the Powerpoint presentation delivered at the conference. First of all, I try to prove the great value of Wright's dictionary from the point of view of English studies. On the other hand, given the mixed nature of the participants of the Dagstuhl conference, the paper tackles interface problems typically arising when printed texts are computerised, problems ranging from "normalisation" to aspects of parsing and of the design of the query mask.

Manfred Markus. Joseph Wright's English Dialect Dictionary (1898-1905) Computerised: architecture and retrieval routine. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

New tricks from an old dog: An overview of TEI P5

Authors: Lou Burnard

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

This paper presents an update on the current state of development of the Text Encoding Initiative’s Guidelines for Electronic Text Encoding and Interchange. Since the last major edition in 2002, which saw the conversion of the Guidelines into XML, there has been substantial activity on adding new content in areas of particular interest to historical corpus builders. The TEI has also reinvented itself as a membership initiative and set up mechanisms for the continued development and maintenance of the Guidelines. We contrast "old" and "new" TEI, and give a brief overview of some recent technical enhancements to the system intended to facilitate expansion and customization of the scheme.

Lou Burnard. New tricks from an old dog: An overview of TEI P5. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Rule-based search in historical text databases - Visualization techniques

Authors: Wolfram Luther

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

The talk describes several techniques used to visualize among other aspects the productivity of rule sets in deriving non-standard spellings. The treemap or similar visualizations help find typical replacement sequences depending on the localization of the spellings and their epoch. The study conducted proves that treemaps ease the understanding of rule hierarchies, the detection of productive and non productive rules and the evaluation of a rule’s importance. They also provide better search performance. An interactive visualization over a map is showing isoglosses running between different regions of Germany and clusters text samples of different epochs and their writings. Furthermore, allograph variants are displayed using adequate data types.

Wolfram Luther. Rule-based search in historical text databases - Visualization techniques. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Searching in text databases with non-standard orthography

Authors: Thomas Pilz

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

In this paper we present research results of the recent project "Rule based search in text data bases with non-standard orthography". There are numerous steps involved from facsimile to searchable text-document. This paper focuses on techniques to ensure better retrieval results on historical texts with non-standard spellings. Historical documents – especially those in black letter fonts – encourage recognition errors. Adequate preparation of the image sources prior to OCR can successfully reduce the amount of misinterpretation of characters. Furthermore, the application of a search engine with categorized distance measures between user interface and text database can help to enhance retrieval results. Specific metrics cover problems in optical character recognition, transcription and historical spelling variation. With a synoptic view interface the users can be kept completely unaware of the methods applied after their queries.

Thomas Pilz. Searching in text databases with non-standard orthography. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Tagging Historical Corpora - the problem of spelling variation

Authors: Paul Rayson, Dawn Archer, Alistair Baron, and Nicholas Smith

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use "standard" or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes represent quote marks or contractions (Grefenstette and Tapanainen, 1994; Grefenstette, 1999). The issue of spelling variation becomes more problematic when utilising corpus linguistic techniques on non-standard varieties of English, not least because variation can be due to differences in spelling habits, transcription or compositing practices, and morpho-syntactic customs, as well as "misspelling". Examples of non-standard varieties include: - Scottish English1 (Anderson et al., forthcoming), and dialects such as Tyneside English2 (Allen et al., forthcoming) - Early Modern English (Archer and Rayson, 2004; Culpeper and Kytö, 2005) - Emerging varieties such as SMS or CMC in weblogs (Ooi et al., 2006) In the Dagstuhl workshop we focussed on historical corpora. Vast quantities of searchable historical material are being created in electronic form through large digitisation initiatives already underway e.g. Open Content Alliance3, Google Book Search4, and Early English Books Online5. Annotation, typically at the part-of-speech (POS) level, is carried out on modern corpora for linguistic analysis, information retrieval and natural language processing tasks such as named entity extraction. Increasingly researchers wish to carry out similar tasks on historical data (Nissim et al, 2004). However, historical data is considered noisy for tasks such as this. The problems faced when applying corpus annotation tools trained on modern language data to historical texts are the motivation for the research described in this paper. Previous research has adopted an approach of adding historical variants to the POS tagger lexicon, for example in TreeTagger annotation of GerManC (Durrell et al, 2006), or "back-dating" the lexicon in the Constraint Grammar Parser of English (ENGCG) when annotating the Helsinki corpus (Kytö and Voutilainen, 1995). Our aim was to develop an historical semantic tagger in order to facilitate similar studies on historical data to those that we had previously been performing on modern data using the USAS semantic analysis system (Rayson et al, 2004). The USAS tool relies on POS tagging as a prerequisite to carrying out semantic disambiguation. Hence we were faced with the task of retraining or back-dating two tools, a POS tagger and a semantic tagger. Our proposed solution incorporates a corpus pre-processor for detecting historical spelling variants and inserting modern equivalents alongside them. This enables retrieval as well as annotation tasks and to some extent avoids the need to retrain each annotation tool that is applied to the corpus. The modern tools can then be applied to the modern spelling equivalents rather than the historical variants, and thereby achieve higher levels of accuracy. The resulting variant detector tool (VARD) employs a number of techniques derived from spell-checking tools as we wished to evaluate their applicability to historical data. The current version of the tool uses known-variant lists, SoundEx, edit distance and letter replacement heuristics to match Early Modern English variants with modern forms. The techniques are combined using a scoring mechanism to enable preferred candidates to be selected using likelihood values. The current known-variant lists and letter replacement rules are manually created. In a cross-language study with English and German texts we found that similar techniques could be used to derive letter replacement heuristics from corpus examples (Pilz et al, forthcoming). Our experiments show that VARD can successfully deal with: - Apostrophes signalling missing letter(s) or sound(s): ’fore ("before"), hee’l ("he will"), - Irregular apostrophe usage: again’st ("against"), whil’st ("whilst") - Contracted forms: ’tis("it is"), thats ("that is"), youle ("you will"), t’anticipate ("to anticipate") - Hyphenated forms: acquain-tance ("acquaintance") - Variation due to different use of graphs: <v>, <u>, <i>, <y>: aboue ("above"), abyde ("abide") - Doubling of vowels and consonants -e.g. <-oo-><-ll>: triviall ("trivial") By direct comparison, variants that are not in the modern lexicon are easy to identify, however, our studies show that a significant portion of variants cannot be discovered this way. Inconsistencies in the use of the genitive, and "then" appearing instead of "than" or vice versa require contextual information to be used in their detection. We will outline our approach to resolving this problem, by the use of contextually-sensitive template rules that contain lexical, grammatical and semantic information. Footnotes 1 http://www.scottishcorpus.ac.uk/ 2 http://www.ncl.ac.uk/necte/ 3 http://www.opencontentalliance.org/ 4 http://books.google.com/ 5 http://eebo.chadwyck.com/home

Paul Rayson, Dawn Archer, Alistair Baron, and Nicholas Smith. Tagging Historical Corpora - the problem of spelling variation. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

