License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/DagSemProc.06491.3
URN: urn:nbn:de:0030-drops-10497
Go to the corresponding Portal

Kamps, Jaap ; Koolen, Marijn ; Adriaans, Frans ; de Rijke, Maarten

A Cross-Language Approach to Historic Document Retrieval

06491.KampsJaap.ExtAbstract.1049.pdf (0.1 MB)


Our cultural heritage, as preserved in libraries, archives and
museums, is made up of documents written many centuries ago.
Large-scale digitization initiatives, like DigiCULT, make these
documents available to non-expert users through digital
libraries and vertical search engines.

For a user, querying a historic document collection may be a
disappointing experience. Natural languages evolve over time, changing
in pronunciation and spelling, and new words are introduced
continuously, while older words may disappear out of everyday use. For
these reasons, queries involving modern words may not be very
effective for retrieving documents that contain many historic terms.
Although reading a 300-year-old document might not be problematic
because the words are still recognizable, the changes in vocabulary
and spelling can make it difficult to use a search engine to find
relevant documents. To illustrate this, consider the following example
from our collection of 17th century Dutch law texts. Looking for
information on the tasks of a lawyer (modern Dutch: {it advocaat}) in
these texts, the modern spelling will not lead you to documents
containing the 17th century Dutch spelling variant {it advocaet}.
Since spelling rules were not introduced until the 19th century, 17th
century Dutch spelling is inconsistent. Being based mainly on
pronunciation, words were often spelled in several different variants,
which poses a problem for standard retrieval engines.

We therefore define Historic Document Retrieval (HDR) as the retrieval
of relevant historic documents for a modern query. Our approach to
this problem is to treat the historic and modern languages as
different languages, and use cross-language information retrieval
(CLIR) techniques to translate one language into the other.

BibTeX - Entry

  author =	{Kamps, Jaap and Koolen, Marijn and Adriaans, Frans and de Rijke, Maarten},
  title =	{{A Cross-Language Approach to Historic Document Retrieval}},
  booktitle =	{Digital Historical Corpora- Architecture, Annotation, and Retrieval},
  pages =	{1--2},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2007},
  volume =	{6491},
  editor =	{Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L\"{u}deling},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-10497},
  doi =		{10.4230/DagSemProc.06491.3},
  annote =	{Keywords: Historic Documents, Information Retrieval, Spelling variation, Modernizing Spelling, 17th Century Dutch}

Keywords: Historic Documents, Information Retrieval, Spelling variation, Modernizing Spelling, 17th Century Dutch
Collection: 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Issue Date: 2007
Date of publication: 13.06.2007

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI