License
when quoting this document, please refer to the following
URN: urn:nbn:de:0030-drops-10533
URL: http://drops.dagstuhl.de/opus/volltexte/2007/1053/

Pilz, Thomas

Searching in text databases with non-standard orthography

pdf-format:
Dokument 1.pdf (20 KB)


Abstract

In this paper we present research results of the recent project “Rule based search in text data bases with non-standard orthography”. There are numerous steps involved from facsimile to searchable text-document. This paper focuses on techniques to ensure better retrieval results on historical texts with non-standard spellings. Historical documents – especially those in black letter fonts - encourage recognition errors. Adequate preparation of the image sources prior to OCR can successfully reduce the amount of misinterpretation of characters. Furthermore, the application of a search engine with categorized distance measures between user interface and text database can help to enhance retrieval results. Specific metrics cover problems in optical character recognition, transcription and historical spelling variation. With a synoptic view interface the users can be kept completely unaware of the methods applied after their queries.

BibTeX - Entry

@InProceedings{pilz:DSP:2007:1053,
  author =	{Thomas Pilz},
  title =	{Searching in text databases with non-standard orthography},
  booktitle =	{Digital Historical Corpora- Architecture, Annotation, and Retrieval},
  year =	{2007},
  editor =	{Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L{\"u}deling },
  number =	{06491},
  series =	{Dagstuhl Seminar Proceedings},
  ISSN =	{1862-4405},
  publisher =	{Internationales Begegnungs- und Forschungszentrum f{\"u}r Informatik (IBFI), Schloss Dagstuhl, Germany},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2007/1053},
  annote =	{Keywords: Rule based search, Optical character recognition, spelling variation, edit distance}
}

Keywords: Rule based search, Optical character recognition, spelling variation, edit distance
Seminar: 06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval
Issue date: 2007
Date of publication: 13.06.2007


DROPS-Home | Fulltext Search | Imprint Published by LZI