2 Search Results for "Rayson, Paul"

Document

DOI: 10.4230/LIPIcs.COSIT.2024.11

Semantic Perspectives on the Lake District Writing: Spatial Ontology Modeling and Relation Extraction for Deeper Insights

Authors: Erum Haris, Anthony G. Cohn, and John G. Stell

Published in: LIPIcs, Volume 315, 16th International Conference on Spatial Information Theory (COSIT 2024)

Abstract

Extracting spatial details from historical texts can be difficult, hindering our understanding of past landscapes. The study addresses this challenge by analyzing the Corpus of the Lake District Writing, focusing on the English Lake District region. We systematically link the theoretical notions from the core concepts of spatial information to provide basis for the problem domain. The conceptual foundation is further complemented with a spatial ontology and a custom gazetteer, allowing a formal and insightful semantic exploration of the massive unstructured corpus. The other contrasting side of the framework is the usage of LLMs for spatial relation extraction. We formulate prompts leveraging understanding of the LLMs of the intended task, curate a list of spatial relations representing the most recurring proximity or vicinity relations terms and extract semantic triples for the top five place names appearing in the corpus. We compare the extraction capabilities of three benchmark LLMs for a scholarly significant historical archive, representing their potential in a challenging and interdisciplinary research problem. Finally, the network comprising the semantic triples is enhanced by incorporating a gazetteer-based classification of the objects involved thus improving their spatial profiling.

Cite as

Erum Haris, Anthony G. Cohn, and John G. Stell. Semantic Perspectives on the Lake District Writing: Spatial Ontology Modeling and Relation Extraction for Deeper Insights. In 16th International Conference on Spatial Information Theory (COSIT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 315, pp. 11:1-11:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{haris_et_al:LIPIcs.COSIT.2024.11,
  author =	{Haris, Erum and Cohn, Anthony G. and Stell, John G.},
  title =	{{Semantic Perspectives on the Lake District Writing: Spatial Ontology Modeling and Relation Extraction for Deeper Insights}},
  booktitle =	{16th International Conference on Spatial Information Theory (COSIT 2024)},
  pages =	{11:1--11:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-330-0},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{315},
  editor =	{Adams, Benjamin and Griffin, Amy L. and Scheider, Simon and McKenzie, Grant},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.COSIT.2024.11},
  URN =		{urn:nbn:de:0030-drops-208268},
  doi =		{10.4230/LIPIcs.COSIT.2024.11},
  annote =	{Keywords: spatial humanities, spatial narratives, ontology, large language models}
}

Document

DOI: 10.4230/DagSemProc.06491.15

Tagging Historical Corpora - the problem of spelling variation

Authors: Paul Rayson, Dawn Archer, Alistair Baron, and Nicholas Smith

Published in: Dagstuhl Seminar Proceedings, Volume 6491, Digital Historical Corpora- Architecture, Annotation, and Retrieval (2007)

Abstract

Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use "standard" or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes represent quote marks or contractions (Grefenstette and Tapanainen, 1994; Grefenstette, 1999). The issue of spelling variation becomes more problematic when utilising corpus linguistic techniques on non-standard varieties of English, not least because variation can be due to differences in spelling habits, transcription or compositing practices, and morpho-syntactic customs, as well as "misspelling". Examples of non-standard varieties include: - Scottish English1 (Anderson et al., forthcoming), and dialects such as Tyneside English2 (Allen et al., forthcoming) - Early Modern English (Archer and Rayson, 2004; Culpeper and Kytö, 2005) - Emerging varieties such as SMS or CMC in weblogs (Ooi et al., 2006) In the Dagstuhl workshop we focussed on historical corpora. Vast quantities of searchable historical material are being created in electronic form through large digitisation initiatives already underway e.g. Open Content Alliance3, Google Book Search4, and Early English Books Online5. Annotation, typically at the part-of-speech (POS) level, is carried out on modern corpora for linguistic analysis, information retrieval and natural language processing tasks such as named entity extraction. Increasingly researchers wish to carry out similar tasks on historical data (Nissim et al, 2004). However, historical data is considered noisy for tasks such as this. The problems faced when applying corpus annotation tools trained on modern language data to historical texts are the motivation for the research described in this paper. Previous research has adopted an approach of adding historical variants to the POS tagger lexicon, for example in TreeTagger annotation of GerManC (Durrell et al, 2006), or "back-dating" the lexicon in the Constraint Grammar Parser of English (ENGCG) when annotating the Helsinki corpus (Kytö and Voutilainen, 1995). Our aim was to develop an historical semantic tagger in order to facilitate similar studies on historical data to those that we had previously been performing on modern data using the USAS semantic analysis system (Rayson et al, 2004). The USAS tool relies on POS tagging as a prerequisite to carrying out semantic disambiguation. Hence we were faced with the task of retraining or back-dating two tools, a POS tagger and a semantic tagger. Our proposed solution incorporates a corpus pre-processor for detecting historical spelling variants and inserting modern equivalents alongside them. This enables retrieval as well as annotation tasks and to some extent avoids the need to retrain each annotation tool that is applied to the corpus. The modern tools can then be applied to the modern spelling equivalents rather than the historical variants, and thereby achieve higher levels of accuracy. The resulting variant detector tool (VARD) employs a number of techniques derived from spell-checking tools as we wished to evaluate their applicability to historical data. The current version of the tool uses known-variant lists, SoundEx, edit distance and letter replacement heuristics to match Early Modern English variants with modern forms. The techniques are combined using a scoring mechanism to enable preferred candidates to be selected using likelihood values. The current known-variant lists and letter replacement rules are manually created. In a cross-language study with English and German texts we found that similar techniques could be used to derive letter replacement heuristics from corpus examples (Pilz et al, forthcoming). Our experiments show that VARD can successfully deal with: - Apostrophes signalling missing letter(s) or sound(s): ’fore ("before"), hee’l ("he will"), - Irregular apostrophe usage: again’st ("against"), whil’st ("whilst") - Contracted forms: ’tis("it is"), thats ("that is"), youle ("you will"), t’anticipate ("to anticipate") - Hyphenated forms: acquain-tance ("acquaintance") - Variation due to different use of graphs: <v>, <u>, <i>, <y>: aboue ("above"), abyde ("abide") - Doubling of vowels and consonants -e.g. <-oo-><-ll>: triviall ("trivial") By direct comparison, variants that are not in the modern lexicon are easy to identify, however, our studies show that a significant portion of variants cannot be discovered this way. Inconsistencies in the use of the genitive, and "then" appearing instead of "than" or vice versa require contextual information to be used in their detection. We will outline our approach to resolving this problem, by the use of contextually-sensitive template rules that contain lexical, grammatical and semantic information. Footnotes 1 http://www.scottishcorpus.ac.uk/ 2 http://www.ncl.ac.uk/necte/ 3 http://www.opencontentalliance.org/ 4 http://books.google.com/ 5 http://eebo.chadwyck.com/home

Cite as

Paul Rayson, Dawn Archer, Alistair Baron, and Nicholas Smith. Tagging Historical Corpora - the problem of spelling variation. In Digital Historical Corpora- Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Volume 6491, pp. 1-2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Copy BibTex To Clipboard

@InProceedings{rayson_et_al:DagSemProc.06491.15,
  author =	{Rayson, Paul and Archer, Dawn and Baron, Alistair and Smith, Nicholas},
  title =	{{Tagging Historical Corpora - the problem of spelling variation}},
  booktitle =	{Digital Historical Corpora- Architecture, Annotation, and Retrieval},
  pages =	{1--2},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2007},
  volume =	{6491},
  editor =	{Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L\"{u}deling},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.06491.15},
  URN =		{urn:nbn:de:0030-drops-10553},
  doi =		{10.4230/DagSemProc.06491.15},
  annote =	{Keywords: Corpus annotation, spelling variation, historical variants}
}

Refine by Author
1 Archer, Dawn
1 Baron, Alistair
1 Cohn, Anthony G.
1 Haris, Erum
1 Rayson, Paul
Show More...

Refine by Classification
1 Computing methodologies → Artificial intelligence

Refine by Keyword
1 Corpus annotation
1 historical variants
1 large language models
1 ontology
1 spatial humanities
Show More...

Refine by Type
2 document

Refine by Publication Year
1 2007
1 2024

2 Search Results for "Rayson, Paul"

Semantic Perspectives on the Lake District Writing: Spatial Ontology Modeling and Relation Extraction for Deeper Insights

Abstract

Cite as

Tagging Historical Corpora - the problem of spelling variation

Abstract

Cite as

Thanks for your feedback!

Could not send message