Automated Georeferencing of Antarctic Species

Scott, Jamie; Stock, Kristin; Morgan, Fraser; Whitehead, Brandon; Medyckyj-Scott, David

doi:10.4230/LIPIcs.GIScience.2021.II.13

Abstract

Many text documents in the biological domain contain references to the toponym of specific phenomena (e.g. species sightings) in natural language form "In <LOCATION> Garwood Valley summer activity was 0.2% for <SPECIES> Umbilicaria aprina and 1.7% for <SPECIES> Caloplaca sp. ..." While methods have been developed to extract place names from documents, and attention has been given to the interpretation of spatial prepositions, the ability to connect toponym mentions in text with the phenomena to which they refer (in this case species) has been given limited attention, but would be of considerable benefit for the task of mapping specific phenomena mentioned in text documents. As part of work to create a pipeline to automate georeferencing of species within legacy documents, this paper proposes a method to: (1) recognise species and toponyms within text and (2) match each species mention to the relevant toponym mention. Our methods find significant promise in a bespoke rules- and dictionary-based approach to recognise species within text (F1 scores up to 0.87 including partial matches) but less success, as yet, recognising toponyms using multiple gazetteers combined with an off the shelf natural language processing tool (F1 up to 0.62). Most importantly, we offer a contribution to the relatively nascent area of matching toponym references to the object they locate (in our case species), including cases in which the toponym and species are in different sentences. We use tree-based models to achieve precision as high as 0.88 or an F1 score up to 0.68 depending on the downsampling rate. Initial results out perform previous research on detecting entity relationships that may cross sentence boundaries within biomedical text, and differ from previous work in specifically addressing species mapping.

Elise Acheson and Ross S. Purves. Extracting and modeling geographic information from scientific articles. PLOS ONE, 16(1):e0244918, January 2021. URL: https://doi.org/10.1371/journal.pone.0244918.
Moises Acuna-Chaves and José Araya. Extraction of geographic entities from biological textual sources. In 2017 XLIII Latin American Computer Conference (CLEI), pages 1-8, 2017. URL: https://doi.org/10.1109/CLEI.2017.8226422.
Lakshmi Manohar Akella, Catherine N. Norton, and Holly Miller. NetiNeti: discovery of scientific names from text using machine learning methods. BMC Bioinformatics, 13(1):211, 2012. URL: https://doi.org/10.1186/1471-2105-13-211.
Ana Bárbara Cardoso, Bruno Martins, and Jacinto Estima. Using recurrent neural networks for toponym resolution in text. In EPIA Conference on Artificial Intelligence, pages 769-780. Springer, 2019.
Arthur D Chapman and John R Wieczorek. Georeferencing Best Practices. GBIF Secretariat, Copenhagen, 2020. URL: https://doi.org/10.15468/doc-gg7h-s853.
Rachel Chasin, Daryl Woodward, Jeremy Witmer, and Jugal Kalita. Extracting and displaying temporal and geospatial entities from articles on historical events. The Computer Journal, 57(3):403-426, 2014.
Hong-Woo Chun, Yoshimasa Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki, and Jun’Ichi Tsujii. Extraction of Gene-Disease Relations from Medline using Domain Dictionaries and Machine Learning. In Biocomputing 2006, pages 4-15, Maui, Hawaii, December 2005. World Scientific. URL: https://doi.org/10.1142/9789812701626_0002.
Markus Eberts and Adrian Ulges. An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3650-3660, 2021. URL: https://www.aclweb.org/anthology/2021.eacl-main.319.
Martin Gerner, Goran Nenadic, and Casey M. Bergman. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics, 11(1):85, 2010. URL: https://doi.org/10.1186/1471-2105-11-85.
Milan Gritta, Mohammad Taher Pilehvar, Nut Limsopatham, and Nigel Collier. What’s missing in geographical parsing? Language Resources and Evaluation, 52(2):603-623, 2018.
Qinghua Guo, Yu Liu, and John Wieczorek. Georeferencing locality descriptions and computing associated uncertainty using a probabilistic approach. International Journal of Geographical Information Science, 22(10):1067-1090, 2008.
Robert P Guralnick, John Wieczorek, Reed Beaman, Robert J Hijmans, the BioGeomancer Working Group, et al. Biogeomancer: automated georeferencing to map the world’s biodiversity data. PLoS Biol, 4(11):e381, 2006.
Felix Hamborg, Corinna Breitinger, and Bela Gipp. Giveme5W1H: A Universal System for Extracting Main Events from News Articles. arXiv:1909.02766 [cs], September 2019. URL: http://arxiv.org/abs/1909.02766.
Andrew W Hill, Robert Guralnick, Paul Flemons, Reed Beaman, John Wieczorek, Ajay Ranipeta, Vishwas Chavan, and David Remsen. Location, location, location: utilizing pipelines and services to more effectively georeference the world’s biodiversity data. BMC bioinformatics, 10(14):1-9, 2009.
Yiting Ju, Benjamin Adams, Krzysztof Janowicz, Yingjie Hu, Bo Yan, and Grant McKenzie. Things and strings: improving place name disambiguation from short texts by combining entity co-occurrence with topic modeling. In European Knowledge Acquisition Workshop, pages 353-367. Springer, 2016.
Morteza Karimzadeh, Scott Pezanowski, Alan M. MacEachren, and Jan O. Wallgrün. GeoTxt: A scalable geoparsing system for unstructured text geolocation. Transactions in GIS, 23(1):118-136, 2019. URL: https://doi.org/10.1111/tgis.12510.
Drew Koning, Indra Neil Sarkar, and Thomas Moritz. TaxonGrab: Extracting Taxonomic Names From Text. Biodiversity Informatics, 2, 2005. URL: https://doi.org/10.17161/bi.v2i0.17.
Parisa Kordjamshidi, Martijn Otterlo, and Marie-Francine Moens. Spatial role labeling: Towards extraction of spatial relations from natural language. TSLP, 8:4, December 2011. URL: https://doi.org/10.1145/2050104.2050105.
Parisa Kordjamshidi, Dan Roth, and Marie-Francine Moens. Structured learning for spatial information extraction from biomedical text: bacteria biotopes. BMC bioinformatics, 16:129, April 2015. URL: https://doi.org/10.1186/s12859-015-0542-z.
Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, and Indra Neil Sarkar. uBioRSS: Tracking taxonomic literature using RSS. Bioinformatics, 23(11):1434-1436, 2007. URL: https://doi.org/10.1093/bioinformatics/btm109.
Jochen L. Leidner. Toponym resolution in text: annotation, evaluation and applications of spatial grounding. ACM SIGIR Forum, 41(2):124-126, December 2007. URL: https://doi.org/10.1145/1328964.1328989.
Jochen L. Leidner and Michael D. Lieberman. Detecting geographical references in the form of place names and associated spatial natural language. SIGSPATIAL Special, 3(2):5-11, 2011. URL: https://doi.org/10.1145/2047296.2047298.
Carl Linnaeus. Species Plantarum. Laurentius Salvius, Stockholm, Sweden, 1753.
Damon P. Little. Recognition of Latin scientific names using artificial neural networks. Applications in Plant Sciences, 8(7):e11378, 2020. URL: https://doi.org/10.1002/aps3.11378.
Fernando Melo and Bruno Martins. Automated Geocoding of Textual Documents: A Survey of Current Approaches. Transactions in GIS, 21(1):3-38, 2017. URL: https://doi.org/10.1111/tgis.12212.
Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Named entity recognition and relation extraction: State-of-the-art. ACM Computing Surveys (CSUR), 54(1):1-39, 2021.
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-Sentence N-ary Relation Extraction with Graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101-115, 2017. URL: https://doi.org/10.1162/tacl_a_00049.
Chris Quirk and Hoifung Poon. Distant Supervision for Relation Extraction beyond the Sentence Boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1171-1182, Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/E17-1110.
Burkhard Schroeter, T. G. Allan Green, Ana Pintado, Roman Türk, and Leopoldo G. Sancho. Summer activity patterns for mosses and lichens in Maritime Antarctica. Antarctic Science, 29(6):517-530, December 2017. URL: https://doi.org/10.1017/S095410201700027X.
Kumutha Swampillai and Mark Stevenson. Inter-sentential Relations in Information Extraction Corpora. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), 2010.
Kumutha Swampillai and Mark Stevenson. Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pages 25-32, 2011.
Patrick Verga, Emma Strubell, and Andrew McCallum. Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 872-884, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/N18-1080.
Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1785-1794, 2015.

Automated Georeferencing of Antarctic Species

Authors Jamie Scott , Kristin Stock , Fraser Morgan , Brandon Whitehead , David Medyckyj-Scott

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Automated Georeferencing of Antarctic Species

Authors Jamie Scott , Kristin Stock , Fraser Morgan , Brandon Whitehead , David Medyckyj-Scott

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References