NER in Archival Finding Aids

Costa Cunha, Luís Filipe; Ramalho, José Carlos

doi:10.4230/OASIcs.SLATE.2021.8

File

Subject Classification

ACM Subject Classification

Computing methodologies → Natural language processing
Computing methodologies → Machine learning
Computing methodologies → Maximum entropy modeling
Computing methodologies → Neural networks
Information systems → Digital libraries and archives

Keywords

Named Entity Recognition
Archival Descriptions
Machine Learning
Deep Learning

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

At the moment, the vast majority of Portuguese archives with an online presence use a software solution to manage their finding aids: e.g. Digitarq or Archeevo. Most of these finding aids are written in natural language without any annotation that would enable a machine to identify named entities, geographical locations or even some dates. That would allow the machine to create smart browsing tools on top of those record contents like entity linking and record linking. In this work we have created a set of datasets to train Machine Learning algorithms to find those named entities and geographical locations. After training several algorithms we tested them in several datasets and registered their precision and accuracy. These results enabled us to achieve some conclusions about what kind of precision we can achieve with this approach in this context and what to do with the results: do we have enough precision and accuracy to create toponymic and anthroponomic indexes for archival finding aids? Is this approach suitable in this context? These are some of the questions we intend to answer along this paper.

Cite As Get BibTex

Luís Filipe Costa Cunha and José Carlos Ramalho. NER in Archival Finding Aids. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 8:1-8:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.SLATE.2021.8

Author Details

Luís Filipe Costa Cunha

University of Minho, Braga, Portugal

José Carlos Ramalho

Department of Informatics, University of Minho, Braga, Portugal

References

Archeevo Arquivo Distrital de Braga. Bem-vindo ao arquivo distrital de braga. Accessed in 10-03-2021. URL: http://pesquisa.adb.uminho.pt/.
Archeevo Arquivo Regional e Biblioteca Pública da Madeira. Accessed in 10-03-2021. URL: https://arquivo-abm.madeira.gov.pt/.
Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 1996.
Luís Filipe Costa Cunha and José Carlos Ramalho. URL: http://ner.epl.di.uminho.pt/.
Leon Derczynski. Complementarity, F-score, and NLP evaluation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, 2016.
Cláudia Freitas, Cristina Mota, Diana Santos, Hugo Gonçalo Oliveira, and Paula Carvalho. Second HAREM: Advancing the state of the art of named entity recognition in Portuguese. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta, 2010. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2010/pdf/412_Paper.pdf.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning (Adaptive Computation and Machine Learning series). The MIT Press, November 2016. URL: https://www.xarg.org/ref/a/0262035618/.
Alex Graves, Abdel Rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2013. URL: https://doi.org/10.1109/ICASSP.2013.6638947.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging, 2015. URL: http://arxiv.org/abs/1508.01991.
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming text: how to find, organize, and manipulate it. Manning, Shelter Island, 2013. OCLC: ocn772977853.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference, 2016. URL: https://doi.org/10.18653/v1/n16-1030.
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 2016. URL: https://doi.org/10.18653/v1/p16-1101.
Christopher Manning. Maxentmodels and discriminative estimation. URL: https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf.
OpenNLP Maxent. The maximum entropy framework, 2008. Accessed in 24-09-2020. URL: http://maxent.sourceforge.net/about.html.
Mike Morais. Neu 560: Statistical modeling and analysis of neural data: Lecture 8: Informationtheory and maximum entropy, 2018. Accessed in 20-10-2020. URL: http://pillowlab.princeton.edu/teaching/statneuro2018/slides/notes08_infotheory.pdf.
Christopher Olah. Understanding lstm networks, August 2015. Accessed on March 10, 2021. URL: http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
José Nuno Oliveira. Hitex: Um sistema em desenvolvimento para historiadores e arquivistas. Forum, 1992.
Apache OpenNLP. Welcome to apache opennlp, 2017. Accessed in 18-10-2020. URL: https://opennlp.apache.org/.
André Ricardo Oliveira Pires. Named entity extraction from portuguese web text. Master’s thesis, Faculdade de Engenharia da Universidade do Porto, 2017.
Alexandre Rademaker, Fabricio Chalub, Livy Real, Cláudia Freitas, Eckhard Bick, and Valeria de Paiva. Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), 2017.
Adwait Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. PhD thesis, University of Pennsylva, 1998.
Ana Maria Rodrigues, Catarina Guimarães, Francisco Barbedo, Glória Santos, Lucília Runa, and Pedro Penteado. Orientações para a descrição arquivística, May 2011. URL: https://act.fct.pt/wp-content/uploads/2014/05/ODA-3%C2%AA-vers%C3%A3o.pdf.
Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, classification and use. John Benjamins Publishing Company, July 2009.
spaCy. spacy 101: Everything you need to know · spacy usage documentation. Accessed in 07-01-2021. URL: https://spacy.io/usage/spacy-101.
spaCy. Model architecture, 2017. Accessed in 14-01-2021. URL: https://spacy.io/models.
Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. Fast and accurate entity recognition with iterated dilated convolutions. CoRR, 2017. URL: https://doi.org/10.18653/v1/d17-1283.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017-December, 2017.

NER in Archival Finding Aids

Authors Luís Filipe Costa Cunha, José Carlos Ramalho

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message