NER in Archival Finding Aids

Authors Luís Filipe Costa Cunha, José Carlos Ramalho



PDF
Thumbnail PDF

File

OASIcs.SLATE.2021.8.pdf
  • Filesize: 1.34 MB
  • 16 pages

Document Identifiers

Author Details

Luís Filipe Costa Cunha
  • University of Minho, Braga, Portugal
José Carlos Ramalho
  • Department of Informatics, University of Minho, Braga, Portugal

Cite As Get BibTex

Luís Filipe Costa Cunha and José Carlos Ramalho. NER in Archival Finding Aids. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 8:1-8:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.SLATE.2021.8

Abstract

At the moment, the vast majority of Portuguese archives with an online presence use a software solution to manage their finding aids: e.g. Digitarq or Archeevo.
Most of these finding aids are written in natural language without any annotation that would enable a machine to identify named entities, geographical locations or even some dates. That would allow the machine to create smart browsing tools on top of those record contents like entity linking and record linking. 
In this work we have created a set of datasets to train Machine Learning algorithms to find those named entities and geographical locations. After training several algorithms we tested them in several datasets and registered their precision and accuracy.
These results enabled us to achieve some conclusions about what kind of precision we can achieve with this approach in this context and what to do with the results: do we have enough precision and accuracy to create toponymic and anthroponomic indexes for archival finding aids? Is this approach suitable in this context? These are some of the questions we intend to answer along this paper.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
  • Computing methodologies → Machine learning
  • Computing methodologies → Maximum entropy modeling
  • Computing methodologies → Neural networks
  • Information systems → Digital libraries and archives
Keywords
  • Named Entity Recognition
  • Archival Descriptions
  • Machine Learning
  • Deep Learning

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Archeevo Arquivo Distrital de Braga. Bem-vindo ao arquivo distrital de braga. Accessed in 10-03-2021. URL: http://pesquisa.adb.uminho.pt/.
  2. Archeevo Arquivo Regional e Biblioteca Pública da Madeira. Accessed in 10-03-2021. URL: https://arquivo-abm.madeira.gov.pt/.
  3. Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 1996. Google Scholar
  4. Luís Filipe Costa Cunha and José Carlos Ramalho. URL: http://ner.epl.di.uminho.pt/.
  5. Leon Derczynski. Complementarity, F-score, and NLP evaluation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, 2016. Google Scholar
  6. Cláudia Freitas, Cristina Mota, Diana Santos, Hugo Gonçalo Oliveira, and Paula Carvalho. Second HAREM: Advancing the state of the art of named entity recognition in Portuguese. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta, 2010. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2010/pdf/412_Paper.pdf.
  7. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning (Adaptive Computation and Machine Learning series). The MIT Press, November 2016. URL: https://www.xarg.org/ref/a/0262035618/.
  8. Alex Graves, Abdel Rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2013. URL: https://doi.org/10.1109/ICASSP.2013.6638947.
  9. Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging, 2015. URL: http://arxiv.org/abs/1508.01991.
  10. Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming text: how to find, organize, and manipulate it. Manning, Shelter Island, 2013. OCLC: ocn772977853. Google Scholar
  11. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference, 2016. URL: https://doi.org/10.18653/v1/n16-1030.
  12. Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 2016. URL: https://doi.org/10.18653/v1/p16-1101.
  13. Christopher Manning. Maxentmodels and discriminative estimation. URL: https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf.
  14. OpenNLP Maxent. The maximum entropy framework, 2008. Accessed in 24-09-2020. URL: http://maxent.sourceforge.net/about.html.
  15. Mike Morais. Neu 560: Statistical modeling and analysis of neural data: Lecture 8: Informationtheory and maximum entropy, 2018. Accessed in 20-10-2020. URL: http://pillowlab.princeton.edu/teaching/statneuro2018/slides/notes08_infotheory.pdf.
  16. Christopher Olah. Understanding lstm networks, August 2015. Accessed on March 10, 2021. URL: http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
  17. José Nuno Oliveira. Hitex: Um sistema em desenvolvimento para historiadores e arquivistas. Forum, 1992. Google Scholar
  18. Apache OpenNLP. Welcome to apache opennlp, 2017. Accessed in 18-10-2020. URL: https://opennlp.apache.org/.
  19. André Ricardo Oliveira Pires. Named entity extraction from portuguese web text. Master’s thesis, Faculdade de Engenharia da Universidade do Porto, 2017. Google Scholar
  20. Alexandre Rademaker, Fabricio Chalub, Livy Real, Cláudia Freitas, Eckhard Bick, and Valeria de Paiva. Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), 2017. Google Scholar
  21. Adwait Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. PhD thesis, University of Pennsylva, 1998. Google Scholar
  22. Ana Maria Rodrigues, Catarina Guimarães, Francisco Barbedo, Glória Santos, Lucília Runa, and Pedro Penteado. Orientações para a descrição arquivística, May 2011. URL: https://act.fct.pt/wp-content/uploads/2014/05/ODA-3%C2%AA-vers%C3%A3o.pdf.
  23. Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, classification and use. John Benjamins Publishing Company, July 2009. Google Scholar
  24. spaCy. spacy 101: Everything you need to know · spacy usage documentation. Accessed in 07-01-2021. URL: https://spacy.io/usage/spacy-101.
  25. spaCy. Model architecture, 2017. Accessed in 14-01-2021. URL: https://spacy.io/models.
  26. Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. Fast and accurate entity recognition with iterated dilated convolutions. CoRR, 2017. URL: https://doi.org/10.18653/v1/d17-1283.
  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017-December, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail