Towards Automatic Creation of Annotations to Foster Development of Named Entity Recognizers

Authors Emanuel Matos, Mário Rodrigues , Pedro Miguel, António Teixeira



PDF
Thumbnail PDF

File

OASIcs.SLATE.2021.11.pdf
  • Filesize: 1.27 MB
  • 14 pages

Document Identifiers

Author Details

Emanuel Matos
  • IEETA, DETI, University of Aveiro, Aveiro, Portugal
Mário Rodrigues
  • IEETA, ESTGA, University of Aveiro, Aveiro, Portugal
Pedro Miguel
  • IEETA, DETI, University of Aveiro, Aveiro, Portugal
António Teixeira
  • IEETA, DETI, University of Aveiro, Aveiro, Portugal

Cite AsGet BibTex

Emanuel Matos, Mário Rodrigues, Pedro Miguel, and António Teixeira. Towards Automatic Creation of Annotations to Foster Development of Named Entity Recognizers. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 11:1-11:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.SLATE.2021.11

Abstract

Named Entity Recognition (NER) is an essential step for many natural language processing tasks, including Information Extraction. Despite recent advances, particularly using deep learning techniques, the creation of accurate named entity recognizers continues a complex task, highly dependent on annotated data availability. To foster existence of NER systems for new domains it is crucial to obtain the required large volumes of annotated data with low or no manual labor. In this paper it is proposed a system to create the annotated data automatically, by resorting to a set of existing NERs and information sources (DBpedia). The approach was tested with documents of the Tourism domain. Distinct methods were applied for deciding the final named entities and respective tags. The results show that this approach can increase the confidence on annotations and/or augment the number of categories possible to annotate. This paper also presents examples of new NERs that can be rapidly created with the obtained annotated data. The annotated data, combined with the possibility to apply both the ensemble of NER systems and the new Gazetteer-based NERs to large corpora, create the necessary conditions to explore the recent neural deep learning state-of-art approaches to NER (ex: BERT) in domains with scarce or nonexistent data for training.

Subject Classification

ACM Subject Classification
  • Information systems → Specialized information retrieval
  • Applied computing → Computers in other domains
Keywords
  • Named Entity Recognition (NER)
  • Automatic Annotation
  • Gazetteers
  • Tourism
  • Portuguese

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Allen NLP - An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. URL: https://github.com/allenai/allennlp.
  2. Flora Amato, Giovanni Cozzolino, Vincenzo Moscato, and Francesco Moscato. Analyse digital forensic evidences through a semantic-based methodology and NLP techniques. Future Generation Computer Systems, 98:297-307, 2019. Google Scholar
  3. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 2007. Google Scholar
  4. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. Dbpedia-a crystallization point for the web of data. Journal of web semantics, 7(3):154-165, 2009. Google Scholar
  5. Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970. Google Scholar
  6. Hiral Desai, Mohammed Firdos Alam Sheikh, and Satyendra K Sharma. Multi-purposed question answer generator with natural language processing. In Emerging Trends in Expert Applications and Security, pages 139-145. Springer, 2019. Google Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.04805.
  8. Mariana Dias, João Boné, João C Ferreira, Ricardo Ribeiro, and Rui Maia. Named entity recognition for sensitive data discovery in portuguese. Applied Sciences, 10(7):2303, 2020. Google Scholar
  9. Tobias Ek, Camilla Kirkegaard, Håkan Jonsson, and Pierre Nugues. Named entity recognition for short text messages. Procedia - Social and Behavioral Sciences, 27:178-187, 2011. Google Scholar
  10. João Ferreira, Hugo Gonçalo Oliveira, and Ricardo Rodrigues. Improving NLTK for processing portuguese. In 8th Symposium on Languages, Applications and Technologies (SLATE), 2019. Google Scholar
  11. Pablo Gamallo and Marcos Garcia. Linguakit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática, 9(1):19-28, 2017. Google Scholar
  12. Pablo Gamallo, Marcos Garcia, César Piñeiro, Rodrigo Martínez-Castaño, and Juan Pichel. Linguakit: A big data-based multilingual tool for linguistic analysis and information extraction. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2018. URL: https://doi.org/10.1109/SNAMS.2018.8554689.
  13. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform, 2018. URL: http://arxiv.org/abs/1803.07640.
  14. Kyle Gorman. Pynini: A python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pages 75-80, 2016. Google Scholar
  15. Kyle Gorman and Richard Sproat. How to get superior text processing in python with pynini, o'reilly ideas blog, 2016. accessed 22/04/2021. URL: https://www.oreilly.com/content/how-to-get-superior-text-processing-in-python-with-pynini/.
  16. Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602-610, 2005. Google Scholar
  17. Ivan Herman, Sergio Fernández, Carlos Tejo Alonso, and Alexey Zakhlestin. Sparql endpoint interface to python. URL: https://sparqlwrapper.readthedocs.io/en/latest/main.html.
  18. Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint, 2015. URL: http://arxiv.org/abs/1508.01991.
  19. Daniel Jurafsky and James H. Martin. Information extraction. In Speech and Language Processing, chapter 17. (3rd ed. draft), 2020. Google Scholar
  20. Daniel Jurafsky and James H. Martin. Sequence labeling for parts of speech and named entities. In Speech and Language Processing, chapter 8. (3rd ed. draft), 2020. Google Scholar
  21. Fábio Lopes, César Teixeira, and Hugo Gonçalo Oliveira. Contributions to clinical named entity recognition in portuguese. In Proc. 18th BioNLP Workshop and Shared Task, 2019. Google Scholar
  22. Pedro H. Luz de Araujo, Teófilo E. de Campos, Renato R. R. de Oliveira, Matheus Stauffer, Samuel Couto, and Paulo Bermejo. LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In PROPOR, LNCS. Springer, 2018. Google Scholar
  23. Lluís Padró. Analizadores Multilingües en Freeling. Linguamática, 3(2):13-20, 2011. URL: https://linguamatica.com/index.php/linguamatica/article/view/115.
  24. A. Patel and A.U. Arasanipalai. Applied Natural Language Processing in the Enterprise: Teaching Machines to Read, Write, and Understand. O'Reilly Media, Incorporated, 2021. Google Scholar
  25. André Pires, José Devezas, and Sérgio Nunes. Benchmarking named entity recognition tools for portuguese. Proceedings of the Ninth INForum: Simpósio de Informática, pages 111-121, 2017. Google Scholar
  26. Juliana PC Pirovani, James Alves, Marcos Spalenza, Wesley Silva, Cristiano da Silveira Colombo, and Elias Oliveira. Adapting NER (CRF+ LG) for many textual genres. In IberLEF@ SEPLN, pages 421-433, 2019. Google Scholar
  27. Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. Universal dependency parsing from scratch. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 160-170, Brussels, Belgium, 2018. Google Scholar
  28. Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, and Timothy Baldwin. Named entity recognition for novel types by transfer learning, 2016. URL: http://arxiv.org/abs/1610.09914.
  29. Mário Rodrigues and António Teixeira. Advanced applications of natural language processing for performing information extraction. Springer, 2015. Google Scholar
  30. Antonio Moreno Sandoval, Julia Díaz, Leonardo Campillos Llanos, and Teófilo Redondo. Biomedical term extraction: NLP techniques in computational medicine. IJIMAI, 5(4), 2019. Google Scholar
  31. K. Sintoris and K. Vergidis. Extracting business process models using natural language processing (nlp) techniques. In Proc. Conf, on Business Informatics (CBI), pages 135-139, 2017. Google Scholar
  32. Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. Portuguese Named Entity Recognition using BERT-CRF, 2020. URL: http://arxiv.org/abs/1909.10649.
  33. António Teixeira, Pedro Miguel, Mário Rodrigues, José Casimiro Pereira, and Marlene Amorim. From web to persons - providing useful information on hotels combining information extraction and natural language generation. In Proc. IberSpeech, Lisbon, 2016. Google Scholar
  34. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL: http://arxiv.org/abs/1706.03762.
  35. Wikivoyage. URL: https://pt.wikivoyage.org/.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail