Towards Automatic Creation of Annotations to Foster Development of Named Entity Recognizers

Matos, Emanuel; Rodrigues, Mário; Miguel, Pedro; Teixeira, António

doi:10.4230/OASIcs.SLATE.2021.11

Abstract

Named Entity Recognition (NER) is an essential step for many natural language processing tasks, including Information Extraction. Despite recent advances, particularly using deep learning techniques, the creation of accurate named entity recognizers continues a complex task, highly dependent on annotated data availability. To foster existence of NER systems for new domains it is crucial to obtain the required large volumes of annotated data with low or no manual labor. In this paper it is proposed a system to create the annotated data automatically, by resorting to a set of existing NERs and information sources (DBpedia). The approach was tested with documents of the Tourism domain. Distinct methods were applied for deciding the final named entities and respective tags. The results show that this approach can increase the confidence on annotations and/or augment the number of categories possible to annotate. This paper also presents examples of new NERs that can be rapidly created with the obtained annotated data. The annotated data, combined with the possibility to apply both the ensemble of NER systems and the new Gazetteer-based NERs to large corpora, create the necessary conditions to explore the recent neural deep learning state-of-art approaches to NER (ex: BERT) in domains with scarce or nonexistent data for training.

Allen NLP - An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. URL: https://github.com/allenai/allennlp.
Flora Amato, Giovanni Cozzolino, Vincenzo Moscato, and Francesco Moscato. Analyse digital forensic evidences through a semantic-based methodology and NLP techniques. Future Generation Computer Systems, 98:297-307, 2019.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 2007.
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. Dbpedia-a crystallization point for the web of data. Journal of web semantics, 7(3):154-165, 2009.
Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970.
Hiral Desai, Mohammed Firdos Alam Sheikh, and Satyendra K Sharma. Multi-purposed question answer generator with natural language processing. In Emerging Trends in Expert Applications and Security, pages 139-145. Springer, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.04805.
Mariana Dias, João Boné, João C Ferreira, Ricardo Ribeiro, and Rui Maia. Named entity recognition for sensitive data discovery in portuguese. Applied Sciences, 10(7):2303, 2020.
Tobias Ek, Camilla Kirkegaard, Håkan Jonsson, and Pierre Nugues. Named entity recognition for short text messages. Procedia - Social and Behavioral Sciences, 27:178-187, 2011.
João Ferreira, Hugo Gonçalo Oliveira, and Ricardo Rodrigues. Improving NLTK for processing portuguese. In 8th Symposium on Languages, Applications and Technologies (SLATE), 2019.
Pablo Gamallo and Marcos Garcia. Linguakit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática, 9(1):19-28, 2017.
Pablo Gamallo, Marcos Garcia, César Piñeiro, Rodrigo Martínez-Castaño, and Juan Pichel. Linguakit: A big data-based multilingual tool for linguistic analysis and information extraction. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2018. URL: https://doi.org/10.1109/SNAMS.2018.8554689.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform, 2018. URL: http://arxiv.org/abs/1803.07640.
Kyle Gorman. Pynini: A python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pages 75-80, 2016.
Kyle Gorman and Richard Sproat. How to get superior text processing in python with pynini, o'reilly ideas blog, 2016. accessed 22/04/2021. URL: https://www.oreilly.com/content/how-to-get-superior-text-processing-in-python-with-pynini/.
Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602-610, 2005.
Ivan Herman, Sergio Fernández, Carlos Tejo Alonso, and Alexey Zakhlestin. Sparql endpoint interface to python. URL: https://sparqlwrapper.readthedocs.io/en/latest/main.html.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint, 2015. URL: http://arxiv.org/abs/1508.01991.
Daniel Jurafsky and James H. Martin. Information extraction. In Speech and Language Processing, chapter 17. (3rd ed. draft), 2020.
Daniel Jurafsky and James H. Martin. Sequence labeling for parts of speech and named entities. In Speech and Language Processing, chapter 8. (3rd ed. draft), 2020.
Fábio Lopes, César Teixeira, and Hugo Gonçalo Oliveira. Contributions to clinical named entity recognition in portuguese. In Proc. 18th BioNLP Workshop and Shared Task, 2019.
Pedro H. Luz de Araujo, Teófilo E. de Campos, Renato R. R. de Oliveira, Matheus Stauffer, Samuel Couto, and Paulo Bermejo. LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In PROPOR, LNCS. Springer, 2018.
Lluís Padró. Analizadores Multilingües en Freeling. Linguamática, 3(2):13-20, 2011. URL: https://linguamatica.com/index.php/linguamatica/article/view/115.
A. Patel and A.U. Arasanipalai. Applied Natural Language Processing in the Enterprise: Teaching Machines to Read, Write, and Understand. O'Reilly Media, Incorporated, 2021.
André Pires, José Devezas, and Sérgio Nunes. Benchmarking named entity recognition tools for portuguese. Proceedings of the Ninth INForum: Simpósio de Informática, pages 111-121, 2017.
Juliana PC Pirovani, James Alves, Marcos Spalenza, Wesley Silva, Cristiano da Silveira Colombo, and Elias Oliveira. Adapting NER (CRF+ LG) for many textual genres. In IberLEF@ SEPLN, pages 421-433, 2019.
Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. Universal dependency parsing from scratch. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 160-170, Brussels, Belgium, 2018.
Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, and Timothy Baldwin. Named entity recognition for novel types by transfer learning, 2016. URL: http://arxiv.org/abs/1610.09914.
Mário Rodrigues and António Teixeira. Advanced applications of natural language processing for performing information extraction. Springer, 2015.
Antonio Moreno Sandoval, Julia Díaz, Leonardo Campillos Llanos, and Teófilo Redondo. Biomedical term extraction: NLP techniques in computational medicine. IJIMAI, 5(4), 2019.
K. Sintoris and K. Vergidis. Extracting business process models using natural language processing (nlp) techniques. In Proc. Conf, on Business Informatics (CBI), pages 135-139, 2017.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. Portuguese Named Entity Recognition using BERT-CRF, 2020. URL: http://arxiv.org/abs/1909.10649.
António Teixeira, Pedro Miguel, Mário Rodrigues, José Casimiro Pereira, and Marlene Amorim. From web to persons - providing useful information on hotels combining information extraction and natural language generation. In Proc. IberSpeech, Lisbon, 2016.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL: http://arxiv.org/abs/1706.03762.
Wikivoyage. URL: https://pt.wikivoyage.org/.

Towards Automatic Creation of Annotations to Foster Development of Named Entity Recognizers

Authors Emanuel Matos, Mário Rodrigues , Pedro Miguel, António Teixeira

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message