A Framework for Fostering Easier Access to Enriched Textual Information

Silva, Gabriel; Rodrigues, Mário; Teixeira, António; Amorim, Marlene

doi:10.4230/OASIcs.SLATE.2023.2

Abstract

Considering the amount of information in unstructured data it is necessary to have suitable methods to extract information from it. Most of these methods have their own output making it difficult and costly to merge and share this information as there currently is no unified way of representing this information. While most of these methods rely on JSON or XML there has been a push to serialize these into RDF compliant formats due to their flexiblity and the existing ecosystem surrounding them.
In this paper we introduce a framework whose goal is to provide a serialization of enriched data into an RDF format, following FAIR principles, making it more interpretable, interoperable and shareable. We process a subset of the WikiNER dataset and showcase two examples of using this framework: One using CoNLL annotations and the other by performing entity-linking on an already existing graph. The results are a graph with every connection starting from the document and finishing on tokens while keeping the original text intact while embedding the enriched data into it, in this case the CoNLL annotations and Entities.

Michael Bergman. Advantages and Myths of RDF. AI3, April, 2009.
Sabine Buchholz and Erwin Marsi. CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 149-164, New York City, June 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-2920.
Chuming Chen, Hongzhan Huang, Karen E. Ross, Julie E. Cowart, Cecilia N. Arighi, Cathy H. Wu, and Darren A. Natale. Protein ontology on the semantic web for knowledge discovery. Scientific Data, 7(1):337, October 2020. URL: https://doi.org/10.1038/s41597-020-00679-9.
Christian Chiarcos. POWLA: Modeling Linguistic Corpora in OWL/DL. In Elena Simperl, Philipp Cimiano, Axel Polleres, Oscar Corcho, and Valentina Presutti, editors, The Semantic Web: Research and Applications, pages 225-239, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
Christian Chiarcos and Christian Fäth. CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In International Conference on Language, Data, and Knowledge, 2017.
Christian Chiarcos and Luis Glaser. A Tree Extension for CoNLL-RDF. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7161-7169, Marseille, France, May 2020. European Language Resources Association. URL: https://aclanthology.org/2020.lrec-1.885.
Philipp Cimiano, Christian Chiarcos, John P. McCrae, and Jorge Gracia. Modelling Linguistic Annotations, pages 89-122. Springer International Publishing, Cham, 2020. URL: https://doi.org/10.1007/978-3-030-30225-2_6.
Anne Cocos, Alexander G Fiks, and Aaron J Masino. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts. Journal of the American Medical Informatics Association, 24(4):813-821, 2017.
Silviu Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708-716, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL: https://aclanthology.org/D07-1074.
Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4585-4592, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1062_Paper.pdf.
Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. Integrating NLP Using Linked Data. In Harith Alani, Lalana Kagal, Achille Fokoue, Paul Groth, Chris Biemann, Josiane Xavier Parreira, Lora Aroyo, Natasha Noy, Chris Welty, and Krzysztof Janowicz, editors, The Semantic Web - ISWC 2013, pages 98-113, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
Martin Hilbert. Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34:135-174, January 2016. URL: https://doi.org/10.1111/dpr.12142.
Mustafa Khanbhai, Patrick Anyadi, Joshua Symons, Kelsey Flott, Ara Darzi, and Erik Mayer. Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review. BMJ Health Care Inform., 28(1):e100262, March 2021.
Vandana Korde. Text Classification and Classifiers:A Survey. International Journal of Artificial Intelligence & Applications, 3:85-99, March 2012. URL: https://doi.org/10.5121/ijaia.2012.3208.
Elizabeth D Liddy. Natural language processing, 2001.
Jose L Martinez-Rodriguez, Aidan Hogan, and Ivan Lopez-Arevalo. Information extraction meets the semantic web: a survey. Semantic Web, 11(2):255-335, 2020.
Saif Mohammad. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th workshop on computational approaches to subjectivity, sentiment and social media analysis, pages 174-179, 2016.
Ines Montani, Matthew Honnibal, Matthew Honnibal, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength Natural Language Processing in Python, 2020. URL: https://doi.org/10.5281/ZENODO.1212303.
Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics, 2:231-244, 2014. URL: https://doi.org/10.1162/tacl_a_00179.
Mark A. Musen. The protégé project: a look back and a look forward. AI Matters, 1(4):4-12, 2015. URL: https://doi.org/10.1145/2757001.2757003.
Dat Ba Nguyen, Johannes Hoffart, Martin Theobald, and Gerhard Weikum. AIDA-light: High-Throughput Named-Entity Disambiguation. LDOW, 1184, 2014.
Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. J-NERD: Joint Named Entity Recognition and Disambiguation with Rich Linguistic Features. Transactions of the Association for Computational Linguistics, 4:215-229, 2016. URL: https://doi.org/10.1162/tacl_a_00094.
Eyal Oren, Knud Möller, Simon Scerri, Siegfried Handschuh, and Michael Sintek. What are semantic annotations. Relatório técnico. DERI Galway, 9:62, 2006.
Svetlana Pestryakova, Daniel Vollmers, Mohamed Ahmed Sherif, Stefan Heindorf, Muhammad Saleem, Diego Moussallem, and Axel-Cyrille Ngonga Ngomo. CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications. Scientific Data, 9(1):389, July 2022. URL: https://doi.org/10.1038/s41597-022-01298-2.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages, 2020. URL: https://arxiv.org/abs/2003.07082.
Laurent Romary. Standards for language resources in ISO - Looking back at 13 fruitful years, 2015. URL: https://arxiv.org/abs/1510.07851.
Laurent Romary and Nancy Ide. International Standard for a Linguistic Annotation Framework, 2007. URL: https://arxiv.org/abs/0707.3269.
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1):160018, March 2016. URL: https://doi.org/10.1038/sdata.2016.18.
Özge Sevgili, Artem Shelmanov, Mikhail Arkhipov, Alexander Panchenko, and Chris Biemann. Neural entity linking: A survey of models based on deep learning. Semantic Web, 13(3):527-570, April 2022. URL: https://doi.org/10.3233/sw-222986.

A Framework for Fostering Easier Access to Enriched Textual Information

Authors Gabriel Silva , Mário Rodrigues , António Teixeira , Marlene Amorim

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

A Framework for Fostering Easier Access to Enriched Textual Information

Authors Gabriel Silva , Mário Rodrigues , António Teixeira , Marlene Amorim

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message