LeMe-PT: A Medical Package Leaflet Corpus for Portuguese

Simões, Alberto; Gamallo, Pablo

doi:10.4230/OASIcs.SLATE.2021.10

File

Subject Classification

ACM Subject Classification

Computing methodologies → Information extraction
Computing methodologies → Language resources

Keywords

drug corpora
information extractiom
word embeddings

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

The current trend on natural language processing is the use of machine learning. This is being done on every field, from summarization to machine translation. For these techniques to be applied, resources are needed, namely quality corpora. While there are large quantities of corpora for the Portuguese language, there is the lack of technical and focused corpora. Therefore, in this article we present a new corpus, built from drug package leaflets. We describe its structure and contents, and discuss possible exploration directions.

Cite As Get BibTex

Alberto Simões and Pablo Gamallo. LeMe-PT: A Medical Package Leaflet Corpus for Portuguese. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 10:1-10:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.SLATE.2021.10

Author Details

Alberto Simões

2Ai, School of Technology, IPCA, Barcelos, Portugal

Pablo Gamallo

Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), University of Santiago de Compostela, A Coruña, Spain

References

Bruno Lage Aguiar. Information extraction from medication leaflets. PhD thesis, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal, 2010.
José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 43-50, Berlin, Germany, 2016.
Liliana Ferreira, António Teixeira, and João Paulo Silva Cunha. Medical information extraction in european portuguese. In Handbook of Research on ICTs for Human-Centered Healthcare and Social Care Services, pages 607-626. IGI Global, 2013. URL: https://doi.org/10.4018/978-1-4666-3986-7.ch032.
Pablo Gamallo. An Overview of Open Information Extraction (Invited talk). In Maria João Varanda Pereira, José Paulo Leal, and Alberto Simões, editors, 3rd Symposium on Languages, Applications and Technologies, volume 38 of OpenAccess Series in Informatics (OASIcs), pages 13-16, Dagstuhl, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/OASIcs.SLATE.2014.13.
Pablo Gamallo. Evaluation of Distributional Models with the Outlier Detection Task. In Pedro Rangel Henriques, José Paulo Leal, António Menezes Leitão, and Xavier Gómez Guinovart, editors, 7th Symposium on Languages, Applications and Technologies (SLATE 2018), volume 62 of OpenAccess Series in Informatics (OASIcs), pages 13:1-13:8, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/OASIcs.SLATE.2018.13.
Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Leif Hanlen, Aurélie Névéol, Cyril Grouin, João Palotti, and Guido Zuccon. Overview of the clef ehealth evaluation lab 2015. In Josanne Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth Jones, Eric San Juan, Linda Capellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 429-443, Cham, 2015. Springer International Publishing.
Nathan Hartmann, Erick Fonseca, Christopher Shulby, Marcos Treviso, Jessica Rodrigues, and Sandra Aluisio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks, 2017. URL: http://arxiv.org/abs/1708.06025.
Radu Ion, Elena Irimia, Dan Ştefănescu, and Dan Tuficommabelows. ROMBAC: The Romanian balanced annotated corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 339-344, Istanbul, Turkey, 2012. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/218_Paper.pdf.
Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Aurélie Névéol, João Palotti, and Guido Zuccon. Overview of the CLEF eHealth evaluation lab 2016. In Lecture Notes in Computer Science, pages 255-266. Springer International Publishing, 2016. URL: https://doi.org/10.1007/978-3-319-44564-9_24.
Fabian Merges and Madjid Fathi. Restructuring medical package leaflets to improve knowledge transfer. In IKE: proceedings of the 2011 international conference on information & knowledge engineering, Las Vegas, Nevada, 2011.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. cite arxiv:1301.3781. URL: http://arxiv.org/abs/1301.3781.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL: https://www.aclweb.org/anthology/L18-1008.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS'13, page 3111–3119, Red Hook, NY, USA, 2013. Curran Associates Inc.
Manuel Cristóbal Rodríguez Martínez and Emilio Ortega Arjonilla. El corpus de prospectos farmacéuticos como recurso didáctico en el aula de traducción especializada francés-español. In Vicent Montalt, Karen Zethsen, and Wioleta Karwacka, editors, Current challenges and emerging trends in medical translation. MonTI 10, pages 117-140. Universidad de Alicante, 2018.
Isabel Segura-Bedmar, Santiago de la Peña González, and Paloma Martínez. Extracting drug indications and adverse drug reactions from Spanish health social media. In Proceedings of BioNLP 2014, pages 98-106, Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/W14-3415.
Isabel Segura-Bedmar and Paloma Martínez. Simplifying drug package leaflets written in spanish by using word embedding. Biomedical Semantics, 8 (45), 2017. URL: https://doi.org/10.1186/s13326-017-0156-7.
Isabel Segura-Bedmar, Luis Núñez-Gómez, Paloma Martínez, and M. Quiroz. Simplifying drug package leaflets. In SMBM, 2016.
Alberto Simões, Álvaro Iriarte, and José João Almeida. Dicionário-aberto – a source of resources for the portuguese language processing. Computational Processing of the Portuguese Language, Lecture Notes for Artificial Intelligence, 7243:121-127, 2012. URL: https://doi.org/10.1007/978-3-642-28885-2_14.
Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol-Estapé. Medical word embeddings for Spanish: Development and evaluation. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 124-133, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/W19-1916.
Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).
Pilar López Úbeda. Reconocimiento de entidades en informes médicos en español. In Proceedings of Doctoral Symposium of the 33rd Conference of the Spanish Society for Natural, 2018.
Łukasz Grabowski. Register variation across english pharmaceutical texts: A corpus-driven study of keywords, lexical bundles and phrase frames in patient information leaflets and summaries of product characteristics. Procedia - Social and Behavioral Sciences, 95:391-401, 2013. URL: https://doi.org/10.1016/j.sbspro.2013.10.661.
Łukasz Grabowski. On lexical bundles in polish patient information leaflets: A corpus-driven study. Studies in Polish Linguistics, 9(1), 2014.
Łukasz Grabowski. Distinctive lexical patterns in russian patient information leaflets: a corpus-driven study. Russian Journal of Linguistics, 23(3):659-680, 2019. URL: https://doi.org/10.22363/2312-9182-2019-23-3-659-680.

LeMe-PT: A Medical Package Leaflet Corpus for Portuguese

Authors Alberto Simões , Pablo Gamallo

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

LeMe-PT: A Medical Package Leaflet Corpus for Portuguese

Authors Alberto Simões , Pablo Gamallo

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message