LeMe-PT: A Medical Package Leaflet Corpus for Portuguese

Authors Alberto Simões , Pablo Gamallo

Thumbnail PDF


  • Filesize: 0.55 MB
  • 10 pages

Document Identifiers

Author Details

Alberto Simões
  • 2Ai, School of Technology, IPCA, Barcelos, Portugal
Pablo Gamallo
  • Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), University of Santiago de Compostela, A Coruña, Spain

Cite AsGet BibTex

Alberto Simões and Pablo Gamallo. LeMe-PT: A Medical Package Leaflet Corpus for Portuguese. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 10:1-10:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


The current trend on natural language processing is the use of machine learning. This is being done on every field, from summarization to machine translation. For these techniques to be applied, resources are needed, namely quality corpora. While there are large quantities of corpora for the Portuguese language, there is the lack of technical and focused corpora. Therefore, in this article we present a new corpus, built from drug package leaflets. We describe its structure and contents, and discuss possible exploration directions.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Information extraction
  • Computing methodologies → Language resources
  • drug corpora
  • information extractiom
  • word embeddings


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Bruno Lage Aguiar. Information extraction from medication leaflets. PhD thesis, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal, 2010. Google Scholar
  2. José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, pages 43-50, Berlin, Germany, 2016. Google Scholar
  3. Liliana Ferreira, António Teixeira, and João Paulo Silva Cunha. Medical information extraction in european portuguese. In Handbook of Research on ICTs for Human-Centered Healthcare and Social Care Services, pages 607-626. IGI Global, 2013. URL: https://doi.org/10.4018/978-1-4666-3986-7.ch032.
  4. Pablo Gamallo. An Overview of Open Information Extraction (Invited talk). In Maria João Varanda Pereira, José Paulo Leal, and Alberto Simões, editors, 3rd Symposium on Languages, Applications and Technologies, volume 38 of OpenAccess Series in Informatics (OASIcs), pages 13-16, Dagstuhl, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/OASIcs.SLATE.2014.13.
  5. Pablo Gamallo. Evaluation of Distributional Models with the Outlier Detection Task. In Pedro Rangel Henriques, José Paulo Leal, António Menezes Leitão, and Xavier Gómez Guinovart, editors, 7th Symposium on Languages, Applications and Technologies (SLATE 2018), volume 62 of OpenAccess Series in Informatics (OASIcs), pages 13:1-13:8, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/OASIcs.SLATE.2018.13.
  6. Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Leif Hanlen, Aurélie Névéol, Cyril Grouin, João Palotti, and Guido Zuccon. Overview of the clef ehealth evaluation lab 2015. In Josanne Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth Jones, Eric San Juan, Linda Capellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 429-443, Cham, 2015. Springer International Publishing. Google Scholar
  7. Nathan Hartmann, Erick Fonseca, Christopher Shulby, Marcos Treviso, Jessica Rodrigues, and Sandra Aluisio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks, 2017. URL: http://arxiv.org/abs/1708.06025.
  8. Radu Ion, Elena Irimia, Dan Ştefănescu, and Dan Tuficommabelows. ROMBAC: The Romanian balanced annotated corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 339-344, Istanbul, Turkey, 2012. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/218_Paper.pdf.
  9. Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Aurélie Névéol, João Palotti, and Guido Zuccon. Overview of the CLEF eHealth evaluation lab 2016. In Lecture Notes in Computer Science, pages 255-266. Springer International Publishing, 2016. URL: https://doi.org/10.1007/978-3-319-44564-9_24.
  10. Fabian Merges and Madjid Fathi. Restructuring medical package leaflets to improve knowledge transfer. In IKE: proceedings of the 2011 international conference on information & knowledge engineering, Las Vegas, Nevada, 2011. Google Scholar
  11. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. cite arxiv:1301.3781. URL: http://arxiv.org/abs/1301.3781.
  12. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL: https://www.aclweb.org/anthology/L18-1008.
  13. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS'13, page 3111–3119, Red Hook, NY, USA, 2013. Curran Associates Inc. Google Scholar
  14. Manuel Cristóbal Rodríguez Martínez and Emilio Ortega Arjonilla. El corpus de prospectos farmacéuticos como recurso didáctico en el aula de traducción especializada francés-español. In Vicent Montalt, Karen Zethsen, and Wioleta Karwacka, editors, Current challenges and emerging trends in medical translation. MonTI 10, pages 117-140. Universidad de Alicante, 2018. Google Scholar
  15. Isabel Segura-Bedmar, Santiago de la Peña González, and Paloma Martínez. Extracting drug indications and adverse drug reactions from Spanish health social media. In Proceedings of BioNLP 2014, pages 98-106, Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/W14-3415.
  16. Isabel Segura-Bedmar and Paloma Martínez. Simplifying drug package leaflets written in spanish by using word embedding. Biomedical Semantics, 8 (45), 2017. URL: https://doi.org/10.1186/s13326-017-0156-7.
  17. Isabel Segura-Bedmar, Luis Núñez-Gómez, Paloma Martínez, and M. Quiroz. Simplifying drug package leaflets. In SMBM, 2016. Google Scholar
  18. Alberto Simões, Álvaro Iriarte, and José João Almeida. Dicionário-aberto – a source of resources for the portuguese language processing. Computational Processing of the Portuguese Language, Lecture Notes for Artificial Intelligence, 7243:121-127, 2012. URL: https://doi.org/10.1007/978-3-642-28885-2_14.
  19. Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol-Estapé. Medical word embeddings for Spanish: Development and evaluation. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 124-133, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/W19-1916.
  20. Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). Google Scholar
  21. Pilar López Úbeda. Reconocimiento de entidades en informes médicos en español. In Proceedings of Doctoral Symposium of the 33rd Conference of the Spanish Society for Natural, 2018. Google Scholar
  22. Łukasz Grabowski. Register variation across english pharmaceutical texts: A corpus-driven study of keywords, lexical bundles and phrase frames in patient information leaflets and summaries of product characteristics. Procedia - Social and Behavioral Sciences, 95:391-401, 2013. URL: https://doi.org/10.1016/j.sbspro.2013.10.661.
  23. Łukasz Grabowski. On lexical bundles in polish patient information leaflets: A corpus-driven study. Studies in Polish Linguistics, 9(1), 2014. Google Scholar
  24. Łukasz Grabowski. Distinctive lexical patterns in russian patient information leaflets: a corpus-driven study. Russian Journal of Linguistics, 23(3):659-680, 2019. URL: https://doi.org/10.22363/2312-9182-2019-23-3-659-680.