A Pseudonymization Prototype for Hungarian

Authors Attila Novák, Borbála Novák



PDF
Thumbnail PDF

File

OASIcs.SLATE.2023.3.pdf
  • Filesize: 437 kB
  • 10 pages

Document Identifiers

Author Details

Attila Novák
  • Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
Borbála Novák
  • Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary

Cite AsGet BibTex

Attila Novák and Borbála Novák. A Pseudonymization Prototype for Hungarian. In 12th Symposium on Languages, Applications and Technologies (SLATE 2023). Open Access Series in Informatics (OASIcs), Volume 113, pp. 3:1-3:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/OASIcs.SLATE.2023.3

Abstract

In this paper, we present a pseudonymization prototype for Hungarian, an agglutinating language with complex morphology, implemented as a web service. The service provides the following functions: entity identification and extraction; automatic generation and selection of replacement candidates; automatic and consistent replacement and reinflection of entities in the final pseudonymized document. The named entity recognition model applied handles names of persons well, and it has decent performance on other entity types as well. However ID-like entities need to be handled separately to achieve proper performance (not handled in the current prototype version). For automatic replacement candidate generation, a simple entity embedding model is used. We discuss the performance and limitations of the prototype in detail.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
Keywords
  • named entity recognition
  • morphological reinflection
  • pseudonymization
  • entity embedding model

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135-146, 2017. URL: https://doi.org/10.1162/tacl_a_00051.
  2. Attila Novák and Borbála Novák. Cross-lingual generation and evaluation of a wide-coverage lexical semantic resource. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL: https://aclanthology.org/L18-1007.
  3. Attila Novák and Borbála Novák. NerKor+Cars-OntoNotes++. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), pages 1907-1916, Marseille, France, June 2022. European Language Resources Association. URL: https://aclanthology.org/2022.lrec-1.203.
  4. Attila Novák. A new form of Humor - Mapping constraint-based computational morphologies to a finite-state representation. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1068-1073, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/207_Paper.pdf.
  5. Attila Novák, Borbála Siklósi, and Csaba Oravecz. A new integrated open-source morphological analyzer for Hungarian. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 1315-1322, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL: https://aclanthology.org/L16-1209.
  6. György Orosz and Attila Novák. PurePos 2.0: a hybrid tool for morphological disambiguation. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pages 539-545, Hissar, Bulgaria, September 2013. INCOMA Ltd. Shoumen, BULGARIA. URL: https://aclanthology.org/R13-1071.
  7. Tamás Váradi, Eszter Simon, Bálint Sass, Iván Mittelholcz, Attila Novák, Balázs Indig, Richárd Farkas, and Veronika Vincze. E-magyar - A Digital Language Processing System. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12 2018. European Language Resources Association (ELRA). Google Scholar
  8. Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. OntoNotes Release 5.0, 2013. URL: https://doi.org/10.35111/xmhb-2b84.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail