A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Authors Danka Jokić , Ranka Stanković , Cvetana Krstev , Branislava Šandrih



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.13.pdf
  • Filesize: 1.38 MB
  • 17 pages

Document Identifiers

Author Details

Danka Jokić
  • University of Belgrade, Serbia
Ranka Stanković
  • Faculty of Mining and Geology, University of Belgrade, Serbia
Cvetana Krstev
  • Faculty of Philology, University of Belgrade, Serbia
Branislava Šandrih
  • Faculty of Philology, University of Belgrade, Serbia

Acknowledgements

We would like to acknowledge the team of annotators that provided their time and efforts to help us build AbCoSER v1.0 corpus of abusive speech in Serbian.

Cite AsGet BibTex

Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih. A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.LDK.2021.13

Abstract

Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
Keywords
  • abusive language
  • hate speech
  • Serbian
  • Twitter
  • lexicon
  • corpus

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54-63, 2019. Google Scholar
  2. Elisa Bassignana, Valerio Basile, and Viviana Patti. Hurtlex: A multilingual lexicon of words to hurt. In 5th Italian Conference on Computational Linguistics, CLiC-it 2018, volume 2253, pages 1-6. CEUR-WS, 2018. Google Scholar
  3. Bastian Birkeneder, Jelena Mitrovic, Julia Niemeier, Leon Teubert, and Siegfried Handschuh. upInf-Offensive Language Detection in German Tweets. In Proceedings of the GermEval 2018 Workshop, pages 71-78, 2018. Google Scholar
  4. Andrej Blagojević et al. The normative framework of hate speech in Serbia and Serbian media. FACTA UNIVERSITATIS-Law and Politics, 14(1):81-95, 2016. Google Scholar
  5. Julia Bosque-Gil, Jorge Gracia, and Elena Montiel-Ponsoda. Towards a Module for Lexicography in OntoLex. In Proceedings of the LDK workshops: OntoLex, LDK 2017, Galway, Ireland, volume 1899, pages 74-84, 2017. Google Scholar
  6. Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Calzolari et al., editor, Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, May 11-16 2020. European Language Resources Association (ELRA). Google Scholar
  7. Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pages 71-80. IEEE, 2012. Google Scholar
  8. Christian Chiarcos, Christian Fäth, and Maxim Ionov. The ACoLi dictionary graph. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3281-3290, Marseille, France, 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.lrec-1.401.pdf.
  9. Christian Chiarcos, Maxim Ionov, Jesse de Does, Katrien Depuydt, Fahad Khan, Sander Stolk, Thierry Declerck, and John Philip McCrae. Modelling Frequency and Attestations for OntoLex-Lemon. In Proceedings of the 2020 Globalex Workshop on Linked Lexicography, pages 1-9, Marseille, France, 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.globalex-1.1.pdf.
  10. Davide Colla, Caselli Tommaso, Valerio Basile, Jelena Mitrović, and Granitzer Michael. GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models. In Proceedings of the 14th International Workshop on Semantic Evaluation(SemEvaleval), 2020. Google Scholar
  11. Çağrı Çöltekin. A corpus of Turkish offensive language on social media. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6174-6184, 2020. Google Scholar
  12. Maral Dadvar, Dolf Trieschnigg, Roeland Ordelman, and Franciska de Jong. Improving cyberbullying detection with user context. In European Conference on Information Retrieval, pages 693-696. Springer, 2013. Google Scholar
  13. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, volume 11/1, 2017. Google Scholar
  14. Mai ElSherief, Vivek Kulkarni, Dana Nguyen, William Yang Wang, and Elizabeth Belding. Hate lingo: A target-based linguistic analysis of hate speech in social media. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12/1, 2018. Google Scholar
  15. Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in Slovene. In Proceedings of the first workshop on abusive language online, pages 46-51, 2017. Google Scholar
  16. Paula Fortuna, Juan Soler, and Leo Wanner. Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6786-6794, 2020. Google Scholar
  17. Philippe Gambette and Jean Véronis. Visualising a text with a tree cloud. In Classification as a Tool for Research, pages 561-569. Springer, 2010. Google Scholar
  18. Cvetana Krstev, Sandra Gucul, Duško Vitas, and Vanja Radulović. Can we make the bell ring? In Proceedings of the Workshop on a Common Natural Language Processing Paradigm for Balkan Languages, pages 15-22, 2007. Google Scholar
  19. Ivana Krstić. Report on the use of hate speech in Serbian media, 2020. URL: https://rm.coe.int/hf25-hate-speech-serbian-media-eng/1680a2278e.
  20. K. Kumaresan and K. Vidanage. Hatesense: Tackling ambiguity in hate speech detection. In 2019 National Information Technology Conference (NITC), pages 20-26, 2019. URL: https://doi.org/10.1109/NITC48475.2019.9114528.
  21. Irene Kwok and Yuzhou Wang. Locate the hate: Detecting tweets against blacks. In Proceedings of the twenty-seventh AAAI conference on artificial intelligence, pages 1621-1622, 2013. Google Scholar
  22. Biljana Lazić and Mihailo Škorić. From DELA based dictionary to Leximirka lexical database. Infotheca - Journal for Digital Humanities, 19(2):81-98, 2020. URL: https://doi.org/10.18485/infotheca.2019.19.2.4.
  23. Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. The FRENK datasets of socially unacceptable discourse in Slovene and English. In International Conference on Text, Speech, and Dialogue, pages 103-114. Springer, 2019. Google Scholar
  24. John McCrae, Guadalupe Aguado-de Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, Dennis Spohr, et al. Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation, 46(4):701-719, 2012. URL: https://doi.org/10.1007/s10579-012-9182-3.
  25. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145-153, 2016. Google Scholar
  26. Government of the Republic of Serbia. Criminal code of the Republic of Serbia. Službeni glasnik, 35:1-104, 2019. Google Scholar
  27. Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Do you really want to hurt me? predicting abusive swearing in social media. In The 12th Language Resources and Evaluation Conference, pages 6237-6246. European Language Resources Association, 2020. Google Scholar
  28. Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Misogyny detection in twitter: a multilingual and cross-domain study. Information Processing & Management, 57(6):102360, 2020. Google Scholar
  29. Endang Wahyu Pamungkas and Viviana Patti. Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 363-370, 2019. Google Scholar
  30. Nikola Pantelić. CRIMINAL OFFENSES COMMITTED ON SOCIAL NETWORKS: Structure of the offense and position of the perpetrator, 2017. URL: https://www.paragraf.rs/100pitanja/krivicno_pravo/krivicna-dela-izvrsena-na-drustvenim-mrezama-struktura-dela-i-polozaj-izvrsioca.html.
  31. Ji Ho Park and Pascale Fung. One-step and two-step classification for abusive language detection on twitter. arXiv preprint, 2017. URL: http://arxiv.org/abs/1706.01206.
  32. Ted Pedersen. Duluth at SemEval-2019 task 6: Lexical approaches to identify and categorize offensive tweets. arXiv preprint, 2020. URL: http://arxiv.org/abs/2007.12949.
  33. Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and Cristina Bosco. Hate speech annotation: Analysis of an Italian twitter corpus. In 4th Italian Conference on Computational Linguistics, CLiC-it 2017, volume 2006, pages 1-6. CEUR-WS, 2017. Google Scholar
  34. Amir H Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. Offensive language detection using multi-level classification. In Canadian Conference on Artificial Intelligence, pages 16-27. Springer, 2010. Google Scholar
  35. Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L Shalin, and Amit Sheth. A quality type-aware annotated corpus and lexicon for harassment research. In Proceedings of the 10th ACM Conference on Web Science, pages 33-36, 2018. Google Scholar
  36. Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. Measuring the reliability of hate speech annotations: The case of the european refugee crisis. arXiv preprint, 2017. URL: http://arxiv.org/abs/1701.08118.
  37. Haji Mohammad Saleem, Kelly P Dillon, Susan Benesch, and Derek Ruths. A web of hate: Tackling hateful speech in online social spaces. arXiv preprint, 2017. URL: http://arxiv.org/abs/1709.10159.
  38. Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media, pages 1-10, 2017. Google Scholar
  39. Alessandro Seganti, Helena Sobol, Iryna Orlova, Hannam Kim, Jakub Staniszewski, Tymoteusz Krumholc, and Krystian Koziel. NLPR@ SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier. arXiv preprint, 2019. URL: http://arxiv.org/abs/1904.05152.
  40. Gudbjartur Ingi Sigurbergsson and Leon Derczynski. Offensive language and hate speech detection for Danish. arXiv preprint, 2019. URL: http://arxiv.org/abs/1908.04531.
  41. Leandro Silva, Mainack Mondal, Denzil Correa, Fabrício Benevenuto, and Ingmar Weber. Analyzing the targets of hate in online social media. arXiv preprint, 2016. URL: http://arxiv.org/abs/1603.07709.
  42. Ranka Stanković, Jelena Mitrović, Danka Jokić, and Cvetana Krstev. Multi-word Expressions for Abusive Speech Detection in Serbian. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 74-84, 2020. Google Scholar
  43. Julien Tissier, Christophe Gravier, and Amaury Habrard. Dict2vec: Learning word embeddings using lexical dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 254-263, 2017. Google Scholar
  44. William Warner and Julia Hirschberg. Detecting hate speech on the world wide web. In Proceedings of the second workshop on language in social media, pages 19-26, 2012. Google Scholar
  45. Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pages 88-93, 2016. Google Scholar
  46. Michael Wiegand, Josef Ruppenhofer, Anna Schmidt, and Clayton Greenberg. Inducing a lexicon of abusive words - a feature-based approach. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 1-June 6, 2018, New Orleans, Louisiana, Vol. 1, 2018. Google Scholar
  47. Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria – September 21, 2018. - Vienna, Austria, pages 1-10, 2018. Google Scholar
  48. Thilini Wijesiriwardene, Hale Inan, Ugur Kursuncu, Manas Gaur, Valerie L Shalin, Krishnaprasad Thirunarayan, Amit Sheth, and I Budak Arpinar. Alone: A dataset for toxic behavior among adolescents on twitter. In International Conference on Social Informatics, pages 427-439. Springer, 2020. Google Scholar
  49. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th international conference on world wide web, pages 1391-1399, 2017. Google Scholar
  50. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Predicting the type and target of offensive posts in social media. arXiv preprint, 2019. URL: http://arxiv.org/abs/1902.09666.
  51. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, pages 75-86. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/s19-2010.
  52. Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail