A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Jokić, Danka; Stanković, Ranka; Krstev, Cvetana; Šandrih, Branislava

doi:10.4230/OASIcs.LDK.2021.13

File

OASIcs.LDK.2021.13.pdf

Filesize: 1.38 MB
17 pages

Document Identifiers

DOI: 10.4230/OASIcs.LDK.2021.13
URN: urn:nbn:de:0030-drops-145493

Author Details

Danka Jokić

University of Belgrade, Serbia

Ranka Stanković

Faculty of Mining and Geology, University of Belgrade, Serbia

Cvetana Krstev

Faculty of Philology, University of Belgrade, Serbia

Branislava Šandrih

Faculty of Philology, University of Belgrade, Serbia

Acknowledgements

We would like to acknowledge the team of annotators that provided their time and efforts to help us build AbCoSER v1.0 corpus of abusive speech in Serbian.

Cite AsGet BibTex

Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih. A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.LDK.2021.13

Abstract

Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset.

Subject Classification

ACM Subject Classification

Computing methodologies → Natural language processing

Keywords

abusive language
hate speech
Serbian
Twitter
lexicon
corpus

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54-63, 2019.
Elisa Bassignana, Valerio Basile, and Viviana Patti. Hurtlex: A multilingual lexicon of words to hurt. In 5th Italian Conference on Computational Linguistics, CLiC-it 2018, volume 2253, pages 1-6. CEUR-WS, 2018.
Bastian Birkeneder, Jelena Mitrovic, Julia Niemeier, Leon Teubert, and Siegfried Handschuh. upInf-Offensive Language Detection in German Tweets. In Proceedings of the GermEval 2018 Workshop, pages 71-78, 2018.
Andrej Blagojević et al. The normative framework of hate speech in Serbia and Serbian media. FACTA UNIVERSITATIS-Law and Politics, 14(1):81-95, 2016.
Julia Bosque-Gil, Jorge Gracia, and Elena Montiel-Ponsoda. Towards a Module for Lexicography in OntoLex. In Proceedings of the LDK workshops: OntoLex, LDK 2017, Galway, Ireland, volume 1899, pages 74-84, 2017.
Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Calzolari et al., editor, Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, May 11-16 2020. European Language Resources Association (ELRA).
Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pages 71-80. IEEE, 2012.
Christian Chiarcos, Christian Fäth, and Maxim Ionov. The ACoLi dictionary graph. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3281-3290, Marseille, France, 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.lrec-1.401.pdf.
Christian Chiarcos, Maxim Ionov, Jesse de Does, Katrien Depuydt, Fahad Khan, Sander Stolk, Thierry Declerck, and John Philip McCrae. Modelling Frequency and Attestations for OntoLex-Lemon. In Proceedings of the 2020 Globalex Workshop on Linked Lexicography, pages 1-9, Marseille, France, 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.globalex-1.1.pdf.
Davide Colla, Caselli Tommaso, Valerio Basile, Jelena Mitrović, and Granitzer Michael. GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models. In Proceedings of the 14th International Workshop on Semantic Evaluation(SemEvaleval), 2020.
Çağrı Çöltekin. A corpus of Turkish offensive language on social media. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6174-6184, 2020.
Maral Dadvar, Dolf Trieschnigg, Roeland Ordelman, and Franciska de Jong. Improving cyberbullying detection with user context. In European Conference on Information Retrieval, pages 693-696. Springer, 2013.
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, volume 11/1, 2017.
Mai ElSherief, Vivek Kulkarni, Dana Nguyen, William Yang Wang, and Elizabeth Belding. Hate lingo: A target-based linguistic analysis of hate speech in social media. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12/1, 2018.
Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in Slovene. In Proceedings of the first workshop on abusive language online, pages 46-51, 2017.
Paula Fortuna, Juan Soler, and Leo Wanner. Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6786-6794, 2020.
Philippe Gambette and Jean Véronis. Visualising a text with a tree cloud. In Classification as a Tool for Research, pages 561-569. Springer, 2010.
Cvetana Krstev, Sandra Gucul, Duško Vitas, and Vanja Radulović. Can we make the bell ring? In Proceedings of the Workshop on a Common Natural Language Processing Paradigm for Balkan Languages, pages 15-22, 2007.
Ivana Krstić. Report on the use of hate speech in Serbian media, 2020. URL: https://rm.coe.int/hf25-hate-speech-serbian-media-eng/1680a2278e.
K. Kumaresan and K. Vidanage. Hatesense: Tackling ambiguity in hate speech detection. In 2019 National Information Technology Conference (NITC), pages 20-26, 2019. URL: https://doi.org/10.1109/NITC48475.2019.9114528.
Irene Kwok and Yuzhou Wang. Locate the hate: Detecting tweets against blacks. In Proceedings of the twenty-seventh AAAI conference on artificial intelligence, pages 1621-1622, 2013.
Biljana Lazić and Mihailo Škorić. From DELA based dictionary to Leximirka lexical database. Infotheca - Journal for Digital Humanities, 19(2):81-98, 2020. URL: https://doi.org/10.18485/infotheca.2019.19.2.4.
Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. The FRENK datasets of socially unacceptable discourse in Slovene and English. In International Conference on Text, Speech, and Dialogue, pages 103-114. Springer, 2019.
John McCrae, Guadalupe Aguado-de Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, Dennis Spohr, et al. Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation, 46(4):701-719, 2012. URL: https://doi.org/10.1007/s10579-012-9182-3.
Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145-153, 2016.
Government of the Republic of Serbia. Criminal code of the Republic of Serbia. Službeni glasnik, 35:1-104, 2019.
Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Do you really want to hurt me? predicting abusive swearing in social media. In The 12th Language Resources and Evaluation Conference, pages 6237-6246. European Language Resources Association, 2020.
Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Misogyny detection in twitter: a multilingual and cross-domain study. Information Processing & Management, 57(6):102360, 2020.
Endang Wahyu Pamungkas and Viviana Patti. Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 363-370, 2019.
Nikola Pantelić. CRIMINAL OFFENSES COMMITTED ON SOCIAL NETWORKS: Structure of the offense and position of the perpetrator, 2017. URL: https://www.paragraf.rs/100pitanja/krivicno_pravo/krivicna-dela-izvrsena-na-drustvenim-mrezama-struktura-dela-i-polozaj-izvrsioca.html.
Ji Ho Park and Pascale Fung. One-step and two-step classification for abusive language detection on twitter. arXiv preprint, 2017. URL: http://arxiv.org/abs/1706.01206.
Ted Pedersen. Duluth at SemEval-2019 task 6: Lexical approaches to identify and categorize offensive tweets. arXiv preprint, 2020. URL: http://arxiv.org/abs/2007.12949.
Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and Cristina Bosco. Hate speech annotation: Analysis of an Italian twitter corpus. In 4th Italian Conference on Computational Linguistics, CLiC-it 2017, volume 2006, pages 1-6. CEUR-WS, 2017.
Amir H Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. Offensive language detection using multi-level classification. In Canadian Conference on Artificial Intelligence, pages 16-27. Springer, 2010.
Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L Shalin, and Amit Sheth. A quality type-aware annotated corpus and lexicon for harassment research. In Proceedings of the 10th ACM Conference on Web Science, pages 33-36, 2018.
Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. Measuring the reliability of hate speech annotations: The case of the european refugee crisis. arXiv preprint, 2017. URL: http://arxiv.org/abs/1701.08118.
Haji Mohammad Saleem, Kelly P Dillon, Susan Benesch, and Derek Ruths. A web of hate: Tackling hateful speech in online social spaces. arXiv preprint, 2017. URL: http://arxiv.org/abs/1709.10159.
Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media, pages 1-10, 2017.
Alessandro Seganti, Helena Sobol, Iryna Orlova, Hannam Kim, Jakub Staniszewski, Tymoteusz Krumholc, and Krystian Koziel. NLPR@ SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier. arXiv preprint, 2019. URL: http://arxiv.org/abs/1904.05152.
Gudbjartur Ingi Sigurbergsson and Leon Derczynski. Offensive language and hate speech detection for Danish. arXiv preprint, 2019. URL: http://arxiv.org/abs/1908.04531.
Leandro Silva, Mainack Mondal, Denzil Correa, Fabrício Benevenuto, and Ingmar Weber. Analyzing the targets of hate in online social media. arXiv preprint, 2016. URL: http://arxiv.org/abs/1603.07709.
Ranka Stanković, Jelena Mitrović, Danka Jokić, and Cvetana Krstev. Multi-word Expressions for Abusive Speech Detection in Serbian. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 74-84, 2020.
Julien Tissier, Christophe Gravier, and Amaury Habrard. Dict2vec: Learning word embeddings using lexical dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 254-263, 2017.
William Warner and Julia Hirschberg. Detecting hate speech on the world wide web. In Proceedings of the second workshop on language in social media, pages 19-26, 2012.
Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pages 88-93, 2016.
Michael Wiegand, Josef Ruppenhofer, Anna Schmidt, and Clayton Greenberg. Inducing a lexicon of abusive words - a feature-based approach. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 1-June 6, 2018, New Orleans, Louisiana, Vol. 1, 2018.
Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria – September 21, 2018. - Vienna, Austria, pages 1-10, 2018.
Thilini Wijesiriwardene, Hale Inan, Ugur Kursuncu, Manas Gaur, Valerie L Shalin, Krishnaprasad Thirunarayan, Amit Sheth, and I Budak Arpinar. Alone: A dataset for toxic behavior among adolescents on twitter. In International Conference on Social Informatics, pages 427-439. Springer, 2020.
Ellery Wulczyn, Nithum Thain, and Lucas Dixon. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th international conference on world wide web, pages 1391-1399, 2017.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Predicting the type and target of offensive posts in social media. arXiv preprint, 2019. URL: http://arxiv.org/abs/1902.09666.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, pages 75-86. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/s19-2010.
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, 2020.

A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Authors Danka Jokić , Ranka Stanković , Cvetana Krstev , Branislava Šandrih

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Authors Danka Jokić , Ranka Stanković , Cvetana Krstev , Branislava Šandrih

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message