Search Results

Documents authored by Krstev, Cvetana


Document
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Authors: Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih

Published in: OASIcs, Volume 93, 3rd Conference on Language, Data and Knowledge (LDK 2021)


Abstract
Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset.

Cite as

Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih. A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{jokic_et_al:OASIcs.LDK.2021.13,
  author =	{Joki\'{c}, Danka and Stankovi\'{c}, Ranka and Krstev, Cvetana and \v{S}andrih, Branislava},
  title =	{{A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{13:1--13:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.13},
  URN =		{urn:nbn:de:0030-drops-145493},
  doi =		{10.4230/OASIcs.LDK.2021.13},
  annote =	{Keywords: abusive language, hate speech, Serbian, Twitter, lexicon, corpus}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail