Validation Methodology for Expert-Annotated Datasets: Event Annotation Case Study

Authors Oana Inel, Lora Aroyo

Thumbnail PDF


  • Filesize: 0.6 MB
  • 15 pages

Document Identifiers

Author Details

Oana Inel
  • Delft University of Technology, The Netherlands
  • Vrije Universiteit Amsterdam, The Netherlands
Lora Aroyo
  • Google Research, New York, US

Cite AsGet BibTex

Oana Inel and Lora Aroyo. Validation Methodology for Expert-Annotated Datasets: Event Annotation Case Study. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Open Access Series in Informatics (OASIcs), Volume 70, pp. 12:1-12:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Event detection is still a difficult task due to the complexity and the ambiguity of such entities. On the one hand, we observe a low inter-annotator agreement among experts when annotating events, disregarding the multitude of existing annotation guidelines and their numerous revisions. On the other hand, event extraction systems have a lower measured performance in terms of F1-score compared to other types of entities such as people or locations. In this paper we study the consistency and completeness of expert-annotated datasets for events and time expressions. We propose a data-agnostic validation methodology of such datasets in terms of consistency and completeness. Furthermore, we combine the power of crowds and machines to correct and extend expert-annotated datasets of events. We show the benefit of using crowd-annotated events to train and evaluate a state-of-the-art event extraction system. Our results show that the crowd-annotated events increase the performance of the system by at least 5.3%.

Subject Classification

ACM Subject Classification
  • Information systems → Crowdsourcing
  • Human-centered computing → Empirical studies in HCI
  • Computing methodologies → Machine learning
  • Crowdsourcing
  • Human-in-the-Loop
  • Event Extraction
  • Time Extraction


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. L. Aroyo and C. Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15-24, 2015. Google Scholar
  2. S. Bethard. ClearTK-TimeML: A minimalist approach to TempEval 2013. In * SEM, Volume 2: SemEval 2013, volume 2, pages 10-14, 2013. Google Scholar
  3. K. Braunschweig, M. Thiele, J. Eberius, and W. Lehner. Enhancing named entity extraction by effectively incorporating the crowd. BTW Workshop, 2013. Google Scholar
  4. K. Cao, X. Li, M. Fan, and R. Grishman. Improving event detection with active learning. In International Conference Recent Advances in Natural Language Processing, pages 72-77, 2015. Google Scholar
  5. T. Caselli and O. Inel. Crowdsourcing StoryLines: Harnessing the Crowd for Causal Relation Annotation. In Proceedings of the Workshop Events and Stories in the News, 2018. Google Scholar
  6. T. Caselli and R. Morante. Systems' Agreements and Disagreements in Temporal Processing: An Extensive Error Analysis of the TempEval-3 Task. In LREC, 2018. Google Scholar
  7. T. Caselli, R. Sprugnoli, and O. Inel. Temporal Information Annotation: Crowd vs. Experts. In LREC, 2016. Google Scholar
  8. A. Ceroni, U. Gadiraju, and M. Fisichella. Justevents: A crowdsourced corpus for event validation with strict temporal constraints. In ECIR, pages 484-492, 2017. Google Scholar
  9. N. Chambers. NavyTime: Event and time ordering from raw text. Technical report, Naval Academy Annapolis MD, 2013. Google Scholar
  10. A. Chang and C. D. Manning. SUTime: Evaluation in tempeval-3. In * SEM, Volume 2: SemEval 2013, volume 2, pages 78-82, 2013. Google Scholar
  11. G. Demartini. Hybrid human-machine information systems: Challenges and opportunities. Computer Networks, 90:5-13, 2015. Google Scholar
  12. A. Dumitrache, L. Aroyo, and C. Welty. Capturing Ambiguity in Crowdsourcing Frame Disambiguation. In HCOMP 2018, pages 12-20, 2018. Google Scholar
  13. A. Dumitrache, O. Inel, L. Aroyo, B. Timmermans, and C. Welty. CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement. arXiv preprint arXiv:1808.06080, 2018. Google Scholar
  14. A. Gangemi. A comparison of knowledge extraction tools for the semantic web. In ESWC Conference, pages 351-366, 2013. Google Scholar
  15. O. Inel and L. Aroyo. Harnessing diversity in crowds and machines for better NER performance. In European Semantic Web Conference, pages 289-304, 2017. Google Scholar
  16. O. Inel, L. Aroyo, C. Welty, and R.-J. Sips. Domain-independent quality measures for crowd truth disagreement. DeRiVE Workshop, page 2, 2013. Google Scholar
  17. O. Inel, G. Haralabopoulos, D. Li, C. Van Gysel, Z. Szlávik, E. Simperl, E. Kanoulas, and L. Aroyo. Studying Topical Relevance with Evidence-based Crowdsourcing. In CIKM, pages 1253-1262. ACM, 2018. Google Scholar
  18. H. Jung and A. Stent. ATT1: Temporal annotation using big windows and rich syntactic and semantic features. In *SEM, Volume 2: SemEval 2013, volume 2, pages 20-24, 2013. Google Scholar
  19. O. Kolomiyets and M.-F. Moens. KUL: Data-driven approach to temporal parsing of newswire articles. In * SEM, Volume 2: SemEval 2013, volume 2, pages 83-87, 2013. Google Scholar
  20. A. K. Kolya, A. Kundu, R. Gupta, A. Ekbal, and S. Bandyopadhyay. JU_CSE: A CRF based approach to annotation of temporal expression, event and temporal relations. In *SEM, Volume 2: SemEval 2013, volume 2, 2013. Google Scholar
  21. K. Lee, Y. Artzi, Y. Choi, and L. Zettlemoyer. Event detection and factuality assessment with non-expert supervision. In EMNLP, pages 1643-1648, 2015. Google Scholar
  22. S. Liao and R. Grishman. Using prediction from sentential scope to build a pseudo co-testing learner for event extraction. In IJCNLP, pages 714-722, 2011. Google Scholar
  23. H. Llorens, E. Saquete, and B. Navarro. TIPsem (English and Spanish): Evaluating CRFs and semantic roles in TempEval-2. In SemEval, 2010. Google Scholar
  24. C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics System Demonstrations, pages 55-60, 2014. Google Scholar
  25. C. Min, M. Srikanth, and A. Fowler. LCC-TE: a hybrid approach to temporal relation identification in news text. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 219-222, 2007. Google Scholar
  26. J. Pustejovsky, R. Knippen, J. Littman, and R. Saurí. Temporal and event information in natural language text. Language resources and evaluation, 39(2):123-164, 2005. Google Scholar
  27. J. Pustejovsky, J. Littman, R. Saurí, and M. Verhagen. TimeBank 1.2. Linguistic Data Consortium, 40, 2006. Google Scholar
  28. R. Saurí, J. Littman, B. Knippen, R. Gaizauskas, A. Setzer, and J. Pustejovsky. TimeML annotation guidelines. Version, 1(1):31, 2006. Google Scholar
  29. R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast - but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP, pages 254-263, 2008. Google Scholar
  30. R. Sprugnoli and A. Lenci. Crowdsourcing for the identification of event nominals: an experiment. In LREC, pages 1949-1955, 2014. Google Scholar
  31. J. Strötgen, J. Zell, and M. Gertz. Heideltime: Tuning english and developing spanish resources for tempeval-3. In * SEM, Volume 2: SemEval 2013, volume 2, pages 15-19, 2013. Google Scholar
  32. N. UzZaman, H. Llorens, L. Derczynski, J. Allen, M. Verhagen, and J. Pustejovsky. SemEval-2013 Task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In * SEM, Volume 2: SemEval 2013, pages 1-9, 2013. Google Scholar
  33. C. Van Son, O. Inel, R. Morante, L. Aroyo, and P. Vossen. Resource Interoperability for Sustainable Benchmarking: The Case of Events. In LREC, 2018. Google Scholar