KG2Tables: A Domain-Specific Tabular Data Generator to Evaluate Semantic Table Interpretation Systems (Resource Paper)

Authors Nora Abdelmageed , Ernesto Jiménez-Ruiz , Oktie Hassanzadeh , Birgitta König-Ries



PDF
Thumbnail PDF

File

TGDK.3.1.1.pdf
  • Filesize: 1.5 MB
  • 28 pages

Document Identifiers

Author Details

Nora Abdelmageed
  • Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany
Ernesto Jiménez-Ruiz
  • City St George’s, University of London, UK
  • University of Oslo, Norway
Oktie Hassanzadeh
  • IBM Research, Yorktown Heights, NY, USA
Birgitta König-Ries
  • Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany

Acknowledgements

The authors would like to thank Sirko Schindler for his help to setup JenTab. In addition, we thank Marco Cremaschi and Roberto Avogadro from s-elbat team, Fidel Jiomekong from TSOTSA team, Yoan Chabot from DAGOBAH team, Ioannis Dasoulas from TorchicTab team for the early and quick tFood solving.

Cite As Get BibTex

Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. KG2Tables: A Domain-Specific Tabular Data Generator to Evaluate Semantic Table Interpretation Systems (Resource Paper). In Transactions on Graph Data and Knowledge (TGDK), Volume 3, Issue 1, pp. 1:1-1:28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/TGDK.3.1.1

Abstract

Tabular data, often in the form of CSV files, plays a pivotal role in data analytics pipelines. Understanding this data semantically, known as Semantic
Table Interpretation (STI), is crucial but poses challenges due to several factors such as the ambiguity of labels. As a result, STI has gained increasing attention from the community in the past few years. Evaluating STI systems requires well-established benchmarks. Most of the existing large-scale benchmarks are derived from general domain sources and focus on ambiguity, while domain-specific benchmarks are relatively small in size. This paper introduces KG2Tables, a framework that can construct domain-specific large-scale benchmarks from a Knowledge Graph (KG). KG2Tables leverages the internal hierarchy of the relevant KG concepts and their properties. As a proof of concept, we have built large datasets in the food, biodiversity, and biomedical domains. The resulting datasets, tFood, tBiomed, and tBiodiv, have been made available for the public in the ISWC SemTab challenge (2023 and 2024 editions). We include the evaluation results of top-performing STI systems using tFood Such results underscore its potential as a robust evaluation benchmark for challenging STI systems. We demonstrate the data quality level using a sample-based approach for the generated benchmarks including, for example, realistic tables assessment. Nevertheless, we provide an extensive discussion of KG2Tables explaining how it could be used to create other benchmarks from any domain of interest and including its key features and limitations with suggestions to overcome them.

Subject Classification

ACM Subject Classification
  • Information systems → Information integration
Keywords
  • Semantic Table Interpretation (STI)
  • Knowledge Graph (KG)
  • STI Benchmark
  • Food
  • Biodiversity
  • Biomedical

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Nora Abdelmageed, Jiaoyan Chen, Vincenzo Cutrona, Vasilis Efthymiou, Oktie Hassanzadeh, Madelon Hulsebos, Ernesto Jiménez-Ruiz, Juan Sequeda, and Kavitha Srinivas. Results of SemTab 2022. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 21st International Semantic Web Conference (ISWC 2022)., volume 3320 of CEUR Workshop Proceedings, pages 1-13. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3320/paper0.pdf.
  2. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. KG2Tables. Software (visited on 2025-03-25). URL: https://github.com/fusion-jena/KG2Tables
    full metadata available at: https://doi.org/10.4230/artifacts.22938
  3. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. Kg2tables: Your way to generate an STI benchmark for your domain. In Lorena Etcheverry, Vanessa López Garcia, Francesco Osborne, and Romana Pernisch, editors, Proceedings of the ISWC 2024 Posters, Demos and Industry Tracks: From Novel Ideas to Industrial Practice co-located with 23nd International Semantic Web Conference (ISWC 2024), Hanover, Maryland, USA, November 11-15, 2024, volume 3828 of CEUR Workshop Proceedings. CEUR-WS.org, 2024. URL: https://ceur-ws.org/Vol-3828/paper14.pdf.
  4. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tBiodiv: Semantic Table Annotations Benchmark for the Biodiversity Domain, December 2023. URL: https://doi.org/10.5281/zenodo.10283015.
  5. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tBiodivL: Larger Semantic Table Annotations Benchmark for the Biodiversity Domain, December 2023. URL: https://doi.org/10.5281/zenodo.10283083.
  6. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tBiodmed: Semantic Table Annotations Benchmark for the Biomedical Domain, December 2023. URL: https://doi.org/10.5281/zenodo.10283103.
  7. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tBiodmedL: Larger Semantic Table Annotations Benchmark for the Biomedical Domain, December 2023. URL: https://doi.org/10.5281/zenodo.10283119.
  8. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tFood: Semantic Table Annotations Benchmark for Food Domain, October 2023. URL: https://doi.org/10.5281/zenodo.10048187.
  9. Nora Abdelmageed, Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. tFoodL: Larger Semantic Table Annotations Benchmark for Food Domain, December 2023. URL: https://doi.org/10.5281/zenodo.10277790.
  10. Nora Abdelmageed, Ernesto Jimènez-Ruiz, Oktie Hassanzadeh, and Birgitta König-Ries. KG2Tables: Semantic Table Interpretation Benchmarks Generator Code, December 2023. URL: https://doi.org/10.5281/zenodo.10285835.
  11. Nora Abdelmageed and Sirko Schindler. Jentab: Do CTA solutions affect the entire scores? In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2021, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022, volume 3320 of CEUR Workshop Proceedings, pages 72-79. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3320/paper8.pdf.
  12. Nora Abdelmageed, Sirko Schindler, and Birgitta König-Ries. Biodivtab: A table annotation benchmark based on biodiversity research data. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, pages 13-18. CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-3103/paper1.pdf.
  13. Nora Abdelmageed, Sirko Schindler, and Birgitta König-Ries. Biodivtab: semantic table annotation benchmark construction, analysis, and new additions. In Proceedings of the 17th International Workshop on Ontology Matching (OM 2022) co-located with the 21th International Semantic Web Conference (ISWC 2022), Hangzhou, China, held as a virtual conference, October 23, 2022, volume 3324, pages 37-48. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3324/om2022_LTpaper4.pdf.
  14. Oktie Hassanzadeh Nora Abdelmageed, Jiaoyan Chen, Vincenzo Cutrona, Vasilis Efthymiou, Madelon Hulsebos, Ernesto Jiménez-Ruiz, Aamod Khatiwada, Keti Korini, Benno Kruit, Juan Sequeda, and Kavitha Srinivas. Results of SemTab 2023. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 22nd International Semantic Web Conference (ISWC 2023)., volume 3557 of CEUR Workshop Proceedings, pages 1-14. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3557/paper0.pdf.
  15. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pages 722-735. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-76298-0_52.
  16. Wiem Baazouzi, Marouen Kachroudi, and Sami Faïz. Yet Another Milestone for Kepler-aSI at SemTab 2022. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2022, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022, volume 3320, pages 80-91. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3320/paper9.pdf.
  17. Wiem Baazouzi, Marouen Kachroudi, and Sami Faïz. Kepler-aSI at SemTab 2023. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 22nd International Semantic Web Conference (ISWC 2023)., volume 3557, pages 85-91, 2023. URL: https://ceur-ws.org/Vol-3557/paper7.pdf.
  18. Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. TabEL: Entity Linking in Web Tables. In The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, volume 9366 of Lecture Notes in Computer Science, pages 425-441. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-25007-6_25.
  19. Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. WebTables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538-549, 2008. URL: https://doi.org/10.14778/1453856.1453916.
  20. Marco Cremaschi, Roberto Avogadro, and David Chieregato. s-elbat: A semantic interpretation approach for messy table-s. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2021, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022, volume 3320 of CEUR Workshop Proceedings, pages 59-71. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3320/paper7.pdf.
  21. Vincenzo Cutrona, Federico Bianchi, Ernesto Jiménez-Ruiz, and Matteo Palmonari. Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. In 19th International Semantic Web Conference (ISWC), pages 328-343, 2020. URL: https://doi.org/10.1007/978-3-030-62466-8_21.
  22. Vincenzo Cutrona, Jiaoyan Chen, Vasilis Efthymiou, Oktie Hassanzadeh, Ernesto Jiménez-Ruiz, Juan Sequeda, Kavitha Srinivas, Nora Abdelmageed, Madelon Hulsebos, Daniela Oliveira, and Catia Pesquita. Results of SemTab 2021. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, pages 1-12. CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-3103/paper0.pdf.
  23. Ioannis Dasoulas, Duo Yang, Xuemin Duan, and Anastasia Dimou. TorchicTab: Semantic Table Annotation with Wikidata and Language Models. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 22nd International Semantic Web Conference (ISWC 2023)., volume 3557 of CEUR Workshop Proceedings, pages 21-37. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3557/paper2.pdf.
  24. Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis Christophides. Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. In The Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part I, volume 10587 of Lecture Notes in Computer Science, pages 260-277. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-68288-4_16.
  25. R. V. Guha, Dan Brickley, and Steve MacBeth. Schema.org: Evolution of structured data on the web: Big data makes common schemas even more necessary. Queue, 13(9):10-37, 2015. URL: https://doi.org/10.1145/2857274.2857276.
  26. Oktie Hassanzadeh. Wikidata truthy dump from march 20, 2024, July 2024. URL: https://doi.org/10.5281/zenodo.12588085.
  27. Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. GitTables: A Large-Scale Corpus of Relational Tables. CoRR, abs/2106.07258, 2021. URL: https://arxiv.org/abs/2106.07258.
  28. Madelon Hulsebos, Çağatay Demiralp, and Paul Demiralp. GitTables for SemTab 2021 - CTA task, November 2021. URL: https://doi.org/10.5281/zenodo.5706316.
  29. Viet-Phi Huynh, Yoan Chabot, Thomas Labbé, Jixiong Liu, and Raphaël Troncy. From heuristics to language models: A journey through the universe of semantic table interpretation with DAGOBAH. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2021, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022, volume 3320 of CEUR Workshop Proceedings, pages 45-58. CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3320/paper6.pdf.
  30. Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, and Kavitha Srinivas. Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems. In The Semantic Web - 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31-June 4, 2020, Proceedings, pages 514-530. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-49461-2_30.
  31. Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Kavitha Srinivas, and Vincenzo Cutrona. Results of semtab 2020. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 5, 2020, volume 2775 of CEUR Workshop Proceedings, pages 1-8. CEUR-WS.org, 2020. URL: https://ceur-ws.org/Vol-2775/paper0.pdf.
  32. Azanzi Jiomekong and Brice Foko. Exploring naive bayes classifiers for tabular data to knowledge graph matching. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 22nd International Semantic Web Conference (ISWC 2023)., volume 3557, pages 72-84, 2023. URL: https://ceur-ws.org/Vol-3557/paper6.pdf.
  33. Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. A Large Public Corpus of Web Tables containing Time and Context Metadata. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11-15, 2016, Companion Volume, pages 75-76. ACM, 2016. URL: https://doi.org/10.1145/2872518.2889386.
  34. Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow., 3(1):1338-1347, 2010. URL: https://doi.org/10.14778/1920841.1921005.
  35. Jixiong Liu, Yoan Chabot, Raphaël Troncy, Viet-Phi Huynh, Thomas Labbé, and Pierre Monnin. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods. Journal of Web Semantics, page 100761, 2022. Google Scholar
  36. Varish Mulwad, Tim Finin, and Anupam Joshi. Semantic Message Passing for Generating Linked Data from Tables. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, volume 8218 of Lecture Notes in Computer Science, pages 363-378. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-41335-3_23.
  37. Daniela Oliveira and Catia Pesquita. Semtab 2021 biotable dataset, October 2021. URL: https://doi.org/10.5281/zenodo.5606585.
  38. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. URL: https://doi.org/10.48550/arXiv.2303.08774.
  39. Gerald Penn, Jianying Hu, Hengbin Luo, and Ryan T. McDonald. Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices. In 6th International Conference on Document Analysis and Recognition (ICDAR 2001), 10-13 September 2001, Seattle, WA, USA, pages 1074-1078. IEEE Computer Society, 2001. URL: https://doi.org/10.1109/ICDAR.2001.953951.
  40. Dominique Ritze, Oliver Lehmberg, and Christian Bizer. Matching HTML Tables to DBpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, WIMS 2015, Larnaca, Cyprus, July 13-15, 2015, pages 10:1-10:6. ACM, 2015. URL: https://doi.org/10.1145/2797115.2797118.
  41. Denny Vrandečić and Markus Krötzsch. Wikidata: A free collaborative knowledgebase. Commun. ACM, 57(10):78-85, September 2014. URL: https://doi.org/10.1145/2629489.
  42. Brend Wanders. Repurposing and probabilistic integration of data. SIKS dissertation series, Universiteit Twente, June 2016. isbn:978-90-365-4110-7, number:2016-24. Google Scholar
  43. Yalin Wang and Jianying Hu. Detecting tables in HTML documents. In Document Analysis Systems V, 5th International Workshop, DAS 2002, Princeton, NJ, USA, August 19-21, 2002, Proceedings, volume 2423 of Lecture Notes in Computer Science, pages 249-260. Springer, 2002. URL: https://doi.org/10.1007/3-540-45869-7_29.
  44. Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. The fair guiding principles for scientific data management and stewardship. Scientific data, 3, 2016. URL: https://doi.org/10.1038/sdata.2016.18.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail