Synthetic Data Generation from JSON Schemas

Authors Hugo André Coelho Cardoso, José Carlos Ramalho



PDF
Thumbnail PDF

File

OASIcs.SLATE.2022.5.pdf
  • Filesize: 1.1 MB
  • 16 pages

Document Identifiers

Author Details

Hugo André Coelho Cardoso
  • University of Minho, Braga, Portugal
José Carlos Ramalho
  • Department of Informatics, University of Minho, Braga, Portugal

Cite As Get BibTex

Hugo André Coelho Cardoso and José Carlos Ramalho. Synthetic Data Generation from JSON Schemas. In 11th Symposium on Languages, Applications and Technologies (SLATE 2022). Open Access Series in Informatics (OASIcs), Volume 104, pp. 5:1-5:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/OASIcs.SLATE.2022.5

Abstract

This document describes the steps taken in the development of DataGen From Schemas. This new version of DataGen is an application that makes it possible to automatically generate representative synthetic datasets from JSON and XML schemas, in order to facilitate tasks such as the thorough testing of software applications and scientific endeavors in relevant areas, namely Data Science. This paper focuses solely on the JSON Schema component of the application.
DataGen’s prior version is an online open-source application that allows the quick prototyping of datasets through its own Domain Specific Language (DSL) of specification of data models. DataGen is able to parse these models and generate synthetic datasets according to the structural and semantic restrictions stipulated, automating the whole process of data generation with spontaneous values created in runtime and/or from a library of support datasets.
The objective of this new product, DataGen From Schemas, is to expand DataGen’s use cases and raise the datasets specification’s abstraction level, making it possible to generate synthetic datasets directly from schemas. This new platform builds upon its prior version and acts as its complement, operating jointly and sharing the same data layer, in order to assure the compatibility of both platforms and the portability of the created DSL models between them. Its purpose is to parse schema files and generate corresponding DSL models, effectively translating the JSON specification to a DataGen model, then using the original application as a middleware to generate the final datasets.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Domain specific languages
  • Theory of computation → Grammars and context-free languages
  • Information systems → Open source software
Keywords
  • Schemas
  • JSON
  • Data Generation
  • Synthetic Data
  • DataGen
  • DSL
  • Dataset
  • Grammar
  • Randomization
  • Open Source
  • Data Science
  • REST API
  • PEG.js

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jason W. Anderson, K. E. Kennedy, Linh B. Ngo, Andre Luckow, and Amy W. Apon. Synthetic data generation for the internet of things. In 2014 IEEE International Conference on Big Data (Big Data), pages 171-176, 2014. URL: https://doi.org/10.1109/BigData.2014.7004228.
  2. Jessamyn Dahmen and Diane Cook. Synsys: A synthetic data generation system for healthcare applications. Sensors, 19(5), 2019. URL: https://doi.org/10.3390/s19051181.
  3. Hadi Keivan Ekbatani, Oriol Pujol, and Santi Segui. Synthetic data generation for deep learning in counting pedestrians. In ICPRAM, pages 318-323, 2017. Google Scholar
  4. GAO. Artificial intelligence in health care: Benefits and challenges of machine learning in drug development (staa)-policy briefs & reports-epta network. In GAO Technology Assessment: Artificial Intelligence in Health Care: Benefits and Challenges of Machine Learning in Drug Development, 2020. Accessed: 2021-04-25. URL: https://eptanetwork.org/database/policy-briefs-reports/1898-artificial-intelligence-in-health-care-benefits-and-challenges-of-machine-learning-in-drug-development-staa.
  5. Menno Mostert, Annelien Bredenoord, Monique Biesaart, and Johannes Delden. Big data in medical research and eu data protection law: challenges to the consent or anonymise approach. European Journal of Human Genetics, 24:1096-1096, July 2016. URL: https://doi.org/10.1038/ejhg.2016.71.
  6. Donald B. Rubin. Statistical disclosure limitation. In Journal of Official Statistics, pages 461-468, 1993. Google Scholar
  7. Filipa Alves dos Santos, Hugo André Coelho Cardoso, João da Cunha e Costa, Válter Ferreira Picas Carvalho, and José Carlos Ramalho. DataGen: JSON/XML Dataset Generator. In Ricardo Queirós, Mário Pinto, Alberto Simões, Filipe Portela, and Maria João Pereira, editors, 10th Symposium on Languages, Applications and Technologies (SLATE 2021), volume 94 of Open Access Series in Informatics (OASIcs), Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/OASIcs.SLATE.2021.6.
  8. Dianna M Smith, Graham P Clarke, and Kirk Harland. Improving the synthetic data generation process in spatial microsimulation models. Environment and Planning A: Economy and Space, 41(5):1251-1268, 2009. URL: https://doi.org/10.1068/a4147.
  9. Apostolia Tsirikoglou, Joel Kronander, Magnus Wrenninge, and Jonas Unger. Procedural modeling and physically based rendering for synthetic data generation in automotive applications. arXiv preprint, 2017. URL: http://arxiv.org/abs/1710.06270.
  10. P. Voigt and A. von dem Bussche. The EU General Data Protection Regulation (GDPR): A Practical Guide. Springer International Publishing, 2017. URL: https://books.google.pt/books?id=cWAwDwAAQBAJ.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail