DataGen: JSON/XML Dataset Generator

Authors Filipa Alves dos Santos, Hugo André Coelho Cardoso, João da Cunha e Costa, Válter Ferreira Picas Carvalho, José Carlos Ramalho



PDF
Thumbnail PDF

File

OASIcs.SLATE.2021.6.pdf
  • Filesize: 0.55 MB
  • 14 pages

Document Identifiers

Author Details

Filipa Alves dos Santos
  • University of Minho, Braga, Portugal
Hugo André Coelho Cardoso
  • University of Minho, Braga, Portugal
João da Cunha e Costa
  • University of Minho, Braga, Portugal
Válter Ferreira Picas Carvalho
  • University of Minho, Braga, Portugal
José Carlos Ramalho
  • Department of Informatics, University of Minho, Braga, Portugal

Cite AsGet BibTex

Filipa Alves dos Santos, Hugo André Coelho Cardoso, João da Cunha e Costa, Válter Ferreira Picas Carvalho, and José Carlos Ramalho. DataGen: JSON/XML Dataset Generator. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 6:1-6:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.SLATE.2021.6

Abstract

In this document we describe the steps towards DataGen implementation. DataGen is a versatile and powerful tool that allows for quick prototyping and testing of software applications, since currently too few solutions offer both the complexity and scalability necessary to generate adequate datasets in order to feed a data API or a more complex APP enabling those applications testing with appropriate data volume and data complexity. DataGen core is a Domain Specific Language (DSL) that was created to specify datasets. This language suffered several updates: repeating fields (with no limit), fuzzy fields (statistically generated), lists, highorder functions over lists, custom made transformation functions. The final result is a complex algebra that allows the generation of very complex datasets coping with very complex requirements. Throughout the paper we will give several examples of the possibilities. After generating a dataset DataGen gives the user the possibility to generate a RESTFull data API with that dataset, creating a running prototype. This solution has already been used in real life cases, described with more detail throughout the paper, in which it was able to create the intended datasets successfully. These allowed the application’s performance to be tested and for the right adjustments to be made. The tool is currently being deployed for general use.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Domain specific languages
  • Theory of computation → Grammars and context-free languages
  • Information systems → Open source software
Keywords
  • JSON
  • XML
  • Data Generation
  • Open Source
  • REST API
  • Strapi
  • JavaScript
  • Node.js
  • Vue.js
  • Scalability
  • Fault Tolerance
  • Dataset
  • DSL
  • PEG.js
  • MongoDB

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. D.b. statistical disclosure limitation, 1993. Google Scholar
  2. General data protection regulation, 2018. URL: https://gdpr-info.eu/.
  3. Artificial intelligence in health care: Benefits and challenges of machine learning in drug development (staa)-policy briefs & reports-epta network, 2020. URL: https://eptanetwork.org/database/policy-briefs-reports/1898-artificial-intelligence-in-health-care-benefits-and-challenges-of-machine-learning-in-drug-development-staa.
  4. Yahya Al-Hadhrami and Farookh Khadeer Hussain. Real time dataset generation framework for intrusion detection systems in iot. Future Generation Computer Systems, 108:414-423, 2020. URL: https://doi.org/10.1016/j.future.2020.02.051.
  5. Anat Reiner Benaim, Ronit Almog, Yuri Gorelik, Irit Hochberg, Laila Nassar, Tanya Mashiach, Mogher Khamaisi, Yael Lurie, Zaher S Azzam, Johad Khoury, Daniel Kurnik, and Rafael Beyar. Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med Inform, 2015. Google Scholar
  6. Maurilio Di Cicco, Ciro Potena, Giorgio Grisetti, and Alberto Pretto. Automatic model based dataset generation for fast and accurate crop and weeds detection. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5188-5195, 2017. URL: https://doi.org/10.1109/IROS.2017.8206408.
  7. Elimination records. https://clav.dglab.gov.pt/autosEliminacaoInfo/. Accessed: 2020-05-02.
  8. Bryan Ford. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2004. Accessed: 2021-04-20. URL: https://bford.info/pub/lang/peg.pdf.
  9. Georgios Gousios. The ghtorent dataset and tool suite. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 233-236, 2013. URL: https://doi.org/10.1109/MSR.2013.6624034.
  10. Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee. Synthetic data for social good, 2017. Google Scholar
  11. JSON Generator. https://next.json-generator.com/4kaddUyG9/. Accessed: 2020-05-04.
  12. Xiangjie Kong, Feng Xia, Zhaolong Ning, Azizur Rahim, Yinqiong Cai, Zhiqiang Gao, and Jianhua Ma. Mobility dataset generation for vehicular social networks based on floating car data. IEEE Transactions on Vehicular Technology, 67(5):3874-3886, 2018. URL: https://doi.org/10.1109/TVT.2017.2788441.
  13. Menno Mostert, Annelien L Bredenoord, Monique Biesaart, and Johannes Delden. Big data in medical research and eu data protection law: Challenges to the consent or anonymise approach. Eur J Hum Genet., 24(7):956-60, 2016. URL: https://doi.org/10.1038/ejhg.2015.239.
  14. PegJS. https://pegjs.org/. Accessed: 2021-04-20.
  15. Haoyue Ping, Julia Stoyanovich, and Bill Howe. Datasynthetizer: Privacy-preserving synthetic datasets. In Proceedings of SSDBM ’17, 2017. URL: https://doi.org/10.1145/3085504.3091117.
  16. Darijo Raca, Dylan Leahy, Cormac J. Sreenan, and Jason J. Quinlan. Beyond throughput, the next generation: A 5g dataset with channel and context metrics. In Proceedings of the 11th ACM Multimedia Systems Conference, MMSys '20, page 303–308, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3339825.3394938.
  17. Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, and Gorka Epelde. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Med Inform., 2020. Google Scholar
  18. Anat Reiner Benaim, Ronit Almog, Yuri Gorelik, Irit Hochberg, Laila Nassar, Tanya Mashiach, Mogher Khamaisi, Yael Lurie, Zaher S Azzam, Johad Khoury, Daniel Kurnik, and Rafael Beyar. Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med Inform, 8(2):e16492, Feb 2020. URL: https://doi.org/10.2196/16492.
  19. Regulamento nacional de interoperabilidade digital (RNID). https://dre.pt/application/file/a/114461891. Accessed: 2020-04-21.
  20. Design APIs fast, manage content easily. https://strapi.io/. Accessed: 2020-04-21.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail