Data Exploration through Dot-driven Development

Author Tomas Petricek



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2017.21.pdf
  • Filesize: 0.74 MB
  • 27 pages

Document Identifiers

Author Details

Tomas Petricek

Cite AsGet BibTex

Tomas Petricek. Data Exploration through Dot-driven Development. In 31st European Conference on Object-Oriented Programming (ECOOP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 74, pp. 21:1-21:27, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.ECOOP.2017.21

Abstract

Data literacy is becoming increasingly important in the modern world. While spreadsheets make simple data analytics accessible to a large number of people, creating transparent scripts that can be checked, modified, reproduced and formally analyzed requires expert programming skills. In this paper, we describe the design of a data exploration language that makes the task more accessible by embedding advanced programming concepts into a simple core language. The core language uses type providers, but we employ them in a novel way -- rather than providing types with members for accessing data, we provide types with members that allow the user to also compose rich and correct queries using just member access ('dot'). This way, we recreate functionality that usually requires complex type systems (row polymorphism, type state and dependent typing) in an extremely simple object-based language. We formalize our approach using an object-based calculus and prove that programs constructed using the provided types represent valid data transformations. We discuss a case study developed using the language, together with additional editor tooling that bridges some of the gaps between programming and spreadsheets. We believe that this work provides a pathway towards democratizing data science - our use of type providers significantly reduce the complexity of languages that one needs to understand in order to write scripts for exploring data.
Keywords
  • Data science
  • type providers
  • pivot tables
  • aggregation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Martin Abadi and Luca Cardelli. A theory of objects. Springer Science &Business, 2012. Google Scholar
  2. Fahd Abdeljallal. Session types with Fahd Abdeljallal. F#unctional Londoners meetup group, 2016. URL: https://skillsmatter.com/meetups/8459.
  3. Isaac Abraham. Azure storage type provider. Available online., 2016. URL: http://fsprojects.github.io/AzureStorageTypeProvider/.
  4. Rakesh Agrawal. Alpha: An extension of relational algebra to express a class of recursive queries. IEEE Transactions on Software Engineering, 14(7):879-885, 1988. Google Scholar
  5. Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of International Conference on Management of Data, SIGMOD '16, pages 1377-1392. ACM, 2016. URL: http://dx.doi.org/10.1145/2882903.2915210.
  6. Adam Chlipala. Ur: Statically-typed metaprogramming with type-level record computation. SIGPLAN Not., 45(6):122-133, June 2010. URL: http://dx.doi.org/10.1145/1809028.1806612.
  7. David Raymond Christiansen. Dependent type providers. In Proceedings of Workshop on Generic Programming, WGP '13, pages 25-34. ACM, 2013. URL: http://dx.doi.org/10.1145/2502488.2502495.
  8. E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 13(6):377-387, June 1970. URL: http://dx.doi.org/10.1145/362384.362685.
  9. Anthony Cowley. Frames: Data frames for tabular data. Available on GitHub, 2017. URL: https://github.com/acowley/Frames.
  10. Richard Cyganiak. A relational algebra for sparql. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170, page 35, 2005. Google Scholar
  11. Oxford Dictionaries. Word of the year 2016 is... Oxford University Press, 2016. URL: https://en.oxforddictionaries.com/word-of-the-year/word-of-the-year-2016.
  12. Kathleen Fisher and Robert Gruber. PADS: a domain-specific language for processing ad hoc data. In Vivek Sarkar and Mary W. Hall, editors, Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, pages 295-304. ACM, 2005. URL: http://dx.doi.org/10.1145/1065010.1065046.
  13. Simon Gay and Malcolm Hole. Types and subtypes for client-server interactions. In European Symposium on Programming, pages 74-90. Springer, 1999. Google Scholar
  14. Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of International Conference on Data Engineering, ICDE '96, pages 152-159. IEEE Computer Society, 1996. Google Scholar
  15. Christopher Ilacqua, Henric Cronstrom, and James Richardson. Learning Qlik Senseregistered: The Official Guide. Packt Publishing Ltd, 2015. Google Scholar
  16. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Interactive visual specification of data transformation scripts. In ACM Human Factors in Computing Systems (CHI), 2011. URL: http://vis.stanford.edu/papers/wrangler.
  17. Paul Krugman. The Excel depression. New York Times, 18, 2013. Google Scholar
  18. Daan Leijen and Erik Meijer. Domain specific embedded compilers. SIGPLAN Not., 35(1):109-122, December 1999. URL: http://dx.doi.org/10.1145/331963.331977.
  19. Martin Leinberger, Stefan Scheglmann, Ralf Lämmel, Steffen Staab, Matthias Thimm, and Evelyne Viegas. Semantic web application development with LITEQ. In International Semantic Web Conference, pages 212-227. Springer, 2014. Google Scholar
  20. Bin Liu and H. V. Jagadish. A spreadsheet algebra for a direct data manipulation query interface. In Proceedings of International Conference on Data Engineering, ICDE '09, pages 417-428. IEEE Computer Society, 2009. URL: http://dx.doi.org/10.1109/ICDE.2009.34.
  21. Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012. Google Scholar
  22. Erik Meijer, Brian Beckman, and Gavin Bierman. LINQ: Reconciling object, relations and XML in the .net framework. In Proceedings of the International Conference on Management of Data, pages 706-706. ACM, 2006. Google Scholar
  23. Z. Meral Özsoyoglu and Gultekin Özsoyoglu. An extension of relational algebra for summary tables. In Proceedings of International Workshop on Statistical Database Management, SSDBM'83, pages 202-211. Lawrence Berkeley Laboratory, 1983. Google Scholar
  24. M. Tamer Ozsu. Principles of Distributed Database Systems. Prentice Hall Press, 3rd edition, 2007. Google Scholar
  25. Raymond R Panko. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC), 10(2):15-21, 1998. Google Scholar
  26. Tomas Petricek, Gustavo Guerra, and Don Syme. Types from data: Making structured data first-class citizens in F#. In Proceedings of Conference on Programming Language Design and Implementation, PLDI '16, pages 477-490. ACM, 2016. URL: http://dx.doi.org/10.1145/2908080.2908115.
  27. Tomas Petricek, Don Syme, and Zach Bray. In the age of web: Typed functional-first programming revisited. In Proceedings ML Family/OCaml Users and Developers workshops, ML '15. ACM, 2015. Google Scholar
  28. Ben Shneiderman. The future of interactive systems and the emergence of direct manipulation. In Proceedings of the NYU Symposium on User Interfaces on Human Factors and Interactive Computer Systems, pages 1-28. Ablex Publishing Corp., 1984. Google Scholar
  29. Ben Shneiderman. Direct manipulation for comprehensible, predictable and controllable user interfaces. In Proceedings of International Conference on Intelligent User Interfaces, pages 33-39. ACM, 1997. Google Scholar
  30. Ben Shneiderman, Christopher Williamson, and Christopher Ahlberg. Dynamic queries: database searching by direct manipulation. In Proceedings of Conference on Human Factors in Computing Systems, pages 669-670. ACM, 1992. Google Scholar
  31. Jeremy G Siek and Walid Taha. Gradual typing for functional languages. In Scheme and Functional Programming Workshop, volume 6, pages 81-92, 2006. Google Scholar
  32. Robert E. Strom and Shaula Yemini. Typestate: A programming language concept for enhancing software reliability. IEEE Trans. Software Eng., 12(1):157-171, 1986. URL: http://dx.doi.org/10.1109/TSE.1986.6312929.
  33. Don Syme. F# 4.0 speclet - extending the F# type provider mechanism to allow methods to have static parameters. F#Language Design Proposal, 2016. URL: https://github.com/fsharp/fslang-design/blob/master/FSharp-4.0/StaticMethodArgumentsDesignAndSpec.md.
  34. Don Syme, Keith Battocchi, Kenji Takeda, Donna Malayeri, and Tomas Petricek. Themes in information-rich functional programming for internet-scale data sources. In Proceedings of Workshop on Data Driven Functional Programming, DDFP '13, pages 1-4. ACM, 2013. URL: http://dx.doi.org/10.1145/2429376.2429378.
  35. Mitchell Wand. Type inference for record concatenation and multiple inheritance. Inf. Comput., 93(1):1-15, July 1991. URL: http://dx.doi.org/10.1016/0890-5401(91)90050-C.
  36. Christopher Webb et al. Power Query for Power BI and Excel. Apress, 2014. Google Scholar
  37. Stephanie Weirich. Depending on types. SIGPLAN Not., 49(9):241-241, August 2014. URL: http://dx.doi.org/10.1145/2692915.2631168.
  38. Richard Wesley, Matthew Eldridge, and Pawel T Terlecki. An analytic data engine for visualization in tableau. In Proceedings of International Conference on Management of Data, pages 1185-1194. ACM, 2011. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail