Is a Dataframe Just a Table?

Author Yifan Wu



PDF
Thumbnail PDF

File

OASIcs.PLATEAU.2019.6.pdf
  • Filesize: 0.57 MB
  • 10 pages

Document Identifiers

Author Details

Yifan Wu
  • UC Berkeley, Berkeley, CA, USA

Acknowledgements

Thanks to my advisor Joe Hellerstein for the inspirations and to Devin Petersohn, Michael Whittaker, Remco Chang, Wenting Zheng, and Eric Liang for their valuable and kind feedback.

Cite AsGet BibTex

Yifan Wu. Is a Dataframe Just a Table?. In 10th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU 2019). Open Access Series in Informatics (OASIcs), Volume 76, pp. 6:1-6:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/OASIcs.PLATEAU.2019.6

Abstract

Querying data is core to databases and data science. However, the two communities have seemingly different concepts and use cases. As a result, both designers and users of the query languages disagree on whether the core abstractions - dataframes (data science) and tables (databases) - and the operations are the same. To investigate the difference from a PL-HCI perspective, we identify the basic affordances provided by tables and dataframes and how programming experiences over tables and dataframes differ. We show that the data structures nudge programmers to query and store their data in different ways. We hope the case study could clarify confusions, dispel misinformation, increase cross-pollination between the two communities, and identify open PL-HCI questions.

Subject Classification

ACM Subject Classification
  • Information systems → Relational database query languages
  • Software and its engineering → Software usability
  • Software and its engineering → API languages
Keywords
  • Usability of Programming Languages

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Conditional join (merge) in pandas. URL: https://github.com/pandas-dev/pandas/issues/7480.
  2. Eve: Programming designed for humans. URL: http://witheve.com/.
  3. HN thread about McKinny, Things I Hate About Pandas. URL: https://news.ycombinator.com/item?id=15335462.
  4. Merge, join, and concatenate. URL: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#set-logic-on-the-other-axes.
  5. Merge two data frames by common columns or row names, or do other versions of database join operations. URL: https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/merge.
  6. Multiindex / advanced indexing. URL: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html.
  7. pandas.dataframe docuemntation. URL: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.
  8. Usage of pandas api by kaggle usage. URL: https://github.com/modin-project/study_kaggle_usage/blob/master/results.csv.
  9. When should i ever want to use pandas apply() in my code? URL: https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code.
  10. Why do people prefer pandas to sql? URL: https://datascience.stackexchange.com/questions/34357/why-do-people-prefer-pandas-to-sql.
  11. data.frame, r-core documentation, 2018. URL: https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/data.frame.
  12. Peter Alvaro, Neil Conway, Joseph M Hellerstein, and William R Marczak. Consistency analysis in bloom: a calm and collected approach. In CIDR, pages 249-260. Citeseer, 2011. Google Scholar
  13. Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383-1394. ACM, 2015. Google Scholar
  14. Raymond F Boyce, Donald D Chamberlin, W Frank King III, and Michael M Hammer. Specifying queries as relational expressions: The square data sublanguage. Communications of the ACM, 18(11):621-628, 1975. Google Scholar
  15. Donald D Chamberlin and Raymond F Boyce. Sequel: A structured english query language. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control, pages 249-264. ACM, 1974. Google Scholar
  16. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2010, pages 363-375. ACM, 2010. URL: https://doi.org/10.1145/1806596.1806638.
  17. Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377-387, 1970. Google Scholar
  18. Thomas RG Green. Cognitive dimensions of notations. People and computers V, pages 443-460, 1989. Google Scholar
  19. Joe Hellerstein. Stop. a “data frame” is just a table, August 2016. URL: https://twitter.com/joe_hellerstein/status/761364295510691840.
  20. Joe Hellerstein. A “data frame” is a messy conflation of relations and matrices, March 2018. URL: https://twitter.com/joe_hellerstein/status/978335500250447878.
  21. Mary Beth Kery, Amber Horvath, and Brad A Myers. Variolite: Supporting exploratory programming by data scientists. In CHI, pages 1265-1276, 2017. Google Scholar
  22. Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment, 9(3):204-215, 2015. Google Scholar
  23. Wes McKinney. Apache arrow and the "10 things i hate about pandas", 2017. URL: https://wesmckinney.com/blog/apache-arrow-pandas-internals/.
  24. Erik Meijer, Brian Beckman, and Gavin Bierman. Linq: reconciling object, relations and xml in the. net framework. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 706-706. ACM, 2006. Google Scholar
  25. Raghu Ramakrishnan and Johannes Gehrke. Database management systems. McGraw Hill, 2000. Google Scholar
  26. Nick Shrock. Graphql exists not just because, November 2018. URL: https://twitter.com/schrockn/status/1060314584525955072.
  27. Nathan Sidoli. Mathematical tables in ptolemy’s almagest. Historia Mathematica, 41(1):13-37, 2014. Google Scholar
  28. Kelly Sommers. Why graphql when we could have used sql?, November 2018. URL: https://twitter.com/kellabyte/status/1059956838744158213.
  29. Hadley Wickham and Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data. O'Reilly Media, Inc., 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail