The Fault in Our Stars: Designing Reproducible Large-scale Code Analysis Experiments

Authors Petr Maj, Stefanie Muroya, Konrad Siek, Luca Di Grazia, Jan Vitek



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2024.27.pdf
  • Filesize: 1.68 MB
  • 23 pages

Document Identifiers

Author Details

Petr Maj
  • Czech Technical University, Prague, Czech Republic
Stefanie Muroya
  • Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria
Konrad Siek
  • Czech Technical University, Prague, Czech Republic
Luca Di Grazia
  • Università della Svizzera italiana (USI), Lugano, Switzerland
Jan Vitek
  • Charles University, Prague, Czech Republic
  • Northeastern University, Boston, MA, USA

Acknowledgements

We would like to thank Digital Ocean for their involuntary contribution of computational resources during the early data gathering phase of our research. We acknoweldge the reviewers of ICSE'22, and thank the reviewers of ECOOP'23 for their encouragments and for sticking around until 2024.

Cite AsGet BibTex

Petr Maj, Stefanie Muroya, Konrad Siek, Luca Di Grazia, and Jan Vitek. The Fault in Our Stars: Designing Reproducible Large-scale Code Analysis Experiments. In 38th European Conference on Object-Oriented Programming (ECOOP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 313, pp. 27:1-27:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ECOOP.2024.27

Abstract

Large-scale software repositories are a source of insights for software engineering. They offer an unmatched window into the software development process at scale. Their sheer number and size holds the promise of broadly applicable results. At the same time, that very size presents practical challenges for scaling tools and algorithms to millions of projects. A reasonable approach is to limit studies to representative samples of the population of interest. Broadly applicable conclusions can then be obtained by generalizing to the entire population. The contribution of this paper is a standardized experimental design methodology for choosing the inputs of studies working with large-scale repositories. We advocate for a methodology that clearly lays out what the population of interest is, how to sample it, and that fosters reproducibility. Along the way, we discourage researchers from using extrinsic attributes of projects such as stars, that measure some unclear notion of popularity.

Subject Classification

ACM Subject Classification
  • Software and its engineering
Keywords
  • software
  • mining code repositories
  • source code analysis

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S Baltes and P Ralph. Sampling in software engineering research: a critical review and guidelines. Empir. Softw. Eng., 27(4):94, 2022. URL: https://doi.org/10.1007/s10664-021-10072-8.
  2. H Borges and M Tulio Valente. What’s in a github star? understanding repository starring practices in a social coding platform. Journal of Systems and Software, 2018. URL: https://doi.org/10.1016/j.jss.2018.09.016.
  3. Z Chen et al. Understanding metric-based detectable smells in python software. Information and Software Technology, 2018. URL: https://doi.org/10.1016/j.infsof.2017.09.011.
  4. V Cosentino, J Izquierdo, and J Cabot. Findings from GitHub: Methods, datasets and limitations. In Mining Software Repositories (MSR), 2016. URL: https://doi.org/10.1145/2901739.2901776.
  5. R Dyer, H Nguyen, H Rajan, and T Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Int. Conf. on Software Engineering (ICSE), 2013. URL: https://doi.org/10.5555/2486788.2486844.
  6. G Gousios and D Spinellis. GHTorrent: GitHub’s data from a firehose. In Mining Software Repositories (MSR), 2012. URL: https://doi.org/10.1109/MSR.2012.6224294.
  7. L Di Grazia and M Pradel. The evolution of type annotations in python: An empirical study. In European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2022. URL: https://doi.org/10.1145/3540250.3549114.
  8. J Han et al. Characterization and prediction of popular projects on GitHub. In Computer Software and Applications Conf. (COMPSAC), 2019. URL: https://doi.org/10.1109/COMPSAC.2019.00013.
  9. H Jebnoun et al. The scent of deep learning code. In Mining Software Repositories (MSR), 2020. URL: https://doi.org/10.1145/3379597.3387479.
  10. E Kalliamvakou et al. The promises and perils of mining GitHub. In Mining Software Repositories (MSR), 2014. URL: https://doi.org/10.1145/2597073.2597074.
  11. S Krishnamurthi and J Vitek. The real software crisis: repeatability as a core value. Commun. ACM, 58(3), 2015. Google Scholar
  12. S Lohr. Sampling: Design and Analysis. Cengage Learning EMEA, 2010. Google Scholar
  13. C Lopes et al. Déjà Vu: A map of code duplicates on GitHub. Proc. ACM Program. Lang. (OOPSLA), 2017. URL: https://doi.org/10.1145/3133908.
  14. Y Ma et al. World of code: enabling a resarch workflow for mining and analyzing the universe of open source vcs data. Empirical Softw. Eng., 2021. URL: https://doi.org/10.1007/s10664-020-09905-9.
  15. P Maj et al. CodeDJ: Reproducible queries over large-scale software repositories. In European Conf. on Object-Oriented Programming (ECOOP), 2021. URL: https://doi.org/10.1145/2658987.
  16. V Markovtsev et al. Style-analyzer: fixing code style inconsistencies with interpretable unsupervised algorithms. In Mining Software Repositories (MSR), 2019. URL: https://doi.org/10.1109/MSR.2019.00073.
  17. T Mattis, P Rein, and R Hirschfeld. Three trillion lines: Infrastructure for mining github in the classroom. In Conf. on Art, Science & Eng. of Programming , 2020. URL: https://doi.org/10.1145/3397537.3397551.
  18. N Munaiah et al. Curating github for engineered software projects. Empirical Software Engineering, 2017. URL: https://doi.org/10.1007/s10664-017-9512-6.
  19. M Nagappan, T Zimmermann, and C Bird. Diversity in software engineering research. In Foundations of Software Engineering (FSE), 2013. URL: https://doi.org/10.1145/2491411.2491415.
  20. T Nakamaru et al. An empirical study of method chaining in Java. In Mining Software Repositories (MSR), 2020. URL: https://doi.org/10.1145/3379597.3387441.
  21. R Pfeiffer. What constitutes software? In Mining Software Repositories (MSR), 2020. URL: https://doi.org/10.1145/3379597.3387442.
  22. P Pickerill et al. Phantom: curating github for engineered software projects using time-series clustering. Empir Software Eng, 2020. URL: https://doi.org/10.1007/s10664-020-09825-8.
  23. P Ralph. SIGSOFT empirical standards released. Softw. Eng. Notes, 46(1):19, 2021. URL: https://doi.org/10.1145/3437479.3437483.
  24. J Vitek and T Kalibera. R3: Repeatability, reproducibility and rigor. SIGPLAN Not., 2012. URL: https://doi.org/10.1145/2442776.2442781.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail