Programming with "Big Code": Lessons, Techniques and Applications

Authors Pavol Bielik, Veselin Raychev, Martin Vechev



PDF
Thumbnail PDF

File

LIPIcs.SNAPL.2015.41.pdf
  • Filesize: 469 kB
  • 10 pages

Document Identifiers

Author Details

Pavol Bielik
Veselin Raychev
Martin Vechev

Cite AsGet BibTex

Pavol Bielik, Veselin Raychev, and Martin Vechev. Programming with "Big Code": Lessons, Techniques and Applications. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 32, pp. 41-50, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)
https://doi.org/10.4230/LIPIcs.SNAPL.2015.41

Abstract

Programming tools based on probabilistic models of massive codebases (aka "Big Code") promise to solve important programming tasks that were difficult or practically infeasible to address before. However, building such tools requires solving a number of hard problems at the intersection of programming languages, program analysis and machine learning. In this paper we summarize some of our experiences and insights obtained by developing several such probabilistic systems over the last few years (some of these systems are regularly used by thousands of developers worldwide). We hope these observations can provide a guideline for others attempting to create such systems. We also present a prediction approach we find suitable as a starting point for building probabilistic tools, and discuss a practical framework implementing this approach, called Nice2Predict. We release the Nice2Predict framework publicly - the framework can be immediately used as a basis for developing new probabilistic tools. Finally, we present programming applications that we believe will benefit from probabilistic models and should be investigated further.
Keywords
  • probabilistic tools
  • probabilistic inference and learning
  • program analysis
  • open-source software

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. Learning natural coding conventions. In Proc. of the 22nd ACM SIGSOFT Int' Symp. on Foundations of Software Engineering (FSE'14), 2014, pages 281-293, 2014. Google Scholar
  2. Miltos Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In MSR, 2013. Google Scholar
  3. Julian Besag. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodol.), 48(3):259-302, 1986. Google Scholar
  4. Atlassian bitbucket. URL: https://bitbucket.org/.
  5. Marcel Bruch, Martin Monperrus, and Mira Mezini. Learning from examples to improve code completion systems. In Proc. of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Int'l Symp. on Foundations of Software Engineering (ESEC/FSE'09), pages 213-222, 2009. Google Scholar
  6. Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proc. of the 34th Annual Meeting on Association for Computational Linguistics (ACL'96), pages 310-318, 1996. Google Scholar
  7. Google closure compiler. https://developers.google.com/closure/compiler/. Google Scholar
  8. Atlassian bitbucket. URL: https://www.codeplex.com/.
  9. Github. URL: https://github.com/.
  10. Xuming He, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In Proc. of the 2004 IEEE Conf. on Computer Vision and Pattern Recognition, CVPR'04, 2004. Google Scholar
  11. Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In ICSE 2012, 2012. Google Scholar
  12. Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. Phrase-based statistical translation of programming languages. In Proc. of the 2014 ACM Int'l Symp.on New Ideas, New Paradigms, and Reflections on Programming & Software (Onward!'14). ACM, 2014. Google Scholar
  13. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Google Scholar
  14. Ted Kremenek, Paul Twohey, Godmar Back, Andrew Ng, and Dawson Engler. From uncertainty to belief: Inferring the specification within. In Proc. of the 7th Symp. on Operating Systems Design and Implementation (OSDI'06), OSDI'06, pages 161-176, Berkeley, CA, USA, 2006. USENIX Association. Google Scholar
  15. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th Int'l Conf. on Machine Learning (ICML'01), ICML'01, pages 282-289, 2001. Google Scholar
  16. Alon Mishne, Sharon Shoham, and Eran Yahav. Typestate-based semantic code search over partial programs. In OOPSLA'12, 2012. Google Scholar
  17. David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In Proc. of the 26th Annual Int'l ACM SIGIR Conf. on Research and Development in Informaion Retrieval (SIGIR'03), pages 235-242, 2003. Google Scholar
  18. Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. In NIPS, pages 1097-1104, 2004. Google Scholar
  19. Nathan D. Ratliff, J. Andrew Bagnell, and Martin Zinkevich. (approximate) subgradient methods for structured prediction. In AISTATS, pages 380-387, 2007. Google Scholar
  20. Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proc. of the 35th ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI'14), pages 419-428. ACM, 2014. Google Scholar
  21. Veselin Raychev, Martin T. Vechev, and Andreas Krause. Predicting Program Properties from "Big Code". In Proc. of the 42nd Annual ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages (POPL'15), pages 111-124. ACM, 2015. Google Scholar
  22. Benjamin Recht, Christopher Re, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proc. of Neural Information Processing Systems Conf. (NIPS'11), pages 693-701, 2011. Google Scholar
  23. Steven P. Reiss. Semantics-based code search. In ICSE'09, 2009. Google Scholar
  24. Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant for reusing open source code on the web. In ASE'07, 2007. Google Scholar
  25. Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085-1094, 1991. Google Scholar
  26. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595-2603, 2010. Google Scholar