Synthesizing Conjunctive Queries for Code Search

Authors Chengpeng Wang , Peisen Yao , Wensheng Tang , Gang Fan , Charles Zhang



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2023.36.pdf
  • Filesize: 1.67 MB
  • 30 pages

Document Identifiers

Author Details

Chengpeng Wang
  • The Hong Kong University of Science and Technology, China
Peisen Yao
  • Zhejiang University, Hangzhou, China
Wensheng Tang
  • The Hong Kong University of Science and Technology, China
Gang Fan
  • Ant Group, Shenzhen, China
Charles Zhang
  • The Hong Kong University of Science and Technology, China

Acknowledgements

We thank the anonymous reviewers, Xiao Xiao, and Xiaoheng Xie for their helpful comments. Peisen Yao is the corresponding author.

Cite AsGet BibTex

Chengpeng Wang, Peisen Yao, Wensheng Tang, Gang Fan, and Charles Zhang. Synthesizing Conjunctive Queries for Code Search. In 37th European Conference on Object-Oriented Programming (ECOOP 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 263, pp. 36:1-36:30, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ECOOP.2023.36

Abstract

This paper presents Squid, a new conjunctive query synthesis algorithm for searching code with target patterns. Given positive and negative examples along with a natural language description, Squid analyzes the relations derived from the examples by a Datalog-based program analyzer and synthesizes a conjunctive query expressing the search intent. The synthesized query can be further used to search for desired grammatical constructs in the editor. To achieve high efficiency, we prune the huge search space by removing unnecessary relations and enumerating query candidates via refinement. We also introduce two quantitative metrics for query prioritization to select the queries from multiple candidates, yielding desired queries for code search. We have evaluated Squid on over thirty code search tasks. It is shown that Squid successfully synthesizes the conjunctive queries for all the tasks, taking only 2.56 seconds on average.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Automatic programming
  • Human-centered computing → User interface programming
Keywords
  • Query Synthesis
  • Multi-modal Program Synthesis
  • Code Search

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. URL: http://webdam.inria.fr/Alice/.
  2. Aws Albarghouthi, Paraschos Koutris, Mayur Naik, and Calvin Smith. Constraint-based synthesis of datalog programs. In J. Christopher Beck, editor, Principles and Practice of Constraint Programming - 23rd International Conference, CP 2017, Melbourne, VIC, Australia, August 28 - September 1, 2017, Proceedings, volume 10416 of Lecture Notes in Computer Science, pages 689-706. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-66158-2_44.
  3. Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Schäfer. QL: object-oriented queries on relational data. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming, ECOOP 2016, July 18-22, 2016, Rome, Italy, volume 56 of LIPIcs, pages 2:1-2:25. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.ECOOP.2016.2.
  4. Christopher Baik, Zhongjun Jin, Michael J. Cafarella, and H. V. Jagadish. Duoquest: A dual-specification system for expressive SQL queries. In David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo, editors, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pages 2319-2329. ACM, 2020. URL: https://doi.org/10.1145/3318464.3389776.
  5. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Occam’s razor. Inf. Process. Lett., 24(6):377-380, 1987. URL: https://doi.org/10.1016/0020-0190(87)90114-1.
  6. Martin Bravenboer and Yannis Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. In Shail Arora and Gary T. Leavens, editors, Proceedings of the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2009, October 25-29, 2009, Orlando, Florida, USA, pages 243-262. ACM, 2009. URL: https://doi.org/10.1145/1640089.1640108.
  7. Ashok K. Chandra and Philip M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In John E. Hopcroft, Emily P. Friedman, and Michael A. Harrison, editors, Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, pages 77-90. ACM, 1977. URL: https://doi.org/10.1145/800105.803397.
  8. Sarah E. Chasins, Elena L. Glassman, and Joshua Sunshine. PL and HCI: better together. Commun. ACM, 64(8):98-106, 2021. URL: https://doi.org/10.1145/3469279.
  9. Qiaochu Chen, Aaron Lamoreaux, Xinyu Wang, Greg Durrett, Osbert Bastani, and Isil Dillig. Web question answering with neurosymbolic program synthesis. In Stephen N. Freund and Eran Yahav, editors, PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021, pages 328-343. ACM, 2021. URL: https://doi.org/10.1145/3453483.3454047.
  10. Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. Multi-modal synthesis of regular expressions. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, pages 487-502. ACM, 2020. URL: https://doi.org/10.1145/3385412.3385988.
  11. Yanju Chen, Ruben Martins, and Yu Feng. Maximal multi-layer specification synthesis. In Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo, editors, Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, pages 602-612. ACM, 2019. URL: https://doi.org/10.1145/3338906.3338951.
  12. Maria Christakis and Christian Bird. What developers want and need from program analysis: an empirical study. In David Lo, Sven Apel, and Sarfraz Khurshid, editors, Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, pages 332-343. ACM, 2016. URL: https://doi.org/10.1145/2970276.2970347.
  13. Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. Component-based synthesis of table consolidation and transformation tasks from examples. In Albert Cohen and Martin T. Vechev, editors, Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, pages 422-436. ACM, 2017. URL: https://doi.org/10.1145/3062341.3062351.
  14. Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. Component-based synthesis for complex apis. In Giuseppe Castagna and Andrew D. Gordon, editors, Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, pages 599-612. ACM, 2017. URL: https://doi.org/10.1145/3009837.3009851.
  15. Pranav Garg and Srinivasan H. Sengamedu. Synthesizing code quality rules from examples. Proc. ACM Program. Lang., 6(OOPSLA2), October 2022. URL: https://doi.org/10.1145/3563350.
  16. Ivan Gavran, Eva Darulova, and Rupak Majumdar. Interactive synthesis of temporal specifications from examples and natural language. Proc. ACM Program. Lang., 4(OOPSLA):201:1-201:26, 2020. URL: https://doi.org/10.1145/3428269.
  17. Georg Gottlob, Christoph Koch, and Klaus U. Schulz. Conjunctive queries over trees. J. ACM, 53(2):238-272, 2006. URL: https://doi.org/10.1145/1131342.1131345.
  18. Sumit Gulwani, Vijay Anand Korthikanti, and Ashish Tiwari. Synthesizing geometry constructions. In Mary W. Hall and David A. Padua, editors, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, pages 50-61. ACM, 2011. URL: https://doi.org/10.1145/1993498.1993505.
  19. Zheng Guo, David Cao, Davin Tjong, Jean Yang, Cole Schlesinger, and Nadia Polikarpova. Type-directed program synthesis for restful apis. In PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022, pages 122-136. ACM, 2022. URL: https://doi.org/10.1145/3519939.3523450.
  20. Zheng Guo, Michael James, David Justo, Jiaxiao Zhou, Ziteng Wang, Ranjit Jhala, and Nadia Polikarpova. Program synthesis by type-guided abstraction refinement. Proc. ACM Program. Lang., 4(POPL):12:1-12:28, 2020. URL: https://doi.org/10.1145/3371080.
  21. Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. Complete completion using types and weights. In Hans-Juergen Boehm and Cormac Flanagan, editors, ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16-19, 2013, pages 27-38. ACM, 2013. URL: https://doi.org/10.1145/2491956.2462192.
  22. Elnar Hajiyev, Mathieu Verbaere, and Oege de Moor. codeQuest: scalable source code queries with datalog. In Dave Thomas, editor, ECOOP 2006 - Object-Oriented Programming, 20th European Conference, Nantes, France, July 3-7, 2006, Proceedings, volume 4067 of Lecture Notes in Computer Science, pages 2-27. Springer, 2006. URL: https://doi.org/10.1007/11785477_2.
  23. IntelliJ IDEA. Structural search and replace, https://www.jetbrains.com/help/idea/structural-search-and-replace.html, 2022. [Online; accessed 10-Nov-2022].
  24. Michael B. James, Zheng Guo, Ziteng Wang, Shivani Doshi, Hila Peleg, Ranjit Jhala, and Nadia Polikarpova. Digging for fold: synthesis-aided API discovery for haskell. Proc. ACM Program. Lang., 4(OOPSLA):205:1-205:27, 2020. URL: https://doi.org/10.1145/3428273.
  25. Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, pages 215-224. ACM, 2010. URL: https://doi.org/10.1145/1806799.1806833.
  26. Monica S Lam, John Whaley, V Benjamin Livshits, Michael C Martin, Dzintars Avots, Michael Carbin, and Christopher Unkel. Context-sensitive program analysis as database queries. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1-12, 2005. Google Scholar
  27. Mina Lee, Sunbeom So, and Hakjoo Oh. Synthesizing regular expressions from examples for introductory automata assignments. In Bernd Fischer and Ina Schaefer, editors, Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, GPCE 2016, Amsterdam, The Netherlands, October 31 - November 1, 2016, pages 70-80. ACM, 2016. URL: https://doi.org/10.1145/2993236.2993244.
  28. Tao Lei, Fan Long, Regina Barzilay, and Martin C. Rinard. From natural language specifications to program input parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages 1294-1303. The Association for Computer Linguistics, 2013. URL: https://aclanthology.org/P13-1127/.
  29. Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei. Relationship-aware code search for javascript frameworks. In Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su, editors, Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pages 690-701. ACM, 2016. URL: https://doi.org/10.1145/2950290.2950341.
  30. Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John C. Grundy. Opportunities and challenges in code search tools. ACM Comput. Surv., 54(9):196:1-196:40, 2022. URL: https://doi.org/10.1145/3480027.
  31. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations, pages 55-60. The Association for Computer Linguistics, 2014. URL: https://doi.org/10.3115/v1/p14-5010.
  32. Jonathan Mendelson, Aaditya Naik, Mukund Raghothaman, and Mayur Naik. GENSYNTH: synthesizing datalog programs without language bias. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 6444-6453. AAAI Press, 2021. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16799.
  33. Mehryar Mohri, Pedro J. Moreno, and Eugene Weinstein. General suffix automaton construction algorithm and space bounds. Theor. Comput. Sci., 410(37):3553-3562, 2009. URL: https://doi.org/10.1016/j.tcs.2009.03.034.
  34. Aaditya Naik, Jonathan Mendelson, Nathaniel Sands, Yuepeng Wang, Mayur Naik, and Mukund Raghothaman. Sporq: An interactive environment for exploring code using query-by-example. In Jeffrey Nichols, Ranjitha Kumar, and Michael Nebeling, editors, UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology, Virtual Event, USA, October 10-14, 2021, pages 84-99. ACM, 2021. URL: https://doi.org/10.1145/3472749.3474737.
  35. Mayur Naik. Chord: A versatile platform for program analysis. In Tutorial at ACM Conference on Programming Language Design and Implementation, 2011. Google Scholar
  36. Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D'Antoni. Automatic repair of regular expressions. Proc. ACM Program. Lang., 3(OOPSLA):139:1-139:29, 2019. URL: https://doi.org/10.1145/3360565.
  37. Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, and Mayur Naik. Codetrek: Flexible modeling of code using an extensible relational representation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL: https://openreview.net/forum?id=WQc075jmBmf.
  38. Daniel Perelman, Sumit Gulwani, Thomas Ball, and Dan Grossman. Type-directed completion of partial expressions. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, Beijing, China - June 11 - 16, 2012, pages 275-286. ACM, 2012. URL: https://doi.org/10.1145/2254064.2254098.
  39. Mukund Raghothaman, Jonathan Mendelson, David Zhao, Mayur Naik, and Bernhard Scholz. Provenance-guided synthesis of datalog programs. Proc. ACM Program. Lang., 4(POPL):62:1-62:27, 2020. URL: https://doi.org/10.1145/3371130.
  40. Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. Compositional program synthesis from natural language and examples. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 792-800. AAAI Press, 2015. URL: http://ijcai.org/Abstract/15/117.
  41. Logging Services. Apache log4j security vulnerabilities , https://logging.apache.org/log4j/2.x/security.html, 2021. [Online; accessed 10-Nov-2022].
  42. Xujie Si, Woosuk Lee, Richard Zhang, Aws Albarghouthi, Paraschos Koutris, and Mayur Naik. Syntax-guided synthesis of datalog programs. In Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu, editors, Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 515-527. ACM, 2018. URL: https://doi.org/10.1145/3236024.3236034.
  43. Xujie Si, Mukund Raghothaman, Kihong Heo, and Mayur Naik. Synthesizing datalog programs using numerical relaxation. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 6117-6124. ijcai.org, 2019. URL: https://doi.org/10.24963/ijcai.2019/847.
  44. Yannis Smaragdakis and Martin Bravenboer. Using datalog for fast and easy program analysis. In Oege de Moor, Georg Gottlob, Tim Furche, and Andrew Jon Sellers, editors, Datalog Reloaded - First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers, volume 6702 of Lecture Notes in Computer Science, pages 245-251. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-24206-9_14.
  45. CodeQL. CodeQL for Java. https://codeql.github.com/docs/codeql-language-guides/codeql-for-java/, 2022. [Online; accessed 10-Nov-2022].
  46. Aalok Thakkar, Aaditya Naik, Nathaniel Sands, Rajeev Alur, Mayur Naik, and Mukund Raghothaman. Example-guided synthesis of relational queries. In Stephen N. Freund and Eran Yahav, editors, PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021, pages 1110-1125. ACM, 2021. URL: https://doi.org/10.1145/3453483.3454098.
  47. Yuchi Tian and Baishakhi Ray. Automatically diagnosing and repairing error handling bugs in C. In Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman, editors, Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017, pages 752-762. ACM, 2017. URL: https://doi.org/10.1145/3106237.3106300.
  48. Squid. SquidData. https://github.com/SquidData/SquidData, 2022. [Online; accessed 10-Nov-2022].
  49. Chengpeng Wang, Peisen Yao, Wensheng Tang, Gang Fan, and Charles Zhang. Synthesizing conjunctive queries for code search. CoRR, abs/2305.04316, 2023. URL: https://doi.org/arXiv.2305.04316.
  50. Jianfeng Wang, Tamás Lévai, Zhuojin Li, Marcos A. M. Vieira, Ramesh Govindan, and Barath Raghavan. Quadrant: A cloud-deployable nf virtualization platform. In Proceedings of the 13th Symposium on Cloud Computing, SoCC '22, pages 493-509, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3542929.3563471.
  51. Brendon J Wilson. Java coding convention, 2000. Google Scholar
  52. Xiuheng Wu, Chenguang Zhu, and Yi Li. DIFFBASE: a differential factbase for effective software evolution management. In Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta, editors, ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, pages 503-515. ACM, 2021. URL: https://doi.org/10.1145/3468264.3468605.
  53. Yingfei Xiong and Bo Wang. L2S: A framework for synthesizing the most probable program under a specification. ACM Trans. Softw. Eng. Methodol., 31(3):34:1-34:45, 2022. URL: https://doi.org/10.1145/3487570.
  54. Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. Sqlizer: query synthesis from natural language. Proc. ACM Program. Lang., 1(OOPSLA):63:1-63:26, 2017. URL: https://doi.org/10.1145/3133887.
  55. Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy, SP 2014, Berkeley, CA, USA, May 18-21, 2014, pages 590-604. IEEE Computer Society, 2014. URL: https://doi.org/10.1109/SP.2014.44.
  56. Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. How not to structure your database-backed web applications: a study of performance bugs in the wild. In Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman, editors, Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 800-810. ACM, 2018. URL: https://doi.org/10.1145/3180155.3180194.
  57. Jane Yen, Jianfeng Wang, Sucha Supittayapornpong, Marcos A. M. Vieira, Ramesh Govindan, and Barath Raghavan. Meeting slos in cross-platform nfv. In Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT '20, pages 509-523, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3386367.3431292.
  58. Xiangyu Zhou, Rastislav Bodík, Alvin Cheung, and Chenglong Wang. Synthesizing analytical SQL queries from computation demonstration. In PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022, pages 168-182. ACM, 2022. URL: https://doi.org/10.1145/3519939.3523712.