Proof Repair Infrastructure for Supervised Models: Building a Large Proof Repair Dataset

Authors Tom Reichel, R. Wesley Henderson, Andrew Touchet, Andrew Gardner, Talia Ringer



PDF
Thumbnail PDF

File

LIPIcs.ITP.2023.26.pdf
  • Filesize: 0.68 MB
  • 20 pages

Document Identifiers

Author Details

Tom Reichel
  • University of Illinois Urbana-Champaign, IL, USA
R. Wesley Henderson
  • Radiance Technologies, Inc., Ruston, LA, USA
Andrew Touchet
  • Radiance Technologies, Inc., Ruston, LA, USA
Andrew Gardner
  • Radiance Technologies, Inc., Ruston, LA, USA
Talia Ringer
  • University of Illinois Urbana-Champaign, IL, USA

Cite As Get BibTex

Tom Reichel, R. Wesley Henderson, Andrew Touchet, Andrew Gardner, and Talia Ringer. Proof Repair Infrastructure for Supervised Models: Building a Large Proof Repair Dataset. In 14th International Conference on Interactive Theorem Proving (ITP 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 268, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ITP.2023.26

Abstract

We report on our efforts building a new, large proof-repair dataset and benchmark suite for the Coq proof assistant. The dataset is made up of Git commits from open-source projects with old and new versions of definitions and proofs aligned across commits. Building this dataset has been a significant undertaking, highlighting a number of challenges and gaps in existing infrastructure. We discuss these challenges and gaps, and we provide recommendations for how the proof assistant community can address them. Our hope is to make it easier to build datasets and benchmark suites so that machine-learning tools for proofs will move to target the tasks that matter most and do so equitably across proof assistants.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Machine learning
  • Software and its engineering → Software maintenance tools
  • Security and privacy → Logic and verification
Keywords
  • proof repair
  • datasets
  • benchmarks
  • machine learning
  • formal proof

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Statistics - archive of formal proofs. URL: https://www.isa-afp.org/statistics/.
  2. Arpan Agrawal, Emily First, Zhanna Kaufman, Tom Reichel, Shizhuo Zhang, Timothy Zhou, Alex Sanchez-Stern, and Talia Ringer. Proofster. URL: https://www.alexsanchezstern.com/papers/proofster.pdf.
  3. Europroofnet. URL: https://europroofnet.github.io/.
  4. 2nd MATH-AI Workshop at NeurIPS'22, 2021-2022. URL: https://mathai2022.github.io/.
  5. Openai. URL: https://openai.com/.
  6. Proof engineering, adaptation, repair, and learning for software (pearls). URL: https://sam.gov/opp/da84366306554cc981f37f703a78c698/view.
  7. AI for Theorem Proving, 2016-2022. URL: http://aitp-conference.org/.
  8. Emilio Jesús Gallego Arias. SerAPI: Machine-Friendly, Data-Centric Serialization for COQ. Technical Report hal-01384408, HAL, 2016. URL: http://dml.mathdoc.fr/item/hal-01384408/.
  9. Zhangir Azerbayev, Bartosz Piotrowski, and Jeremy Avigad. ProofNet: A benchmark for autoformalizing and formally proving undergraduate-level mathematics problems. In Second MATH-AI Workshop, 2022. URL: https://mathai2022.github.io/.
  10. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang., 3(OOPSLA), October 2019. URL: https://doi.org/10.1145/3360585.
  11. Kshitij Bansal, Sarah M Loos, Markus N Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In ICML, 2019. Google Scholar
  12. Beyond Bayes: Paths Towards Universal Reasoning Systems, 2022. URL: https://beyond-bayes.github.io/.
  13. Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. The tactician: A seamless, interactive tactic learner and prover for coq. In Intelligent Computer Mathematics: 13th International Conference, CICM 2020, Bertinoro, Italy, July 26–31, 2020, Proceedings, pages 271-277, Berlin, Heidelberg, 2020. Springer-Verlag. URL: https://doi.org/10.1007/978-3-030-53518-6_17.
  14. Olivier Boite. Proof reuse with extended inductive types. In Theorem Proving in Higher Order Logics: 17th International Conference, TPHOLs 2004, Park City, Utah, USA, September 14-17, 2004. Proceedings, pages 50-65, Berlin, Heidelberg, 2004. Springer. URL: https://doi.org/10.1007/978-3-540-30142-4_4.
  15. Anthony Bordg, Yiannos A Stathopoulos, and Lawrence C Paulson. A parallel corpus of natural language and isabelle artefacts. In 7th Conference on Artificial Intelligence and Theorem Proving (AITP), 2022. URL: http://aitp-conference.org/2022/abstract/AITP_2022_paper_8.pdf.
  16. Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, Yisong Yue, et al. Neurosymbolic programming. Foundations and Trendsregistered in Programming Languages, 7(3):158-243, 2021. Google Scholar
  17. Is there a full documentation of coq’s grammar? URL: https://coq.discourse.group/t/is-there-a-full-documentation-of-coqs-grammar/647/10.
  18. coq_lsp. URL: https://github.com/ejgallego/coq-lsp.
  19. Proposal: a custom build tool for coq projects. URL: https://coq.discourse.group/t/proposal-a-custom-build-tool-for-coq-projects/239/2.
  20. pycoq. URL: https://github.com/IBM/pycoq.
  21. Garett Cunningham, Razvan C. Bunescu, and David Juedes. Towards autoformalization of mathematics and code correctness: Experiments with elementary proofs, 2023. URL: https://doi.org/10.48550/ARXIV.2301.02195.
  22. Michel Marie Deza and Elena Deza. Encyclopedia of Distances. Springer Berlin Heidelberg, Berlin, 3 edition, October 2014. URL: https://link.springer.com/book/10.1007/978-3-642-30958-8.
  23. Thomas Durieux, Matias Martinez, Martin Monperrus, Romain Sommerard, and Jifeng Xuan. Automatic repair of real bugs: An experience report on the defects4j dataset. CoRR, abs/1505.07002, 2015. URL: https://arxiv.org/abs/1505.07002.
  24. Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: how should we assess quality of the code generation models? arXiv preprint arXiv:2208.03133, 2022. Google Scholar
  25. Amy Felty and Douglas Howe. Generalization and reuse of tactic proofs. In Logic Programming and Automated Reasoning: 5th International Conference, LPAR '94, pages 1-15, Berlin, Heidelberg, 1994. Springer. URL: https://doi.org/10.1007/3-540-58216-9_25.
  26. Emily First and Yuriy Brun. Diversity-driven automated formal verification. In Proceedings of the 44th International Conference on Software Engineering (ICSE)(22-27). Pittsburgh, PA, USA. https://doi. org/10.1145/3510003.3510138, 2022. Google Scholar
  27. General roadmap. URL: https://github.com/ejgallego/coq-serapi/issues/252.
  28. Query ast returns empty result. URL: https://github.com/ejgallego/coq-serapi/issues/117.
  29. Serapi 'classic mode' final release notice. URL: https://github.com/ejgallego/coq-serapi/issues/252#issuecomment-1365510329.
  30. Luca Gazzola, Daniela Micucci, and Leonardo Mariani. Automatic software repair: A survey. In Proceedings of the 40th International Conference on Software Engineering, ICSE '18, page 1219, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3180155.3182526.
  31. Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W. Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. CoRR, abs/2102.06203, 2021. URL: https://arxiv.org/abs/2102.06203.
  32. Albert Jiang, Wenda Li, Jesse Han, and Wu Yuhuai. Lisa: Language models of isabelle proofs. In 6th Conference on Artificial Intelligence and Theorem Proving (AITP), 2021. Google Scholar
  33. Portal-to-isabelle. URL: https://github.com/albertqjiang/Portal-to-ISAbelle.
  34. René Just, Darioush Jalali, and Michael D. Ernst. Defects4J: A Database of existing faults to enable controlled testing studies for Java programs. In ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis, pages 437-440, San Jose, CA, USA, July 2014. Tool demo. Google Scholar
  35. Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. Mash: Machine learning for sledgehammer. In Sandrine Blazy, Christine Paulin-Mohring, and David Pichardie, editors, Interactive Theorem Proving, pages 35-50, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. Google Scholar
  36. Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022. Google Scholar
  37. Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving, 2022. URL: https://doi.org/10.48550/ARXIV.2205.11491.
  38. Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. Coconut: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2020, pages 101-114, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3395363.3397369.
  39. Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Mathematical Proofs. Lulu Press, Morrisville, North Carolina, 2019. http://us.metamath.org/downloads/metamath.pdf. Google Scholar
  40. Martin Monperrus. Automatic software repair: A bibliography. ACM Comput. Surv., 51(1), January 2018. URL: https://doi.org/10.1145/3105906.
  41. James Munkres. Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics, 5(1):32-38, 1957. Publisher: Society for Industrial and Applied Mathematics. URL: https://www.jstor.org/stable/2098689.
  42. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311-318, USA, 2002. Association for Computational Linguistics. URL: https://doi.org/10.3115/1073083.1073135.
  43. Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision, pages 460-467, 2009. URL: https://doi.org/10.1109/ICCV.2009.5459199.
  44. Frank Pfenning. Proof Transformations in Higher-Order Logic. PhD thesis, Carnegie Mellon University Pittsburgh, 1987. Google Scholar
  45. Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL: https://arxiv.org/abs/2009.03393.
  46. Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015, pages 24-36, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/2771783.2771791.
  47. Markus N. Rabe and Christian Szegedy. Towards the automatic mathematician. In André Platzer and Geoff Sutcliffe, editors, Automated Deduction - CADE 28, pages 25-37, Cham, 2021. Springer International Publishing. Google Scholar
  48. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. CoRR, abs/2009.10297, 2020. URL: https://arxiv.org/abs/2009.10297.
  49. Talia Ringer. Proof Repair. PhD thesis, University of Washington, 2021. Google Scholar
  50. Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Gligoric, and Zachary Tatlock. QED at large: A survey of engineering of formally verified software. CoRR, abs/2003.06458, 2020. URL: https://arxiv.org/abs/2003.06458.
  51. Talia Ringer, RanDair Porter, Nathaniel Yazdani, John Leo, and Dan Grossman. Proof repair across type equivalences. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pages 112-127, New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3453483.3454033.
  52. Talia Ringer, Alex Sanchez-Stern, Dan Grossman, and Sorin Lerner. REPLica: REPL instrumentation for Coq analysis. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, pages 99-113, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3372885.3373823.
  53. Talia Ringer, Nathaniel Yazdani, John Leo, and Dan Grossman. Adapting proof automation to adapt proofs. In Proceedings of the 7th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2018, pages 115-129, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3167094.
  54. Talia Ringer, Nathaniel Yazdani, John Leo, and Dan Grossman. Ornaments for proof reuse in coq. In Interactive Theorem Proving, 2019. Google Scholar
  55. Valentin Robert. Front-end tooling for building and maintaining dependently-typed functional programs. PhD thesis, UC San Diego, 2018. Google Scholar
  56. Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks, 2019. URL: https://doi.org/10.48550/ARXIV.1907.07794.
  57. scala-isabelle. URL: https://dominique-unruh.github.io/scala-isabelle/.
  58. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195-197, 1981. URL: https://doi.org/10.1016/0022-2836(81)90087-5.
  59. Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering, ICSE '18, pages 1-11, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3180155.3180233.
  60. Makarius Wenzel. Isabelle/jedit - a prover IDE within the PIDE framework. CoRR, abs/1207.3441, 2012. URL: https://arxiv.org/abs/1207.3441.
  61. mmverify.py. URL: https://github.com/david-a-wheeler/mmverify.py.
  62. Iain Johnston Whiteside. Refactoring proofs. PhD thesis, University of Edinburgh, November 2013. URL: http://hdl.handle.net/1842/7970.
  63. Karin Wibergh. Automatic refactoring for agda. Master’s thesis, Chalmers University of Technology and University of Gothenburg, 2019. Google Scholar
  64. Yuhuai Wu, Albert Q Jiang, Wenda Li, Markus N Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. arXiv preprint arXiv:2205.12615, 2022. Google Scholar
  65. Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning (ICML), Long Beach, CA, USA, 2019. URL: http://proceedings.mlr.press/v97/yang19a/yang19a.pdf.
  66. Michihiro Yasunaga and Percy Liang. Break-it-fix-it: Unsupervised learning for program repair. In International Conference on Machine Learning, pages 11941-11952. PMLR, 2021. Google Scholar
  67. John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11):e1002683, November 2018. URL: https://doi.org/10.1371/journal.pmed.1002683.
  68. Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. In International Conference on Learning Representations, 2022. URL: https://openreview.net/forum?id=9ZPegFuFTFv.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail