GDBMiner: Mining Precise Input Grammars on (Almost) Any System

Eisele, Max; Hägele, Johannes; Huth, Christopher; Zeller, Andreas

doi:10.4230/LITES.10.1.1

Abstract

If one knows the input language of the system to be tested, one can generate inputs in a very efficient manner. Grammar-based fuzzers, for instance, produce inputs that are syntactically valid by construction. They are thus much more likely to be accepted by the program under test and to reach code beyond the input parser.
Grammar-based fuzzers, however, need a grammar in the first place. Grammar miners are set to extract such grammars from programs. However, current grammar mining tools place huge demands on the source code they are applied on, or are too imprecise, both preventing adoption in industrial practice.
We present GDBMiner, a tool to mine input grammars for binaries and executables in any (compiled) programming language, on any operating system, using any processor architecture, even without source code. GDBMiner leverages the GNU debugger (GDB) to step through the program and determine which code locations access which input bytes, generalizing bytes accessed by the same location into grammar elements.
GDBMiner is slow, but versatile - and precise: In our evaluation, GDBMiner produces grammars as precise as the (more demanding) Cmimid tool, while producing more precise grammars than the (less demanding) Arvada black-box approach. GDBMiner can be applied on any recursive descent parser that can be debugged via GDB and is available as open source.

Hiralal Agrawal. Dominators, super blocks, and program coverage. In Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 25-34, 1994. URL: https://doi.org/10.1145/174675.175935.
Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. Compilers: principles, techniques, and tools, volume 2. Addison-wesley Reading, 2007. URL: https://www.worldcat.org/oclc/12285707.
Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering, pages 1-10, 2011. URL: https://doi.org/10.1145/1985793.1985795.
Arduino Libraries . Arduino_json, 2022. Accessed: 2023-10-01. URL: https://registry.platformio.org/libraries/arduino-libraries/Arduino_JSON.
Mohammad Rifat Arefin, Suraj Shetiya, Zili Wang, and Christoph Csallner. Fast deterministic black-box context-free grammar inference. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1-12, 2024. URL: https://doi.org/10.1145/3597503.3639214.
Arm. ARMv7-M Architecture Reference Manual, 2021. Accessed: 2023-05-05. URL: https://developer.arm.com/documentation/ddi0403/ee/?lang=en.
Fabrice Bellard. Qemu, a fast and portable dynamic translator. In USENIX annual technical conference, FREENIX Track, volume 41, page 46. California, USA, 2005. URL: http://www.usenix.org/events/usenix05/tech/freenix/bellard.html.
Leon Bettscheider and Andreas Zeller. Look ma, no input samples! mining input grammars from code with symbolic parsing. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 522-526, 2024. URL: https://doi.org/10.1145/3663529.3663790.
Franck Bui. Implement basic parsers for parsing trivial arithmetic expressions, 2010. Accessed: 2023-05-02. URL: https://github.com/fbuihuu/parser/blob/master/calc.c.
Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM conference on Computer and communications security, pages 317-329, 2007. URL: https://doi.org/10.1145/1315245.1315286.
Jay Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94-102, 1970. URL: https://doi.org/10.1145/362007.362035.
Max Eisele, Daniel Ebert, Christopher Huth, and Andreas Zeller. Fuzzing embedded systems using debug interfaces. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1031-1042, 2023. URL: https://doi.org/10.1145/3597926.3598115.
Max Camillo Eisele, Marcello Maugeri, Rachna Shriwas, Christopher Huth, and Giampaolo Bella. Embedded fuzzing: a review of challenges, tools, and solutions. Cybersecurity, 2022. URL: https://doi.org/10.1186/s42400-022-00123-y.
Björn Fahller. A variant of recursive descent parsing, 2017. Accessed: 2023-05-02. URL: https://github.com/rollbear/variant_parse.
Andrew Fasano, Tiemoko Ballo, Marius Muench, Tim Leek, Alexander Bulekov, Brendan Dolan-Gavitt, Manuel Egele, Aurélien Francillon, Long Lu, Nick Gregory, et al. SoK: Enabling security analyses of embedded systems via rehosting. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pages 687-701, 2021. URL: https://doi.org/10.1145/3433210.3453093.
Bill Gatliff. Embedding with GNU: the GDB remote serial protocol. Embedded Systems Programming, 12:108-113, 1999.
Google. OSS-Fuzz, 2021. Accessed: 2021-12-20. URL: https://google.github.io/oss-fuzz/.
Rahul Gopinath, Björn Mathis, and Andreas Zeller. Mining input grammars from dynamic control flow. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pages 172-183, 2020. URL: https://doi.org/10.1145/3368089.3409679.
Joseph L Greathouse, Hongyi Xin, Yixin Luo, and Todd Austin. A case for unlimited watchpoints. ACM SIGPLAN Notices, 47(4):159-172, 2012. URL: https://doi.org/10.1145/2248487.2150994.
Nikolas Havrikov and Andreas Zeller. Systematically covering input structure. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 189-199. IEEE, 2019. URL: https://doi.org/10.1109/ASE.2019.00027.
Yoran Heling. Yxml - a small, fast and correct* xml parser, 2013. Accessed: 2023-05-02. URL: https://dev.yorhel.nl/yxml.
John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. Introduction to automata theory, languages, and computation. Acm Sigact News, 32(1):60-65, 2001. URL: https://doi.org/10.1145/568438.568455.
Matthias Höschele and Andreas Zeller. Mining input grammars with AUTOGRAM. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pages 31-34. IEEE, 2017. URL: https://doi.org/10.1109/ICSE-C.2017.14.
Ioulianos Kakoulidis. Platformio yxml, 2021. Accessed: 2023-10-01. URL: https://registry.platformio.org/libraries/julstrat/LibYxml.
Marcin Kalicinski. C++ xml parser, 2006. Accessed: 2023-05-02. URL: https://rapidxml.sourceforge.net/.
Donald E Knuth. Top-down syntax analysis. Acta Informatica, 1:79-110, 1971. URL: https://doi.org/10.1007/BF00289517.
Ivan Kravets. Platformio, 2014. Accessed: 2023-05-02. URL: https://platformio.org/.
Neil Kulkarni, Caroline Lemieux, and Koushik Sen. Learning highly recursive input grammars. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 456-467. IEEE, 2021. URL: https://doi.org/10.1109/ASE51524.2021.9678879.
Linda_pp. Simple json parser/generator for rust, 2016. Accessed: 2023-05-02. URL: https://crates.io/crates/tinyjson.
Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. The art, science, and engineering of fuzzing: A survey. IEEE Trans. Software Eng., 47(11):2312-2331, 2021. URL: https://doi.org/10.1109/TSE.2019.2946563.
Björn Mathis, Rahul Gopinath, Michaël Mera, Alexander Kampmann, Matthias Höschele, and Andreas Zeller. Parser-directed fuzzing. In Proceedings of the 40th ACM sigplan conference on programming language design and implementation, pages 548-560, 2019. URL: https://doi.org/10.1145/3314221.3314651.
Oleg Maximenko. Svg++ documentation, 2014. Accessed: 2024-01-23. URL: http://svgpp.org/.
MDN contributors. Svg tutorial - basic shapes, 2023. Accessed: 2024-01-23. URL: https://developer.mozilla.org/en-US/docs/Web/SVG/Tutorial/Basic_Shapes.
Marius Muench, Dario Nisi, Aurélien Francillon, and Davide Balzarotti. Avatar 2: A multi-target orchestration platform. In Proc. Workshop Binary Anal. Res.(Colocated NDSS Symp.), volume 18, pages 1-11, 2018. URL: https://s3.eurecom.fr/docs/bar18_muench.pdf.
Marius Muench, Jan Stijohann, Frank Kargl, Aurélien Francillon, and Davide Balzarotti. What you corrupt is not what you crash: Challenges in fuzzing embedded devices. In NDSS, 2018. URL: https://www.ndss-symposium.org/wp-content/uploads/2018/07/bar2018_1_Muench_paper.pdf.
National Security Agency. Ghidra, 2019. Accessed: 2021-12-20. URL: https://ghidra-sre.org/.
Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM Sigplan notices, 42(6):89-100, 2007. URL: https://doi.org/10.1145/1273442.1250746.
Kazuho Oku. A header-file-only, json parser serializer in c++, 2009. Accessed: 2023-05-02. URL: https://github.com/kazuho/picojson.
Kazuki Ota. Arduino percent, 2023. Accessed: 2023-10-01. URL: https://registry.platformio.org/libraries/dojyorin/percent_encode/.
Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. SoK: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In 2021 IEEE Symposium on Security and Privacy (SP), pages 833-851. IEEE, 2021. URL: https://doi.org/10.1109/SP40001.2021.00012.
Terence J. Parr and Russell W. Quong. Antlr: A predicated-ll (k) parser generator. Software: Practice and Experience, 25(7):789-810, 1995. URL: https://doi.org/10.1002/spe.4380250705.
Goldman Sachs. Average number of lines of codes per vehicle globally in 2015 and 2020, with a forecast for 2025, 2022. Accessed: 2023-05-02. URL: https://www.statista.com/statistics/1370978/automotive-software-average-lines-of-codes-per-vehicle-globally/.
Harald Scheirich. Jsonparser, 2017. Accessed: 2023-10-01. URL: https://github.com/HarryDC/JsonParser.
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. AddressSanitizer: A fast address sanity checker. In 2012 USENIX annual technical conference (USENIX ATC 12), pages 309-318, 2012. URL: https://dl.acm.org/doi/abs/10.5555/2342821.2342849.
Prashast Srivastava and Mathias Payer. Gramatron: Effective grammar-aware fuzzing. In Proceedings of the 30th acm sigsoft international symposium on software testing and analysis, pages 244-256, 2021. URL: https://doi.org/10.1145/3460319.3464814.
Richard Stallman, Roland Pesch, Stan Shebs, et al. Debugging with GDB. Free Software Foundation, 675, 1988. URL: https://sourceware.org/gdb/current/onlinedocs/gdb.pdf.
Dominic Steinhöfel and Andreas Zeller. Input invariants. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 583-594, 2022. URL: https://doi.org/10.1145/3540250.3549139.
Elecia White. Making embedded systems. O'Reilly Media, Inc., 2024.
Christopher Wright, William A Moeglein, Saurabh Bagchi, Milind Kulkarni, and Abraham A Clements. Challenges in firmware re-hosting, emulation, and analysis. ACM Computing Surveys (CSUR), 54(1):1-36, 2021. URL: https://doi.org/10.1145/3423167.
Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. The fuzzing book, 2019. URL: https://www.fuzzingbook.org/.

GDBMiner: Mining Precise Input Grammars on (Almost) Any System

Authors Max Eisele , Johannes Hägele , Christopher Huth , Andreas Zeller

Files

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

GDBMiner: Mining Precise Input Grammars on (Almost) Any System

Authors Max Eisele , Johannes Hägele , Christopher Huth , Andreas Zeller

Files

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message