Java Bytecode Normalization for Code Similarity Analysis

Schott, Stefan; Ponta, Serena Elisa; Fischer, Wolfram; Klauke, Jonas; Bodden, Eric

doi:10.4230/LIPIcs.ECOOP.2024.37

Abstract

Analyzing the similarity of two code fragments has many applications, including code clone, vulnerability and plagiarism detection. Most existing approaches for similarity analysis work on source code. However, in scenarios like plagiarism detection, copyright violation detection or Software Bill of Materials creation source code is often not available and thus similarity analysis has to be performed on binary formats. Java bytecode is a binary format executable by the Java Virtual Machine and obtained from the compilation of Java source code. Performing similarity detection on bytecode is challenging because different compilers can compile the same source code to syntactically vastly different bytecode. In this work we assess to what extent one can nonetheless enable similarity detection by bytecode normalization, a procedure to transform Java bytecode into a representation that is identical for the same original source code, irrespective of the Java compiler and Java version used during compilation. Our manual study revealed 16 classes of compilation differences that various compilation environments may induce. Based on these findings, we implemented bytecode normalization in a tool jNorm. It uses Jimple as intermediate representation, applies common code optimizations and transforms all classes of compilation difference to a normalized form, thus achieving a representation of the bytecode that is identical despite different compilation environments. Our evaluation, performed on more than 300 popular Java projects, shows that solely the act of incrementing a compiler version may cause differences in 46% of all resulting bytecode files. By applying bytecode normalization, one can remove more than 99% of these differences, thus acting as a crucial enabler for subsequent applications of bytecode similarity analysis.

Amazon Corretto 8. Accessed 2023-03-31. URL: https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html.
Wolfram Amme, Thomas S. Heinze, and André Schäfer. You look so different: Finding structural clones and subclones in java source code. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2021, Luxembourg, September 27 - October 1, 2021, pages 70-80. IEEE, 2021.
ASM: Java bytecode manipulation and analysis framework. Accessed 2022-10-24. URL: https://asm.ow2.io/.
Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In 1998 USENIX Annual Technical Conference, New Orleans, Louisiana, USA, June 15-19, 1998. USENIX Association, 1998.
Musard Balliu, Benoit Baudry, Sofia Bobadilla, Mathias Ekstedt, Martin Monperrus, Javier Ron, Aman Sharma, Gabriel Skoglund, César Soto-Valero, and Martin Wittlinger. Challenges of producing software bill of materials for java. IEEE Security & Privacy, pages 2-13, 2023.
Executive Order on Improving the Nation’s Cybersecurity. Accessed 2023-09-12. URL: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity.
Kai Chen, Peng Liu, and Yingjun Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In 36th International Conference on Software Engineering, ICSE '14, Hyderabad, India - May 31 - June 07, 2014, pages 175-186. ACM, 2014.
Apache Maven Compiler Plugin - Setting the -release of the Java Compiler. Accessed 2023-04-03. URL: https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html.
Apache Maven Compiler Plugin - Setting the -source and -target of the Java Compiler. Accessed 2023-04-03. URL: https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-source-and-target.html.
James R. Cordy and Chanchal K. Roy. The nicad clone detector. In The 19th IEEE International Conference on Program Comprehension, ICPC 2011, Kingston, ON, Canada, June 22-24, 2011, pages 219-220. IEEE Computer Society, 2011.
Cyber Resilience Act. Accessed 2023-09-12. URL: https://digital-strategy.ec.europa.eu/en/library/cyber-resilience-act.
Andreas Dann, Ben Hermann, and Eric Bodden. Sootdiff: bytecode comparison across different java compilers. In Proceedings of the 8th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis, SOAP@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, pages 14-19. ACM, 2019.
Andreas Dann, Henrik Plate, Ben Hermann, Serena Elisa Ponta, and Eric Bodden. Identifying challenges for oss vulnerability scanners-a study & test suite. IEEE Transactions on Software Engineering, 48(9):3613-3625, 2021.
Yaniv David, Nimrod Partush, and Eran Yahav. Statistical similarity of binaries. Acm Sigplan Notices, 51(6):266-280, 2016.
Yaniv David, Nimrod Partush, and Eran Yahav. Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN conference on programming language design and implementation, pages 79-94, 2017.
Yaniv David, Nimrod Partush, and Eran Yahav. Firmup: Precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices, 53(2):392-404, 2018.
Ian J. Davis and Michael W. Godfrey. From whence it came: Detecting source code clones by analyzing assembler. In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October 2010, Beverly, MA, USA, pages 242-246. IEEE Computer Society, 2010.
JDK-6246854 : Unnecessary checkcast in generated code. Accessed 2022-10-28. URL: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6246854.
GNU Compiler for Java (GCJ). Accessed 2022-10-17. URL: https://gcc.gnu.org/wiki/GCJ.
Gradle Build Tool. Accessed 2022-11-07. URL: https://gradle.org/.
Irfan Ul Haq and Juan Caballero. A survey of binary code similarity. ACM Comput. Surv., 54(3):51:1-51:38, 2022.
Foyzul Hassan, Shaikh Mostafa, Edmund S. L. Lam, and Xiaoyin Wang. Automatic building of java projects in software repositories: A study on feasibility and challenges. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2017, Toronto, ON, Canada, November 9-10, 2017, pages 38-47. IEEE Computer Society, 2017.
Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Dolstra. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories, pages 63-72, 2011.
The Java HotSpot Performance Engine Architecture. Accessed 2022-10-14. URL: https://www.oracle.com/java/technologies/whitepaper.html.
JEP 280: Indify String Concatenation. Accessed 2022-10-27. URL: https://openjdk.org/jeps/280.
Oracle Java SE 6 and JRockit End of Support. Accessed 2022-12-12. URL: https://support.oracle.com/knowledge/Middleware/2244851_1.html.
JDK Release Notes. Accessed 2023-03-30. URL: https://www.oracle.com/java/technologies/javase/jdk-relnotes-index.html.
Eclipse Java development tools (JDT). Accessed 2022-10-17. URL: https://www.eclipse.org/jdt/core/.
The State of Developer Ecosystem 2023. Accessed 2023-12-15. URL: https://www.jetbrains.com/lp/devecosystem-2023/java/.
Jeong-Hoon Ji, Gyun Woo, and Hwan-Gue Cho. A plagiarism detection technique for java program using bytecode analysis. In 2008 third international conference on convergence and hybrid information technology, volume 1, pages 1092-1098. IEEE, 2008.
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stéphane Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE 2007), Minneapolis, MN, USA, May 20-26, 2007, pages 96-105. IEEE Computer Society, 2007.
IBM Jikes Compiler for the Java Language. Accessed 2022-10-17. URL: https://sourceforge.net/projects/jikes/.
The ClassFile Structure. Accessed 2023-12-12. URL: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.1.
Oracle JVM Specification - Chapter 4. The class File Format. Accessed 2023-04-03. URL: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.8.
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng., 28(7):654-670, 2002.
Iman Keivanloo, Chanchal Kumar Roy, and Juergen Rilling. Sebyte: Scalable clone and similarity search for bytecode. Sci. Comput. Program., 95:426-444, 2014.
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. VUDDY: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 595-614. IEEE Computer Society, 2017.
Oleksii Kononenko, Cheng Zhang, and Michael W. Godfrey. Compiling clones: What happens? In 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014, pages 481-485. IEEE Computer Society, 2014.
Jens Krinke. Identifying similar code with program dependence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering, WCRE'01, Stuttgart, Germany, October 2-5, 2001, pages 301-309. IEEE Computer Society, 2001.
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 389-400, 2014.
Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099-2116, 2022.
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 309-329. Springer, 2019.
JEP 181: Nest-Based Access Control. Accessed 2022-10-28. URL: https://openjdk.org/jeps/181.
2022 State of the Java Ecosystem Report. Accessed 2022-10-24. URL: https://newrelic.com/resources/report/2022-state-of-java-ecosystem.
The Java programming language Compiler Group. Accessed 2022-10-17. URL: https://openjdk.org/groups/compiler/.
The Java Language Environment - Chapter 4: Architecture Neutral, Portable, and Robust. Accessed 2022-10-17. URL: https://www.oracle.com/java/technologies/architecture-neutral-portable-robust.html.
Lutz Prechelt, Guido Malpohl, Michael Philippsen, et al. Finding plagiarisms among a set of programs with jplag. J. Univers. Comput. Sci., 8(11):1016, 2002.
Chaiyong Ragkhitwetsagul and Jens Krinke. Using compilation/decompilation to enhance clone detection. In 2017 IEEE 11th International Workshop on Software Clones (IWSC), pages 1-7. IEEE, 2017.
Chaiyong Ragkhitwetsagul, Jens Krinke, and David Clark. A comparison of code similarity analysers. Empir. Softw. Eng., 23(4):2464-2519, 2018.
Chanchal Kumar Roy and James R. Cordy. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In The 16th IEEE International Conference on Program Comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10-13, 2008, pages 172-181. IEEE Computer Society, 2008.
Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. Oreo: detection of clones in the twilight zone. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 354-365. ACM, 2018.
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. Sourcerercc: scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 1157-1168. ACM, 2016.
Gehan M. K. Selim, King Chun Foo, and Ying Zou. Enhancing source-based clone detection using intermediate representation. In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October 2010, Beverly, MA, USA, pages 227-236. IEEE Computer Society, 2010.
Soot Options and Phases. Accessed 2022-10-17. URL: https://soot-oss.github.io/soot/docs/4.3.0/options/soot_options.html.
Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, and Vijay Sundaresan. Soot - A java bytecode optimization framework. In Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative Research, November 8-11, 1999, Mississauga, Ontario, Canada, page 13. IBM, 1999.
Jiawen Xiong, Yong Shi, Boyuan Chen, Filipe R Cogo, and Zhen Ming Jiang. Towards build verifiability for java-based systems. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, pages 297-306, 2022.
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363-376, 2017.
Dongjin Yu, Jiazha Yang, Xin Chen, and Jie Chen. Detecting java code clones based on bytecode sequence alignment. IEEE Access, 7:22421-22433, 2019.
Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091-1095, 2007.
Gang Zhao and Jeff Huang. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 141-151. ACM, 2018.

Java Bytecode Normalization for Code Similarity Analysis

Authors Stefan Schott , Serena Elisa Ponta , Wolfram Fischer , Jonas Klauke , Eric Bodden

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Java Bytecode Normalization for Code Similarity Analysis

Authors Stefan Schott , Serena Elisa Ponta , Wolfram Fischer , Jonas Klauke , Eric Bodden

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message