Java Bytecode Normalization for Code Similarity Analysis

Authors Stefan Schott , Serena Elisa Ponta , Wolfram Fischer , Jonas Klauke , Eric Bodden



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2024.37.pdf
  • Filesize: 0.91 MB
  • 29 pages

Document Identifiers

Author Details

Stefan Schott
  • Paderborn University, Germany
Serena Elisa Ponta
  • SAP Security Research, Mougins, France
Wolfram Fischer
  • SAP Security Research, Mougins, France
Jonas Klauke
  • Paderborn University, Germany
Eric Bodden
  • Paderborn University, Germany
  • Fraunhofer IEM, Paderborn, Germany

Cite AsGet BibTex

Stefan Schott, Serena Elisa Ponta, Wolfram Fischer, Jonas Klauke, and Eric Bodden. Java Bytecode Normalization for Code Similarity Analysis. In 38th European Conference on Object-Oriented Programming (ECOOP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 313, pp. 37:1-37:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ECOOP.2024.37

Abstract

Analyzing the similarity of two code fragments has many applications, including code clone, vulnerability and plagiarism detection. Most existing approaches for similarity analysis work on source code. However, in scenarios like plagiarism detection, copyright violation detection or Software Bill of Materials creation source code is often not available and thus similarity analysis has to be performed on binary formats. Java bytecode is a binary format executable by the Java Virtual Machine and obtained from the compilation of Java source code. Performing similarity detection on bytecode is challenging because different compilers can compile the same source code to syntactically vastly different bytecode. In this work we assess to what extent one can nonetheless enable similarity detection by bytecode normalization, a procedure to transform Java bytecode into a representation that is identical for the same original source code, irrespective of the Java compiler and Java version used during compilation. Our manual study revealed 16 classes of compilation differences that various compilation environments may induce. Based on these findings, we implemented bytecode normalization in a tool jNorm. It uses Jimple as intermediate representation, applies common code optimizations and transforms all classes of compilation difference to a normalized form, thus achieving a representation of the bytecode that is identical despite different compilation environments. Our evaluation, performed on more than 300 popular Java projects, shows that solely the act of incrementing a compiler version may cause differences in 46% of all resulting bytecode files. By applying bytecode normalization, one can remove more than 99% of these differences, thus acting as a crucial enabler for subsequent applications of bytecode similarity analysis.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Compilers
Keywords
  • Bytecode
  • Java Compiler
  • Code Similarity Analysis

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amazon Corretto 8. Accessed 2023-03-31. URL: https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html.
  2. Wolfram Amme, Thomas S. Heinze, and André Schäfer. You look so different: Finding structural clones and subclones in java source code. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2021, Luxembourg, September 27 - October 1, 2021, pages 70-80. IEEE, 2021. Google Scholar
  3. ASM: Java bytecode manipulation and analysis framework. Accessed 2022-10-24. URL: https://asm.ow2.io/.
  4. Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In 1998 USENIX Annual Technical Conference, New Orleans, Louisiana, USA, June 15-19, 1998. USENIX Association, 1998. Google Scholar
  5. Musard Balliu, Benoit Baudry, Sofia Bobadilla, Mathias Ekstedt, Martin Monperrus, Javier Ron, Aman Sharma, Gabriel Skoglund, César Soto-Valero, and Martin Wittlinger. Challenges of producing software bill of materials for java. IEEE Security & Privacy, pages 2-13, 2023. Google Scholar
  6. Executive Order on Improving the Nation’s Cybersecurity. Accessed 2023-09-12. URL: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity.
  7. Kai Chen, Peng Liu, and Yingjun Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In 36th International Conference on Software Engineering, ICSE '14, Hyderabad, India - May 31 - June 07, 2014, pages 175-186. ACM, 2014. Google Scholar
  8. Apache Maven Compiler Plugin - Setting the -release of the Java Compiler. Accessed 2023-04-03. URL: https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html.
  9. Apache Maven Compiler Plugin - Setting the -source and -target of the Java Compiler. Accessed 2023-04-03. URL: https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-source-and-target.html.
  10. James R. Cordy and Chanchal K. Roy. The nicad clone detector. In The 19th IEEE International Conference on Program Comprehension, ICPC 2011, Kingston, ON, Canada, June 22-24, 2011, pages 219-220. IEEE Computer Society, 2011. Google Scholar
  11. Cyber Resilience Act. Accessed 2023-09-12. URL: https://digital-strategy.ec.europa.eu/en/library/cyber-resilience-act.
  12. Andreas Dann, Ben Hermann, and Eric Bodden. Sootdiff: bytecode comparison across different java compilers. In Proceedings of the 8th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis, SOAP@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, pages 14-19. ACM, 2019. Google Scholar
  13. Andreas Dann, Henrik Plate, Ben Hermann, Serena Elisa Ponta, and Eric Bodden. Identifying challenges for oss vulnerability scanners-a study & test suite. IEEE Transactions on Software Engineering, 48(9):3613-3625, 2021. Google Scholar
  14. Yaniv David, Nimrod Partush, and Eran Yahav. Statistical similarity of binaries. Acm Sigplan Notices, 51(6):266-280, 2016. Google Scholar
  15. Yaniv David, Nimrod Partush, and Eran Yahav. Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN conference on programming language design and implementation, pages 79-94, 2017. Google Scholar
  16. Yaniv David, Nimrod Partush, and Eran Yahav. Firmup: Precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices, 53(2):392-404, 2018. Google Scholar
  17. Ian J. Davis and Michael W. Godfrey. From whence it came: Detecting source code clones by analyzing assembler. In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October 2010, Beverly, MA, USA, pages 242-246. IEEE Computer Society, 2010. Google Scholar
  18. JDK-6246854 : Unnecessary checkcast in generated code. Accessed 2022-10-28. URL: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6246854.
  19. GNU Compiler for Java (GCJ). Accessed 2022-10-17. URL: https://gcc.gnu.org/wiki/GCJ.
  20. Gradle Build Tool. Accessed 2022-11-07. URL: https://gradle.org/.
  21. Irfan Ul Haq and Juan Caballero. A survey of binary code similarity. ACM Comput. Surv., 54(3):51:1-51:38, 2022. Google Scholar
  22. Foyzul Hassan, Shaikh Mostafa, Edmund S. L. Lam, and Xiaoyin Wang. Automatic building of java projects in software repositories: A study on feasibility and challenges. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2017, Toronto, ON, Canada, November 9-10, 2017, pages 38-47. IEEE Computer Society, 2017. Google Scholar
  23. Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Dolstra. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories, pages 63-72, 2011. Google Scholar
  24. The Java HotSpot Performance Engine Architecture. Accessed 2022-10-14. URL: https://www.oracle.com/java/technologies/whitepaper.html.
  25. JEP 280: Indify String Concatenation. Accessed 2022-10-27. URL: https://openjdk.org/jeps/280.
  26. Oracle Java SE 6 and JRockit End of Support. Accessed 2022-12-12. URL: https://support.oracle.com/knowledge/Middleware/2244851_1.html.
  27. JDK Release Notes. Accessed 2023-03-30. URL: https://www.oracle.com/java/technologies/javase/jdk-relnotes-index.html.
  28. Eclipse Java development tools (JDT). Accessed 2022-10-17. URL: https://www.eclipse.org/jdt/core/.
  29. The State of Developer Ecosystem 2023. Accessed 2023-12-15. URL: https://www.jetbrains.com/lp/devecosystem-2023/java/.
  30. Jeong-Hoon Ji, Gyun Woo, and Hwan-Gue Cho. A plagiarism detection technique for java program using bytecode analysis. In 2008 third international conference on convergence and hybrid information technology, volume 1, pages 1092-1098. IEEE, 2008. Google Scholar
  31. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stéphane Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE 2007), Minneapolis, MN, USA, May 20-26, 2007, pages 96-105. IEEE Computer Society, 2007. Google Scholar
  32. IBM Jikes Compiler for the Java Language. Accessed 2022-10-17. URL: https://sourceforge.net/projects/jikes/.
  33. The ClassFile Structure. Accessed 2023-12-12. URL: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.1.
  34. Oracle JVM Specification - Chapter 4. The class File Format. Accessed 2023-04-03. URL: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.8.
  35. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng., 28(7):654-670, 2002. Google Scholar
  36. Iman Keivanloo, Chanchal Kumar Roy, and Juergen Rilling. Sebyte: Scalable clone and similarity search for bytecode. Sci. Comput. Program., 95:426-444, 2014. Google Scholar
  37. Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. VUDDY: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 595-614. IEEE Computer Society, 2017. Google Scholar
  38. Oleksii Kononenko, Cheng Zhang, and Michael W. Godfrey. Compiling clones: What happens? In 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014, pages 481-485. IEEE Computer Society, 2014. Google Scholar
  39. Jens Krinke. Identifying similar code with program dependence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering, WCRE'01, Stuttgart, Germany, October 2-5, 2001, pages 301-309. IEEE Computer Society, 2001. Google Scholar
  40. Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 389-400, 2014. Google Scholar
  41. Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099-2116, 2022. Google Scholar
  42. Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 309-329. Springer, 2019. Google Scholar
  43. JEP 181: Nest-Based Access Control. Accessed 2022-10-28. URL: https://openjdk.org/jeps/181.
  44. 2022 State of the Java Ecosystem Report. Accessed 2022-10-24. URL: https://newrelic.com/resources/report/2022-state-of-java-ecosystem.
  45. The Java programming language Compiler Group. Accessed 2022-10-17. URL: https://openjdk.org/groups/compiler/.
  46. The Java Language Environment - Chapter 4: Architecture Neutral, Portable, and Robust. Accessed 2022-10-17. URL: https://www.oracle.com/java/technologies/architecture-neutral-portable-robust.html.
  47. Lutz Prechelt, Guido Malpohl, Michael Philippsen, et al. Finding plagiarisms among a set of programs with jplag. J. Univers. Comput. Sci., 8(11):1016, 2002. Google Scholar
  48. Chaiyong Ragkhitwetsagul and Jens Krinke. Using compilation/decompilation to enhance clone detection. In 2017 IEEE 11th International Workshop on Software Clones (IWSC), pages 1-7. IEEE, 2017. Google Scholar
  49. Chaiyong Ragkhitwetsagul, Jens Krinke, and David Clark. A comparison of code similarity analysers. Empir. Softw. Eng., 23(4):2464-2519, 2018. Google Scholar
  50. Chanchal Kumar Roy and James R. Cordy. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In The 16th IEEE International Conference on Program Comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10-13, 2008, pages 172-181. IEEE Computer Society, 2008. Google Scholar
  51. Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. Oreo: detection of clones in the twilight zone. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 354-365. ACM, 2018. Google Scholar
  52. Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. Sourcerercc: scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 1157-1168. ACM, 2016. Google Scholar
  53. Gehan M. K. Selim, King Chun Foo, and Ying Zou. Enhancing source-based clone detection using intermediate representation. In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October 2010, Beverly, MA, USA, pages 227-236. IEEE Computer Society, 2010. Google Scholar
  54. Soot Options and Phases. Accessed 2022-10-17. URL: https://soot-oss.github.io/soot/docs/4.3.0/options/soot_options.html.
  55. Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, and Vijay Sundaresan. Soot - A java bytecode optimization framework. In Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative Research, November 8-11, 1999, Mississauga, Ontario, Canada, page 13. IBM, 1999. Google Scholar
  56. Jiawen Xiong, Yong Shi, Boyuan Chen, Filipe R Cogo, and Zhen Ming Jiang. Towards build verifiability for java-based systems. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, pages 297-306, 2022. Google Scholar
  57. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363-376, 2017. Google Scholar
  58. Dongjin Yu, Jiazha Yang, Xin Chen, and Jie Chen. Detecting java code clones based on bytecode sequence alignment. IEEE Access, 7:22421-22433, 2019. Google Scholar
  59. Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091-1095, 2007. Google Scholar
  60. Gang Zhao and Jeff Huang. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 141-151. ACM, 2018. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail