Command Similarity Measurement Using NLP

Hussain, Zafar; Nurminen, Jukka K.; Mikkonen, Tommi; Kowiel, Marcin

doi:10.4230/OASIcs.SLATE.2021.13

Abstract

Process invocations happen with almost every activity on a computer. To distinguish user input and potentially malicious activities, we need to better understand program invocations caused by commands. To achieve this, one must understand commands’ objectives, possible parameters, and valid syntax. In this work, we collected commands’ data by scrapping commands’ manual pages, including command description, syntax, and parameters. Then, we measured command similarity using two of these - description and parameters - based on commands' natural language documentation. We used Term Frequency-Inverse Document Frequency (TFIDF) of a word to compare the commands, followed by measuring cosine similarity to find a similarity of commands’ description. For parameters, after measuring TFIDF and cosine similarity, the Hungarian method is applied to solve the assignment of different parameters’ combinations. Finally, commands are clustered based on their similarity scores. The results show that these methods have efficiently clustered the commands in smaller groups (commands with aliases or close counterparts), and in a bigger group (commands belonging to a larger set of related commands, e.g., bitsadmin for Windows and systemd for Linux). To validate the clustering results, we applied topic modeling on the commands' data, which confirms that 84% of the Windows commands and 98% ofthe Linux commands are clustered correctly.

Rubayyi Alghamdi and Khalid Alfalqi. A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications, 6, January 2015. URL: https://doi.org/10.14569/IJACSA.2015.060121.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3(null):993–1022, 2003.
Rainer Burkard, Mauro Dell'Amico, and Silvano Martello. Assignment Problems. Society for Industrial and Applied Mathematics, 2012.
Brian D. Davison and H. Hirsh. Toward an adaptive command line interface. In HCI, 1997.
Brian D. Davison and H. Hirsh. Predicting sequences of user actions. In AAAI/ICML 1998 Workshop on Predicting the Future: AI Approaches to Time-Series Analysis, 1998.
Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler, and Pasi Fränti. Framework for syntactic string similarity measures. Expert Systems with Applications, 129:169-185, 2019. URL: https://doi.org/10.1016/j.eswa.2019.03.048.
Jiawei Han, Micheline Kamber, and Jian Pei. 2 - getting to know your data. In Data Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems, pages 39-82. Elsevier, third edition edition, 2012.
Haym Hirsh and Brian Davison. Adaptive unix command-line assistant. Proceedings of the International Conference on Autonomous Agents, October 1998. URL: https://doi.org/10.1145/267658.267827.
José Iglesias, Agapito Ledezma Espino, and Araceli Sanchis de Miguel. Creating user profiles from a command-line interface: A statistical approach. In International Conference on User Modeling, Adaptation, and Personalization, volume 5535, pages 90-101. Springer, 2009.
E. Juergens, F. Deissenboeck, and B. Hummel. Code similarities beyond copy paste. In 2010 14th European Conference on Software Maintenance and Reengineering, pages 78-87, 2010. URL: https://doi.org/10.1109/CSMR.2010.33.
Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3:91-97, 2006. The Proceedings of the 6th Annual Digital Forensic Research Workshop (DFRWS '06). URL: https://doi.org/10.1016/j.diin.2006.06.015.
Benjamin Korvemaker and Russ Greiner. Predicting unix command lines: Adjusting to user patterns. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, page 230–235, 2000.
H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1‐2):83-97, 1955. URL: https://doi.org/10.1002/nav.3800020109.
J. Lawler and H.A. Dry. Using Computers in Linguistics: A Practical Guide. Routledge, 1998.
Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation LREC 2018, Miyazaki (Japan), 7-12 May, 2018., 2018.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR, 2013. URL: http://arxiv.org/abs/1310.4546.
Vineeth G Nair. Getting Started with Beautiful Soup. Packt Publishing Ltd, 2014.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. URL: http://www.aclweb.org/anthology/D14-1162.
M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk. Deep learning similarities from different representations of source code. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), pages 542-553, 2018.
M. Umadevi. Document comparison based on tf-idf metric. In International Research Journal of Engineering and Technology (IRJET), volume 7(02), 2020.
Antony Unwin and Hofmann Heike. Gui and command-line - conflict or synergy? In Proceedings of the 31st Symposium on the Interface: models, predictions, and computing, 2000.

Command Similarity Measurement Using NLP

Authors Zafar Hussain , Jukka K. Nurminen , Tommi Mikkonen , Marcin Kowiel

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message