Command Similarity Measurement Using NLP

Authors Zafar Hussain , Jukka K. Nurminen , Tommi Mikkonen , Marcin Kowiel

Thumbnail PDF


  • Filesize: 0.89 MB
  • 14 pages

Document Identifiers

Author Details

Zafar Hussain
  • Department of Computer Science, University of Helsinki, Finland
Jukka K. Nurminen
  • Department of Computer Science, University of Helsinki, Finland
Tommi Mikkonen
  • Department of Computer Science, University of Helsinki, Finland
Marcin Kowiel
  • F-Secure Corporation, Poland

Cite AsGet BibTex

Zafar Hussain, Jukka K. Nurminen, Tommi Mikkonen, and Marcin Kowiel. Command Similarity Measurement Using NLP. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Process invocations happen with almost every activity on a computer. To distinguish user input and potentially malicious activities, we need to better understand program invocations caused by commands. To achieve this, one must understand commands’ objectives, possible parameters, and valid syntax. In this work, we collected commands’ data by scrapping commands’ manual pages, including command description, syntax, and parameters. Then, we measured command similarity using two of these - description and parameters - based on commands' natural language documentation. We used Term Frequency-Inverse Document Frequency (TFIDF) of a word to compare the commands, followed by measuring cosine similarity to find a similarity of commands’ description. For parameters, after measuring TFIDF and cosine similarity, the Hungarian method is applied to solve the assignment of different parameters’ combinations. Finally, commands are clustered based on their similarity scores. The results show that these methods have efficiently clustered the commands in smaller groups (commands with aliases or close counterparts), and in a bigger group (commands belonging to a larger set of related commands, e.g., bitsadmin for Windows and systemd for Linux). To validate the clustering results, we applied topic modeling on the commands' data, which confirms that 84% of the Windows commands and 98% ofthe Linux commands are clustered correctly.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Natural language processing
  • Natural Language Processing
  • NLP
  • Windows Commands
  • Linux Commands
  • Textual Similarity
  • Command Term Frequency
  • Inverse Document Frequency
  • Cosine Similarity
  • Linear Sum Assignment
  • Command Clustering


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Rubayyi Alghamdi and Khalid Alfalqi. A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications, 6, January 2015. URL:
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3(null):993–1022, 2003. Google Scholar
  3. Rainer Burkard, Mauro Dell'Amico, and Silvano Martello. Assignment Problems. Society for Industrial and Applied Mathematics, 2012. Google Scholar
  4. Brian D. Davison and H. Hirsh. Toward an adaptive command line interface. In HCI, 1997. Google Scholar
  5. Brian D. Davison and H. Hirsh. Predicting sequences of user actions. In AAAI/ICML 1998 Workshop on Predicting the Future: AI Approaches to Time-Series Analysis, 1998. Google Scholar
  6. Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler, and Pasi Fränti. Framework for syntactic string similarity measures. Expert Systems with Applications, 129:169-185, 2019. URL:
  7. Jiawei Han, Micheline Kamber, and Jian Pei. 2 - getting to know your data. In Data Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems, pages 39-82. Elsevier, third edition edition, 2012. Google Scholar
  8. Haym Hirsh and Brian Davison. Adaptive unix command-line assistant. Proceedings of the International Conference on Autonomous Agents, October 1998. URL:
  9. José Iglesias, Agapito Ledezma Espino, and Araceli Sanchis de Miguel. Creating user profiles from a command-line interface: A statistical approach. In International Conference on User Modeling, Adaptation, and Personalization, volume 5535, pages 90-101. Springer, 2009. Google Scholar
  10. E. Juergens, F. Deissenboeck, and B. Hummel. Code similarities beyond copy paste. In 2010 14th European Conference on Software Maintenance and Reengineering, pages 78-87, 2010. URL:
  11. Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3:91-97, 2006. The Proceedings of the 6th Annual Digital Forensic Research Workshop (DFRWS '06). URL:
  12. Benjamin Korvemaker and Russ Greiner. Predicting unix command lines: Adjusting to user patterns. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, page 230–235, 2000. Google Scholar
  13. H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1‐2):83-97, 1955. URL:
  14. J. Lawler and H.A. Dry. Using Computers in Linguistics: A Practical Guide. Routledge, 1998. Google Scholar
  15. Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation LREC 2018, Miyazaki (Japan), 7-12 May, 2018., 2018. Google Scholar
  16. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR, 2013. URL:
  17. Vineeth G Nair. Getting Started with Beautiful Soup. Packt Publishing Ltd, 2014. Google Scholar
  18. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. URL:
  19. M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk. Deep learning similarities from different representations of source code. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), pages 542-553, 2018. Google Scholar
  20. M. Umadevi. Document comparison based on tf-idf metric. In International Research Journal of Engineering and Technology (IRJET), volume 7(02), 2020. Google Scholar
  21. Antony Unwin and Hofmann Heike. Gui and command-line - conflict or synergy? In Proceedings of the 31st Symposium on the Interface: models, predictions, and computing, 2000. Google Scholar