Minha: Large-Scale Distributed Systems Testing Made Practical

Authors Nuno Machado , Francisco Maia , Francisco Neves , Fábio Coelho , José Pereira

Thumbnail PDF


  • Filesize: 0.58 MB
  • 17 pages

Document Identifiers

Author Details

Nuno Machado
  • Teradata, Madrid, Spain
  • INESC TEC, Porto, Portugal
Francisco Maia
  • INESC TEC, Porto, Portugal
Francisco Neves
  • INESC TEC, Porto, Portugal
  • U.Minho, Braga, Portugal
Fábio Coelho
  • INESC TEC, Porto, Portugal
  • U.Minho, Braga, Portugal
José Pereira
  • INESC TEC, Porto, Portugal
  • U.Minho, Braga, Portugal

Cite AsGet BibTex

Nuno Machado, Francisco Maia, Francisco Neves, Fábio Coelho, and José Pereira. Minha: Large-Scale Distributed Systems Testing Made Practical. In 23rd International Conference on Principles of Distributed Systems (OPODIS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 153, pp. 11:1-11:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Testing large-scale distributed system software is still far from practical as the sheer scale needed and the inherent non-determinism make it very expensive to deploy and use realistically large environments, even with cloud computing and state-of-the-art automation. Moreover, observing global states without disturbing the system under test is itself difficult. This is particularly troubling as the gap between distributed algorithms and their implementations can easily introduce subtle bugs that are disclosed only with suitably large scale tests. We address this challenge with Minha, a framework that virtualizes multiple JVM instances in a single JVM, thus simulating a distributed environment where each host runs on a separate machine, accessing dedicated network and CPU resources. The key contributions are the ability to run off-the-shelf concurrent and distributed JVM bytecode programs while at the same time scaling up to thousands of virtual nodes; and enabling global observation within standard software testing frameworks. Our experiments with two distributed systems show the usefulness of Minha in disclosing errors, evaluating global properties, and in scaling tests orders of magnitude with the same hardware resources.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Distributed computing methodologies
  • Distributed software testing
  • Large scale distributed systems
  • Simulation


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Werner Almesberger. umlsim-A UML-based simulator. In 10th International Linux System Technology Conference (Linux-Kongress 2003), pages 202-213, 2003. Google Scholar
  2. G. A. Alvarez and F. Cristian. Applying simulation to the design and performance evaluation of fault-tolerant systems. In SRDS '97, pages 35-42, October 1997. Google Scholar
  3. Rimon Barr, Zygmunt J Haas, and Robbert van Renesse. JiST: An efficient approach to simulation using virtual machines. Software: Practice and Experience, 35(6):539-576, 2005. Google Scholar
  4. Y Bertot and P Castéran. Interactive Theorem Proving and Program Development—Coq’Art: The Calculus of Inductive Constructions (2004). Google Scholar
  5. Eric Bruneton, Romain Lenglet, and Thierry Coupaye. ASM: A code manipulation tool to implement adaptable systems. In In Adaptable and extensible component systems, 2002. Google Scholar
  6. Bug CASSANDRA-6127: vnodes don't scale to hundreds of nodes. URL: https://issues.apache.org/jira/browse/CASSANDRA-6127.
  7. Bug ZOOKEEPER-2212: distributed race condition related to QV version. URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2212.
  8. Apache Cassandra. URL: http://cassandra.apache.org.
  9. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC '10, pages 143-154, New York, NY, USA, 2010. ACM. Google Scholar
  10. Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In EuroSys '17, New York, NY, USA, 2017. ACM. Google Scholar
  11. F. Gortázar, M. Gallego, M. Donato, E. Pages, A. Edmonds, G. Tuñón, A. Bertolino, G. De Angelis, A. Cervantes, T. Bohnert, A. Willner, and V. Gowtham. The ElasTest Platform: Supporting Automation of End-to-End Testing of Large Complex Applications. ElasTest project whitepaper, November 2018. URL: https://elastest.io/resources/ElasTest_white_paper.pdf.
  12. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical software model checking via dynamic interface reduction. In SOSP '11, pages 265-278. ACM, 2011. Google Scholar
  13. Diwaker Gupta, Kashi Venkatesh Vishwanath, Marvin McNett, Amin Vahdat, Ken Yocum, Alex Snoeren, and Geoffrey M. Voelker. DieCast: Testing Distributed Systems with an Accurate Scale Model. ACM Trans. Comput. Syst., 29(2):4:1-4:48, May 2011. Google Scholar
  14. Apache Hadoop. URL: http://hadoop.apache.org.
  15. Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. The Phi Accrual Failure Detector. In SRDS. IEEE Computer Society, 2004. Google Scholar
  16. Mike Hibler, Robert Ricci, Leigh Stoller, Jonathon Duerig, Shashi Guruprasad, Tim Stack, Kirk Webb, and Jay Lepreau. Large-scale Virtualization in the Emulab Network Testbed. In USENIX 2008 Annual Technical Conference, ATC'08, 2008. Google Scholar
  17. Márk Jelasity, Spyros Voulgaris, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten Van Steen. Gossip-based peer sampling. ACM Transactions on Computer Systems (TOCS), 25(3):8, 2007. Google Scholar
  18. Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In NSDI '07, 2007. Google Scholar
  19. Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Martin Schulz, Saurabh Bagchi, Milind Kulkarni, Bowen Zhou, Zhezhe Chen, and Feng Qin. Debugging High-performance Computing Applications at Massive Scales. Commun. ACM, 58(9):72-81, August 2015. Google Scholar
  20. Leslie Lamport. Time clocks, and the ordering of events in a distributed system. Commun. ACM, 21:558-565, July 1978. Google Scholar
  21. Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002. Google Scholar
  22. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In OSDI '14. USENIX Association, 2014. Google Scholar
  23. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In ASPLOS '16. ACM, 2016. Google Scholar
  24. Tanakorn Leesatapornwongsa, Cesar A. Stuardo, Riza O. Suminto, Huan Ke, Jeffrey F. Lukman, and Haryadi S. Gunawi. Scalability Bugs: When 100-Node Testing is Not Enough. In HotOS '17, pages 24-29. ACM, 2017. Google Scholar
  25. Lorenzo Leonini, Étienne Rivière, and Pascal Felber. SPLAY: Distributed Systems Evaluation Made Simple (or How to Turn Ideas into Live Systems in a Breeze). In NSDI, volume 9, pages 185-198, 2009. Google Scholar
  26. Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS '17. ACM, 2017. Google Scholar
  27. Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08. USENIX Association, 2008. Google Scholar
  28. B. D. Lubachevsky. Efficient Distributed Event-driven Simulations of Multiple-loop Networks. Commun. ACM, 32(1):111-123, January 1989. Google Scholar
  29. Nuno Machado, Francisco Maia, Miguel Matos, and Rui Oliveira. BuzzPSS: A Dependable and Adaptive Peer Sampling Service. In LADC '16. IEEE Computer Society, 2016. Google Scholar
  30. Francisco Maia, Miguel Matos, Ricardo Vilaça, José Pereira, Rui Oliveira, and Etienne Rivière. DATAFLASKS: Epidemic Store for Massive Scale Systems. In SRDS '14. IEEE Computer Society, 2014. Google Scholar
  31. A. Montresor and M. Jelasity. PeerSim: A scalable P2P simulator. In International Conference on Peer-to-Peer Computing, 2009. Google Scholar
  32. Larry Peterson and Timothy Roscoe. The design principles of PlanetLab. ACM SIGOPS operating systems review, 40(1):11-16, 2006. Google Scholar
  33. Antony I. T. Rowstron and Peter Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. In Middleware '01, London, UK, UK, 2001. Springer-Verlag. Google Scholar
  34. Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. Proc. VLDB Endow., 3(1-2):460-471, September 2010. URL: https://doi.org/10.14778/1920841.1920902.
  35. Jiri Simsa, Randy Bryant, and Garth A Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV '10, 2010. Google Scholar
  36. Peter Urban, Xavier Défago, and André Schiper. Neko: A single environment to simulate and prototype distributed algorithms. In Information Networking, 2001. Proceedings. 15th International Conference on, pages 503-511. IEEE, 2001. Google Scholar
  37. Spyros Voulgaris, Daniela Gavidia, and Maarten van Steen. CYCLON: Inexpensive membership management for unstructured p2p overlays. Journal of Network and Systems Management, 13(2):197-217, 2005. Google Scholar
  38. Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and Mike Dahlin. Exalt: Empowering Researchers to Evaluate Large-scale Storage Systems. In NSDI'14, pages 129-141, Berkeley, CA, USA, 2014. USENIX Association. Google Scholar
  39. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent model checking of unmodified distributed systems. In NSDI '09. USENIX Association, 2009. Google Scholar