Minha: Large-Scale Distributed Systems Testing Made Practical

Machado, Nuno; Maia, Francisco; Neves, Francisco; Coelho, Fábio; Pereira, José

doi:10.4230/LIPIcs.OPODIS.2019.11

Abstract

Testing large-scale distributed system software is still far from practical as the sheer scale needed and the inherent non-determinism make it very expensive to deploy and use realistically large environments, even with cloud computing and state-of-the-art automation. Moreover, observing global states without disturbing the system under test is itself difficult. This is particularly troubling as the gap between distributed algorithms and their implementations can easily introduce subtle bugs that are disclosed only with suitably large scale tests. We address this challenge with Minha, a framework that virtualizes multiple JVM instances in a single JVM, thus simulating a distributed environment where each host runs on a separate machine, accessing dedicated network and CPU resources. The key contributions are the ability to run off-the-shelf concurrent and distributed JVM bytecode programs while at the same time scaling up to thousands of virtual nodes; and enabling global observation within standard software testing frameworks. Our experiments with two distributed systems show the usefulness of Minha in disclosing errors, evaluating global properties, and in scaling tests orders of magnitude with the same hardware resources.

Werner Almesberger. umlsim-A UML-based simulator. In 10th International Linux System Technology Conference (Linux-Kongress 2003), pages 202-213, 2003.
G. A. Alvarez and F. Cristian. Applying simulation to the design and performance evaluation of fault-tolerant systems. In SRDS '97, pages 35-42, October 1997.
Rimon Barr, Zygmunt J Haas, and Robbert van Renesse. JiST: An efficient approach to simulation using virtual machines. Software: Practice and Experience, 35(6):539-576, 2005.
Y Bertot and P Castéran. Interactive Theorem Proving and Program Development—Coq’Art: The Calculus of Inductive Constructions (2004).
Eric Bruneton, Romain Lenglet, and Thierry Coupaye. ASM: A code manipulation tool to implement adaptable systems. In In Adaptable and extensible component systems, 2002.
Bug CASSANDRA-6127: vnodes don't scale to hundreds of nodes. URL: https://issues.apache.org/jira/browse/CASSANDRA-6127.
Bug ZOOKEEPER-2212: distributed race condition related to QV version. URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2212.
Apache Cassandra. URL: http://cassandra.apache.org.
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC '10, pages 143-154, New York, NY, USA, 2010. ACM.
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In EuroSys '17, New York, NY, USA, 2017. ACM.
F. Gortázar, M. Gallego, M. Donato, E. Pages, A. Edmonds, G. Tuñón, A. Bertolino, G. De Angelis, A. Cervantes, T. Bohnert, A. Willner, and V. Gowtham. The ElasTest Platform: Supporting Automation of End-to-End Testing of Large Complex Applications. ElasTest project whitepaper, November 2018. URL: https://elastest.io/resources/ElasTest_white_paper.pdf.
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical software model checking via dynamic interface reduction. In SOSP '11, pages 265-278. ACM, 2011.
Diwaker Gupta, Kashi Venkatesh Vishwanath, Marvin McNett, Amin Vahdat, Ken Yocum, Alex Snoeren, and Geoffrey M. Voelker. DieCast: Testing Distributed Systems with an Accurate Scale Model. ACM Trans. Comput. Syst., 29(2):4:1-4:48, May 2011.
Apache Hadoop. URL: http://hadoop.apache.org.
Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. The Phi Accrual Failure Detector. In SRDS. IEEE Computer Society, 2004.
Mike Hibler, Robert Ricci, Leigh Stoller, Jonathon Duerig, Shashi Guruprasad, Tim Stack, Kirk Webb, and Jay Lepreau. Large-scale Virtualization in the Emulab Network Testbed. In USENIX 2008 Annual Technical Conference, ATC'08, 2008.
Márk Jelasity, Spyros Voulgaris, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten Van Steen. Gossip-based peer sampling. ACM Transactions on Computer Systems (TOCS), 25(3):8, 2007.
Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In NSDI '07, 2007.
Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Martin Schulz, Saurabh Bagchi, Milind Kulkarni, Bowen Zhou, Zhezhe Chen, and Feng Qin. Debugging High-performance Computing Applications at Massive Scales. Commun. ACM, 58(9):72-81, August 2015.
Leslie Lamport. Time clocks, and the ordering of events in a distributed system. Commun. ACM, 21:558-565, July 1978.
Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002.
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In OSDI '14. USENIX Association, 2014.
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In ASPLOS '16. ACM, 2016.
Tanakorn Leesatapornwongsa, Cesar A. Stuardo, Riza O. Suminto, Huan Ke, Jeffrey F. Lukman, and Haryadi S. Gunawi. Scalability Bugs: When 100-Node Testing is Not Enough. In HotOS '17, pages 24-29. ACM, 2017.
Lorenzo Leonini, Étienne Rivière, and Pascal Felber. SPLAY: Distributed Systems Evaluation Made Simple (or How to Turn Ideas into Live Systems in a Breeze). In NSDI, volume 9, pages 185-198, 2009.
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS '17. ACM, 2017.
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08. USENIX Association, 2008.
B. D. Lubachevsky. Efficient Distributed Event-driven Simulations of Multiple-loop Networks. Commun. ACM, 32(1):111-123, January 1989.
Nuno Machado, Francisco Maia, Miguel Matos, and Rui Oliveira. BuzzPSS: A Dependable and Adaptive Peer Sampling Service. In LADC '16. IEEE Computer Society, 2016.
Francisco Maia, Miguel Matos, Ricardo Vilaça, José Pereira, Rui Oliveira, and Etienne Rivière. DATAFLASKS: Epidemic Store for Massive Scale Systems. In SRDS '14. IEEE Computer Society, 2014.
A. Montresor and M. Jelasity. PeerSim: A scalable P2P simulator. In International Conference on Peer-to-Peer Computing, 2009.
Larry Peterson and Timothy Roscoe. The design principles of PlanetLab. ACM SIGOPS operating systems review, 40(1):11-16, 2006.
Antony I. T. Rowstron and Peter Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. In Middleware '01, London, UK, UK, 2001. Springer-Verlag.
Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. Proc. VLDB Endow., 3(1-2):460-471, September 2010. URL: https://doi.org/10.14778/1920841.1920902.
Jiri Simsa, Randy Bryant, and Garth A Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV '10, 2010.
Peter Urban, Xavier Défago, and André Schiper. Neko: A single environment to simulate and prototype distributed algorithms. In Information Networking, 2001. Proceedings. 15th International Conference on, pages 503-511. IEEE, 2001.
Spyros Voulgaris, Daniela Gavidia, and Maarten van Steen. CYCLON: Inexpensive membership management for unstructured p2p overlays. Journal of Network and Systems Management, 13(2):197-217, 2005.
Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and Mike Dahlin. Exalt: Empowering Researchers to Evaluate Large-scale Storage Systems. In NSDI'14, pages 129-141, Berkeley, CA, USA, 2014. USENIX Association.
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent model checking of unmodified distributed systems. In NSDI '09. USENIX Association, 2009.

Minha: Large-Scale Distributed Systems Testing Made Practical

Authors Nuno Machado , Francisco Maia , Francisco Neves , Fábio Coelho , José Pereira

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Minha: Large-Scale Distributed Systems Testing Made Practical

Authors Nuno Machado , Francisco Maia , Francisco Neves , Fábio Coelho , José Pereira

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message