Statistical Comparison of Algorithm Performance Through Instance Selection

Authors Théo Matricon , Marie Anastacio , Nathanaël Fijalkow , Laurent Simon , Holger H. Hoos

Thumbnail PDF


  • Filesize: 1.02 MB
  • 21 pages

Document Identifiers

Author Details

Théo Matricon
  • Univ. Bordeaux, CNRS, LaBRI, UMR 5800, F-33400, Talence, France
Marie Anastacio
  • Leiden Institute of Advanced Computer Science, Leiden, The Netherlands
Nathanaël Fijalkow
  • CNRS, LaBRI, Bordeaux, France,
  • The Alan Turing Institute of data science, London, UK
Laurent Simon
  • Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400, Talence, France
Holger H. Hoos
  • Leiden Institute of Advanced Computer Science, Leiden, The Netherlands
  • University of British Columbia, Vancouver, Canada

Cite AsGet BibTex

Théo Matricon, Marie Anastacio, Nathanaël Fijalkow, Laurent Simon, and Holger H. Hoos. Statistical Comparison of Algorithm Performance Through Instance Selection. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 210, pp. 43:1-43:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Empirical performance evaluations, in competitions and scientific publications, play a major role in improving the state of the art in solving many automated reasoning problems, including SAT, CSP and Bayesian network structure learning (BNSL). To empirically demonstrate the merit of a new solver usually requires extensive experiments, with computational costs of CPU years. This not only makes it difficult for researchers with limited access to computational resources to test their ideas and publish their work, but also consumes large amounts of energy. We propose an approach for comparing the performance of two algorithms: by performing runs on carefully chosen instances, we obtain a probabilistic statement on which algorithm performs best, trading off between the computational cost of running algorithms and the confidence in the result. We describe a set of methods for this purpose and evaluate their efficacy on diverse datasets from SAT, CSP and BNSL. On all these datasets, most of our approaches were able to choose the correct algorithm with about 95% accuracy, while using less than a third of the CPU time required for a full comparison; the best methods reach this level of accuracy within less than 15% of the CPU time for a full comparison.

Subject Classification

ACM Subject Classification
  • General and reference → Evaluation
  • Theory of computation → Automated reasoning
  • Theory of computation → Constraint and logic programming
  • Performance assessment
  • early stopping
  • automated reasoning solvers


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Tomáš Balyo, Nils Froleyks, Marijn J.H. Heule, Markus Iser, Matti Järvisalo, and Martin Suda, editors. Proceedings of SAT Competition 2020: Solver and Benchmark Descriptions. Department of Computer Science Report Series B. Department of Computer Science, University of Helsinki, Finland, 2020. URL:
  2. Mauro Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer Publishing Company, Incorporated, 1st ed. 2005. 2nd printing edition, 2009. URL:
  3. Mauro Birattari, Thomas Stützle, Luis Paquete, and Klaus Varrentrapp. A racing algorithm for configuring metaheuristics. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, GECCO'02, pages 11-18, San Francisco, CA, USA, January 2002. Morgan Kaufmann Publishers Inc. URL:
  4. Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and Joaquin Vanschoren. Aslib: A benchmark library for algorithm selection. Artificial Intelligence, 237:41-58, 2016. URL:
  5. Jakob Bossek and Markus Wagner. Generating instances with performance differences for more than just two algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO '21, page 1423–1432, New York, NY, USA, 2021. Association for Computing Machinery. URL:
  6. William Jay Conover. Practical nonparametric statistics, volume 350. John Wiley & Sons, 1998. URL:
  7. Martin Davis and Hilary Putnam. A computing procedure for quantification theory. Journal of the ACM, 7(3):201–215, 1960. URL:
  8. Ian P. Gent, Bilal Syed Hussain, Christopher Jefferson, Lars Kotthoff, Ian Miguel, Glenna F. Nightingale, and Peter Nightingale. Discriminating instance generation for automated constraint model selection. In Barry O'Sullivan, editor, Principles and Practice of Constraint Programming, pages 356-365, Cham, 2014. Springer International Publishing. URL:
  9. Carla Gomes, Bart Selman, Nuno Crato, and Henry Kautz. Heavy-tailed phenomena in satisfiability and constraint satisfaction problems. Journal of Automated Reasoning, 24:67–100, January 2000. URL:
  10. Marijn Heule, Matti Järvisalo, and Martin Suda. Sat competition 2018. Journal on Satisfiability, Boolean Modeling and Computation, 11:133-154, September 2019. URL:
  11. Holger H. Hoos. Automated algorithm configuration and parameter tuning. In Youssef Hamadi, Eric Monfroy, and Frédéric Saubion, editors, Autonomous Search, pages 37-71. Springer, 2012. URL:
  12. Barry Hurley and Barry O'Sullivan. Statistical regimes and runtime prediction. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI'15, page 318–324. AAAI Press, 2015. URL:
  13. Frank Hutter, Holger Hoos, and Kevin Leyton-brown. Bayesian optimization with censored response data. In In NIPS workshop on Bayesian Optimization, Sequential Experimental Design, and Bandits, 2011. URL:
  14. Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Carlos A. Coello Coello, editor, Learning and Intelligent Optimization, pages 507-523, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. URL:
  15. Frank Hutter, Thomas Stützle, Kevin Leyton-Brown, and Holger H. Hoos. Paramils: An automatic algorithm configuration framework. CoRR, abs/1401.3492, 2014. URL:
  16. Frank Hutter, Lin Xu, Holger H. Hoos, and Kevin Leyton-Brown. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence, 206:79-111, 2014. URL:
  17. Pascal Kerschke, Holger H. Hoos, Frank Neumann, and Heike Trautmann. Automated algorithm selection: Survey and perspectives. Evolutionary Computation, 27(1):3-45, 2019. URL:
  18. Charles L. Lawson and Richard J. Hanson. Solving Least Squares Problems. Society for Industrial and Applied Mathematics, 1995. URL:
  19. Marius Lindauer, Jan N. van Rijn, and Lars Kotthoff. Open algorithm selection challenge 2017: Setup and scenarios. In Marius Lindauer, Jan N. van Rijn, and Lars Kotthoff, editors, Proceedings of the Open Algorithm Selection Challenge, volume 79 of Proceedings of Machine Learning Research, Brussels, Belgium, September 11-12, 2017. PMLR. URL:
  20. Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43-58, 2016. URL:
  21. Brandon Malone, Kustaa Kangas, Matti Järvisalo, Mikko Koivisto, and Petri Myllymäki. Empirical hardness of finding optimal bayesian network structures: algorithm selection and runtime prediction. Machine Learning, 107:247–283, January 2018. URL:
  22. Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners. Artif. Intell. Rev., 11(1-5):193–225, 1997. URL:
  23. John W. Pratt. Remarks on zeros and ties in the wilcoxon signed rank procedures. Journal of the American Statistical Association, 54:655-667, 1959. URL:
  24. Luca Pulina and Martina Seidl. The 2016 and 2017 qbf solvers evaluations (qbfeval'16 and qbfeval'17). Artificial Intelligence, 274:224-248, 2019. URL:
  25. Josef Schmee and Gerald J. Hahn. A simple method for regression analysis with censored data. Technometrics, 21(4):417-432, 1979. URL:
  26. Sidney Siegel and N John Castellan Jr. Nonparametric statistics for the behavioral sciences. Mcgraw-Hill Book Company, 1988. URL:
  27. L. Simon, Daniel Leberre, and E. Hirsch. The SAT 2002 Competition. Annals of Mathematics and Artificial Intelligence, vol.43, Issue 1-4:307-342, 2005. URL:
  28. Peter James Stuckey, Thibaut Feydy, Andreas Schutt, Guido Tack, and Julien Fischer. The minizinc challenge 2008-2013. AI Magazine, 35(2):55-60, 2014. URL:
  29. Li-Li Sun and Xi-Zhao Wang. A survey on active learning strategy. In 2010 International Conference on Machine Learning and Cybernetics, volume 1, pages 161-166, 2010. URL:
  30. G. Sutcliffe. Proceedings of the 10th ijcar atp system competition ( casc-j 10 ). IJCAR, 2020. URL:
  31. Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research, 32:565-606, 2008. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail