Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations

Authors Guy Blanc, Jane Lange, Li-Yang Tan



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2020.44.pdf
  • Filesize: 0.83 MB
  • 44 pages

Document Identifiers

Author Details

Guy Blanc
  • Stanford University, CA, USA
Jane Lange
  • Stanford University, CA, USA
Li-Yang Tan
  • Stanford University, CA, USA

Acknowledgements

We thank Clément Canonne, Adam Klivans, Charlotte Peale, Toniann Pitassi, Omer Reingold, and Rocco Servedio for helpful conversations and suggestions. We also thank the anonymous reviewers of ITCS 2020 for their feedback.

Cite AsGet BibTex

Guy Blanc, Jane Lange, and Li-Yang Tan. Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 151, pp. 44:1-44:44, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ITCS.2020.44

Abstract

Consider the following heuristic for building a decision tree for a function f : {0,1}^n → {± 1}. Place the most influential variable x_i of f at the root, and recurse on the subfunctions f_{x_i=0} and f_{x_i=1} on the left and right subtrees respectively; terminate once the tree is an ε-approximation of f. We analyze the quality of this heuristic, obtaining near-matching upper and lower bounds: - Upper bound: For every f with decision tree size s and every ε ∈ (0,1/2), this heuristic builds a decision tree of size at most s^O(log(s/ε)log(1/ε)). - Lower bound: For every ε ∈ (0,1/2) and s ≤ 2^Õ(√n), there is an f with decision tree size s such that this heuristic builds a decision tree of size s^Ω~(log s). We also obtain upper and lower bounds for monotone functions: s^O(√{log s}/ε) and s^Ω(∜{log s}) respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004) and Lee (2009). Our upper bounds yield new algorithms for properly learning decision trees under the uniform distribution. We show that these algorithms - which are motivated by widely employed and empirically successful top-down decision tree learning heuristics such as ID3, C4.5, and CART - achieve provable guarantees that compare favorably with those of the current fastest algorithm (Ehrenfeucht and Haussler, 1989), and even have certain qualitative advantages. Our lower bounds shed new light on the limitations of these heuristics. Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend it to give the first uniform-distribution proper learning algorithm that achieves polynomial sample and memory complexity, while matching its state-of-the-art quasipolynomial runtime.

Subject Classification

ACM Subject Classification
  • Theory of computation → Oracles and decision trees
  • Theory of computation → Boolean function learning
Keywords
  • Decision trees
  • Influence of variables
  • Analysis of boolean functions
  • Learning theory
  • Top-down decision tree heuristics

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Scott Aaronson and Andris Ambainis. The Need for Structure in Quantum Speedups. Theory of Computing, 10(6):133-166, 2014. Google Scholar
  2. Misha Alekhnovich, Mark Braverman, Vitaly Feldman, Adam Klivans, and Toniann Pitassi. The complexity of properly learning simple concept classes. Journal of Computer & System Sciences, 74(1):16-34, 2009. Google Scholar
  3. Pranjal Awasthi, Vitaly Feldman, and Varun Kanade. Learning Using Local Membership Queries. In Proceedings of the 26th Annual Conference on Learning Theory (COLT), pages 398-431, 2013. Google Scholar
  4. Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-Space Tradeoffs for Learning Finite Functions from Random Evaluations, with Applications to Polynomials. In Proceedings of the 31st Conference On Learning Theory (COLT), volume 75, pages 843-856, 2018. Google Scholar
  5. Eric Blais, Clément Canonne, Igor Oliveira, Rocco Servedio, and Li-Yang Tan. Learning circuits with few negations. In Proceedings of the 18th International Workshop on Randomization and Computation (RANDOM), pages 512-527, 2015. Google Scholar
  6. Eric Blais and Li-Yang Tan. Approximating Boolean functions with depth-2 circuits. SIAM J. Comput., 44(6):1583-1600, 2015. Google Scholar
  7. Avirm Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (STOC), pages 253-262, 1994. Google Scholar
  8. Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Inform. Process. Lett., 42(4):183-185, 1992. URL: https://doi.org/10.1016/0020-0190(92)90237-P.
  9. Avrim Blum, Carl Burch, and John Langford. On Learning Monotone Boolean Functions. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS), pages 408-415, 1998. Google Scholar
  10. Avrim Blum and Pat Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 97(1-2):245-271, 1997. Google Scholar
  11. Leo Breiman. Classification and regression trees. Routledge, 2017. Google Scholar
  12. Alon Brutzkus, Amit Daniely, and Eran Malach. ID3 Learns Juntas for Smoothed Product Distributions. ArXiv, abs/1906.08654, 2019. URL: http://arxiv.org/abs/1906.08654.
  13. Alon Brutzkus, Amit Daniely, and Eran Malach. On the Optimality of Trees Generated by ID3. ArXiv, abs/1907.05444, 2019. URL: http://arxiv.org/abs/1907.05444.
  14. Nader Bshouty. Exact learning via the monotone theory. Information and Computation, 123(1):146-153, 1995. Google Scholar
  15. Nader Bshouty and Christino Tamon. On the Fourier spectrum of monotone functions. Journal of the ACM, 43(4):747-770, 1996. Google Scholar
  16. Nader H. Bshouty, Elchanan Mossel, Ryan O'Donnell, and Rocco A. Servedio. Learning DNF from random walks. J. Comput. System Sci., 71(3):250-265, 2005. Google Scholar
  17. Sitan Chen and Ankur Moitra. Beyond the low-degree algorithm: mixtures of subcubes and their applications. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC), pages 869-880, 2019. Google Scholar
  18. Dana Dachman-Soled, Homin K. Lee, Tal Malkin, Rocco A. Servedio, Andrew Wan, and Hoeteck Wee. Optimal Cryptographic Hardness of Learning Monotone Functions. Theory of Computing, 5(13):257-282, 2009. URL: https://doi.org/10.4086/toc.2009.v005a013.
  19. Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghavendra, Rocco Servedio, and Li-Yang Tan. Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the 42nd Annual Symposium on Theory of Computing (STOC), pages 533-542, 2010. Google Scholar
  20. Tom Dietterich, Michael Kearns, and Yishay Mansour. Applying the weak learning framework to understand and improve C4.5. In Proceedings of the 13th International Conference on Machine Learning (ICML), pages 96-104, 1996. Google Scholar
  21. Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information and Computation, 82(3):231-246, 1989. Google Scholar
  22. Vitaly Feldman. Hardness of Proper Learning. In Encyclopedia of Algorithms, 2016. Google Scholar
  23. Amos Fiat and Dmitry Pechyony. Decision trees: More theoretical justification for practical algorithms. In Proceedings of the 15th International Conference on Algorithmic Learning Theory (ALT), pages 156-170, 2004. Google Scholar
  24. Ehud Friedgut. Boolean functions with low average sensitivity depend on few coordinates. Combinatorica, 18(1):474-483, 1998. Google Scholar
  25. Ehud Friedgut and Gil Kalai. Every Monotone Graph Property has a Sharp Threshold. Proceedings of the American Mathematical Society, 124:2993-3002, 1996. Google Scholar
  26. Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based Time-space Lower Bounds for Learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 990-1002, 2018. Google Scholar
  27. Sumegha Garg, Ran Raz, and Avishay Tal. Time-Space Lower Bounds for Two-Pass Learning. In Proceedings of the 34th Computational Complexity Conference (CCC), pages 22:1-22:39, 2019. Google Scholar
  28. Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45:653-750, 1998. Google Scholar
  29. Oded Goldreich and Leonid Levin. A hard-core predicate for all one-way functions. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing (STOC), pages 25-32, 1989. Google Scholar
  30. Parikshit Gopalan, Adam Kalai, and Adam Klivans. Agnostically learning decision trees. In Proceedings of the 40th ACM Symposium on Theory of Computing (STOC), pages 527-536, 2008. Google Scholar
  31. Parikshit Gopalan and Rocco Servedio. Learning and lower bounds for AC⁰ with threshold gates. In Proceedings of the 14th International Workshop on Randomization and Computation (RANDOM), pages 588-601, 2010. Google Scholar
  32. Thomas Hancock and Yishay Mansour. Learning monotone k-μ DNF formulas on product distributions. In Proceedings of the 4th Annual Conference on Computational Learning Theory (COLT), pages 179-193, 1991. Google Scholar
  33. Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter Optimization: A Spectral Approach. Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. Google Scholar
  34. Jeffrey Jackson, Homin Lee, Rocco Servedio, and Andrew Wan. Learning Random Monotone DNF. Discrete Applied Mathematics, 159(5):259-271, 2011. Google Scholar
  35. Jeff Kahn, Gil Kalai, and Nathan Linial. The influence of variables on boolean functions. In Proceedings of the 29th Annual Symposium on Foundations of Computer Science (FOCS), pages 68-80, 1988. Google Scholar
  36. Daniel Kane. The average sensitivity of an intersection of halfspaces. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 437-440, 2014. Google Scholar
  37. Daniel Kane. The correct exponent for the Gotsman-Linial conjecture. Computational Complexity, 23(2):151-175, 2014. Google Scholar
  38. Michael Kearns. Boosting theory towards practice: recent developments in decision tree induction and the weak learning framework (invited talk). In Proceedings of the 13th National Conference on Artificial intelligence (AAAI), pages 1337-1339, 1996. Google Scholar
  39. Michael Kearns, Ming Li, and Leslie Valiant. Learning Boolean formulas. Journal of the ACM, 41(6):1298-1328, 1994. Google Scholar
  40. Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109-128, 1999. Google Scholar
  41. Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1):67-95, 1994. Google Scholar
  42. Gillat Kol, Ran Raz, and Avishay Tal. Time-space Hardness of Learning Sparse Parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1067-1080, 2017. Google Scholar
  43. Eyal Kushilevitz and Yishay Mansour. Learning Decision Trees Using the Fourier Spectrum. SIAM Journal on Computing, 22(6):1331-1348, December 1993. Google Scholar
  44. Homin Lee. On the learnability of monotone functions. PhD thesis, Columbia University, 2009. Google Scholar
  45. Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform and learnability. Journal of the ACM, 40(3):607-620, 1993. Google Scholar
  46. Dinesh Mehta and Vijay Raghavan. Decision tree approximations of Boolean functions. Theoretical Computer Science, 270(1-2):609-623, 2002. Google Scholar
  47. Dana Moshkovitz and Michal Moshkovitz. Mixing Implies Lower Bounds for Space Bounded Learning. In Proceedings of the 30th Conference on Learning Theory (COLT), pages 1516-1566, 2017. Google Scholar
  48. Dana Moshkovitz and Michal Moshkovitz. Entropy Samplers and Strong Generic Lower Bounds For Space Bounded Learning. In Proceedings of the 9th Innovations in Theoretical Computer Science Conference (ITCS), pages 28:1-28:20, 2018. Google Scholar
  49. Elchanan Mossel, Ryan O'Donnell, and Rocco A. Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421-434, 2004. Google Scholar
  50. Elchannan Mossel, Ryan O'Donnell, and Krzysztof Oleszkiewicz. Noise stability of functions with low influences: Invariance and optimality. Annals of Mathematics, 171:295-341, 2010. Google Scholar
  51. Ryan O'Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014. Available at URL: http://analysisofbooleanfunctions.net/.
  52. Ryan O'Donnell, Michael Saks, Oded Schramm, and Rocco Servedio. Every Decision Tree Has an Influential Variable. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 31-39, 2005. Google Scholar
  53. Ryan O'Donnell and Rocco Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing, 37(3):827-844, 2007. Google Scholar
  54. Ryan O'Donnell and Karl Wimmer. KKL, Kruskal-Katona, and Monotone Nets. SIAM Journal on Computing, 42(6):2375-2399, 2013. Google Scholar
  55. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986. Google Scholar
  56. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google Scholar
  57. Ran Raz. A Time-Space Lower Bound for a Large Class of Learning Problems. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 732-742, 2017. Google Scholar
  58. Ran Raz. Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning. Journal of the ACM, 66(1):3:1-3:18, December 2018. Google Scholar
  59. Ronald Rivest. Learning decision lists. Machine learning, 2(3):229-246, 1987. Google Scholar
  60. Yoshifumi Sakai and Akira Maruoka. Learning monotone log-term DNF formulas under the uniform distribution. Theory of Computing Systems, 33:17-33, 2000. Google Scholar
  61. Dominik Scheder and Li-Yang Tan. On the Average Sensitivity and Density of k-CNF Formulas. In Proceedings of the 17th International Workshop on Randomization and Computation (RANDOM), pages 683-698, 2013. Google Scholar
  62. Linda Sellie. Learning random monotone DNF under the uniform distribution. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 181-192, 2008. Google Scholar
  63. Rocco Servedio. On learning monotone DNF under product distributions. Information and Computation, 193(1):57-74, 2004. Google Scholar
  64. Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Proceedings of the 28th Conference on Neural Information Processing Systems, pages 163-171, 2014. Google Scholar
  65. Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), pages 1490-1516, 2016. Google Scholar
  66. Karsten Verbeurgt. Learning sub-classes of monotone DNF on the uniform distribution. In Proceedings of the 9th Conference on Algorithmic Learning Theory (ALT), pages 385-399, 1998. Google Scholar
  67. Ian Witten, Eibe Frank, Mark Hall, and Christopher Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail