Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations

Blanc, Guy; Lange, Jane; Tan, Li-Yang

doi:10.4230/LIPIcs.ITCS.2020.44

Abstract

Consider the following heuristic for building a decision tree for a function f : {0,1}^n → {± 1}. Place the most influential variable x_i of f at the root, and recurse on the subfunctions f_{x_i=0} and f_{x_i=1} on the left and right subtrees respectively; terminate once the tree is an ε-approximation of f. We analyze the quality of this heuristic, obtaining near-matching upper and lower bounds: - Upper bound: For every f with decision tree size s and every ε ∈ (0,1/2), this heuristic builds a decision tree of size at most s^O(log(s/ε)log(1/ε)). - Lower bound: For every ε ∈ (0,1/2) and s ≤ 2^Õ(√n), there is an f with decision tree size s such that this heuristic builds a decision tree of size s^Ω~(log s). We also obtain upper and lower bounds for monotone functions: s^O(√{log s}/ε) and s^Ω(∜{log s}) respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004) and Lee (2009). Our upper bounds yield new algorithms for properly learning decision trees under the uniform distribution. We show that these algorithms - which are motivated by widely employed and empirically successful top-down decision tree learning heuristics such as ID3, C4.5, and CART - achieve provable guarantees that compare favorably with those of the current fastest algorithm (Ehrenfeucht and Haussler, 1989), and even have certain qualitative advantages. Our lower bounds shed new light on the limitations of these heuristics. Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend it to give the first uniform-distribution proper learning algorithm that achieves polynomial sample and memory complexity, while matching its state-of-the-art quasipolynomial runtime.

Scott Aaronson and Andris Ambainis. The Need for Structure in Quantum Speedups. Theory of Computing, 10(6):133-166, 2014.
Misha Alekhnovich, Mark Braverman, Vitaly Feldman, Adam Klivans, and Toniann Pitassi. The complexity of properly learning simple concept classes. Journal of Computer & System Sciences, 74(1):16-34, 2009.
Pranjal Awasthi, Vitaly Feldman, and Varun Kanade. Learning Using Local Membership Queries. In Proceedings of the 26th Annual Conference on Learning Theory (COLT), pages 398-431, 2013.
Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-Space Tradeoffs for Learning Finite Functions from Random Evaluations, with Applications to Polynomials. In Proceedings of the 31st Conference On Learning Theory (COLT), volume 75, pages 843-856, 2018.
Eric Blais, Clément Canonne, Igor Oliveira, Rocco Servedio, and Li-Yang Tan. Learning circuits with few negations. In Proceedings of the 18th International Workshop on Randomization and Computation (RANDOM), pages 512-527, 2015.
Eric Blais and Li-Yang Tan. Approximating Boolean functions with depth-2 circuits. SIAM J. Comput., 44(6):1583-1600, 2015.
Avirm Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (STOC), pages 253-262, 1994.
Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Inform. Process. Lett., 42(4):183-185, 1992. URL: https://doi.org/10.1016/0020-0190(92)90237-P.
Avrim Blum, Carl Burch, and John Langford. On Learning Monotone Boolean Functions. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS), pages 408-415, 1998.
Avrim Blum and Pat Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 97(1-2):245-271, 1997.
Leo Breiman. Classification and regression trees. Routledge, 2017.
Alon Brutzkus, Amit Daniely, and Eran Malach. ID3 Learns Juntas for Smoothed Product Distributions. ArXiv, abs/1906.08654, 2019. URL: http://arxiv.org/abs/1906.08654.
Alon Brutzkus, Amit Daniely, and Eran Malach. On the Optimality of Trees Generated by ID3. ArXiv, abs/1907.05444, 2019. URL: http://arxiv.org/abs/1907.05444.
Nader Bshouty. Exact learning via the monotone theory. Information and Computation, 123(1):146-153, 1995.
Nader Bshouty and Christino Tamon. On the Fourier spectrum of monotone functions. Journal of the ACM, 43(4):747-770, 1996.
Nader H. Bshouty, Elchanan Mossel, Ryan O'Donnell, and Rocco A. Servedio. Learning DNF from random walks. J. Comput. System Sci., 71(3):250-265, 2005.
Sitan Chen and Ankur Moitra. Beyond the low-degree algorithm: mixtures of subcubes and their applications. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC), pages 869-880, 2019.
Dana Dachman-Soled, Homin K. Lee, Tal Malkin, Rocco A. Servedio, Andrew Wan, and Hoeteck Wee. Optimal Cryptographic Hardness of Learning Monotone Functions. Theory of Computing, 5(13):257-282, 2009. URL: https://doi.org/10.4086/toc.2009.v005a013.
Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghavendra, Rocco Servedio, and Li-Yang Tan. Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the 42nd Annual Symposium on Theory of Computing (STOC), pages 533-542, 2010.
Tom Dietterich, Michael Kearns, and Yishay Mansour. Applying the weak learning framework to understand and improve C4.5. In Proceedings of the 13th International Conference on Machine Learning (ICML), pages 96-104, 1996.
Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information and Computation, 82(3):231-246, 1989.
Vitaly Feldman. Hardness of Proper Learning. In Encyclopedia of Algorithms, 2016.
Amos Fiat and Dmitry Pechyony. Decision trees: More theoretical justification for practical algorithms. In Proceedings of the 15th International Conference on Algorithmic Learning Theory (ALT), pages 156-170, 2004.
Ehud Friedgut. Boolean functions with low average sensitivity depend on few coordinates. Combinatorica, 18(1):474-483, 1998.
Ehud Friedgut and Gil Kalai. Every Monotone Graph Property has a Sharp Threshold. Proceedings of the American Mathematical Society, 124:2993-3002, 1996.
Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based Time-space Lower Bounds for Learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 990-1002, 2018.
Sumegha Garg, Ran Raz, and Avishay Tal. Time-Space Lower Bounds for Two-Pass Learning. In Proceedings of the 34th Computational Complexity Conference (CCC), pages 22:1-22:39, 2019.
Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45:653-750, 1998.
Oded Goldreich and Leonid Levin. A hard-core predicate for all one-way functions. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing (STOC), pages 25-32, 1989.
Parikshit Gopalan, Adam Kalai, and Adam Klivans. Agnostically learning decision trees. In Proceedings of the 40th ACM Symposium on Theory of Computing (STOC), pages 527-536, 2008.
Parikshit Gopalan and Rocco Servedio. Learning and lower bounds for AC⁰ with threshold gates. In Proceedings of the 14th International Workshop on Randomization and Computation (RANDOM), pages 588-601, 2010.
Thomas Hancock and Yishay Mansour. Learning monotone k-μ DNF formulas on product distributions. In Proceedings of the 4th Annual Conference on Computational Learning Theory (COLT), pages 179-193, 1991.
Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter Optimization: A Spectral Approach. Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
Jeffrey Jackson, Homin Lee, Rocco Servedio, and Andrew Wan. Learning Random Monotone DNF. Discrete Applied Mathematics, 159(5):259-271, 2011.
Jeff Kahn, Gil Kalai, and Nathan Linial. The influence of variables on boolean functions. In Proceedings of the 29th Annual Symposium on Foundations of Computer Science (FOCS), pages 68-80, 1988.
Daniel Kane. The average sensitivity of an intersection of halfspaces. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 437-440, 2014.
Daniel Kane. The correct exponent for the Gotsman-Linial conjecture. Computational Complexity, 23(2):151-175, 2014.
Michael Kearns. Boosting theory towards practice: recent developments in decision tree induction and the weak learning framework (invited talk). In Proceedings of the 13th National Conference on Artificial intelligence (AAAI), pages 1337-1339, 1996.
Michael Kearns, Ming Li, and Leslie Valiant. Learning Boolean formulas. Journal of the ACM, 41(6):1298-1328, 1994.
Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109-128, 1999.
Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1):67-95, 1994.
Gillat Kol, Ran Raz, and Avishay Tal. Time-space Hardness of Learning Sparse Parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1067-1080, 2017.
Eyal Kushilevitz and Yishay Mansour. Learning Decision Trees Using the Fourier Spectrum. SIAM Journal on Computing, 22(6):1331-1348, December 1993.
Homin Lee. On the learnability of monotone functions. PhD thesis, Columbia University, 2009.
Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform and learnability. Journal of the ACM, 40(3):607-620, 1993.
Dinesh Mehta and Vijay Raghavan. Decision tree approximations of Boolean functions. Theoretical Computer Science, 270(1-2):609-623, 2002.
Dana Moshkovitz and Michal Moshkovitz. Mixing Implies Lower Bounds for Space Bounded Learning. In Proceedings of the 30th Conference on Learning Theory (COLT), pages 1516-1566, 2017.
Dana Moshkovitz and Michal Moshkovitz. Entropy Samplers and Strong Generic Lower Bounds For Space Bounded Learning. In Proceedings of the 9th Innovations in Theoretical Computer Science Conference (ITCS), pages 28:1-28:20, 2018.
Elchanan Mossel, Ryan O'Donnell, and Rocco A. Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421-434, 2004.
Elchannan Mossel, Ryan O'Donnell, and Krzysztof Oleszkiewicz. Noise stability of functions with low influences: Invariance and optimality. Annals of Mathematics, 171:295-341, 2010.
Ryan O'Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014. Available at URL: http://analysisofbooleanfunctions.net/.
Ryan O'Donnell, Michael Saks, Oded Schramm, and Rocco Servedio. Every Decision Tree Has an Influential Variable. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 31-39, 2005.
Ryan O'Donnell and Rocco Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing, 37(3):827-844, 2007.
Ryan O'Donnell and Karl Wimmer. KKL, Kruskal-Katona, and Monotone Nets. SIAM Journal on Computing, 42(6):2375-2399, 2013.
Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986.
Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
Ran Raz. A Time-Space Lower Bound for a Large Class of Learning Problems. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 732-742, 2017.
Ran Raz. Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning. Journal of the ACM, 66(1):3:1-3:18, December 2018.
Ronald Rivest. Learning decision lists. Machine learning, 2(3):229-246, 1987.
Yoshifumi Sakai and Akira Maruoka. Learning monotone log-term DNF formulas under the uniform distribution. Theory of Computing Systems, 33:17-33, 2000.
Dominik Scheder and Li-Yang Tan. On the Average Sensitivity and Density of k-CNF Formulas. In Proceedings of the 17th International Workshop on Randomization and Computation (RANDOM), pages 683-698, 2013.
Linda Sellie. Learning random monotone DNF under the uniform distribution. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 181-192, 2008.
Rocco Servedio. On learning monotone DNF under product distributions. Information and Computation, 193(1):57-74, 2004.
Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Proceedings of the 28th Conference on Neural Information Processing Systems, pages 163-171, 2014.
Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), pages 1490-1516, 2016.
Karsten Verbeurgt. Learning sub-classes of monotone DNF on the uniform distribution. In Proceedings of the 9th Conference on Algorithmic Learning Theory (ALT), pages 385-399, 1998.
Ian Witten, Eibe Frank, Mark Hall, and Christopher Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations

Authors Guy Blanc, Jane Lange, Li-Yang Tan

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations

Authors Guy Blanc, Jane Lange, Li-Yang Tan

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References