Feature Cross Search via Submodular Optimization

Authors Lin Chen, Hossein Esfandiari, Gang Fu, Vahab S. Mirrokni, Qian Yu

Thumbnail PDF


  • Filesize: 0.72 MB
  • 16 pages

Document Identifiers

Author Details

Lin Chen
  • Simons Institute for the Theory of Computing, University of California, Berkeley, CA, USA
Hossein Esfandiari
  • Google Research, New York, NY, USA
Gang Fu
  • Google Research, New York, NY, USA
Vahab S. Mirrokni
  • Google Research, New York, NY, USA
Qian Yu
  • Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA

Cite AsGet BibTex

Lin Chen, Hossein Esfandiari, Gang Fu, Vahab S. Mirrokni, and Qian Yu. Feature Cross Search via Submodular Optimization. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 31:1-31:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


In this paper, we study feature cross search as a fundamental primitive in feature engineering. The importance of feature cross search especially for the linear model has been known for a while, with well-known textbook examples. In this problem, the goal is to select a small subset of features, combine them to form a new feature (called the crossed feature) by considering their Cartesian product, and find feature crosses to learn an accurate model. In particular, we study the problem of maximizing a normalized Area Under the Curve (AUC) of the linear model trained on the crossed feature column. First, we show that it is not possible to provide an n^{1/log log n}-approximation algorithm for this problem unless the exponential time hypothesis fails. This result also rules out the possibility of solving this problem in polynomial time unless 𝖯 = NP. On the positive side, by assuming the naïve Bayes assumption, we show that there exists a simple greedy (1-1/e)-approximation algorithm for this problem. This result is established by relating the AUC to the total variation of the commutator of two probability measures and showing that the total variation of the commutator is monotone and submodular. To show this, we relate the submodularity of this function to the positive semi-definiteness of a corresponding kernel matrix. Then, we use Bochner’s theorem to prove the positive semi-definiteness by showing that its inverse Fourier transform is non-negative everywhere. Our techniques and structural results might be of independent interest.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Feature selection
  • Feature engineering
  • feature cross
  • submodularity


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. L Douglas Baker and Andrew Kachites McCallum. Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 96-103. ACM, 1998. Google Scholar
  2. Mohammadhossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab Mirrokni, and Afshin Rostamizadeh. Categorical feature compression via submodular optimization. In International Conference on Machine Learning, pages 515-523, 2019. Google Scholar
  3. Aditya Bhaskara, Moses Charikar, Eden Chlamtac, Uriel Feige, and Aravindan Vijayaraghavan. Detecting high log-densities: an o (n^1/4) approximation for densest k-subgraph. In STOC, pages 201-210. ACM, 2010. Google Scholar
  4. Simon Byrne. A note on the use of empirical auc for evaluating probabilistic forecasts. Electronic Journal of Statistics, 10(1):380-393, 2016. Google Scholar
  5. Lin Chen, Hossein Esfandiari, Gang Fu, Vahab S Mirrokni, and Qian Yu. Feature cross search via submodular optimization. arXiv preprint arXiv:2107.02139, 2021. Google Scholar
  6. Yuxin Chen, S Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information maximization: When is greedy near-optimal? In Conference on Learning Theory, pages 338-363, 2015. Google Scholar
  7. Inderjit S Dhillon, Subramanyam Mallela, and Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of machine learning research, 3(Mar):1265-1287, 2003. Google Scholar
  8. Ethan R Elenberg, Rajiv Khanna, Alexandros G Dimakis, and Sahand Negahban. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539-3568, 2018. Google Scholar
  9. Susana Eyheramendy, David D Lewis, and David Madigan. On the naive bayes model for text categorization. In 9th International Workshop on Artificial Intelligence and Statistics. Citeseer, 2003. Google Scholar
  10. Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157-1182, 2003. Google Scholar
  11. Nazrul Hoque, Dhruba K Bhattacharyya, and Jugal K Kalita. Mifs-nd: A mutual information-based feature selection method. Expert Systems with Applications, 41(14):6371-6385, 2014. Google Scholar
  12. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367-375, 2001. Google Scholar
  13. Ken-ichi Iwata and Shin-ya Ozawa. Quantizer design for outputs of binary-input discrete memoryless channels using smawk algorithm. In 2014 IEEE International Symposium on Information Theory, pages 191-195. IEEE, 2014. Google Scholar
  14. Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems, pages 71-104. Cambridge University Press, 2014. Google Scholar
  15. Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical models. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 324-331. AUAI Press, 2005. Google Scholar
  16. Andreas Krause, Carlos Guestrin, Anupam Gupta, and Jon Kleinberg. Near-optimal sensor placements: Maximizing information while minimizing communication cost. In Proceedings of the 5th international conference on Information processing in sensor networks, pages 2-10. ACM, 2006. Google Scholar
  17. Brian M Kurkoski and Hideki Yagi. Quantization of binary-input discrete memoryless channels. IEEE Transactions on Information Theory, 60(8):4544-4552, 2014. Google Scholar
  18. Nojun Kwak and Chong-Ho Choi. Input feature selection by mutual information based on parzen window. IEEE transactions on pattern analysis and machine intelligence, 24(12):1667-1671, 2002. Google Scholar
  19. Hui Lin. Submodularity in natural language processing: algorithms and applications. PhD thesis, University of Washington, 2012. Google Scholar
  20. Yuanfei Luo, Mengshuo Wang, Hao Zhou, Quanming Yao, Wei-Wei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pages 1936-1945, 2019. Google Scholar
  21. Pasin Manurangsi. Almost-polynomial ratio eth-hardness of approximating densest k-subgraph. In STOC, pages 954-961. ACM, 2017. Google Scholar
  22. Tom Mitchell. Machine Learning. McGraw-Hill International Editions. McGraw-Hill, 1997. Google Scholar
  23. Marko Mitrovic, Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. Data summarization at scale: A two-stage submodular approach. In ICML, pages 3593-3602, 2018. Google Scholar
  24. George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265-294, 1978. Google Scholar
  25. Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efficient and robust feature selection via joint 𝓁_2,1-norms minimization. In Advances in neural information processing systems, pages 1813-1821, 2010. Google Scholar
  26. Monica Rogati and Yiming Yang. High-performing feature selection for text classification. In Proceedings of the eleventh international conference on Information and knowledge management, pages 659-661. ACM, 2002. Google Scholar
  27. Henry Scheffé. A useful convergence theorem for probability distributions. The Annals of Mathematical Statistics, 18(3):434-438, 1947. Google Scholar
  28. Noam Slonim and Naftali Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research, volume 1, page 200, 2001. Google Scholar
  29. Burak Turhan and Ayse Bener. Analysis of naive bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering, 68(2):278-290, 2009. Google Scholar
  30. Dennis Wei, Sanjeeb Dash, Tian Gao, and Oktay Günlük. Generalized linear rule models. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6687-6696, 2019. Google Scholar
  31. Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954-1963. PMLR, 2015. Google Scholar
  32. Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. Feature selection for svms. In Advances in neural information processing systems, pages 668-674, 2001. Google Scholar
  33. Sepehr Abbasi Zadeh, Mehrdad Ghadiri, Vahab Mirrokni, and Morteza Zadimoghaddam. Scalable feature selection via distributed diversity maximization. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Google Scholar
  34. Yuanxing Zhang, Yichong Bai, Lin Chen, Kaigui Bian, and Xiaoming Li. Influence maximization in messenger-based social networks. In GLOBECOM, pages 1-6. IEEE, 2016. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail