Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

Authors Chen Dan, Kristoffer Arnsfelt Hansen , He Jiang , Liwei Wang, Yuchen Zhou

Thumbnail PDF


  • Filesize: 0.59 MB
  • 16 pages

Document Identifiers

Author Details

Chen Dan
  • Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Kristoffer Arnsfelt Hansen
  • Department of Computer Science, Aarhus University, Aarhus, Denmark
He Jiang
  • University of Southern California, Los Angeles, California, United States
Liwei Wang
  • 1. Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
  • , 2. Center for Data Science, Peking University, Beijing Institute of Big Data Research
  • , Beijing, China
Yuchen Zhou
  • Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States

Cite AsGet BibTex

Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations. In 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 117, pp. 41:1-41:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Low rank approximation of matrices is an important tool in machine learning. Given a data matrix, low rank approximation helps to find factors, patterns, and provides concise representations for the data. Research on low rank approximation usually focuses on real matrices. However, in many applications data are binary (categorical) rather than continuous. This leads to the problem of low rank approximation of binary matrices. Here we are given a d x n binary matrix A and a small integer k < d. The goal is to find two binary matrices U and V of sizes d x k and k x n respectively, so that the Frobenius norm of A - U V is minimized. There are two models of this problem, depending on the definition of the dot product of binary vectors: The GF(2) model and the Boolean semiring model. Unlike low rank approximation of a real matrix which can be efficiently solved by Singular Value Decomposition, we show that approximation of a binary matrix is NP-hard, even for k=1. In this paper, our main concern is the problem of Column Subset Selection (CSS), in which the low rank matrix U must be formed by k columns of the data matrix, and we are interested in the approximation ratio achievable by CSS for binary matrices. For the GF(2) model, we show that CSS has approximation ratio bounded by k/2+1+k/(2(2^k-1)) and this is asymptotically tight. For the Boolean model, it turns out that CSS is no longer sufficient to obtain a bound. We then develop a Generalized CSS (GCSS) procedure in which the columns of U are generated from Boolean formulas operating bitwise on selected columns of the data matrix. We show that the approximation ratio achieved by GCSS is bounded by 2^(k-1)+1, and argue that an exponential dependency on k is seems inherent.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Unsupervised learning and clustering
  • Computing methodologies → Factorization methods
  • Approximation Algorithms
  • Low Rank Approximation
  • Binary Matrices


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. Proceedings of the 33rd International Conference on MachineLearning, 2016. Google Scholar
  2. Noga Amit. The bicluster graph editing problem. M.sc. thesis, Tel Aviv University, 2004. Google Scholar
  3. Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1):3-20, 2010. Google Scholar
  4. Christian H Bischof and Gregorio Quintana-Ortí. Computing rank-revealing qr factorizations of dense matrices. ACM Transactions on Mathematical Software (TOMS), 24(2):226-253, 1998. Google Scholar
  5. Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687-717, 2014. Google Scholar
  6. Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 968-977. Society for Industrial and Applied Mathematics, 2009. Google Scholar
  7. Tony F Chan. Rank revealing qr factorizations. Linear algebra and its applications, 88:67-82, 1987. Google Scholar
  8. Tony F Chan and Per Christian Hansen. Low-rank revealing qr factorizations. Numerical Linear Algebra with Applications, 1(1):33-44, 1994. Google Scholar
  9. Shivkumar Chandrasekaran and Ilse CF Ipsen. On rank-revealing factorisations. SIAM Journal on Matrix Analysis and Applications, 15(2):592-622, 1994. Google Scholar
  10. Ali Civril and Malik Magdon-Ismail. Column subset selection via sparse approximation of svd. Theoretical Computer Science, 421:1-14, 2012. Google Scholar
  11. Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 163-172. ACM, 2015. Google Scholar
  12. Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. On low rank approximation of binary matrices. CoRR, abs/1511.01699, 2015. Google Scholar
  13. Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 329-338. IEEE, 2010. Google Scholar
  14. Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117-1126. Society for Industrial and Applied Mathematics, 2006. Google Scholar
  15. Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrix approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 292-303. Springer, 2006. Google Scholar
  16. Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Google Scholar
  17. Herbert Fleischner, Egbert Mujuni, Daniël Paulusma, and Stefan Szeider. Covering graphs with few complete bipartite subgraphs. Theor. Comput. Sci, 410(21-23):2045-2053, 2009. Google Scholar
  18. Leslie V Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra and its Applications, 74:47-71, 1986. Google Scholar
  19. Mario Frank, Andreas P Streich, David Basin, and Joachim M Buhmann. Multi-assignment clustering for boolean data. The Journal of Machine Learning Research, 13(1):459-489, 2012. Google Scholar
  20. Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025-1041, 2004. Google Scholar
  21. Alexander A Frolov, Dusan Husek, Igor P Muraviev, and P Yu Polyakov. Boolean factor analysis by attractor neural network. Neural Networks, IEEE Transactions on, 18(3):698-707, 2007. Google Scholar
  22. Nicolas Gillis and Stephen A. Vavasis. On the complexity of robust PCA and 𝓁₁-norm low-rank matrix approximation. CoRR, abs/1509.09236, 2015. Google Scholar
  23. Gene Golub. Numerical methods for solving linear least squares problems. Numerische Mathematik, 7(3):206-216, 1965. Google Scholar
  24. Ming Gu and Stanley C Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17(4):848-869, 1996. Google Scholar
  25. Harold W Gutch, Peter Gruber, Arie Yeredor, and Fabian J Theis. Ica over finite fields: separability and algorithms. Signal Processing, 92(8):1796-1808, 2012. Google Scholar
  26. Dorit S Hochbaum and Anu Pathria. Forest harvesting and minimum cuts: a new approach to handling spatial constraints. Forest Science, 43(4):544-554, 1997. Google Scholar
  27. Yoo Pyo Hong and C-T Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation, 58(197):213-232, 1992. Google Scholar
  28. Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933. Google Scholar
  29. Peng Jiang, Jiming Peng, Michael Heath, and Rui Yang. A clustering approach to constrained binary matrix factorization. In Data Mining and Knowledge Discovery for Big Data, pages 281-303. Springer, 2014. Google Scholar
  30. Kari K. Karhunen. über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 37:1-79, 1947. Google Scholar
  31. Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Mining top-k patterns from binary datasets in presence of noise. In SDM, volume 10, pages 165-176, 2010. Google Scholar
  32. Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trendsregistered in Machine Learning, 3(2):123-224, 2011. Google Scholar
  33. Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. Knowledge and Data Engineering, IEEE Transactions on, 20(10):1348-1362, 2008. Google Scholar
  34. Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng, 20(10):1348-1362, 2008. Google Scholar
  35. Amichai Painsky, Saharon Rosset, and Meir Feder. Generalized independent component analysis over finite alphabets. Information Theory, IEEE Transactions on, 2015. Google Scholar
  36. C-T Pan. On the existence and computation of rank-revealing lu factorizations. Linear Algebra and its Applications, 316(1-3):199-222, 2000. Google Scholar
  37. Ron M. Roth and Krishnamurthy Viswanathan. On the hardness of decoding the gale-berlekamp code. IEEE Transactions on Information Theory, 54(3):1050-1060, 2008. Google Scholar
  38. Jouni K Seppänen, Ella Bingham, and Heikki Mannila. A simple algorithm for topic identification in 0-1 data. In Knowledge Discovery in Databases: PKDD 2003, pages 423-434. Springer, 2003. Google Scholar
  39. Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix factorization. In John F. Elder IV, Françoise Fogelman-Soulié, Peter A. Flach, and Mohammed Javeed Zaki, editors, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 757-766. ACM, 2009. Google Scholar
  40. Tomáš Šingliar and Miloš Hauskrecht. Noisy-or component analysis and its application to link analysis. The Journal of Machine Learning Research, 7:2189-2213, 2006. Google Scholar
  41. Larry Stockmeyer. The minimal set basis problem is NP-complete. IBM Research Report RC-5431, IBM Thomas J. Watson Research Center, 1975. Google Scholar
  42. Jinsong Tan. Inapproximability of maximum weighted edge biclique and its applications. In Manindra Agrawal, Ding-Zhu Du, Zhenhua Duan, and Angsheng Li, editors, TAMC 2008, volume 4978 of LNCS, pages 282-293. Springer, 2008. Google Scholar
  43. Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 978-986. Society for Industrial and Applied Mathematics, 2009. Google Scholar
  44. Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining problem: finding a minimal descriptive set of roles. In Proceedings of the 12th ACM symposium on Access control models and technologies, pages 175-184. ACM, 2007. Google Scholar
  45. Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In Jozef Gruska, editor, 6th Symposium on Mathematical Foundations of Computer Science, MFCS 1977, volume 53 of Lecture Notes in Computer Science, pages 162-176. Springer, 1977. Google Scholar
  46. Yining Wang and Aarti Singh. Column subset selection with missing data via active sampling. In AISTATS, pages 1033-1041, 2015. Google Scholar
  47. Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. An explicit sampling dependent spectral error bound for column subset selection. In Proceedings of The 32nd International Conference on Machine Learning, pages 135-143, 2015. Google Scholar
  48. Arie Yeredor. Independent component analysis over galois fields of prime order. Information Theory, IEEE Transactions on, 57(8):5342-5359, 2011. Google Scholar