Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

Dan, Chen; Hansen, Kristoffer Arnsfelt; Jiang, He; Wang, Liwei; Zhou, Yuchen

doi:10.4230/LIPIcs.MFCS.2018.41

Abstract

Low rank approximation of matrices is an important tool in machine learning. Given a data matrix, low rank approximation helps to find factors, patterns, and provides concise representations for the data. Research on low rank approximation usually focuses on real matrices. However, in many applications data are binary (categorical) rather than continuous. This leads to the problem of low rank approximation of binary matrices. Here we are given a d x n binary matrix A and a small integer k < d. The goal is to find two binary matrices U and V of sizes d x k and k x n respectively, so that the Frobenius norm of A - U V is minimized. There are two models of this problem, depending on the definition of the dot product of binary vectors: The GF(2) model and the Boolean semiring model. Unlike low rank approximation of a real matrix which can be efficiently solved by Singular Value Decomposition, we show that approximation of a binary matrix is NP-hard, even for k=1. In this paper, our main concern is the problem of Column Subset Selection (CSS), in which the low rank matrix U must be formed by k columns of the data matrix, and we are interested in the approximation ratio achievable by CSS for binary matrices. For the GF(2) model, we show that CSS has approximation ratio bounded by k/2+1+k/(2(2^k-1)) and this is asymptotically tight. For the Boolean model, it turns out that CSS is no longer sufficient to obtain a bound. We then develop a Generalized CSS (GCSS) procedure in which the columns of U are generated from Boolean formulas operating bitwise on selected columns of the data matrix. We show that the approximation ratio achieved by GCSS is bounded by 2^(k-1)+1, and argue that an exponential dependency on k is seems inherent.

Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. Proceedings of the 33rd International Conference on MachineLearning, 2016.
Noga Amit. The bicluster graph editing problem. M.sc. thesis, Tel Aviv University, 2004.
Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1):3-20, 2010.
Christian H Bischof and Gregorio Quintana-Ortí. Computing rank-revealing qr factorizations of dense matrices. ACM Transactions on Mathematical Software (TOMS), 24(2):226-253, 1998.
Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687-717, 2014.
Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 968-977. Society for Industrial and Applied Mathematics, 2009.
Tony F Chan. Rank revealing qr factorizations. Linear algebra and its applications, 88:67-82, 1987.
Tony F Chan and Per Christian Hansen. Low-rank revealing qr factorizations. Numerical Linear Algebra with Applications, 1(1):33-44, 1994.
Shivkumar Chandrasekaran and Ilse CF Ipsen. On rank-revealing factorisations. SIAM Journal on Matrix Analysis and Applications, 15(2):592-622, 1994.
Ali Civril and Malik Magdon-Ismail. Column subset selection via sparse approximation of svd. Theoretical Computer Science, 421:1-14, 2012.
Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 163-172. ACM, 2015.
Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. On low rank approximation of binary matrices. CoRR, abs/1511.01699, 2015.
Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 329-338. IEEE, 2010.
Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117-1126. Society for Industrial and Applied Mathematics, 2006.
Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrix approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 292-303. Springer, 2006.
Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.
Herbert Fleischner, Egbert Mujuni, Daniël Paulusma, and Stefan Szeider. Covering graphs with few complete bipartite subgraphs. Theor. Comput. Sci, 410(21-23):2045-2053, 2009.
Leslie V Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra and its Applications, 74:47-71, 1986.
Mario Frank, Andreas P Streich, David Basin, and Joachim M Buhmann. Multi-assignment clustering for boolean data. The Journal of Machine Learning Research, 13(1):459-489, 2012.
Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025-1041, 2004.
Alexander A Frolov, Dusan Husek, Igor P Muraviev, and P Yu Polyakov. Boolean factor analysis by attractor neural network. Neural Networks, IEEE Transactions on, 18(3):698-707, 2007.
Nicolas Gillis and Stephen A. Vavasis. On the complexity of robust PCA and 𝓁₁-norm low-rank matrix approximation. CoRR, abs/1509.09236, 2015.
Gene Golub. Numerical methods for solving linear least squares problems. Numerische Mathematik, 7(3):206-216, 1965.
Ming Gu and Stanley C Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17(4):848-869, 1996.
Harold W Gutch, Peter Gruber, Arie Yeredor, and Fabian J Theis. Ica over finite fields: separability and algorithms. Signal Processing, 92(8):1796-1808, 2012.
Dorit S Hochbaum and Anu Pathria. Forest harvesting and minimum cuts: a new approach to handling spatial constraints. Forest Science, 43(4):544-554, 1997.
Yoo Pyo Hong and C-T Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation, 58(197):213-232, 1992.
Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
Peng Jiang, Jiming Peng, Michael Heath, and Rui Yang. A clustering approach to constrained binary matrix factorization. In Data Mining and Knowledge Discovery for Big Data, pages 281-303. Springer, 2014.
Kari K. Karhunen. über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 37:1-79, 1947.
Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Mining top-k patterns from binary datasets in presence of noise. In SDM, volume 10, pages 165-176, 2010.
Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trendsregistered in Machine Learning, 3(2):123-224, 2011.
Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. Knowledge and Data Engineering, IEEE Transactions on, 20(10):1348-1362, 2008.
Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng, 20(10):1348-1362, 2008.
Amichai Painsky, Saharon Rosset, and Meir Feder. Generalized independent component analysis over finite alphabets. Information Theory, IEEE Transactions on, 2015.
C-T Pan. On the existence and computation of rank-revealing lu factorizations. Linear Algebra and its Applications, 316(1-3):199-222, 2000.
Ron M. Roth and Krishnamurthy Viswanathan. On the hardness of decoding the gale-berlekamp code. IEEE Transactions on Information Theory, 54(3):1050-1060, 2008.
Jouni K Seppänen, Ella Bingham, and Heikki Mannila. A simple algorithm for topic identification in 0-1 data. In Knowledge Discovery in Databases: PKDD 2003, pages 423-434. Springer, 2003.
Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix factorization. In John F. Elder IV, Françoise Fogelman-Soulié, Peter A. Flach, and Mohammed Javeed Zaki, editors, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 757-766. ACM, 2009.
Tomáš Šingliar and Miloš Hauskrecht. Noisy-or component analysis and its application to link analysis. The Journal of Machine Learning Research, 7:2189-2213, 2006.
Larry Stockmeyer. The minimal set basis problem is NP-complete. IBM Research Report RC-5431, IBM Thomas J. Watson Research Center, 1975.
Jinsong Tan. Inapproximability of maximum weighted edge biclique and its applications. In Manindra Agrawal, Ding-Zhu Du, Zhenhua Duan, and Angsheng Li, editors, TAMC 2008, volume 4978 of LNCS, pages 282-293. Springer, 2008.
Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 978-986. Society for Industrial and Applied Mathematics, 2009.
Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining problem: finding a minimal descriptive set of roles. In Proceedings of the 12th ACM symposium on Access control models and technologies, pages 175-184. ACM, 2007.
Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In Jozef Gruska, editor, 6th Symposium on Mathematical Foundations of Computer Science, MFCS 1977, volume 53 of Lecture Notes in Computer Science, pages 162-176. Springer, 1977.
Yining Wang and Aarti Singh. Column subset selection with missing data via active sampling. In AISTATS, pages 1033-1041, 2015.
Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. An explicit sampling dependent spectral error bound for column subset selection. In Proceedings of The 32nd International Conference on Machine Learning, pages 135-143, 2015.
Arie Yeredor. Independent component analysis over galois fields of prime order. Information Theory, IEEE Transactions on, 57(8):5342-5359, 2011.

Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

Authors Chen Dan, Kristoffer Arnsfelt Hansen , He Jiang , Liwei Wang, Yuchen Zhou

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

Authors Chen Dan, Kristoffer Arnsfelt Hansen , He Jiang , Liwei Wang, Yuchen Zhou

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message