Parameterized Complexity of Feature Selection for Categorical Data Clustering

Bandyapadhyay, Sayan; Fomin, Fedor V.; Golovach, Petr A.; Simonov, Kirill

doi:10.4230/LIPIcs.MFCS.2021.14

Abstract

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers l (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-l relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (l0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters.
We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,|Σ|)⋅m^{g(k,|Σ|)}⋅n² for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, Binary and Boolean Low-rank Matrix Approximation with Outliers, and Binary Robust Projective Clustering. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.

Salem Alelyani, Jiliang Tang, and Huan Liu. Feature selection for clustering: A review. In Charu C. Aggarwal and Chandan K. Reddy, editors, Data Clustering: Algorithms and Applications, pages 30-373. CRC Press, 2013.
Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. J. ACM, 42(4):844-856, 1995. URL: https://doi.org/10.1145/210332.210337.
Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for 𝓁_p-low rank approximation. In SODA'19, pages 747-766. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.47.
Aditya Bhaskara and Srivatsan Kumar. Low Rank Approximation in the Presence of Outliers. In APPROX/RANDOM'18, volume 116, pages 4:1-4:16, Dagstuhl, Germany, 2018. URL: https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2018.4.
Arnab Bhattacharyya, Édouard Bonnet, László Egri, Suprovat Ghoshal, Karthik C. S., Bingkai Lin, Pasin Manurangsi, and Dániel Marx. Parameterized intractability of even set and shortest vector problem. CoRR, abs/1909.01986, 2019. URL: http://arxiv.org/abs/1909.01986.
Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature selection for the k-means clustering problem. In NIPS'09, pages 153-161. Curran Associates, Inc., 2009. URL: http://papers.nips.cc/paper/3724-unsupervised-feature-selection-for-the-k-means-clustering-problem.
Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Trans. Information Theory, 61(2):1045-1062, 2015. URL: https://doi.org/10.1109/TIT.2014.2375327.
Thierry Bouwmans, Necdet Serhat Aybat, and El-hadi Zahzah. Handbook of robust low-rank and sparse matrix decomposition: Applications in image and video processing. Chapman and Hall/CRC, 2016.
Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11:1-11:37, 2011. URL: https://doi.org/10.1145/1970392.1970395.
Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust matrix completion and corrupted columns. In ICML'11, pages 873-880, 2011.
Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In STOC'15, pages 163-172. ACM, 2015.
Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. On low rank approximation of binary matrices. CoRR, abs/1511.01699, 2015. URL: http://arxiv.org/abs/1511.01699.
Uriel Feige. NP-hardness of hypercube 2-segmentation. CoRR, abs/1411.0821, 2014. URL: http://arxiv.org/abs/1411.0821.
Fedor V. Fomin, Petr A. Golovach, Daniel Lokshtanov, Fahad Panolan, and Saket Saurabh. Approximation schemes for low-rank binary matrix approximation problems. ACM Trans. Algorithms, 16(1):12:1-12:39, 2020. URL: https://doi.org/10.1145/3365653.
Fedor V. Fomin, Petr A. Golovach, and Fahad Panolan. Parameterized low-rank binary matrix approximation. Data Min. Knowl. Discov., 34(2):478-532, 2020. URL: https://doi.org/10.1007/s10618-019-00669-5.
Fedor V. Fomin, Petr A. Golovach, and Kirill Simonov. Parameterized k-Clustering: Tractability Island. In FSTTCS'19, volume 150 of Leibniz International Proceedings in Informatics (LIPIcs), pages 14:1-14:15, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.FSTTCS.2019.14.
Fedor V. Fomin, Daniel Lokshtanov, Syed Mohammad Meesum, Saket Saurabh, and Meirav Zehavi. Matrix rigidity from the viewpoint of parameterized complexity. SIAM J. Discrete Math., 32(2):966-985, 2018. URL: https://doi.org/10.1137/17M112258X.
Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. On the parameterized complexity of clustering incomplete data into subspaces of small rank. In AAAI'20, pages 3906-3913. AAAI Press, 2020.
Nicolas Gillis and Stephen A. Vavasis. On the complexity of robust PCA and 𝓁₁-norm low-rank matrix approximation. CoRR, abs/1509.09236, 2015. URL: http://arxiv.org/abs/1509.09236.
YongSeog Kim, W. Nick Street, and Filippo Menczer. Evolutionary model selection in unsupervised learning. Intell. Data Anal., 6(6):531-556, 2002. URL: http://content.iospress.com/articles/intelligent-data-analysis/ida00110.
Ravi Kumar, Rina Panigrahy, Ali Rahimi, and David P. Woodruff. Faster algorithms for binary matrix factorization. In ICML'19, volume 97 of Proceedings of Machine Learning Research, pages 3551-3559. PMLR, 2019. URL: http://proceedings.mlr.press/v97/kumar19a.html.
Tao Li. A general model for clustering binary data. In KDD'05, pages 188-197, 2005.
Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, and Yuan Hong. Constraint-aware role mining via extended boolean matrix decomposition. IEEE Trans. Dependable Sec. Comput., 9(5):655-669, 2012. URL: https://doi.org/10.1109/TDSC.2012.21.
Dániel Marx. Closest substring problems with small distances. SIAM J. Comput., 38(4):1382-1410, 2008. URL: https://doi.org/10.1137/060673898.
Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348-1362, 2008. URL: https://doi.org/10.1109/TKDE.2008.53.
Pauli Miettinen and Stefan Neumann. Recent developments in boolean matrix factorization. In IJCAI'20, pages 4922-4928. ijcai.org, 2020.
Pauli Miettinen and Jilles Vreeken. Model order selection for boolean matrix factorization. In KDD'11, pages 51-59. ACM, 2011. URL: https://doi.org/10.1145/2020408.2020424.
Rafail Ostrovsky and Yuval Rabani. Polynomial-time approximation schemes for geometric min-sum median clustering. J. ACM, 49(2):139-156, 2002. URL: https://doi.org/10.1145/506147.506149.
Kirill Simonov, Fedor V. Fomin, Petr A. Golovach, and Fahad Panolan. Refined complexity of PCA with outliers. In ICML'19, volume 97, pages 5818-5826. PMLR, 2019. URL: http://proceedings.mlr.press/v97/simonov19a.html.
René Vidal, Yi Ma, and S. Shankar Sastry. Generalized Principal Component Analysis, volume 40 of Interdisciplinary applied mathematics. Springer, 2016. URL: https://doi.org/10.1007/978-0-387-87811-9.
Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In NIPS'10, pages 2496-2504. Curran Associates, Inc., 2010. URL: http://papers.nips.cc/paper/4005-robust-pca-via-outlier-pursuit.
Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factorization with applications. In ICDM'07, pages 391-400. IEEE, 2007.

Parameterized Complexity of Feature Selection for Categorical Data Clustering

Authors Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach , Kirill Simonov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Parameterized Complexity of Feature Selection for Categorical Data Clustering

Authors Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach , Kirill Simonov

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message