Matrix Completion: Approximating the Minimum Diameter

Matrix Completion: Approximating the Minimum Diameter In this paper, we focus on the matrix completion problem and aim to minimize the diameter over an arbitrary alphabet. Given a matrix M with missing entries, our objective is to complete the matrix by filling in the missing entries in a way that minimizes the maximum (Hamming) distance between any pair of rows in the completed matrix (also known as the diameter of the matrix). It is worth noting that this problem is already known to be NP-hard. Currently, the best-known upper bound is a 4-approximation algorithm derived by applying the triangle inequality together with a well-known 2-approximation algorithm for the radius minimization variant. In this work, we make the following contributions: - We present a novel 3-approximation algorithm for the diameter minimization variant of the matrix completion problem. To the best of our knowledge, this is the first approximation result that breaks below the straightforward 4-factor bound. - Furthermore, we establish that the diameter minimization variant of the matrix completion problem is (2-ε)-inapproximable, for any ε > 0, even when considering a binary alphabet, under the assumption that 𝖯 ≠ NP. This is the first result that demonstrates a hardness of approximation for this problem. Incomplete Data Matrix Completion Hamming Distance Diameter Minimization Approximation Algorithms Hardness of Approximation Theory of computation~Approximation algorithms analysis 17:1-17:19 Regular Paper This work was supported by an MoE AcRF Tier 2 grant (MOE-T2EP20221-0009). Diptarka Chakraborty Diptarka Chakraborty National University of Singapore, Singapore Sanjana Dey Sanjana Dey National University of Singapore, Singapore 10.4230/LIPIcs.ISAAC.2023.17 Paul D Allison. Missing data. Sage publications, 2001. S Arora, C Lund, R Motwani, M Sudan, and M Szegedy. Proof verification and intractability of approximation problems. In Proceedings of the 33rd Annual IEEE Symposium on the Foundations of Computer Science, IEEE, 1992. Per Austrin, Venkatesan Guruswami, and Johan Håstad. (2+ε)-SAT is NP-hard. SIAM Journal on Computing, 46(5):1554-1573, 2017. Vineet Bafna, Sorin Istrail, Giuseppe Lancia, and Romeo Rizzi. Polynomial and apx-hard cases of the individual haplotyping problem. Theoretical Computer Science, 335(1):109-125, 2005. Laura Balzano, Arthur Szlam, Benjamin Recht, and Robert Nowak. K-subspaces with missing data. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 612-615. IEEE, 2012. Manu Basavaraju, Fahad Panolan, Ashutosh Rai, MS Ramanujan, and Saket Saurabh. On the kernelization complexity of string problems. Theoretical Computer Science, 730:21-31, 2018. Christina Boucher, Christine Lo, and Daniel Lokshantov. Consensus patterns (probably) has no eptas. In Algorithms-ESA 2015: 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, pages 239-250. Springer, 2015. Vladimir Braverman, Shaofeng Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering with missing values. Advances in Neural Information Processing Systems, 34:17360-17372, 2021. Laurent Bulteau, Vincent Froese, and Rolf Niedermeier. Tight hardness results for consensus problems on circular strings and time series. SIAM Journal on Discrete Mathematics, 34(3):1854-1883, 2020. Laurent Bulteau, Falk Hüffner, Christian Komusiewicz, Rolf Niedermeier, et al. Multivariate algorithmics for NP-hard string problems. Bulletin of EATCS, 3(114), 2014. Laurent Bulteau and Markus L Schmid. Consensus strings with small maximum distance and small distance sum. Algorithmica, 82(5):1378-1409, 2020. Diptarka Chakraborty, Kshitij Gajjar, and Agastya Vibhuti Jha. Approximating the Center Ranking Under Ulam. In 41st IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2021), volume 213, pages 12:1-12:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 1-10, 2001. Marek Cygan, Daniel Lokshtanov, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. Lower bounds for approximation schemes for closest string. arXiv preprint arXiv:1509.05809, 2015. Eduard Eiben, Fedor V Fomin, Petr A Golovach, William Lochet, Fahad Panolan, and Kirill Simonov. Eptas for k-means clustering of affine subspaces. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2649-2659. SIAM, 2021. Eduard Eiben, Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. The parameterized complexity of clustering incomplete data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7296-7304, 2021. Eduard Eiben, Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. Finding a cluster in incomplete data. In 30th Annual European Symposium on Algorithms (ESA 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022. Ehsan Elhamifar. High-rank matrix completion and clustering under self-expressive models. Advances in Neural Information Processing Systems, 29, 2016. Ehsan Elhamifar and René Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765-2781, 2013. Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. Parameterized algorithms for the matrix completion problem. In International Conference on Machine Learning, pages 1656-1665. PMLR, 2018. Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. On the parameterized complexity of clustering incomplete data into subspaces of small rank. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3906-3913, 2020. Jie Gao, Michael Langberg, and Leonard J Schulman. Analysis of incomplete data and an intrinsic-dimension helly theorem. Discrete & Computational Geometry, 40:537-560, 2008. Jie Gao, Michael Langberg, and Leonard J Schulman. Clustering lines in high-dimensional space: Classification of incomplete data. ACM Transactions on Algorithms (TALG), 7(1):1-26, 2010. Leszek Gasieniec, Jesper Jansson, and Andrzej Lingas. Approximation algorithms for hamming clustering problems. Journal of Discrete Algorithms, 2(2):289-301, 2004. Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38:293-306, 1985. Jens Gramm, Rolf Niedermeier, Peter Rossmanith, et al. Fixed-parameter algorithms for closest string and related problems. Algorithmica, 37(1):25-42, 2003. Danny Hermelin and Liat Rozenberg. Parameterized complexity analysis for the closest string with wildcards problem. Theoretical Computer Science, 600:11-18, 2015. Tomohiro Koana, Vincent Froese, and Rolf Niedermeier. Parameterized algorithms for matrix completion with radius constraints. arXiv preprint arXiv:2002.00645, 2020. Tomohiro Koana, Vincent Froese, and Rolf Niedermeier. Binary matrix completion under diameter constraints. In 38th International Symposium on Theoretical Aspects of Computer Science (STACS 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021. Euiwoong Lee and Leonard J Schulman. Clustering affine subspaces: hardness and algorithms. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 810-827. SIAM, 2013. Ming Li, Bin Ma, and Lusheng Wang. On the closest string and substring problems. Journal of the ACM (JACM), 49(2):157-171, 2002. Ross Lippert, Russell Schwartz, Giuseppe Lancia, and Sorin Istrail. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, 3(1):23-31, 2002. Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019. Christine Lo, Boyko Kakaradov, Daniel Lokshtanov, and Christina Boucher. Seesite: characterizing relationships between splice junctions and splicing enhancers. IEEE/ACM transactions on computational biology and bioinformatics, 11(4):648-656, 2014. Yair Marom and Dan Feldman. k-means clustering of lines for big data. Advances in Neural Information Processing Systems, 32, 2019. Ran Raz. A parallel repetition theorem. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, pages 447-456, 1995. Markus L Schmid. Finding consensus strings with small length difference between input and solution strings. ACM Transactions on Computation Theory (TOCT), 9(3):1-18, 2017. Lusheng Wang, Ming Li, and Bin Ma. Closest String and Substring Problems, pages 321-324. Springer New York, 2016. Jinfeng Yi, Tianbao Yang, Rong Jin, Anil K Jain, and Mehrdad Mahdavi. Robust ensemble clustering by matrix completion. In 2012 IEEE 12th international conference on data mining, pages 1176-1181. IEEE, 2012. Diptarka Chakraborty and Sanjana Dey Creative Commons Attribution 4.0 International license https://creativecommons.org/licenses/by/4.0/legalcode