Clustering Permutations: New Techniques with Streaming Applications

Chakraborty, Diptarka; Das, Debarati; Krauthgamer, Robert

doi:10.4230/LIPIcs.ITCS.2023.31

Abstract

We study the classical metric k-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a 2-approximate solution in polynomial time for all k = O(1), and works irrespective of the underlying distance measure, so long it is a metric; however, going below the 2-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a (2-δ)-approximation algorithm for k = 1, where δ≈ 2^{-40}. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a 1.999-approximation algorithm for the metric k-median problem under the Ulam metric, that runs in time (k log (nd))^{O(k)} nd³ for an input consisting of n permutations over [d]. In fact, our framework is powerful enough to extend this result to the streaming model (where the n input permutations arrive one by one) using only polylogarithmic (in n) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.

J Abreu and Juan Ramón Rico-Juan. A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recognition Letters, 36:74-80, 2014.
Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1-23:27, 2008. URL: https://doi.org/10.1145/1411509.1411513.
David Aldous and Persi Diaconis. Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem. Bulletin of the American Mathematical Society, 36(4):413-432, 1999. URL: https://doi.org/10.1090/S0273-0979-99-00796-X.
Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SIAM J. Comput., 39(6):2398-2429, 2010. URL: https://doi.org/10.1137/080716530.
Alexandr Andoni and Huy L. Nguyen. Near-optimal sublinear time algorithms for Ulam distance. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, pages 76-86, 2010. URL: https://doi.org/10.1137/1.9781611973075.8.
Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 21-29, 2001.
Olivier Bachem, Mario Lucic, and Silvio Lattanzi. One-shot coresets: The case of k-clustering. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pages 784-792. PMLR, 2018.
Mahdi Boroujeni and Saeed Seddighin. Improved MPC algorithms for edit distance and Ulam distance. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2019, pages 31-40, 2019.
Felix Brandt, Vincent Conitzer, Ulle Endriss, Jérôme Lang, and Ariel D Procaccia. Handbook of computational social choice. Cambridge University Press, 2016.
Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming coreset constructions for m-estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 2679-2696. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.159.
Vladimir Braverman, Harry Lang, Keith Levin, and Yevgeniy Rudoy. Metric k-median clustering in insertion-only streams. Discrete Applied Mathematics, 304:164-180, 2021.
Hervé Cardot, Peggy Cénac, Antoine Godichon-Baggioni, et al. Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. Annals of Statistics, 45(2):591-614, 2017. URL: https://doi.org/10.1214/16-AOS1460.
Francisco Casacuberta and M.D. Antonio. A greedy algorithm for computing approximate median strings. In Proc. of National Symposium on Pattern Recognition and Image Analysis, pages 193-198, 1997.
Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the Ulam metric. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 761-775. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.48.
Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 1-10, 1999.
Moses Charikar and Robert Krauthgamer. Embedding the Ulam metric into l₁. Theory of Computing, 2(11):207-224, 2006. URL: https://doi.org/10.4086/toc.2006.v002a011.
Ke Chen. On coresets for k-Median and k-Means clustering in metric and Euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009. URL: https://doi.org/10.1137/070699007.
Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the Jaccard median. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 293-311. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.25.
Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM Symposium on Theory of Computing, pages 9-21, 2016.
Graham Cormode, Shan Muthukrishnan, and Süleyman Cenk Sahinalp. Permutation editing and matching via embeddings. In International Colloquium on Automata, Languages, and Programming, pages 481-492. Springer, 2001. URL: https://doi.org/10.1007/3-540-48224-5_40.
Matan Danos. Coresets for clustering by uniform sampling and generalized rank aggregation. Master’s thesis, Weizmann Institute of Science, 2021. URL: https://www.wisdom.weizmann.ac.il/~robi/files/MatanDanos-MScThesis-2021_11.pdf.
Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theor. Comput. Sci., 230(1-2):39-48, 2000. URL: https://doi.org/10.1016/S0304-3975(97)00240-5.
Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, pages 613-622, 2001. URL: https://doi.org/10.1145/371920.372165.
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569-578, 2011.
P. Thomas Fletcher, Suresh Venkatasubramanian, and Sarang Joshi. Robust statistics on riemannian manifolds via the geometric median. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. IEEE, 2008.
Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77-80, 2013.
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE transactions on knowledge and data engineering, 15(3):515-528, 2003.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Donna Harman. Ranking algorithms. In William B. Frakes and Ricardo A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 363-392. Prentice-Hall, 1992.
Morihiro Hayashida and Hitoshi Koyano. Integer linear programming approach to median and center strings for a probability distribution on a set of strings. In 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016), pages 35-41. SciTePress, 2016. URL: https://doi.org/10.5220/0005666400350041.
Piotr Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 428-434, 1999.
John G. Kemeny. Mathematics without numbers. Daedalus, 88(4):577-591, 1959.
Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 95-103, 2007.
Teuvo Kohonen. Median strings. Pattern Recognition Letters, 3(5):309-313, 1985. URL: https://doi.org/10.1016/0167-8655(85)90061-3.
Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201-237, 1983. URL: https://doi.org/10.1137/1025045.
Ferenc Kruzslicz. Improved greedy algorithm for computing approximate median strings. Acta Cybernetica, 14(2):331-339, 1999.
Carlos D. Martínez-Hinarejos, Alfons Juan, and Francisco Casacuberta. Use of median string for classification. In Proceedings 15th International Conference on Pattern Recognition, ICPR 2000, volume 2, pages 903-906. IEEE, 2000. URL: https://doi.org/10.1109/ICPR.2000.906220.
Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1):35-60, 2004.
Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308-2335, 2015.
P. Mirabal, J. Abreu, and D. Seco. Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string. Pattern Recognition Letters, 120:104-111, April 2019.
Timothy Naumovitz, Michael E. Saks, and C. Seshadhri. Accurate and nearly optimal sublinear approximations to Ulam distance. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2012-2031, 2017. URL: https://doi.org/10.1137/1.9781611974782.131.
François Nicolas and Eric Rivals. Complexities of the centre and median string problems. In 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, pages 315-327, 2003.
Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, pages 349-358, 2000.
Oscar Pedreira and Nieves R. Brisaboa. Spatial selection of sparse pivots for similarity search in metric spaces. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 434-445. Springer, 2007.
Pavel Pevzner. Computational molecular biology: an algorithmic approach. MIT press, 2000.
Cyrus Rashtchian, Konstantin Makarychev, Miklós Z. Rácz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, and Karin Strauss. Clustering billions of reads for DNA data storage. In Advances in Neural Information Processing Systems 30, pages 3360-3371. Curran Associates, Inc., 2017.
David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. URL: https://doi.org/10.1137/0128004.
Warren Schudy. Approximation Schemes for Inferring Rankings and Clusterings from Pairwise Data. PhD thesis, Brown University, 2012. URL: https://cs.brown.edu/research/pubs/theses/phd/2012/schudy.pdf.
Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM Journal on Computing, 34(2):405-432, 2005.
H. Peyton Young. Condorcet’s theory of voting. American Political science review, 82(4):1231-1244, 1988.
H. Peyton Young and Arthur Levenglick. A consistent extension of Condorcet’s election principle. SIAM Journal on applied Mathematics, 35(2):285-300, 1978.

Clustering Permutations: New Techniques with Streaming Applications

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Clustering Permutations: New Techniques with Streaming Applications

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References