Clustering Permutations: New Techniques with Streaming Applications

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2023.31.pdf
  • Filesize: 0.82 MB
  • 24 pages

Document Identifiers

Author Details

Diptarka Chakraborty
  • National University of Singapore, Singapore
Debarati Das
  • Pennsylvania State University, University Park, PA, USA
Robert Krauthgamer
  • Weizmann Institute of Science, Rehovot, Israel

Cite AsGet BibTex

Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering Permutations: New Techniques with Streaming Applications. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 31:1-31:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ITCS.2023.31

Abstract

We study the classical metric k-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a 2-approximate solution in polynomial time for all k = O(1), and works irrespective of the underlying distance measure, so long it is a metric; however, going below the 2-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a (2-δ)-approximation algorithm for k = 1, where δ≈ 2^{-40}. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a 1.999-approximation algorithm for the metric k-median problem under the Ulam metric, that runs in time (k log (nd))^{O(k)} nd³ for an input consisting of n permutations over [d]. In fact, our framework is powerful enough to extend this result to the streaming model (where the n input permutations arrive one by one) using only polylogarithmic (in n) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Clustering
  • Approximation Algorithms
  • Ulam Distance
  • Rank Aggregation
  • Streaming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. J Abreu and Juan Ramón Rico-Juan. A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recognition Letters, 36:74-80, 2014. Google Scholar
  2. Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1-23:27, 2008. URL: https://doi.org/10.1145/1411509.1411513.
  3. David Aldous and Persi Diaconis. Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem. Bulletin of the American Mathematical Society, 36(4):413-432, 1999. URL: https://doi.org/10.1090/S0273-0979-99-00796-X.
  4. Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SIAM J. Comput., 39(6):2398-2429, 2010. URL: https://doi.org/10.1137/080716530.
  5. Alexandr Andoni and Huy L. Nguyen. Near-optimal sublinear time algorithms for Ulam distance. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, pages 76-86, 2010. URL: https://doi.org/10.1137/1.9781611973075.8.
  6. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 21-29, 2001. Google Scholar
  7. Olivier Bachem, Mario Lucic, and Silvio Lattanzi. One-shot coresets: The case of k-clustering. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pages 784-792. PMLR, 2018. Google Scholar
  8. Mahdi Boroujeni and Saeed Seddighin. Improved MPC algorithms for edit distance and Ulam distance. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2019, pages 31-40, 2019. Google Scholar
  9. Felix Brandt, Vincent Conitzer, Ulle Endriss, Jérôme Lang, and Ariel D Procaccia. Handbook of computational social choice. Cambridge University Press, 2016. Google Scholar
  10. Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming coreset constructions for m-estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  11. Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 2679-2696. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.159.
  12. Vladimir Braverman, Harry Lang, Keith Levin, and Yevgeniy Rudoy. Metric k-median clustering in insertion-only streams. Discrete Applied Mathematics, 304:164-180, 2021. Google Scholar
  13. Hervé Cardot, Peggy Cénac, Antoine Godichon-Baggioni, et al. Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. Annals of Statistics, 45(2):591-614, 2017. URL: https://doi.org/10.1214/16-AOS1460.
  14. Francisco Casacuberta and M.D. Antonio. A greedy algorithm for computing approximate median strings. In Proc. of National Symposium on Pattern Recognition and Image Analysis, pages 193-198, 1997. Google Scholar
  15. Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the Ulam metric. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 761-775. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.48.
  16. Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 1-10, 1999. Google Scholar
  17. Moses Charikar and Robert Krauthgamer. Embedding the Ulam metric into l₁. Theory of Computing, 2(11):207-224, 2006. URL: https://doi.org/10.4086/toc.2006.v002a011.
  18. Ke Chen. On coresets for k-Median and k-Means clustering in metric and Euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009. URL: https://doi.org/10.1137/070699007.
  19. Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the Jaccard median. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 293-311. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.25.
  20. Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM Symposium on Theory of Computing, pages 9-21, 2016. Google Scholar
  21. Graham Cormode, Shan Muthukrishnan, and Süleyman Cenk Sahinalp. Permutation editing and matching via embeddings. In International Colloquium on Automata, Languages, and Programming, pages 481-492. Springer, 2001. URL: https://doi.org/10.1007/3-540-48224-5_40.
  22. Matan Danos. Coresets for clustering by uniform sampling and generalized rank aggregation. Master’s thesis, Weizmann Institute of Science, 2021. URL: https://www.wisdom.weizmann.ac.il/~robi/files/MatanDanos-MScThesis-2021_11.pdf.
  23. Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theor. Comput. Sci., 230(1-2):39-48, 2000. URL: https://doi.org/10.1016/S0304-3975(97)00240-5.
  24. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, pages 613-622, 2001. URL: https://doi.org/10.1145/371920.372165.
  25. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569-578, 2011. Google Scholar
  26. P. Thomas Fletcher, Suresh Venkatasubramanian, and Sarang Joshi. Robust statistics on riemannian manifolds via the geometric median. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. IEEE, 2008. Google Scholar
  27. Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77-80, 2013. Google Scholar
  28. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE transactions on knowledge and data engineering, 15(3):515-528, 2003. Google Scholar
  29. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  30. Donna Harman. Ranking algorithms. In William B. Frakes and Ricardo A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 363-392. Prentice-Hall, 1992. Google Scholar
  31. Morihiro Hayashida and Hitoshi Koyano. Integer linear programming approach to median and center strings for a probability distribution on a set of strings. In 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016), pages 35-41. SciTePress, 2016. URL: https://doi.org/10.5220/0005666400350041.
  32. Piotr Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 428-434, 1999. Google Scholar
  33. John G. Kemeny. Mathematics without numbers. Daedalus, 88(4):577-591, 1959. Google Scholar
  34. Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 95-103, 2007. Google Scholar
  35. Teuvo Kohonen. Median strings. Pattern Recognition Letters, 3(5):309-313, 1985. URL: https://doi.org/10.1016/0167-8655(85)90061-3.
  36. Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201-237, 1983. URL: https://doi.org/10.1137/1025045.
  37. Ferenc Kruzslicz. Improved greedy algorithm for computing approximate median strings. Acta Cybernetica, 14(2):331-339, 1999. Google Scholar
  38. Carlos D. Martínez-Hinarejos, Alfons Juan, and Francisco Casacuberta. Use of median string for classification. In Proceedings 15th International Conference on Pattern Recognition, ICPR 2000, volume 2, pages 903-906. IEEE, 2000. URL: https://doi.org/10.1109/ICPR.2000.906220.
  39. Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1):35-60, 2004. Google Scholar
  40. Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308-2335, 2015. Google Scholar
  41. P. Mirabal, J. Abreu, and D. Seco. Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string. Pattern Recognition Letters, 120:104-111, April 2019. Google Scholar
  42. Timothy Naumovitz, Michael E. Saks, and C. Seshadhri. Accurate and nearly optimal sublinear approximations to Ulam distance. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2012-2031, 2017. URL: https://doi.org/10.1137/1.9781611974782.131.
  43. François Nicolas and Eric Rivals. Complexities of the centre and median string problems. In 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, pages 315-327, 2003. Google Scholar
  44. Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, pages 349-358, 2000. Google Scholar
  45. Oscar Pedreira and Nieves R. Brisaboa. Spatial selection of sparse pivots for similarity search in metric spaces. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 434-445. Springer, 2007. Google Scholar
  46. Pavel Pevzner. Computational molecular biology: an algorithmic approach. MIT press, 2000. Google Scholar
  47. Cyrus Rashtchian, Konstantin Makarychev, Miklós Z. Rácz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, and Karin Strauss. Clustering billions of reads for DNA data storage. In Advances in Neural Information Processing Systems 30, pages 3360-3371. Curran Associates, Inc., 2017. Google Scholar
  48. David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. URL: https://doi.org/10.1137/0128004.
  49. Warren Schudy. Approximation Schemes for Inferring Rankings and Clusterings from Pairwise Data. PhD thesis, Brown University, 2012. URL: https://cs.brown.edu/research/pubs/theses/phd/2012/schudy.pdf.
  50. Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM Journal on Computing, 34(2):405-432, 2005. Google Scholar
  51. H. Peyton Young. Condorcet’s theory of voting. American Political science review, 82(4):1231-1244, 1988. Google Scholar
  52. H. Peyton Young and Arthur Levenglick. A consistent extension of Condorcet’s election principle. SIAM Journal on applied Mathematics, 35(2):285-300, 1978. Google Scholar