Document Open Access Logo

On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

Authors Sayan Bandyapadhyay , Fedor V. Fomin , Kirill Simonov



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.23.pdf
  • Filesize: 0.78 MB
  • 15 pages

Document Identifiers

Author Details

Sayan Bandyapadhyay
  • Department of Informatics, University of Bergen, Norway
Fedor V. Fomin
  • Department of Informatics, University of Bergen, Norway
Kirill Simonov
  • Department of Informatics, University of Bergen, Norway

Acknowledgements

The authors are thankful to Vincent Cohen-Addad for sharing the full version of [Vincent Cohen{-}Addad and Jason Li, 2019].

Cite AsGet BibTex

Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 23:1-23:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.23

Abstract

Fair clustering is a variant of constrained clustering where the goal is to partition a set of colored points. The fraction of points of each color in every cluster should be more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS 2017] and became widely popular. This paper proposes a new construction of coresets for fair k-means and k-median clustering for Euclidean and general metrics based on random sampling. For the Euclidean space ℝ^d, we provide the first coresets whose size does not depend exponentially on the dimension d. The question of whether such constructions exist was asked by Schmidt, Schwiegelshohn, and Sohler [WAOA 2019] and Huang, Jiang, and Vishnoi [NeurIPS 2019]. For general metric, our construction provides the first coreset for fair k-means and k-median. New coresets appear to be a handy tool for designing better approximation and streaming algorithms for fair and other constrained clustering variants. In particular, we obtain - the first fixed-parameter tractable (FPT) PTAS for fair k-means and k-median clustering in ℝ^d. The near-linear time of our PTAS improves over the previous scheme of Böhm, Fazzone, Leonardi, and Schwiegelshohn [ArXiv 2020] with running time n^{poly(k/ε)}; - FPT "true" constant-approximation for metric fair clustering. All previous algorithms for fair k-means and k-median in general metric are bicriteria and violate the fairness constraints; - FPT 3-approximation for lower-bounded k-median improving the best-known 3.736 factor of Bera, Chakrabarty, and Negahbani [ArXiv 2019]; - the first FPT constant-approximations for metric chromatic clustering and 𝓁-Diversity clustering; - near linear-time (in n) PTAS for capacitated and lower-bounded clustering improving over PTAS of Bhattacharya, Jaiswal, and Kumar [TOCS 2018] with super-quadratic running time; - a streaming (1+ε)-approximation for fair k-means and k-median of space complexity polynomial in k, d, ε and log{n} (the previous algorithms have exponential space complexity on either d or k).

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
Keywords
  • fair clustering
  • coresets
  • approximation algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 267-275, 2019. Google Scholar
  2. Sara Ahmadian and Chaitanya Swamy. Improved approximation guarantees for lower-bounded facility location. In International Workshop on Approximation and Online Algorithms, pages 257-271. Springer, 2012. Google Scholar
  3. Ravindra K. Ahuja, James B. Orlin, Clifford Stein, and Robert E. Tarjan. Improved algorithms for bipartite network flow, 1994. Google Scholar
  4. Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In International Conference on Machine Learning, pages 405-413, 2019. Google Scholar
  5. Daniel Baker, Vladimir Braverman, Lingxiao Huang, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in graphs of bounded treewidth. arXiv preprint, 2019. URL: http://arxiv.org/abs/1907.04733.
  6. Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1039-1050, 2019. Google Scholar
  7. Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, pages 4954-4965, 2019. Google Scholar
  8. Suman K. Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for clustering. CoRR, abs/1901.02393, 2019. URL: http://arxiv.org/abs/1901.02393.
  9. Ioana O Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R Schmidt, and Melanie Schmidt. On the cost of essentially fair clusterings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  10. Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster Algorithms for the Constrained k-means Problem. Theory of Computing Systems, 62(1):93-115, 2018. URL: https://doi.org/10.1007/s00224-017-9820-7.
  11. Matteo Böhm, Adriano Fazzone, Stefano Leonardi, and Chris Schwiegelshohn. Fair clustering with multiple colors. arXiv preprint, 2020. URL: http://arxiv.org/abs/2002.07892.
  12. Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.00889.
  13. Vladimir Braverman, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2679-2696. SIAM, 2021. Google Scholar
  14. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009. Google Scholar
  15. Xingyu Chen, Brandon Fain, Liang Lyu, and Kamesh Munagala. Proportionally fair clustering. In International Conference on Machine Learning, pages 1032-1041, 2019. Google Scholar
  16. Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5029-5037, 2017. Google Scholar
  17. Vincent Cohen-Addad. Approximation schemes for capacitated clustering in doubling metrics. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 2241-2259. SIAM, 2020. Google Scholar
  18. Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li. Tight FPT approximations for k-median and k-means. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 42:1-42:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
  19. Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 41:1-41:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
  20. Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property. Algorithmica, 82(4):808-852, 2020. Google Scholar
  21. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569-578, 2011. Google Scholar
  22. Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601-657, 2020. Google Scholar
  23. Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 209-217, 2005. Google Scholar
  24. András Frank and Éva Tardos. An application of simultaneous diophantine approximation in combinatorial optimization. Combinatorica, 7(1):49-65, 1987. URL: https://doi.org/10.1007/bf02579200.
  25. Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3-19, 2007. Google Scholar
  26. Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom., 37(1):3-19, 2007. Google Scholar
  27. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291-300. ACM, 2004. Google Scholar
  28. Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, pages 7589-7600, 2019. Google Scholar
  29. Lingxiao Huang and Nisheeth K Vishnoi. Coresets for clustering in euclidean spaces: Importance sampling is nearly optimal. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.06263.
  30. Piotr Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 428-434, 1999. Google Scholar
  31. Ravi Kannan. Minkowski’s convex body theorem and integer programming. Mathematics of Operations Research, 12(3):415-440, 1987. URL: https://doi.org/10.1287/moor.12.3.415.
  32. Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair k-center clustering for data summarization. In 36th International Conference on Machine Learning, ICML 2019, pages 5984-6003. International Machine Learning Society (IMLS), 2019. Google Scholar
  33. Matthäus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgenstern. Guarantees for spectral clustering with fairness constraints. arXiv preprint, 2019. URL: http://arxiv.org/abs/1901.08668.
  34. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, 2010. Google Scholar
  35. Michael Langberg and Leonard J Schulman. Universal ε-approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 598-607. SIAM, 2010. Google Scholar
  36. H. W. Lenstra. Integer programming with a fixed number of variables. Mathematics of Operations Research, 8(4):538-548, 1983. URL: https://doi.org/10.1287/moor.8.4.538.
  37. Clemens Rösner and Melanie Schmidt. Privacy preserving clustering with constraints. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  38. Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In International Workshop on Approximation and Online Algorithms, pages 232-251. Springer, 2019. Google Scholar
  39. Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 802-813. IEEE, 2018. Google Scholar
  40. Zoya Svitkina. Lower-bounded facility location. ACM Transactions on Algorithms (TALG), 6(4):1-16, 2010. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail