License: Creative Commons Attribution 3.0 Unported license (CC-BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ESA.2020.18
URN: urn:nbn:de:0030-drops-128848
URL: https://drops.dagstuhl.de/opus/volltexte/2020/12884/
Go to the corresponding LIPIcs Volume Portal


Bhattacharya, Anup ; Eube, Jan ; Röglin, Heiko ; Schmidt, Melanie

Noisy, Greedy and Not so Greedy k-Means++

pdf-format:
LIPIcs-ESA-2020-18.pdf (0.5 MB)


Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

BibTeX - Entry

@InProceedings{bhattacharya_et_al:LIPIcs:2020:12884,
  author =	{Anup Bhattacharya and Jan Eube and Heiko R{\"o}glin and Melanie Schmidt},
  title =	{{Noisy, Greedy and Not so Greedy k-Means++}},
  booktitle =	{28th Annual European Symposium on Algorithms (ESA 2020)},
  pages =	{18:1--18:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-162-7},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{173},
  editor =	{Fabrizio Grandoni and Grzegorz Herman and Peter Sanders},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2020/12884},
  URN =		{urn:nbn:de:0030-drops-128848},
  doi =		{10.4230/LIPIcs.ESA.2020.18},
  annote =	{Keywords: k-means++, greedy, adaptive sampling}
}

Keywords: k-means++, greedy, adaptive sampling
Collection: 28th Annual European Symposium on Algorithms (ESA 2020)
Issue Date: 2020
Date of publication: 26.08.2020


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI