Abstract
The kmeans++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k1 centers iteratively according to D²sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. kmeans++ is known to achieve an approximation factor of 𝒪(log k) in expectation.
Already in the original paper on kmeans++, Arthur and Vassilvitskii suggested a variation called greedy kmeans++ algorithm in which in each iteration multiple possible centers are sampled according to D²sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy kmeans++ yields only an Ω(𝓁⋅log k)approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration.
Inspired by the negative results, we study a variation of greedy kmeans++ which we call noisy kmeans++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²sampling individually for each point by a factor between 1ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy kmeans++ computes an 𝒪(log² k)approximation in expectation. We use the analysis of noisy kmeans++ to design a moderately greedy kmeans++ algorithm.
BibTeX  Entry
@InProceedings{bhattacharya_et_al:LIPIcs:2020:12884,
author = {Anup Bhattacharya and Jan Eube and Heiko R{\"o}glin and Melanie Schmidt},
title = {{Noisy, Greedy and Not so Greedy kMeans++}},
booktitle = {28th Annual European Symposium on Algorithms (ESA 2020)},
pages = {18:118:21},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771627},
ISSN = {18688969},
year = {2020},
volume = {173},
editor = {Fabrizio Grandoni and Grzegorz Herman and Peter Sanders},
publisher = {Schloss DagstuhlLeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/12884},
URN = {urn:nbn:de:0030drops128848},
doi = {10.4230/LIPIcs.ESA.2020.18},
annote = {Keywords: kmeans++, greedy, adaptive sampling}
}
Keywords: 

kmeans++, greedy, adaptive sampling 
Collection: 

28th Annual European Symposium on Algorithms (ESA 2020) 
Issue Date: 

2020 
Date of publication: 

26.08.2020 