Abstract
In this work, we study the kmedian and kmeans clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worstcase instances, we focus on gaining a finer understanding through the lens of beyond worstcase analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a kmedian or kmeans algorithm will produce clusterings which are close to matching the target clustering. Worstcase results can guarantee constant factor approximations to the optimal kmedian or kmeans objective value, but not closeness to the target clustering.
Our first result is a distributed algorithm which returns a nearoptimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various nonworstcase assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal kmeans or kmedian cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for realworld instances.
BibTeX  Entry
@InProceedings{awasthi_et_al:LIPIcs:2019:10594,
author = {Pranjal Awasthi and Ainesh Bakshi and MariaFlorina Balcan and Colin White and David P. Woodruff},
title = {{Robust CommunicationOptimal Distributed Clustering Algorithms}},
booktitle = {46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)},
pages = {18:118:16},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771092},
ISSN = {18688969},
year = {2019},
volume = {132},
editor = {Christel Baier and Ioannis Chatzigiannakis and Paola Flocchini and Stefano Leonardi},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2019/10594},
URN = {urn:nbn:de:0030drops105942},
doi = {10.4230/LIPIcs.ICALP.2019.18},
annote = {Keywords: robust distributed clustering, communication complexity}
}
Keywords: 

robust distributed clustering, communication complexity 
Seminar: 

46th International Colloquium on Automata, Languages, and Programming (ICALP 2019) 
Issue Date: 

2019 
Date of publication: 

08.07.2019 