Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2018-07-04 57:1 57:14 10.4230/LIPIcs.ICALP.2018.57 article Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering Gamlath, Buddhima 1 Huang, Sangxia 2 Svensson, Ola 1 École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland Sony Mobile Communications, Lund, Sweden We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - delta), simultaneously has a cost of at most (1 + epsilon) times the optimal cost and an accuracy of at least (1 - epsilon)? We show how to achieve such a clustering on n points with O{((k^2 log n) * m{(Q, epsilon^4, delta / (k log n))})} oracle queries, when the k clusters can be learned with an epsilon' error and a failure probability delta' using m(Q, epsilon',delta') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, epsilon', delta') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O{(k^4/(epsilon^2 delta) + (k^{9}/epsilon^4) log(1/delta) + k * m{({R}^r, epsilon^4/k, delta)})} oracle queries. We also show that the number of queries needed for (1 - epsilon)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters. https://drops.dagstuhl.de/storage/00lipics/lipics-vol107-icalp2018/LIPIcs.ICALP.2018.57/LIPIcs.ICALP.2018.57.pdf Clustering Semi-supervised Learning Approximation Algorithms k-Means k-Median

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.ICALP.2018.57</doi>

<documentType>article</documentType>

<title language="eng">Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering</title>

<name>Gamlath, Buddhima</name>

</author>

<name>Huang, Sangxia</name>

</author>

<name>Svensson, Ola</name>

</author>

</authors>

<affiliationName affiliationId="1">École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland</affiliationName>

<affiliationName affiliationId="2">Sony Mobile Communications, Lund, Sweden</affiliationName>

</affiliationsList>

<abstract language="eng">We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - delta), simultaneously has a cost of at most (1 + epsilon) times the optimal cost and an accuracy of at least (1 - epsilon)? We show how to achieve such a clustering on n points with O{((k^2 log n) * m{(Q, epsilon^4, delta / (k log n))})} oracle queries, when the k clusters can be learned with an epsilon' error and a failure probability delta' using m(Q, epsilon',delta') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, epsilon', delta') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O{(k^4/(epsilon^2 delta) + (k^{9}/epsilon^4) log(1/delta) + k * m{({R}^r, epsilon^4/k, delta)})} oracle queries. We also show that the number of queries needed for (1 - epsilon)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol107-icalp2018/LIPIcs.ICALP.2018.57/LIPIcs.ICALP.2018.57.pdf</fullTextUrl>

<keyword>Clustering</keyword>

<keyword>Semi-supervised Learning</keyword>

<keyword>Approximation Algorithms</keyword>

<keyword>k-Means</keyword>

<keyword>k-Median</keyword>

</keywords>

</record>

</records>