Document

**Published in:** LIPIcs, Volume 219, 39th International Symposium on Theoretical Aspects of Computer Science (STACS 2022)

In this work, we consider d-Hyperedge Estimation and d-Hyperedge Sample problem in a hypergraph H(U(H),F(H)) in the query complexity framework, where U(H) denotes the set of vertices and F(H) denotes the set of hyperedges. The oracle access to the hypergraph is called Colorful Independence Oracle (CID), which takes d (non-empty) pairwise disjoint subsets of vertices A₁,…, A_d ⊆ U(ℋ) as input, and answers whether there exists a hyperedge in H having (exactly) one vertex in each A_i, i ∈ {1,2,…,d}. The problem of d-Hyperedge Estimation and d-Hyperedge Sample with CID oracle access is important in its own right as a combinatorial problem. Also, Dell et al. [SODA '20] established that decision vs counting complexities of a number of combinatorial optimization problems can be abstracted out as d-Hyperedge Estimation problems with a CID oracle access.
The main technical contribution of the paper is an algorithm that estimates m = |F(H)| with m̂ such that
1/(C_{d)log^{d-1} n) ≤ m̂/m ≤ C_{d} log ^{d-1} n.
by using at most C_{d}log ^{d+2} n many CID queries, where n denotes the number of vertices in the hypergraph H and C_d is a constant that depends only on d}. Our result coupled with the framework of Dell et al. [SODA '21] implies improved bounds for the following fundamental problems:
Edge Estimation using the Bipartite Independent Set (BIS). We improve the bound obtained by Beame et al. [ITCS '18, TALG '20].
Triangle Estimation using the Tripartite Independent Set (TIS). The previous best bound for the case of graphs with low co-degree (Co-degree for an edge in the graph is the number of triangles incident to that edge in the graph) was due to Bhattacharya et al. [ISAAC '19, TOCS '21], and Dell {et al.}’s result gives the best bound for the case of general graphs [SODA '21]. We improve both of these bounds.
Hyperedge Estimation & Sampling using Colorful Independence Oracle (CID). We give an improvement over the bounds obtained by Dell et al. [SODA '21].

Anup Bhattacharya, Arijit Bishnu, Arijit Ghosh, and Gopinath Mishra. Faster Counting and Sampling Algorithms Using Colorful Decision Oracle. In 39th International Symposium on Theoretical Aspects of Computer Science (STACS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 219, pp. 10:1-10:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.STACS.2022.10, author = {Bhattacharya, Anup and Bishnu, Arijit and Ghosh, Arijit and Mishra, Gopinath}, title = {{Faster Counting and Sampling Algorithms Using Colorful Decision Oracle}}, booktitle = {39th International Symposium on Theoretical Aspects of Computer Science (STACS 2022)}, pages = {10:1--10:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-222-8}, ISSN = {1868-8969}, year = {2022}, volume = {219}, editor = {Berenbrink, Petra and Monmege, Benjamin}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2022.10}, URN = {urn:nbn:de:0030-drops-158205}, doi = {10.4230/LIPIcs.STACS.2022.10}, annote = {Keywords: Query Complexity, Subset Query, Hyperedge Estimation, and Colorful Independent Set oracle} }

Document

APPROX

**Published in:** LIPIcs, Volume 207, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021)

The Euclidean k-median problem is defined in the following manner: given a set 𝒳 of n points in d-dimensional Euclidean space ℝ^d, and an integer k, find a set C ⊂ ℝ^d of k points (called centers) such that the cost function Φ(C,𝒳) ≡ ∑_{x ∈ 𝒳} min_{c ∈ C} ‖x-c‖₂ is minimized. The Euclidean k-means problem is defined similarly by replacing the distance with squared Euclidean distance in the cost function. Various hardness of approximation results are known for the Euclidean k-means problem [Pranjal Awasthi et al., 2015; Euiwoong Lee et al., 2017; Vincent Cohen{-}Addad and {Karthik {C. S.}}, 2019]. However, no hardness of approximation result was known for the Euclidean k-median problem. In this work, assuming the unique games conjecture (UGC), we provide the hardness of approximation result for the Euclidean k-median problem in O(log k) dimensional space. This solves an open question posed explicitly in the work of Awasthi et al. [Pranjal Awasthi et al., 2015].
Furthermore, we study the hardness of approximation for the Euclidean k-means/k-median problems in the bi-criteria setting where an algorithm is allowed to choose more than k centers. That is, bi-criteria approximation algorithms are allowed to output β k centers (for constant β > 1) and the approximation ratio is computed with respect to the optimal k-means/k-median cost. We show the hardness of bi-criteria approximation result for the Euclidean k-median problem for any β < 1.015, assuming UGC. We also show a similar hardness of bi-criteria approximation result for the Euclidean k-means problem with a stronger bound of β < 1.28, again assuming UGC.

Anup Bhattacharya, Dishant Goyal, and Ragesh Jaiswal. Hardness of Approximation for Euclidean k-Median. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 207, pp. 4:1-4:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.APPROX/RANDOM.2021.4, author = {Bhattacharya, Anup and Goyal, Dishant and Jaiswal, Ragesh}, title = {{Hardness of Approximation for Euclidean k-Median}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021)}, pages = {4:1--4:23}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-207-5}, ISSN = {1868-8969}, year = {2021}, volume = {207}, editor = {Wootters, Mary and Sanit\`{a}, Laura}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2021.4}, URN = {urn:nbn:de:0030-drops-146979}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2021.4}, annote = {Keywords: Hardness of approximation, bicriteria approximation, approximation algorithms, k-median, k-means} }

Document

**Published in:** LIPIcs, Volume 185, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)

We study a graph coloring problem that is otherwise easy in the RAM model but becomes quite non-trivial in the one-pass streaming model. In contrast to previous graph coloring problems in streaming that try to find an assignment of colors to vertices, our main work is on estimating the number of conflicting or monochromatic edges given a coloring function that is streaming along with the graph; we call the problem Conflict-Est. The coloring function on a vertex can be read or accessed only when the vertex is revealed in the stream. If we need the color on a vertex that has streamed past, then that color, along with its vertex, has to be stored explicitly. We provide algorithms for a graph that is streaming in different variants of the vertex arrival in one-pass streaming model, viz. the Vertex Arrival (VA), Vertex Arrival With Degree Oracle (VAdeg), Vertex Arrival in Random Order (VArand) models, with special focus on the random order model. We also provide matching lower bounds for most of the cases. The mainstay of our work is in showing that the properties of a random order stream can be exploited to design efficient streaming algorithms for estimating the number of monochromatic edges. We have also obtained a lower bound, though not matching the upper bound, for the random order model. Among all the three models vis-a-vis this problem, we can show a clear separation of power in favor of the VArand model.

Anup Bhattacharya, Arijit Bishnu, Gopinath Mishra, and Anannya Upasana. Even the Easiest(?) Graph Coloring Problem Is Not Easy in Streaming!. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 15:1-15:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.ITCS.2021.15, author = {Bhattacharya, Anup and Bishnu, Arijit and Mishra, Gopinath and Upasana, Anannya}, title = {{Even the Easiest(?) Graph Coloring Problem Is Not Easy in Streaming!}}, booktitle = {12th Innovations in Theoretical Computer Science Conference (ITCS 2021)}, pages = {15:1--15:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-177-1}, ISSN = {1868-8969}, year = {2021}, volume = {185}, editor = {Lee, James R.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2021.15}, URN = {urn:nbn:de:0030-drops-135544}, doi = {10.4230/LIPIcs.ITCS.2021.15}, annote = {Keywords: Streaming, random ordering, graph coloring, estimation, lower bounds} }

Document

**Published in:** LIPIcs, Volume 182, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)

We generalise the results of Bhattacharya et al. [Bhattacharya et al., 2018] for the list-k-means problem defined as - for a (unknown) partition X₁, ..., X_k of the dataset X ⊆ ℝ^d, find a list of k-center-sets (each element in the list is a set of k centers) such that at least one of k-center-sets {c₁, ..., c_k} in the list gives an (1+ε)-approximation with respect to the cost function min_{permutation π} [∑_{i = 1}^{k} ∑_{x ∈ X_i} ||x - c_{π(i)}||²]. The list-k-means problem is important for the constrained k-means problem since algorithms for the former can be converted to {PTAS} for various versions of the latter. The algorithm for the list-k-means problem by Bhattacharya et al. is a D²-sampling based algorithm that runs in k iterations. Making use of a constant factor solution for the (classical or unconstrained) k-means problem, we generalise the algorithm of Bhattacharya et al. in two ways - (i) for any fixed set X_{j₁}, ..., X_{j_t} of t ≤ k clusters, the algorithm produces a list of (k/(ε))^{O(t/(ε))} t-center sets such that (w.h.p.) at least one of them is good for X_{j₁}, ..., X_{j_t}, and (ii) the algorithm runs in a single iteration. Following are the consequences of our generalisations:
1) Faster PTAS under stability and a parameterised reduction: Property (i) of our generalisation is useful in scenarios where finding good centers becomes easier once good centers for a few "bad" clusters have been chosen. One such case is clustering under stability of Awasthi et al. [Awasthi et al., 2010] where the number of such bad clusters is a constant. Using property (i), we significantly improve the running time of their algorithm from O(dn³) (k log{n})^{poly(1/(β), 1/(ε))} to O (dn³ (k/(ε)) ^{O(1/βε²)}). Another application is a parameterised reduction from the outlier version of k-means to the classical one where the bad clusters are the outliers.
2) Streaming algorithms: The sampling algorithm running in a single iteration (i.e., property (ii)) allows us to design a constant-pass, logspace streaming algorithm for the list-k-means problem. This can be converted to a constant-pass, logspace streaming PTAS for various constrained versions of the k-means problem. In particular, this gives a 3-pass, polylog-space streaming PTAS for the constrained binary k-means problem which in turn gives a 4-pass, polylog-space streaming PTAS for the generalised binary 𝓁₀-rank-r approximation problem. This is the first constant pass, polylog-space streaming algorithm for either of the two problems. Coreset based techniques, which is another approach for designing streaming algorithms in general, is not known to work for the constrained binary k-means problem to the best of our knowledge.

Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 182, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.FSTTCS.2020.13, author = {Bhattacharya, Anup and Goyal, Dishant and Jaiswal, Ragesh and Kumar, Amit}, title = {{On Sampling Based Algorithms for k-Means}}, booktitle = {40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)}, pages = {13:1--13:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-174-0}, ISSN = {1868-8969}, year = {2020}, volume = {182}, editor = {Saxena, Nitin and Simon, Sunil}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2020.13}, URN = {urn:nbn:de:0030-drops-132549}, doi = {10.4230/LIPIcs.FSTTCS.2020.13}, annote = {Keywords: k-means, low rank approximation} }

Document

**Published in:** LIPIcs, Volume 173, 28th Annual European Symposium on Algorithms (ESA 2020)

The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation.
Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration.
Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.ESA.2020.18, author = {Bhattacharya, Anup and Eube, Jan and R\"{o}glin, Heiko and Schmidt, Melanie}, title = {{Noisy, Greedy and Not so Greedy k-Means++}}, booktitle = {28th Annual European Symposium on Algorithms (ESA 2020)}, pages = {18:1--18:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-162-7}, ISSN = {1868-8969}, year = {2020}, volume = {173}, editor = {Grandoni, Fabrizio and Herman, Grzegorz and Sanders, Peter}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2020.18}, URN = {urn:nbn:de:0030-drops-128848}, doi = {10.4230/LIPIcs.ESA.2020.18}, annote = {Keywords: k-means++, greedy, adaptive sampling} }

Document

RANDOM

**Published in:** LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

The disjointness problem - where Alice and Bob are given two subsets of {1, … , n} and they have to check if their sets intersect - is a central problem in the world of communication complexity. While both deterministic and randomized communication complexities for this problem are known to be Θ(n), it is also known that if the sets are assumed to be drawn from some restricted set systems then the communication complexity can be much lower. In this work, we explore how communication complexity measures change with respect to the complexity of the underlying set system. The complexity measure for the set system that we use in this work is the Vapnik–Chervonenkis (VC) dimension. More precisely, on any set system with VC dimension bounded by d, we analyze how large can the deterministic and randomized communication complexities be, as a function of d and n. The d-sparse set disjointness problem, where the sets have size at most d, is one such set system with VC dimension d. The deterministic and the randomized communication complexities of the d-sparse set disjointness problem have been well studied and is known to be Θ (d log ({n}/{d})) and Θ(d), respectively, in the multi-round communication setting. In this paper, we address the question of whether the randomized communication complexity is always upper bounded by a function of the VC dimension of the set system, and does there always exist a gap between the deterministic and randomized communication complexity for set systems with small VC dimension.
In this paper, we construct two natural set systems of VC dimension d, motivated from geometry. Using these set systems we show that the deterministic and randomized communication complexity can be Θ̃(dlog (n/d)) for set systems of VC dimension d and this matches the deterministic upper bound for all set systems of VC dimension d. We also study the deterministic and randomized communication complexities of the set intersection problem when sets belong to a set system of bounded VC dimension. We show that there exists set systems of VC dimension d such that both deterministic and randomized (one-way and multi-round) complexities for the set intersection problem can be as high as Θ(dlog (n/d)), and this is tight among all set systems of VC dimension d.

Anup Bhattacharya, Sourav Chakraborty, Arijit Ghosh, Gopinath Mishra, and Manaswi Paraashar. Disjointness Through the Lens of Vapnik–Chervonenkis Dimension: Sparsity and Beyond. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 23:1-23:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.APPROX/RANDOM.2020.23, author = {Bhattacharya, Anup and Chakraborty, Sourav and Ghosh, Arijit and Mishra, Gopinath and Paraashar, Manaswi}, title = {{Disjointness Through the Lens of Vapnik–Chervonenkis Dimension: Sparsity and Beyond}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)}, pages = {23:1--23:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-164-1}, ISSN = {1868-8969}, year = {2020}, volume = {176}, editor = {Byrka, Jaros{\l}aw and Meka, Raghu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.23}, URN = {urn:nbn:de:0030-drops-126261}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2020.23}, annote = {Keywords: Communication complexity, VC dimension, Sparsity, and Geometric Set System} }

Document

**Published in:** LIPIcs, Volume 149, 30th International Symposium on Algorithms and Computation (ISAAC 2019)

Estimating the number of triangles in a graph is one of the most fundamental problems in sublinear algorithms. In this work, we provide an approximate triangle counting algorithm using only polylogarithmic queries when the number of triangles on any edge in the graph is polylogarithmically bounded. Our query oracle Tripartite Independent Set (TIS) takes three disjoint sets of vertices A, B and C as input, and answers whether there exists a triangle having one endpoint in each of these three sets. Our query model generally belongs to the class of group queries (Ron and Tsur, ACM ToCT, 2016; Dell and Lapinskas, STOC 2018) and in particular is inspired by the Bipartite Independent Set (BIS) query oracle of Beame et al. (ITCS 2018). We extend the algorithmic framework of Beame et al., with TIS replacing BIS, for triangle counting using ideas from color coding due to Alon et al. (J. ACM, 1995) and a concentration inequality for sums of random variables with bounded dependency (Janson, Rand. Struct. Alg., 2004).

Anup Bhattacharya, Arijit Bishnu, Arijit Ghosh, and Gopinath Mishra. Triangle Estimation Using Tripartite Independent Set Queries. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.ISAAC.2019.19, author = {Bhattacharya, Anup and Bishnu, Arijit and Ghosh, Arijit and Mishra, Gopinath}, title = {{Triangle Estimation Using Tripartite Independent Set Queries}}, booktitle = {30th International Symposium on Algorithms and Computation (ISAAC 2019)}, pages = {19:1--19:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-130-6}, ISSN = {1868-8969}, year = {2019}, volume = {149}, editor = {Lu, Pinyan and Zhang, Guochuan}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2019.19}, URN = {urn:nbn:de:0030-drops-115156}, doi = {10.4230/LIPIcs.ISAAC.2019.19}, annote = {Keywords: Triangle estimation, query complexity, sublinear algorithm} }

Document

**Published in:** LIPIcs, Volume 94, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)

Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard.
In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm.

Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate Clustering with Same-Cluster Queries. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 40:1-40:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{ailon_et_al:LIPIcs.ITCS.2018.40, author = {Ailon, Nir and Bhattacharya, Anup and Jaiswal, Ragesh and Kumar, Amit}, title = {{Approximate Clustering with Same-Cluster Queries}}, booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)}, pages = {40:1--40:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-060-6}, ISSN = {1868-8969}, year = {2018}, volume = {94}, editor = {Karlin, Anna R.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2018.40}, URN = {urn:nbn:de:0030-drops-83358}, doi = {10.4230/LIPIcs.ITCS.2018.40}, annote = {Keywords: k-means, semi-supervised learning, query bounds} }

Document

**Published in:** LIPIcs, Volume 47, 33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016)

The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.

Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster Algorithms for the Constrained k-Means Problem. In 33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 47, pp. 16:1-16:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.STACS.2016.16, author = {Bhattacharya, Anup and Jaiswal, Ragesh and Kumar, Amit}, title = {{Faster Algorithms for the Constrained k-Means Problem}}, booktitle = {33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016)}, pages = {16:1--16:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-001-9}, ISSN = {1868-8969}, year = {2016}, volume = {47}, editor = {Ollinger, Nicolas and Vollmer, Heribert}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2016.16}, URN = {urn:nbn:de:0030-drops-57179}, doi = {10.4230/LIPIcs.STACS.2016.16}, annote = {Keywords: k-means, k-median, approximation algorithm, sampling} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail