DROPS

Document

DOI: 10.4230/LIPIcs.FORC.2023.10

Distributionally Robust Data Join

Authors: Pranjal Awasthi, Christopher Jung, and Jamie Morgenstern

Published in: LIPIcs, Volume 256, 4th Symposium on Foundations of Responsible Computing (FORC 2023)

Abstract

Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is r₁ away from the empirical distribution over the labeled dataset and r₂ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.

Cite as

Pranjal Awasthi, Christopher Jung, and Jamie Morgenstern. Distributionally Robust Data Join. In 4th Symposium on Foundations of Responsible Computing (FORC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 256, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{awasthi_et_al:LIPIcs.FORC.2023.10,
  author =	{Awasthi, Pranjal and Jung, Christopher and Morgenstern, Jamie},
  title =	{{Distributionally Robust Data Join}},
  booktitle =	{4th Symposium on Foundations of Responsible Computing (FORC 2023)},
  pages =	{10:1--10:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-272-3},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{256},
  editor =	{Talwar, Kunal},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.FORC.2023.10},
  URN =		{urn:nbn:de:0030-drops-179311},
  doi =		{10.4230/LIPIcs.FORC.2023.10},
  annote =	{Keywords: Distributionally Robust Optimization, Semi-Supervised Learning, Learning Theory}
}

Document

DOI: 10.4230/LIPIcs.ESA.2019.7

Bilu-Linial Stability, Certified Algorithms and the Independent Set Problem

Authors: Haris Angelidakis, Pranjal Awasthi, Avrim Blum, Vaggos Chatziafratis, and Chen Dan

Published in: LIPIcs, Volume 144, 27th Annual European Symposium on Algorithms (ESA 2019)

Abstract

We study the classic Maximum Independent Set problem under the notion of stability introduced by Bilu and Linial (2010): a weighted instance of Independent Set is gamma-stable if it has a unique optimal solution that remains the unique optimal solution under multiplicative perturbations of the weights by a factor of at most gamma >= 1. The goal then is to efficiently recover this "pronounced" optimal solution exactly. In this work, we solve stable instances of Independent Set on several classes of graphs: we improve upon previous results by solving O~(Delta/sqrt(log Delta))-stable instances on graphs of maximum degree Delta, (k - 1)-stable instances on k-colorable graphs and (1 + epsilon)-stable instances on planar graphs (for any fixed epsilon > 0), using both combinatorial techniques as well as LPs and the Sherali-Adams hierarchy. For general graphs, we present a strong lower bound showing that there are no efficient algorithms for O(n^(1/2 - epsilon))-stable instances of Independent Set, assuming the planted clique conjecture. To complement our negative result, we give an algorithm for (epsilon n)-stable instances, for any fixed epsilon > 0. As a by-product of our techniques, we give algorithms as well as lower bounds for stable instances of Node Multiway Cut (a generalization of Edge Multiway Cut), by exploiting its connections to Vertex Cover. Furthermore, we prove a general structural result showing that the integrality gap of convex relaxations of several maximization problems reduces dramatically on stable instances. Moreover, we initiate the study of certified algorithms for Independent Set. The notion of a gamma-certified algorithm was introduced very recently by Makarychev and Makarychev (2018) and it is a class of gamma-approximation algorithms that satisfy one crucial property: the solution returned is optimal for a perturbation of the original instance, where perturbations are again multiplicative up to a factor of gamma >= 1 (hence, such algorithms not only solve gamma-stable instances optimally, but also have guarantees even on unstable instances). Here, we obtain Delta-certified algorithms for Independent Set on graphs of maximum degree Delta, and (1+epsilon)-certified algorithms on planar graphs. Finally, we analyze the algorithm of Berman and Fürer (1994) and prove that it is a ((Delta + 1)/3 + epsilon)-certified algorithm for Independent Set on graphs of maximum degree Delta where all weights are equal to 1.

Cite as

Haris Angelidakis, Pranjal Awasthi, Avrim Blum, Vaggos Chatziafratis, and Chen Dan. Bilu-Linial Stability, Certified Algorithms and the Independent Set Problem. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 7:1-7:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{angelidakis_et_al:LIPIcs.ESA.2019.7,
  author =	{Angelidakis, Haris and Awasthi, Pranjal and Blum, Avrim and Chatziafratis, Vaggos and Dan, Chen},
  title =	{{Bilu-Linial Stability, Certified Algorithms and the Independent Set Problem}},
  booktitle =	{27th Annual European Symposium on Algorithms (ESA 2019)},
  pages =	{7:1--7:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-124-5},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{144},
  editor =	{Bender, Michael A. and Svensson, Ola and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2019.7},
  URN =		{urn:nbn:de:0030-drops-111288},
  doi =		{10.4230/LIPIcs.ESA.2019.7},
  annote =	{Keywords: Bilu-Linial stability, perturbation resilience, beyond worst-case analysis, Independent Set, Vertex Cover, Multiway Cut}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2019.18

Robust Communication-Optimal Distributed Clustering Algorithms

Authors: Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David P. Woodruff

Published in: LIPIcs, Volume 132, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)

Abstract

In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for real-world instances.

Cite as

Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David P. Woodruff. Robust Communication-Optimal Distributed Clustering Algorithms. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{awasthi_et_al:LIPIcs.ICALP.2019.18,
  author =	{Awasthi, Pranjal and Bakshi, Ainesh and Balcan, Maria-Florina and White, Colin and Woodruff, David P.},
  title =	{{Robust Communication-Optimal Distributed Clustering Algorithms}},
  booktitle =	{46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)},
  pages =	{18:1--18:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-109-2},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{132},
  editor =	{Baier, Christel and Chatzigiannakis, Ioannis and Flocchini, Paola and Leonardi, Stefano},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2019.18},
  URN =		{urn:nbn:de:0030-drops-105942},
  doi =		{10.4230/LIPIcs.ICALP.2019.18},
  annote =	{Keywords: robust distributed clustering, communication complexity}
}

Document

DOI: 10.4230/LIPIcs.SOCG.2015.754

The Hardness of Approximation of Euclidean k-Means

Authors: Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop

Published in: LIPIcs, Volume 34, 31st International Symposium on Computational Geometry (SoCG 2015)

Abstract

The Euclidean k-means problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of n points in Euclidean space R^d, and the goal is to choose k center points in R^d so that the sum of squared distances of each point to its nearest center is minimized. The best approximation algorithms for this problem include a polynomial time constant factor approximation for general k and a (1+c)-approximation which runs in time poly(n) exp(k/c). At the other extreme, the only known computational complexity result for this problem is NP-hardness [Aloise et al.'09]. The main difficulty in obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point in R^d can be a potential center. This gap in understanding left open the intriguing possibility that the problem might admit a PTAS for all k, d. In this paper we provide the first hardness of approximation for the Euclidean k-means problem. Concretely, we show that there exists a constant c > 0 such that it is NP-hard to approximate the k-means objective to within a factor of (1+c). We show this via an efficient reduction from the vertex cover problem on triangle-free graphs: given a triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness results for vertex cover can be carried over to triangle-free graphs. To show this we transform G, a known hard vertex cover instance, by taking a graph product with a suitably chosen graph H, and showing that the size of the (normalized) maximum independent set is almost exactly preserved in the product graph using a spectral analysis, which might be of independent interest.

Cite as

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In 31st International Symposium on Computational Geometry (SoCG 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 34, pp. 754-767, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{awasthi_et_al:LIPIcs.SOCG.2015.754,
  author =	{Awasthi, Pranjal and Charikar, Moses and Krishnaswamy, Ravishankar and Sinop, Ali Kemal},
  title =	{{The Hardness of Approximation of Euclidean k-Means}},
  booktitle =	{31st International Symposium on Computational Geometry (SoCG 2015)},
  pages =	{754--767},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-83-5},
  ISSN =	{1868-8969},
  year =	{2015},
  volume =	{34},
  editor =	{Arge, Lars and Pach, J\'{a}nos},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SOCG.2015.754},
  URN =		{urn:nbn:de:0030-drops-51178},
  doi =		{10.4230/LIPIcs.SOCG.2015.754},
  annote =	{Keywords: Euclidean k-means, Hardness of Approximation, Vertex Cover}
}

Search Results

Documents authored by Awasthi, Pranjal

Distributionally Robust Data Join

Abstract

Cite as

Bilu-Linial Stability, Certified Algorithms and the Independent Set Problem

Abstract

Cite as

Robust Communication-Optimal Distributed Clustering Algorithms

Abstract

Cite as

The Hardness of Approximation of Euclidean k-Means

Abstract

Cite as