Document

**Published in:** LIPIcs, Volume 251, 14th Innovations in Theoretical Computer Science Conference (ITCS 2023)

A line of work has looked at the problem of recovering an input from distance queries. In this setting, there is an unknown sequence s ∈ {0,1}^{≤ n}, and one chooses a set of queries y ∈ {0,1}^𝒪(n) and receives d(s,y) for a distance function d. The goal is to make as few queries as possible to recover s. Although this problem is well-studied for decomposable distances, i.e., distances of the form d(s,y) = ∑_{i=1}^n f(s_i, y_i) for some function f, which includes the important cases of Hamming distance, 𝓁_p-norms, and M-estimators, to the best of our knowledge this problem has not been studied for non-decomposable distances, for which there are important special cases such as edit distance, dynamic time warping (DTW), Fréchet distance, earth mover’s distance, and so on. We initiate the study and develop a general framework for such distances. Interestingly, for some distances such as DTW or Fréchet, exact recovery of the sequence s is provably impossible, and so we show by allowing the characters in y to be drawn from a slightly larger alphabet this then becomes possible. In a number of cases we obtain optimal or near-optimal query complexity. We also study the role of adaptivity for a number of different distance functions. One motivation for understanding non-adaptivity is that the query sequence can be fixed and the distances of the input to the queries provide a non-linear embedding of the input, which can be used in downstream applications involving, e.g., neural networks for natural language processing.

Zhuangfei Hu, Xinda Li, David P. Woodruff, Hongyang Zhang, and Shufan Zhang. Recovery from Non-Decomposable Distance Oracles. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 73:1-73:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{hu_et_al:LIPIcs.ITCS.2023.73, author = {Hu, Zhuangfei and Li, Xinda and Woodruff, David P. and Zhang, Hongyang and Zhang, Shufan}, title = {{Recovery from Non-Decomposable Distance Oracles}}, booktitle = {14th Innovations in Theoretical Computer Science Conference (ITCS 2023)}, pages = {73:1--73:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-263-1}, ISSN = {1868-8969}, year = {2023}, volume = {251}, editor = {Tauman Kalai, Yael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2023.73}, URN = {urn:nbn:de:0030-drops-175767}, doi = {10.4230/LIPIcs.ITCS.2023.73}, annote = {Keywords: Sequence Recovery, Edit Distance, DTW Distance, Fr\'{e}chet Distance} }

Document

RANDOM

**Published in:** LIPIcs, Volume 245, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)

We initiate a broad study of classical problems in the streaming model with insertions and deletions in the setting where we allow the approximation factor α to be much larger than 1. Such algorithms can use significantly less memory than the usual setting for which α = 1+ε for an ε ∈ (0,1). We study large approximations for a number of problems in sketching and streaming, assuming that the underlying n-dimensional vector has all coordinates bounded by M throughout the data stream:
1) For the 𝓁_p norm/quasi-norm, 0 < p ≤ 2, we show that obtaining a poly(n)-approximation requires the same amount of memory as obtaining an O(1)-approximation for any M = n^Θ(1), which holds even for randomly ordered streams or for streams in the bounded deletion model.
2) For estimating the 𝓁_p norm, p > 2, we show an upper bound of O(n^{1-2/p} (log n log M)/α²) bits for an α-approximation, and give a matching lower bound for linear sketches.
3) For the 𝓁₂-heavy hitters problem, we show that the known lower bound of Ω(k log nlog M) bits for identifying (1/k)-heavy hitters holds even if we are allowed to output items that are 1/(α k)-heavy, provided the algorithm succeeds with probability 1-O(1/n). We also obtain a lower bound for linear sketches that is tight even for constant failure probability algorithms.
4) For estimating the number 𝓁₀ of distinct elements, we give an n^{1/t}-approximation algorithm using O(tlog log M) bits of space, as well as a lower bound of Ω(t) bits, both excluding the storage of random bits, where n is the dimension of the underlying frequency vector and M is an upper bound on the magnitude of its coordinates.
5) For α-approximation to the Schatten-p norm, we give near-optimal Õ(n^{2-4/p}/α⁴) sketching dimension for every even integer p and every α ≥ 1, while for p not an even integer we obtain near-optimal sketching dimension once α = Ω(n^{1/q-1/p}), where q is the largest even integer less than p. The latter is surprising as it is unknown what the complexity of Schatten-p norm estimation is for constant approximation; we show once the approximation factor is at least n^{1/q-1/p}, we can obtain near-optimal sketching bounds.

Yi Li, Honghao Lin, David P. Woodruff, and Yuheng Zhang. Streaming Algorithms with Large Approximation Factors. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 13:1-13:23, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.APPROX/RANDOM.2022.13, author = {Li, Yi and Lin, Honghao and Woodruff, David P. and Zhang, Yuheng}, title = {{Streaming Algorithms with Large Approximation Factors}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)}, pages = {13:1--13:23}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-249-5}, ISSN = {1868-8969}, year = {2022}, volume = {245}, editor = {Chakrabarti, Amit and Swamy, Chaitanya}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2022.13}, URN = {urn:nbn:de:0030-drops-171354}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2022.13}, annote = {Keywords: streaming algorithms, 𝓁\underlinep norm, heavy hitters, distinct elements} }

Document

RANDOM

**Published in:** LIPIcs, Volume 245, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples T gradients of dimension d from nearly the optimal importance sampling distribution for a robust regression problem over n rows. Thus our algorithm effectively runs T steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

Sepideh Mahabadi, David P. Woodruff, and Samson Zhou. Adaptive Sketches for Robust Regression with Importance Sampling. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 31:1-31:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{mahabadi_et_al:LIPIcs.APPROX/RANDOM.2022.31, author = {Mahabadi, Sepideh and Woodruff, David P. and Zhou, Samson}, title = {{Adaptive Sketches for Robust Regression with Importance Sampling}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)}, pages = {31:1--31:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-249-5}, ISSN = {1868-8969}, year = {2022}, volume = {245}, editor = {Chakrabarti, Amit and Swamy, Chaitanya}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2022.31}, URN = {urn:nbn:de:0030-drops-171531}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2022.31}, annote = {Keywords: Streaming algorithms, stochastic optimization, importance sampling} }

Document

Complete Volume

**Published in:** LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

LIPIcs, Volume 229, ICALP 2022, Complete Volume

Mikołaj Bojańczyk, Emanuela Merelli, and David P. Woodruff. LIPIcs, Volume 229, ICALP 2022, Complete Volume. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 1-2516, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@Proceedings{bojanczyk_et_al:LIPIcs.ICALP.2022, title = {{LIPIcs, Volume 229, ICALP 2022, Complete Volume}}, booktitle = {49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)}, pages = {1--2516}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-235-8}, ISSN = {1868-8969}, year = {2022}, volume = {229}, editor = {Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022}, URN = {urn:nbn:de:0030-drops-163400}, doi = {10.4230/LIPIcs.ICALP.2022}, annote = {Keywords: LIPIcs, Volume 229, ICALP 2022, Complete Volume} }

Document

Front Matter

**Published in:** LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Front Matter, Table of Contents, Preface, Conference Organization

Mikołaj Bojańczyk, Emanuela Merelli, and David P. Woodruff. Front Matter, Table of Contents, Preface, Conference Organization. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 0:i-0:xxxvi, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bojanczyk_et_al:LIPIcs.ICALP.2022.0, author = {Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.}, title = {{Front Matter, Table of Contents, Preface, Conference Organization}}, booktitle = {49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)}, pages = {0:i--0:xxxvi}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-235-8}, ISSN = {1868-8969}, year = {2022}, volume = {229}, editor = {Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.0}, URN = {urn:nbn:de:0030-drops-163417}, doi = {10.4230/LIPIcs.ICALP.2022.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization} }

Document

**Published in:** LIPIcs, Volume 215, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022)

We present the first semi-streaming polynomial-time approximation scheme (PTAS) for the minimum feedback arc set problem on directed tournaments in a small number of passes. Namely, we obtain a (1 + ε)-approximation in time O (poly(n) 2^{poly(1/ε)}), with p passes, in n^{1+1/p} ⋅ poly((log n)/ε) space. The only previous algorithm with this pass/space trade-off gave a 3-approximation (SODA, 2020), and other polynomial-time algorithms which achieved a (1+ε)-approximation did so with quadratic memory or with a linear number of passes. We also present a new time/space trade-off for 1-pass algorithms that solve the tournament feedback arc set problem. This problem has several applications in machine learning such as creating linear classifiers and doing Bayesian inference. We also provide several additional algorithms and lower bounds for related streaming problems on directed graphs, which is a largely unexplored territory.

Anubhav Baweja, Justin Jia, and David P. Woodruff. An Efficient Semi-Streaming PTAS for Tournament Feedback Arc Set with Few Passes. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 215, pp. 16:1-16:23, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{baweja_et_al:LIPIcs.ITCS.2022.16, author = {Baweja, Anubhav and Jia, Justin and Woodruff, David P.}, title = {{An Efficient Semi-Streaming PTAS for Tournament Feedback Arc Set with Few Passes}}, booktitle = {13th Innovations in Theoretical Computer Science Conference (ITCS 2022)}, pages = {16:1--16:23}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-217-4}, ISSN = {1868-8969}, year = {2022}, volume = {215}, editor = {Braverman, Mark}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2022.16}, URN = {urn:nbn:de:0030-drops-156128}, doi = {10.4230/LIPIcs.ITCS.2022.16}, annote = {Keywords: Minimum Feedback Arc Set, Tournament Graphs, Approximation Algorithms, Semi-streaming Algorithms} }

Document

**Published in:** LIPIcs, Volume 215, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022)

The Boolean Hidden Matching (BHM) problem, introduced in a seminal paper of Gavinsky et al. [STOC'07], has played an important role in lower bounds for graph problems in the streaming model (e.g., subgraph counting, maximum matching, MAX-CUT, Schatten p-norm approximation). The BHM problem typically leads to Ω(√n) space lower bounds for constant factor approximations, with the reductions generating graphs that consist of connected components of constant size. The related Boolean Hidden Hypermatching (BHH) problem provides Ω(n^{1-1/t}) lower bounds for 1+O(1/t) approximation, for integers t ≥ 2. The corresponding reductions produce graphs with connected components of diameter about t, and essentially show that long range exploration is hard in the streaming model with an adversarial order of updates.
In this paper we introduce a natural variant of the BHM problem, called noisy BHM (and its natural noisy BHH variant), that we use to obtain stronger than Ω(√n) lower bounds for approximating a number of the aforementioned problems in graph streams when the input graphs consist only of components of diameter bounded by a fixed constant.
We next introduce and study the graph classification problem, where the task is to test whether the input graph is isomorphic to a given graph. As a first step, we use the noisy BHM problem to show that the problem of classifying whether an underlying graph is isomorphic to a complete binary tree in insertion-only streams requires Ω(n) space, which seems challenging to show using either BHM or BHH.

Michael Kapralov, Amulya Musipatla, Jakab Tardos, David P. Woodruff, and Samson Zhou. Noisy Boolean Hidden Matching with Applications. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 215, pp. 91:1-91:19, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{kapralov_et_al:LIPIcs.ITCS.2022.91, author = {Kapralov, Michael and Musipatla, Amulya and Tardos, Jakab and Woodruff, David P. and Zhou, Samson}, title = {{Noisy Boolean Hidden Matching with Applications}}, booktitle = {13th Innovations in Theoretical Computer Science Conference (ITCS 2022)}, pages = {91:1--91:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-217-4}, ISSN = {1868-8969}, year = {2022}, volume = {215}, editor = {Braverman, Mark}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2022.91}, URN = {urn:nbn:de:0030-drops-156876}, doi = {10.4230/LIPIcs.ITCS.2022.91}, annote = {Keywords: Boolean Hidden Matching, Lower Bounds, Communication Complexity, Streaming Algorithms} }

Document

RANDOM

**Published in:** LIPIcs, Volume 207, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021)

We study the distribution of the matrix product G₁ G₂ ⋯ G_r of r independent Gaussian matrices of various sizes, where G_i is d_{i-1} × d_i, and we denote p = d₀, q = d_r, and require d₁ = d_{r-1}. Here the entries in each G_i are standard normal random variables with mean 0 and variance 1. Such products arise in the study of wireless communication, dynamical systems, and quantum transport, among other places. We show that, provided each d_i, i = 1, …, r, satisfies d_i ≥ C p ⋅ q, where C ≥ C₀ for a constant C₀ > 0 depending on r, then the matrix product G₁ G₂ ⋯ G_r has variation distance at most δ to a p × q matrix G of i.i.d. standard normal random variables with mean 0 and variance ∏_{i = 1}^{r-1} d_i. Here δ → 0 as C → ∞. Moreover, we show a converse for constant r that if d_i < C' max{p,q}^{1/2}min{p,q}^{3/2} for some i, then this total variation distance is at least δ', for an absolute constant δ' > 0 depending on C' and r. This converse is best possible when p = Θ(q).

Yi Li and David P. Woodruff. The Product of Gaussian Matrices Is Close to Gaussian. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 207, pp. 35:1-35:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.APPROX/RANDOM.2021.35, author = {Li, Yi and Woodruff, David P.}, title = {{The Product of Gaussian Matrices Is Close to Gaussian}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021)}, pages = {35:1--35:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-207-5}, ISSN = {1868-8969}, year = {2021}, volume = {207}, editor = {Wootters, Mary and Sanit\`{a}, Laura}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2021.35}, URN = {urn:nbn:de:0030-drops-147281}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2021.35}, annote = {Keywords: random matrix theory, total variation distance, matrix product} }

Document

**Published in:** LIPIcs, Volume 200, 36th Computational Complexity Conference (CCC 2021)

The multiplayer promise set disjointness is one of the most widely used problems from communication complexity in applications. In this problem there are k players with subsets S¹, …, S^k, each drawn from {1, 2, …, n}, and we are promised that either the sets are (1) pairwise disjoint, or (2) there is a unique element j occurring in all the sets, which are otherwise pairwise disjoint. The total communication of solving this problem with constant probability in the blackboard model is Ω(n/k).
We observe for most applications, it instead suffices to look at what we call the "mostly" set disjointness problem, which changes case (2) to say there is a unique element j occurring in at least half of the sets, and the sets are otherwise disjoint. This change gives us a much simpler proof of an Ω(n/k) randomized total communication lower bound, avoiding Hellinger distance and Poincare inequalities. Our proof also gives strong lower bounds for high probability protocols, which are much larger than what is possible for the set disjointness problem. Using this we show several new results for data streams:
1) for 𝓁₂-Heavy Hitters, any O(1)-pass streaming algorithm in the insertion-only model for detecting if an ε-𝓁₂-heavy hitter exists requires min(1/(ε²)log((ε²n)/δ), 1/(ε)n^{1/2}) bits of memory, which is optimal up to a log n factor. For deterministic algorithms and constant ε, this gives an Ω(n^{1/2}) lower bound, improving the prior Ω(log n) lower bound. We also obtain lower bounds for Zipfian distributions.
2) for 𝓁_p-Estimation, p > 2, we show an O(1)-pass Ω(n^{1-2/p} log(1/δ)) bit lower bound for outputting an O(1)- approximation with probability 1-δ, in the insertion-only model. This is optimal, and the best previous lower bound was Ω(n^{1-2/p} + log(1/δ)).
3) for low rank approximation of a sparse matrix in ℝ^{d× n}, if we see the rows of a matrix one at a time in the row-order model, each row having O(1) non-zero entries, any deterministic algorithm requires Ω(√d) memory to output an O(1)-approximate rank-1 approximation. Finally, we consider strict and general turnstile streaming models, and show separations between sketching lower bounds and non-sketching upper bounds for the heavy hitters problem.

Akshay Kamath, Eric Price, and David P. Woodruff. A Simple Proof of a New Set Disjointness with Applications to Data Streams. In 36th Computational Complexity Conference (CCC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 200, pp. 37:1-37:24, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{kamath_et_al:LIPIcs.CCC.2021.37, author = {Kamath, Akshay and Price, Eric and Woodruff, David P.}, title = {{A Simple Proof of a New Set Disjointness with Applications to Data Streams}}, booktitle = {36th Computational Complexity Conference (CCC 2021)}, pages = {37:1--37:24}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-193-1}, ISSN = {1868-8969}, year = {2021}, volume = {200}, editor = {Kabanets, Valentine}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CCC.2021.37}, URN = {urn:nbn:de:0030-drops-143119}, doi = {10.4230/LIPIcs.CCC.2021.37}, annote = {Keywords: Streaming algorithms, heavy hitters, communication complexity, information complexity} }

Document

Invited Talk

**Published in:** LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

We give an overview of dimensionality reduction methods, or sketching, for a number of problems in optimization, first surveying work using these methods for classical problems, which gives near optimal algorithms for regression, low rank approximation, and natural variants. We then survey recent work applying sketching to column subset selection, kernel methods, sublinear algorithms for structured matrices, tensors, trace estimation, and so on. The focus is on fast algorithms. This is a short survey accompanying an invited talk at ICALP, 2021.

David P. Woodruff. A Very Sketchy Talk (Invited Talk). In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 6:1-6:8, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{woodruff:LIPIcs.ICALP.2021.6, author = {Woodruff, David P.}, title = {{A Very Sketchy Talk}}, booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)}, pages = {6:1--6:8}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-195-5}, ISSN = {1868-8969}, year = {2021}, volume = {198}, editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.6}, URN = {urn:nbn:de:0030-drops-140755}, doi = {10.4230/LIPIcs.ICALP.2021.6}, annote = {Keywords: dimensionality reduction, optimization, randomized numerical linear algebra, sketching} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

We study the classical problem of moment estimation of an underlying vector whose n coordinates are implicitly defined through a series of updates in a data stream. We show that if the updates to the vector arrive in the random-order insertion-only model, then there exist space efficient algorithms with improved dependencies on the approximation parameter ε. In particular, for any real p > 2, we first obtain an algorithm for F_p moment estimation using 𝒪̃(1/(ε^{4/p})⋅ n^{1-2/p}) bits of memory. Our techniques also give algorithms for F_p moment estimation with p > 2 on arbitrary order insertion-only and turnstile streams, using 𝒪̃(1/(ε^{4/p})⋅ n^{1-2/p}) bits of space and two passes, which is the first optimal multi-pass F_p estimation algorithm up to log n factors. Finally, we give an improved lower bound of Ω(1/(ε²)⋅ n^{1-2/p}) for one-pass insertion-only streams. Our results separate the complexity of this problem both between random and non-random orders, as well as one-pass and multi-pass streams.

David P. Woodruff and Samson Zhou. Separations for Estimating Large Frequency Moments on Data Streams. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 112:1-112:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{woodruff_et_al:LIPIcs.ICALP.2021.112, author = {Woodruff, David P. and Zhou, Samson}, title = {{Separations for Estimating Large Frequency Moments on Data Streams}}, booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)}, pages = {112:1--112:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-195-5}, ISSN = {1868-8969}, year = {2021}, volume = {198}, editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.112}, URN = {urn:nbn:de:0030-drops-141810}, doi = {10.4230/LIPIcs.ICALP.2021.112}, annote = {Keywords: streaming algorithms, frequency moments, random order, lower bounds} }

Document

**Published in:** LIPIcs, Volume 185, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)

In the masked low-rank approximation problem, one is given data matrix A ∈ ℝ^{n × n} and binary mask matrix W ∈ {0,1}^{n × n}. The goal is to find a rank-k matrix L for which:
cost(L) := ∑_{i=1}^n ∑_{j=1}^n W_{i,j} ⋅ (A_{i,j} - L_{i,j})² ≤ OPT + ε ‖A‖_F²,
where OPT = min_{rank-k L̂} cost(L̂) and ε is a given error parameter. Depending on the choice of W, the above problem captures factor analysis, low-rank plus diagonal decomposition, robust PCA, low-rank matrix completion, low-rank plus block matrix approximation, low-rank recovery from monotone missing data, and a number of other important problems. Many of these problems are NP-hard, and while algorithms with provable guarantees are known in some cases, they either 1) run in time n^Ω(k²/ε) or 2) make strong assumptions, for example, that A is incoherent or that the entries in W are chosen independently and uniformly at random.
In this work, we show that a common polynomial time heuristic, which simply sets A to 0 where W is 0, and then finds a standard low-rank approximation, yields bicriteria approximation guarantees for this problem. In particular, for rank k' > k depending on the public coin partition number of W, the heuristic outputs rank-k' L with cost(L) ≤ OPT + ε ‖A‖_F². This partition number is in turn bounded by the randomized communication complexity of W, when interpreted as a two-player communication matrix. For many important cases, including all those listed above, this yields bicriteria approximation guarantees with rank k' = k ⋅ poly(log n/ε).
Beyond this result, we show that different notions of communication complexity yield bicriteria algorithms for natural variants of masked low-rank approximation. For example, multi-player number-in-hand communication complexity connects to masked tensor decomposition and non-deterministic communication complexity to masked Boolean low-rank factorization.

Cameron Musco, Christopher Musco, and David P. Woodruff. Simple Heuristics Yield Provable Algorithms for Masked Low-Rank Approximation. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 6:1-6:20, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{musco_et_al:LIPIcs.ITCS.2021.6, author = {Musco, Cameron and Musco, Christopher and Woodruff, David P.}, title = {{Simple Heuristics Yield Provable Algorithms for Masked Low-Rank Approximation}}, booktitle = {12th Innovations in Theoretical Computer Science Conference (ITCS 2021)}, pages = {6:1--6:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-177-1}, ISSN = {1868-8969}, year = {2021}, volume = {185}, editor = {Lee, James R.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2021.6}, URN = {urn:nbn:de:0030-drops-135452}, doi = {10.4230/LIPIcs.ITCS.2021.6}, annote = {Keywords: low-rank approximation, communication complexity, weighted low-rank approximation, bicriteria approximation algorithms} }

Document

RANDOM

**Published in:** LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

We consider the general problem of learning about a matrix through vector-matrix-vector queries. These queries provide the value of u^{T}Mv over a fixed field 𝔽 for a specified pair of vectors u,v ∈ 𝔽ⁿ. To motivate these queries, we observe that they generalize many previously studied models, such as independent set queries, cut queries, and standard graph queries. They also specialize the recently studied matrix-vector query model. Our work is exploratory and broad, and we provide new upper and lower bounds for a wide variety of problems, spanning linear algebra, statistics, and graphs. Many of our results are nearly tight, and we use diverse techniques from linear algebra, randomized algorithms, and communication complexity.

Cyrus Rashtchian, David P. Woodruff, and Hanlin Zhu. Vector-Matrix-Vector Queries for Solving Linear Algebra, Statistics, and Graph Problems. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 26:1-26:20, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{rashtchian_et_al:LIPIcs.APPROX/RANDOM.2020.26, author = {Rashtchian, Cyrus and Woodruff, David P. and Zhu, Hanlin}, title = {{Vector-Matrix-Vector Queries for Solving Linear Algebra, Statistics, and Graph Problems}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)}, pages = {26:1--26:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-164-1}, ISSN = {1868-8969}, year = {2020}, volume = {176}, editor = {Byrka, Jaros{\l}aw and Meka, Raghu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.26}, URN = {urn:nbn:de:0030-drops-126294}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2020.26}, annote = {Keywords: Query complexity, property testing, vector-matrix-vector, linear algebra, statistics, graph parameter estimation} }

Document

APPROX

**Published in:** LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

We study the space complexity of solving the bias-regularized SVM problem in the streaming model. In particular, given a data set (x_i,y_i) ∈ ℝ^d× {-1,+1}, the objective function is F_λ(θ,b) = λ/2‖(θ,b)‖₂² + 1/n∑_{i=1}ⁿ max{0,1-y_i(θ^Tx_i+b)} and the goal is to find the parameters that (approximately) minimize this objective. This is a classic supervised learning problem that has drawn lots of attention, including for developing fast algorithms for solving the problem approximately: i.e., for finding (θ,b) such that F_λ(θ,b) ≤ min_{(θ',b')} F_λ(θ',b')+ε.
One of the most widely used algorithms for approximately optimizing the SVM objective is Stochastic Gradient Descent (SGD), which requires only O(1/λε) random samples, and which immediately yields a streaming algorithm that uses O(d/λε) space. For related problems, better streaming algorithms are only known for smooth functions, unlike the SVM objective that we focus on in this work.
We initiate an investigation of the space complexity for both finding an approximate optimum of this objective, and for the related "point estimation" problem of sketching the data set to evaluate the function value F_λ on any query (θ, b). We show that, for both problems, for dimensions d = 1,2, one can obtain streaming algorithms with space polynomially smaller than 1/λε, which is the complexity of SGD for strongly convex functions like the bias-regularized SVM [Shalev-Shwartz et al., 2007], and which is known to be tight in general, even for d = 1 [Agarwal et al., 2009]. We also prove polynomial lower bounds for both point estimation and optimization. In particular, for point estimation we obtain a tight bound of Θ(1/√{ε}) for d = 1 and a nearly tight lower bound of Ω̃(d/{ε}²) for d = Ω(log(1/ε)). Finally, for optimization, we prove a Ω(1/√{ε}) lower bound for d = Ω(log(1/ε)), and show similar bounds when d is constant.

Alexandr Andoni, Collin Burns, Yi Li, Sepideh Mahabadi, and David P. Woodruff. Streaming Complexity of SVMs. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 50:1-50:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{andoni_et_al:LIPIcs.APPROX/RANDOM.2020.50, author = {Andoni, Alexandr and Burns, Collin and Li, Yi and Mahabadi, Sepideh and Woodruff, David P.}, title = {{Streaming Complexity of SVMs}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)}, pages = {50:1--50:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-164-1}, ISSN = {1868-8969}, year = {2020}, volume = {176}, editor = {Byrka, Jaros{\l}aw and Meka, Raghu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.50}, URN = {urn:nbn:de:0030-drops-126532}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2020.50}, annote = {Keywords: support vector machine, streaming algorithm, space lower bound, sketching algorithm, point estimation} }

Document

APPROX

**Published in:** LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

We study the Maximum Independent Set problem for geometric objects given in the data stream model. A set of geometric objects is said to be independent if the objects are pairwise disjoint. We consider geometric objects in one and two dimensions, i.e., intervals and disks. Let α be the cardinality of the largest independent set. Our goal is to estimate α in a small amount of space, given that the input is received as a one-pass stream. We also consider a generalization of this problem by assigning weights to each object and estimating β, the largest value of a weighted independent set. We initialize the study of this problem in the turnstile streaming model (insertions and deletions) and provide the first algorithms for estimating α and β.
For unit-length intervals, we obtain a (2+ε)-approximation to α and β in poly(log(n)/ε) space. We also show a matching lower bound. Combined with the 3/2-approximation for insertion-only streams by Cabello and Perez-Lanterno [Cabello and Pérez-Lantero, 2017], our result implies a separation between the insertion-only and turnstile model. For unit-radius disks, we obtain a (8√3/π)-approximation to α and β in poly(log(n)/ε) space, which is closely related to the hexagonal circle packing constant.
Finally, we provide algorithms for estimating α for arbitrary-length intervals under a bounded intersection assumption and study the parameterized space complexity of estimating α and β, where the parameter is the ratio of maximum to minimum interval length.

Ainesh Bakshi, Nadiia Chepurko, and David P. Woodruff. Weighted Maximum Independent Set of Geometric Objects in Turnstile Streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 64:1-64:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bakshi_et_al:LIPIcs.APPROX/RANDOM.2020.64, author = {Bakshi, Ainesh and Chepurko, Nadiia and Woodruff, David P.}, title = {{Weighted Maximum Independent Set of Geometric Objects in Turnstile Streams}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)}, pages = {64:1--64:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-164-1}, ISSN = {1868-8969}, year = {2020}, volume = {176}, editor = {Byrka, Jaros{\l}aw and Meka, Raghu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.64}, URN = {urn:nbn:de:0030-drops-126679}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2020.64}, annote = {Keywords: Weighted Maximum Independent Set, Geometric Graphs, Turnstile Streams} }

Document

**Published in:** LIPIcs, Volume 151, 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)

Graph spanners are sparse subgraphs which approximately preserve all pairwise shortest-path distances in an input graph. The notion of approximation can be additive, multiplicative, or both, and many variants of this problem have been extensively studied. We study the problem of computing a graph spanner when the edges of the input graph are distributed across two or more sites in an arbitrary, possibly worst-case partition, and the goal is for the sites to minimize the communication used to output a spanner. We assume the message-passing model of communication, for which there is a point-to-point link between all pairs of sites as well as a coordinator who is responsible for producing the output. We stress that the subset of edges that each site has is not related to the network topology, which is fixed to be point-to-point. While this model has been extensively studied for related problems such as graph connectivity, it has not been systematically studied for graph spanners. We present the first tradeoffs for total communication versus the quality of the spanners computed, for two or more sites, as well as for additive and multiplicative notions of distortion. We show separations in the communication complexity when edges are allowed to occur on multiple sites, versus when each edge occurs on at most one site. We obtain nearly tight bounds (up to polylog factors) for the communication of additive 2-spanners in both the with and without duplication models, multiplicative (2k-1)-spanners in the with duplication model, and multiplicative 3 and 5-spanners in the without duplication model. Our lower bound for multiplicative 3-spanners employs biregular bipartite graphs rather than the usual Erdős girth conjecture graphs and may be of wider interest.

Manuel Fernández V, David P. Woodruff, and Taisuke Yasuda. Graph Spanners in the Message-Passing Model. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 151, pp. 77:1-77:18, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{fernandezv_et_al:LIPIcs.ITCS.2020.77, author = {Fern\'{a}ndez V, Manuel and Woodruff, David P. and Yasuda, Taisuke}, title = {{Graph Spanners in the Message-Passing Model}}, booktitle = {11th Innovations in Theoretical Computer Science Conference (ITCS 2020)}, pages = {77:1--77:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-134-4}, ISSN = {1868-8969}, year = {2020}, volume = {151}, editor = {Vidick, Thomas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2020.77}, URN = {urn:nbn:de:0030-drops-117620}, doi = {10.4230/LIPIcs.ITCS.2020.77}, annote = {Keywords: Graph spanners, Message-passing model, Communication complexity} }

Document

**Published in:** LIPIcs, Volume 151, 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)

A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, ?_2 approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm A, there exists a stream x so that running A twice on x (using different randomness) would with high probability result in two different entries as the output.
In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for ?_2 norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix.
We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted.

Shafi Goldwasser, Ofer Grossman, Sidhanth Mohanty, and David P. Woodruff. Pseudo-Deterministic Streaming. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 151, pp. 79:1-79:25, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{goldwasser_et_al:LIPIcs.ITCS.2020.79, author = {Goldwasser, Shafi and Grossman, Ofer and Mohanty, Sidhanth and Woodruff, David P.}, title = {{Pseudo-Deterministic Streaming}}, booktitle = {11th Innovations in Theoretical Computer Science Conference (ITCS 2020)}, pages = {79:1--79:25}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-134-4}, ISSN = {1868-8969}, year = {2020}, volume = {151}, editor = {Vidick, Thomas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2020.79}, URN = {urn:nbn:de:0030-drops-117644}, doi = {10.4230/LIPIcs.ITCS.2020.79}, annote = {Keywords: streaming, pseudo-deterministic} }

Document

APPROX

**Published in:** LIPIcs, Volume 145, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)

Consider a variant of the Mastermind game in which queries are l_p distances, rather than the usual Hamming distance. That is, a codemaker chooses a hidden vector y in {-k,-k+1,...,k-1,k}^n and answers to queries of the form ||y-x||_p where x in {-k,-k+1,...,k-1,k}^n. The goal is to minimize the number of queries made in order to correctly guess y.
In this work, we show an upper bound of O(min{n,(n log k)/(log n)}) queries for any real 1<=p<infty and O(n) queries for p=infty. To prove this result, we in fact develop a nonadaptive polynomial time algorithm that works for a natural class of separable distance measures, i.e., coordinate-wise sums of functions of the absolute value. We also show matching lower bounds up to constant factors, even for adaptive algorithms for the approximation version of the problem, in which the problem is to output y' such that ||y'-y||_p <= R for any R <= k^{1-epsilon}n^{1/p} for constant epsilon>0. Thus, essentially any approximation of this problem is as hard as finding the hidden vector exactly, up to constant factors. Finally, we show that for the noisy version of the problem, i.e., the setting when the codemaker answers queries with any q = (1 +/- epsilon)||y-x||_p, there is no query efficient algorithm.

Manuel Fernández V, David P. Woodruff, and Taisuke Yasuda. The Query Complexity of Mastermind with l_p Distances. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 1:1-1:11, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{fernandezv_et_al:LIPIcs.APPROX-RANDOM.2019.1, author = {Fern\'{a}ndez V, Manuel and Woodruff, David P. and Yasuda, Taisuke}, title = {{The Query Complexity of Mastermind with l\underlinep Distances}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)}, pages = {1:1--1:11}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-125-2}, ISSN = {1868-8969}, year = {2019}, volume = {145}, editor = {Achlioptas, Dimitris and V\'{e}gh, L\'{a}szl\'{o} A.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2019.1}, URN = {urn:nbn:de:0030-drops-112165}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2019.1}, annote = {Keywords: Mastermind, Query Complexity, l\underlinep Distance} }

Document

APPROX

**Published in:** LIPIcs, Volume 145, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)

One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order.
We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter.

Rajesh Jayaram and David P. Woodruff. Towards Optimal Moment Estimation in Streaming and Distributed Models. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 29:1-29:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{jayaram_et_al:LIPIcs.APPROX-RANDOM.2019.29, author = {Jayaram, Rajesh and Woodruff, David P.}, title = {{Towards Optimal Moment Estimation in Streaming and Distributed Models}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)}, pages = {29:1--29:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-125-2}, ISSN = {1868-8969}, year = {2019}, volume = {145}, editor = {Achlioptas, Dimitris and V\'{e}gh, L\'{a}szl\'{o} A.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2019.29}, URN = {urn:nbn:de:0030-drops-112443}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2019.29}, annote = {Keywords: Streaming, Sketching, Message Passing, Moment Estimation} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 132, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)

In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering.
Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for real-world instances.

Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David P. Woodruff. Robust Communication-Optimal Distributed Clustering Algorithms. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 18:1-18:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{awasthi_et_al:LIPIcs.ICALP.2019.18, author = {Awasthi, Pranjal and Bakshi, Ainesh and Balcan, Maria-Florina and White, Colin and Woodruff, David P.}, title = {{Robust Communication-Optimal Distributed Clustering Algorithms}}, booktitle = {46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)}, pages = {18:1--18:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-109-2}, ISSN = {1868-8969}, year = {2019}, volume = {132}, editor = {Baier, Christel and Chatzigiannakis, Ioannis and Flocchini, Paola and Leonardi, Stefano}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2019.18}, URN = {urn:nbn:de:0030-drops-105942}, doi = {10.4230/LIPIcs.ICALP.2019.18}, annote = {Keywords: robust distributed clustering, communication complexity} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 132, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)

We consider algorithms with access to an unknown matrix M in F^{n x d} via matrix-vector products, namely, the algorithm chooses vectors v^1, ..., v^q, and observes Mv^1, ..., Mv^q. Here the v^i can be randomized as well as chosen adaptively as a function of Mv^1, ..., Mv^{i-1}. Motivated by applications of sketching in distributed computation, linear algebra, and streaming models, as well as connections to areas such as communication complexity and property testing, we initiate the study of the number q of queries needed to solve various fundamental problems. We study problems in three broad categories, including linear algebra, statistics problems, and graph problems. For example, we consider the number of queries required to approximate the rank, trace, maximum eigenvalue, and norms of a matrix M; to compute the AND/OR/Parity of each column or row of M, to decide whether there are identical columns or rows in M or whether M is symmetric, diagonal, or unitary; or to compute whether a graph defined by M is connected or triangle-free. We also show separations for algorithms that are allowed to obtain matrix-vector products only by querying vectors on the right, versus algorithms that can query vectors on both the left and the right. We also show separations depending on the underlying field the matrix-vector product occurs in. For graph problems, we show separations depending on the form of the matrix (bipartite adjacency versus signed edge-vertex incidence matrix) to represent the graph.
Surprisingly, this fundamental model does not appear to have been studied on its own, and we believe a thorough investigation of problems in this model would be beneficial to a number of different application areas.

Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. Querying a Matrix Through Matrix-Vector Products. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 94:1-94:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{sun_et_al:LIPIcs.ICALP.2019.94, author = {Sun, Xiaoming and Woodruff, David P. and Yang, Guang and Zhang, Jialin}, title = {{Querying a Matrix Through Matrix-Vector Products}}, booktitle = {46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)}, pages = {94:1--94:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-109-2}, ISSN = {1868-8969}, year = {2019}, volume = {132}, editor = {Baier, Christel and Chatzigiannakis, Ioannis and Flocchini, Paola and Leonardi, Stefano}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2019.94}, URN = {urn:nbn:de:0030-drops-106709}, doi = {10.4230/LIPIcs.ICALP.2019.94}, annote = {Keywords: Communication complexity, linear algebra, sketching} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 132, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)

In a k-party communication problem, the k players with inputs x_1, x_2, ..., x_k, respectively, want to evaluate a function f(x_1, x_2, ..., x_k) using as little communication as possible. We consider the message-passing model, in which the inputs are partitioned in an arbitrary, possibly worst-case manner, among a smaller number t of players (t<k). The t-player communication cost of computing f can only be smaller than the k-player communication cost, since the t players can trivially simulate the k-player protocol. But how much smaller can it be? We study deterministic and randomized protocols in the one-way model, and provide separations for product input distributions, which are optimal for low error probability protocols. We also provide much stronger separations when the input distribution is non-product.
A key application of our results is in proving lower bounds for data stream algorithms. In particular, we give an optimal Omega(epsilon^{-2}log(N) log log(mM)) bits of space lower bound for the fundamental problem of (1 +/-{epsilon})-approximating the number |x |_0 of non-zero entries of an n-dimensional vector x after m updates each of magnitude M, and with success probability >= 2/3, in a strict turnstile stream. Our result matches the best known upper bound when epsilon >= 1/polylog(mM). It also improves on the prior Omega({epsilon}^{-2}log(mM)) lower bound and separates the complexity of approximating L_0 from approximating the p-norm L_p for p bounded away from 0, since the latter has an O(epsilon^{-2}log(mM)) bit upper bound.

David P. Woodruff and Guang Yang. Separating k-Player from t-Player One-Way Communication, with Applications to Data Streams. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 97:1-97:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{woodruff_et_al:LIPIcs.ICALP.2019.97, author = {Woodruff, David P. and Yang, Guang}, title = {{Separating k-Player from t-Player One-Way Communication, with Applications to Data Streams}}, booktitle = {46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)}, pages = {97:1--97:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-109-2}, ISSN = {1868-8969}, year = {2019}, volume = {132}, editor = {Baier, Christel and Chatzigiannakis, Ioannis and Flocchini, Paola and Leonardi, Stefano}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2019.97}, URN = {urn:nbn:de:0030-drops-106733}, doi = {10.4230/LIPIcs.ICALP.2019.97}, annote = {Keywords: Communication complexity, multi-player communication, one-way communication, streaming complexity} }

Document

**Published in:** LIPIcs, Volume 129, 35th International Symposium on Computational Geometry (SoCG 2019)

We resolve the randomized one-way communication complexity of Dynamic Time Warping (DTW) distance. We show that there is an efficient one-way communication protocol using O~(n/alpha) bits for the problem of computing an alpha-approximation for DTW between strings x and y of length n, and we prove a lower bound of Omega(n / alpha) bits for the same problem. Our communication protocol works for strings over an arbitrary metric of polynomial size and aspect ratio, and we optimize the logarithmic factors depending on properties of the underlying metric, such as when the points are low-dimensional integer vectors equipped with various metrics or have bounded doubling dimension. We also consider linear sketches of DTW, showing that such sketches must have size Omega(n).

Vladimir Braverman, Moses Charikar, William Kuszmaul, David P. Woodruff, and Lin F. Yang. The One-Way Communication Complexity of Dynamic Time Warping Distance. In 35th International Symposium on Computational Geometry (SoCG 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 129, pp. 16:1-16:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.SoCG.2019.16, author = {Braverman, Vladimir and Charikar, Moses and Kuszmaul, William and Woodruff, David P. and Yang, Lin F.}, title = {{The One-Way Communication Complexity of Dynamic Time Warping Distance}}, booktitle = {35th International Symposium on Computational Geometry (SoCG 2019)}, pages = {16:1--16:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-104-7}, ISSN = {1868-8969}, year = {2019}, volume = {129}, editor = {Barequet, Gill and Wang, Yusu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SoCG.2019.16}, URN = {urn:nbn:de:0030-drops-104203}, doi = {10.4230/LIPIcs.SoCG.2019.16}, annote = {Keywords: dynamic time warping, one-way communication complexity, tree metrics} }

Document

**Published in:** LIPIcs, Volume 116, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)

We study the distinct elements and l_p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and l_p-heavy hitters that are nearly optimal in both n and epsilon.
Applying our new composable histogram framework, we provide an algorithm that outputs a (1+epsilon)-approximation to the number of distinct elements in the sliding window model and uses O{1/(epsilon^2) log n log (1/epsilon)log log n+ (1/epsilon) log^2 n} bits of space. For l_p-heavy hitters, we provide an algorithm using space O{(1/epsilon^p) log^2 n (log^2 log n+log 1/epsilon)} for 0<p <=2, improving upon the best-known algorithm for l_2-heavy hitters (Braverman et al., COCOON 2014), which has space complexity O{1/epsilon^4 log^3 n}. We also show complementing nearly optimal lower bounds of Omega ((1/epsilon) log^2 n+(1/epsilon^2) log n) for distinct elements and Omega ((1/epsilon^p) log^2 n) for l_p-heavy hitters, both tight up to O{log log n} and O{log 1/epsilon} factors.

Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 116, pp. 7:1-7:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.APPROX-RANDOM.2018.7, author = {Braverman, Vladimir and Grigorescu, Elena and Lang, Harry and Woodruff, David P. and Zhou, Samson}, title = {{Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)}, pages = {7:1--7:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-085-9}, ISSN = {1868-8969}, year = {2018}, volume = {116}, editor = {Blais, Eric and Jansen, Klaus and D. P. Rolim, Jos\'{e} and Steurer, David}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2018.7}, URN = {urn:nbn:de:0030-drops-94118}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2018.7}, annote = {Keywords: Streaming algorithms, sliding windows, heavy hitters, distinct elements} }

Document

**Published in:** LIPIcs, Volume 116, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)

We initiate the study of data dimensionality reduction, or sketching, for the q -> p norms. Given an n x d matrix A, the q -> p norm, denoted |A |_{q -> p} = sup_{x in R^d \ 0} |Ax |_p / |x |_q, is a natural generalization of several matrix and vector norms studied in the data stream and sketching models, with applications to datamining, hardness of approximation, and oblivious routing. We say a distribution S on random matrices L in R^{nd} - > R^k is a (k,alpha)-sketching family if from L(A), one can approximate |A |_{q -> p} up to a factor alpha with constant probability. We provide upper and lower bounds on the sketching dimension k for every p, q in [1, infty], and in a number of cases our bounds are tight. While we mostly focus on constant alpha, we also consider large approximation factors alpha, as well as other variants of the problem such as when A has low rank.

Aditya Krishnan, Sidhanth Mohanty, and David P. Woodruff. On Sketching the q to p Norms. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 116, pp. 15:1-15:20, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{krishnan_et_al:LIPIcs.APPROX-RANDOM.2018.15, author = {Krishnan, Aditya and Mohanty, Sidhanth and Woodruff, David P.}, title = {{On Sketching the q to p Norms}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)}, pages = {15:1--15:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-085-9}, ISSN = {1868-8969}, year = {2018}, volume = {116}, editor = {Blais, Eric and Jansen, Klaus and D. P. Rolim, Jos\'{e} and Steurer, David}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2018.15}, URN = {urn:nbn:de:0030-drops-94192}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2018.15}, annote = {Keywords: Dimensionality Reduction, Norms, Sketching, Streaming} }

Document

**Published in:** LIPIcs, Volume 116, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)

We study the heavy hitters and related sparse recovery problems in the low failure probability regime. This regime is not well-understood, and the main previous work on this is by Gilbert et al. (ICALP'13). We recognize an error in their analysis, improve their results, and contribute new sparse recovery algorithms, as well as provide upper and lower bounds for the heavy hitters problem with low failure probability. Our results are summarized as follows:
1) (Heavy Hitters) We study three natural variants for finding heavy hitters in the strict turnstile model, where the variant depends on the quality of the desired output. For the weakest variant, we give a randomized algorithm improving the failure probability analysis of the ubiquitous Count-Min data structure. We also give a new lower bound for deterministic schemes, resolving a question about this variant posed in Question 4 in the IITK Workshop on Algorithms for Data Streams (2006). Under the strongest and well-studied l_{infty}/ l_2 variant, we show that the classical Count-Sketch data structure is optimal for very low failure probabilities, which was previously unknown.
2) (Sparse Recovery Algorithms) For non-adaptive sparse-recovery, we give sublinear-time algorithms with low-failure probability, which improve upon Gilbert et al. (ICALP'13). In the adaptive case, we improve the failure probability from a constant by Indyk et al. (FOCS '11) to e^{-k^{0.99}}, where k is the sparsity parameter.
3) (Optimal Average-Case Sparse Recovery Bounds) We give matching upper and lower bounds in all parameters, including the failure probability, for the measurement complexity of the l_2/l_2 sparse recovery problem in the spiked-covariance model, completely settling its complexity in this model.

Yi Li, Vasileios Nakos, and David P. Woodruff. On Low-Risk Heavy Hitters and Sparse Recovery Schemes. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 116, pp. 19:1-19:13, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.APPROX-RANDOM.2018.19, author = {Li, Yi and Nakos, Vasileios and Woodruff, David P.}, title = {{On Low-Risk Heavy Hitters and Sparse Recovery Schemes}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)}, pages = {19:1--19:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-085-9}, ISSN = {1868-8969}, year = {2018}, volume = {116}, editor = {Blais, Eric and Jansen, Klaus and D. P. Rolim, Jos\'{e} and Steurer, David}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2018.19}, URN = {urn:nbn:de:0030-drops-94237}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2018.19}, annote = {Keywords: heavy hitters, sparse recovery, turnstile model, spike covariance model, lower bounds} }

Document

**Published in:** LIPIcs, Volume 107, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)

We revisit one of the classic problems in the data stream literature, namely, that of estimating the frequency moments F_p for 0 < p < 2 of an underlying n-dimensional vector presented as a sequence of additive updates in a stream. It is well-known that using p-stable distributions one can approximate any of these moments up to a multiplicative (1+epsilon)-factor using O(epsilon^{-2} log n) bits of space, and this space bound is optimal up to a constant factor in the turnstile streaming model. We show that surprisingly, if one instead considers the popular random-order model of insertion-only streams, in which the updates to the underlying vector arrive in a random order, then one can beat this space bound and achieve O~(epsilon^{-2} + log n) bits of space, where the O~ hides poly(log(1/epsilon) + log log n) factors. If epsilon^{-2} ~~ log n, this represents a roughly quadratic improvement in the space achievable in turnstile streams. Our algorithm is in fact deterministic, and we show our space bound is optimal up to poly(log(1/epsilon) + log log n) factors for deterministic algorithms in the random order model. We also obtain a similar improvement in space for p = 2 whenever F_2 >~ log n * F_1.

Vladimir Braverman, Emanuele Viola, David P. Woodruff, and Lin F. Yang. Revisiting Frequency Moment Estimation in Random Order Streams. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 25:1-25:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.ICALP.2018.25, author = {Braverman, Vladimir and Viola, Emanuele and Woodruff, David P. and Yang, Lin F.}, title = {{Revisiting Frequency Moment Estimation in Random Order Streams}}, booktitle = {45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)}, pages = {25:1--25:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-076-7}, ISSN = {1868-8969}, year = {2018}, volume = {107}, editor = {Chatzigiannakis, Ioannis and Kaklamanis, Christos and Marx, D\'{a}niel and Sannella, Donald}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2018.25}, URN = {urn:nbn:de:0030-drops-90294}, doi = {10.4230/LIPIcs.ICALP.2018.25}, annote = {Keywords: Data Stream, Frequency Moments, Random Order, Space Complexity, Insertion Only Stream} }

Document

**Published in:** LIPIcs, Volume 107, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)

We consider the problem of sketching the p-th frequency moment of a vector, p>2, with multiplicative error at most 1 +/- epsilon and with high confidence 1-delta. Despite the long sequence of work on this problem, tight bounds on this quantity are only known for constant delta. While one can obtain an upper bound with error probability delta by repeating a sketching algorithm with constant error probability O(log(1/delta)) times in parallel, and taking the median of the outputs, we show this is a suboptimal algorithm! Namely, we show optimal upper and lower bounds of Theta(n^{1-2/p} log(1/delta) + n^{1-2/p} log^{2/p} (1/delta) log n) on the sketching dimension, for any constant approximation. Our result should be contrasted with results for estimating frequency moments for 1 <= p <= 2, for which we show the optimal algorithm for general delta is obtained by repeating the optimal algorithm for constant error probability O(log(1/delta)) times and taking the median output. We also obtain a matching lower bound for this problem, up to constant factors.

Sumit Ganguly and David P. Woodruff. High Probability Frequency Moment Sketches. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 58:1-58:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ICALP.2018.58, author = {Ganguly, Sumit and Woodruff, David P.}, title = {{High Probability Frequency Moment Sketches}}, booktitle = {45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)}, pages = {58:1--58:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-076-7}, ISSN = {1868-8969}, year = {2018}, volume = {107}, editor = {Chatzigiannakis, Ioannis and Kaklamanis, Christos and Marx, D\'{a}niel and Sannella, Donald}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2018.58}, URN = {urn:nbn:de:0030-drops-90623}, doi = {10.4230/LIPIcs.ICALP.2018.58}, annote = {Keywords: Data Streams, Frequency Moments, High Confidence} }

Document

**Published in:** LIPIcs, Volume 107, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)

In the problem of adaptive compressed sensing, one wants to estimate an approximately k-sparse vector x in R^n from m linear measurements A_1 x, A_2 x,..., A_m x, where A_i can be chosen based on the outcomes A_1 x,..., A_{i-1} x of previous measurements. The goal is to output a vector x^ for which |x-x^|_p <=C * min_{k-sparse x'} |x-x'|_q, with probability at least 2/3, where C > 0 is an approximation factor. Indyk, Price and Woodruff (FOCS'11) gave an algorithm for p=q=2 for C = 1+epsilon with O((k/epsilon) loglog (n/k)) measurements and O(log^*(k) loglog (n)) rounds of adaptivity. We first improve their bounds, obtaining a scheme with O(k * loglog (n/k) + (k/epsilon) * loglog(1/epsilon)) measurements and O(log^*(k) loglog (n)) rounds, as well as a scheme with O((k/epsilon) * loglog (n log (n/k))) measurements and an optimal O(loglog (n)) rounds. We then provide novel adaptive compressed sensing schemes with improved bounds for (p,p) for every 0 < p < 2. We show that the improvement from O(k log(n/k)) measurements to O(k log log (n/k)) measurements in the adaptive setting can persist with a better epsilon-dependence for other values of p and q. For example, when (p,q) = (1,1), we obtain O(k/sqrt{epsilon} * log log n log^3 (1/epsilon)) measurements. We obtain nearly matching lower bounds, showing our algorithms are close to optimal. Along the way, we also obtain the first nearly-optimal bounds for (p,p) schemes for every 0 < p < 2 even in the non-adaptive setting.

Vasileios Nakos, Xiaofei Shi, David P. Woodruff, and Hongyang Zhang. Improved Algorithms for Adaptive Compressed Sensing. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 90:1-90:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{nakos_et_al:LIPIcs.ICALP.2018.90, author = {Nakos, Vasileios and Shi, Xiaofei and Woodruff, David P. and Zhang, Hongyang}, title = {{Improved Algorithms for Adaptive Compressed Sensing}}, booktitle = {45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)}, pages = {90:1--90:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-076-7}, ISSN = {1868-8969}, year = {2018}, volume = {107}, editor = {Chatzigiannakis, Ioannis and Kaklamanis, Christos and Marx, D\'{a}niel and Sannella, Donald}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2018.90}, URN = {urn:nbn:de:0030-drops-90945}, doi = {10.4230/LIPIcs.ICALP.2018.90}, annote = {Keywords: Compressed Sensing, Adaptivity, High-Dimensional Vectors} }

Document

**Published in:** LIPIcs, Volume 94, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)

This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and its dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and show that under certain dual conditions, the optimal solution of the matrix factorization program is the same as its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization problems are hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly.
We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis (PCA). These are examples of efficiently recovering a hidden matrix given limited reliable observations of it. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity guarantees for matrix completion and robust PCA.

Maria-Florina Balcan, Yingyu Liang, David P. Woodruff, and Hongyang Zhang. Matrix Completion and Related Problems via Strong Duality. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 5:1-5:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{balcan_et_al:LIPIcs.ITCS.2018.5, author = {Balcan, Maria-Florina and Liang, Yingyu and Woodruff, David P. and Zhang, Hongyang}, title = {{Matrix Completion and Related Problems via Strong Duality}}, booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)}, pages = {5:1--5:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-060-6}, ISSN = {1868-8969}, year = {2018}, volume = {94}, editor = {Karlin, Anna R.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2018.5}, URN = {urn:nbn:de:0030-drops-83583}, doi = {10.4230/LIPIcs.ITCS.2018.5}, annote = {Keywords: Non-Convex Optimization, Strong Duality, Matrix Completion, Robust PCA, Sample Complexity} }

Document

**Published in:** LIPIcs, Volume 94, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)

Understanding the singular value spectrum of an n x n matrix A is a fundamental task in countless numerical computation and data analysis applications. In matrix multiplication time, it is possible to perform a full SVD of A and directly compute the singular values \sigma_1,...,\sigma_n. However, little is known about algorithms that break this runtime barrier.
Using tools from stochastic trace estimation, polynomial approximation, and fast linear system solvers, we show how to efficiently isolate different ranges of A's spectrum and approximate the number of singular values in these ranges. We thus effectively compute an approximate histogram of the spectrum, which can stand in for the true singular values in many applications.
We use our histogram primitive to give the first algorithms for approximating a wide class of symmetric matrix norms and spectral sums faster than the best known runtime for matrix multiplication. For example, we show how to obtain a (1 + \epsilon) approximation to the Schatten 1-norm (i.e. the nuclear or trace norm) in just ~ O((nnz(A)n^{1/3} + n^2)\epsilon^{-3}) time for A with uniform row sparsity or \tilde O(n^{2.18} \epsilon^{-3}) time for dense matrices. The runtime scales smoothly for general Schatten-p norms, notably becoming \tilde O (p nnz(A) \epsilon^{-3}) for any real p >= 2.
At the same time, we show that the complexity of spectrum approximation is inherently tied to fast matrix multiplication in the small \epsilon regime. We use fine-grained complexity to give conditional lower bounds for spectrum approximation, showing that achieving milder \epsilon dependencies in our algorithms would imply triangle detection algorithms for general graphs running in faster than state of the art matrix multiplication time. This further implies, through a reduction of (Williams & William, 2010), that highly accurate spectrum approximation algorithms running in subcubic time can be used to give subcubic time matrix multiplication. As an application of our bounds, we show that precisely computing all effective resistances in a graph in less than matrix multiplication time is likely difficult, barring a major algorithmic breakthrough.

Cameron Musco, Praneeth Netrapalli, Aaron Sidford, Shashanka Ubaru, and David P. Woodruff. Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and Hardness. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 8:1-8:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{musco_et_al:LIPIcs.ITCS.2018.8, author = {Musco, Cameron and Netrapalli, Praneeth and Sidford, Aaron and Ubaru, Shashanka and Woodruff, David P.}, title = {{Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and Hardness}}, booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)}, pages = {8:1--8:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-060-6}, ISSN = {1868-8969}, year = {2018}, volume = {94}, editor = {Karlin, Anna R.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2018.8}, URN = {urn:nbn:de:0030-drops-83397}, doi = {10.4230/LIPIcs.ITCS.2018.8}, annote = {Keywords: spectrum approximation, matrix norm computation, fine-grained complexity, linear algebra} }

Document

**Published in:** Dagstuhl Reports, Volume 7, Issue 5 (2018)

This report documents the program and the topics discussed of the 4-day
Dagstuhl Seminar 17181 "Theory and Applications of Hashing",
which took place May 1-5, 2017. Four long and eighteen short talks
covered a wide and diverse range of topics within the theme of the workshop.
The program left sufficient space for informal discussions among the 40 participants.

Martin Dietzfelbinger, Michael Mitzenmacher, Rasmus Pagh, David P. Woodruff, and Martin Aumüller. Theory and Applications of Hashing (Dagstuhl Seminar 17181). In Dagstuhl Reports, Volume 7, Issue 5, pp. 1-21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@Article{dietzfelbinger_et_al:DagRep.7.5.1, author = {Dietzfelbinger, Martin and Mitzenmacher, Michael and Pagh, Rasmus and Woodruff, David P. and Aum\"{u}ller, Martin}, title = {{Theory and Applications of Hashing (Dagstuhl Seminar 17181)}}, pages = {1--21}, journal = {Dagstuhl Reports}, ISSN = {2192-5283}, year = {2017}, volume = {7}, number = {5}, editor = {Dietzfelbinger, Martin and Mitzenmacher, Michael and Pagh, Rasmus and Woodruff, David P. and Aum\"{u}ller, Martin}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagRep.7.5.1}, URN = {urn:nbn:de:0030-drops-82788}, doi = {10.4230/DagRep.7.5.1}, annote = {Keywords: connections to complexity theory, data streaming applications, hash function construction and analysis, hashing primitives, information retrieval applications, locality-sensitive hashing, machine learning applications} }

Document

Invited Talk

**Published in:** LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

In this invited talk at the European Symposium on Algorithms (ESA), 2017, I will discuss a tool called sketching, which is a form of data dimensionality reduction, and its applications to several problems in high dimensional geometry. In particular, I will show how to obtain the fastest possible algorithms for fundamental problems such as projection onto a flat, and also study generalizations of projection onto more complicated objects such as the union of flats or subspaces. Some of these problems are just least squares regression problems, with many applications in machine learning, numerical linear algebra, and optimization. I will also discuss low rank approximation, with applications to clustering. Finally I will mention a number of other applications of sketching in machine learning, numerical linear algebra, and optimization.

David P. Woodruff. Sketching for Geometric Problems (Invited Talk). In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 1:1-1:5, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{woodruff:LIPIcs.ESA.2017.1, author = {Woodruff, David P.}, title = {{Sketching for Geometric Problems}}, booktitle = {25th Annual European Symposium on Algorithms (ESA 2017)}, pages = {1:1--1:5}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-049-1}, ISSN = {1868-8969}, year = {2017}, volume = {87}, editor = {Pruhs, Kirk and Sohler, Christian}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.1}, URN = {urn:nbn:de:0030-drops-78848}, doi = {10.4230/LIPIcs.ESA.2017.1}, annote = {Keywords: dimensionality reduction, low rank approximation, projection, regression, sketching} }

Document

**Published in:** LIPIcs, Volume 81, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017)

We study matrix sketching methods for regularized variants of linear regression, low rank approximation, and canonical correlation analysis. Our main focus is on sketching techniques which preserve the objective function value for regularized problems, which is an area that has remained largely unexplored. We study regularization both in a fairly broad setting, and in the specific context of the popular and widely used technique of ridge regularization; for the latter, as applied to each of these problems, we show algorithmic resource bounds in which the statistical dimension appears in places where in previous bounds the rank would appear. The statistical dimension is always smaller than the rank, and decreases as the amount of regularization increases. In particular we show this for the ridge low-rank approximation problem as well as regularized low-rank approximation problems in a much more general setting, where the regularizing function satisfies some very general conditions (chiefly, invariance under orthogonal transformations).

Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Sharper Bounds for Regularized Data Fitting. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 81, pp. 27:1-27:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{avron_et_al:LIPIcs.APPROX-RANDOM.2017.27, author = {Avron, Haim and Clarkson, Kenneth L. and Woodruff, David P.}, title = {{Sharper Bounds for Regularized Data Fitting}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017)}, pages = {27:1--27:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-044-6}, ISSN = {1868-8969}, year = {2017}, volume = {81}, editor = {Jansen, Klaus and Rolim, Jos\'{e} D. P. and Williamson, David P. and Vempala, Santosh S.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2017.27}, URN = {urn:nbn:de:0030-drops-75761}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2017.27}, annote = {Keywords: Matrices, Regression, Low-rank approximation, Regularization, Canonical Correlation Analysis} }

Document

**Published in:** LIPIcs, Volume 80, 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)

Sketching has emerged as a powerful technique for speeding up problems in numerical linear algebra, such as regression. In the overconstrained regression problem, one is given an n x d matrix A, with n >> d, as well as an n x 1 vector b, and one wants to find a vector \hat{x} so as to minimize the residual error ||Ax-b||_2. Using the sketch and solve paradigm, one first computes S \cdot A and S \cdot b for a randomly chosen matrix S, then outputs x' = (SA)^{\dagger} Sb so as to minimize || SAx' - Sb||_2.
The sketch-and-solve paradigm gives a bound on ||x'-x^*||_2 when A is well-conditioned. Our main result is that, when S is the subsampled randomized Fourier/Hadamard transform, the error x' - x^* behaves as if it lies in a "random" direction within this bound: for any fixed direction a in R^d, we have with 1 - d^{-c} probability that
(1) \langle a, x'-x^* \rangle \lesssim \frac{ \|a\|_2\|x'-x^*\|_2}{d^{\frac{1}{2}-\gamma}},
where c, \gamma > 0 are arbitrary constants. This implies ||x'-x^*||_{\infty} is a factor d^{\frac{1}{2}-\gamma} smaller than ||x'-x^*||_2. It also gives a better bound on the generalization of x' to new examples: if rows of A correspond to examples and columns to features, then our result gives a better bound for the error introduced by sketch-and-solve when classifying fresh examples. We show that not all oblivious subspace embeddings S satisfy these properties. In particular, we give counterexamples showing that matrices based on Count-Sketch or leverage score sampling do not satisfy these properties.
We also provide lower bounds, both on how small ||x'-x^*||_2 can be, and for our new guarantee (1), showing that the subsampled randomized Fourier/Hadamard transform is nearly optimal. Our lower bound on ||x'-x^*||_2 shows that there is an O(1/epsilon) separation in the dimension of the optimal oblivious subspace embedding required for outputting an x' for which ||x'-x^*||_2 <= epsilon ||Ax^*-b||_2 \cdot ||A^{\dagger}||_2$, compared to the dimension of the optimal oblivious subspace embedding required for outputting an x' for which ||Ax'-b||_2 <= (1+epsilon)||Ax^*-b||_2, that is, the former problem requires dimension Omega(d/epsilon^2) while the latter problem can be solved with dimension O(d/epsilon). This explains the reason known upper bounds on the dimensions of these two variants of regression have differed in prior work.

Eric Price, Zhao Song, and David P. Woodruff. Fast Regression with an $ell_infty$ Guarantee. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 59:1-59:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{price_et_al:LIPIcs.ICALP.2017.59, author = {Price, Eric and Song, Zhao and Woodruff, David P.}, title = {{Fast Regression with an \$ell\underlineinfty\$ Guarantee}}, booktitle = {44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)}, pages = {59:1--59:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-041-5}, ISSN = {1868-8969}, year = {2017}, volume = {80}, editor = {Chatzigiannakis, Ioannis and Indyk, Piotr and Kuhn, Fabian and Muscholl, Anca}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2017.59}, URN = {urn:nbn:de:0030-drops-74488}, doi = {10.4230/LIPIcs.ICALP.2017.59}, annote = {Keywords: Linear regression, Count-Sketch, Gaussians, Leverage scores, ell\underlineinfty-guarantee} }

Document

**Published in:** LIPIcs, Volume 80, 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)

Given an n×d matrix A, its Schatten-p norm, p >= 1, is defined as |A|_p = (sum_{i=1}^rank(A) sigma(i)^p）^{1/p} where sigma_i(A) is the i-th largest singular value of A. These norms have been studied in functional analysis in the context of non-commutative L_p-spaces, and recently in data stream and linear sketching models of computation. Basic questions on the relations between these norms, such as their embeddability, are still open. Specifically, given a set of matrices A_1, ... , A_poly(nd) in R^{n x d}, suppose we want to construct a linear map L such that L(A_i) in R^{n' x d'} for each i, where n' < n and d' < d, and further, |A_i|p <= |L(A_i)|_q <= D_{p,q}|A_i|_p for a given approximation factor D_{p,q} and real number q >= 1. Then how large do n' and d' need to be as a function of D_{p,q}?
We nearly resolve this question for every p, q >= 1, for the case where L(A_i) can be expressed as R*A_i*S, where R and S are arbitrary matrices that are allowed to depend on A_1, ... ,A_t, that is, L(A_i) can be implemented by left and right matrix multiplication. Namely, for every p, q >= 1, we provide nearly matching upper and lower bounds on the size of n' and d' as a function of D_{p,q}. Importantly, our upper bounds are oblivious, meaning that R and S do not depend on the A_i, while our lower bounds hold even if R and S depend on the A_i. As an application of our upper bounds, we answer a recent open question of Blasiok et al. about space-approximation trade-offs for the Schatten 1-norm, showing in a data stream it is possible to estimate the Schatten-1 norm up to a factor of D >= 1 using O~(min(n, d)^2/D^4) space.

Yi Li and David P. Woodruff. Embeddings of Schatten Norms with Applications to Data Streams. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 60:1-60:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.ICALP.2017.60, author = {Li, Yi and Woodruff, David P.}, title = {{Embeddings of Schatten Norms with Applications to Data Streams}}, booktitle = {44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)}, pages = {60:1--60:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-041-5}, ISSN = {1868-8969}, year = {2017}, volume = {80}, editor = {Chatzigiannakis, Ioannis and Indyk, Piotr and Kuhn, Fabian and Muscholl, Anca}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2017.60}, URN = {urn:nbn:de:0030-drops-73726}, doi = {10.4230/LIPIcs.ICALP.2017.60}, annote = {Keywords: data stream algorithms, embeddings, matrix norms, sketching} }

Document

**Published in:** LIPIcs, Volume 60, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016)

We consider the following oblivious sketching problem: given epsilon in (0,1/3) and n >= d/epsilon^2, design a distribution D over R^{k * nd} and a function f: R^k * R^{nd} -> R}, so that for any n * d matrix A, Pr_{S sim D} [(1-epsilon) |A|_{op} <= f(S(A),S) <= (1+epsilon)|A|_{op}] >= 2/3, where |A|_{op} = sup_{x:|x|_2 = 1} |Ax|_2 is the operator norm of A and S(A) denotes S * A, interpreting A as a vector in R^{nd}. We show a tight lower bound of k = Omega(d^2/epsilon^2) for this problem. Previously, Nelson and Nguyen (ICALP, 2014) considered the problem of finding a distribution D over R^{k * n} such that for any n * d matrix A, Pr_{S sim D}[forall x, (1-epsilon)|Ax|_2 <= |SAx|_2 <= (1+epsilon)|Ax|_2] >= 2/3, which is called an oblivious subspace embedding (OSE). Our result considerably strengthens theirs, as it (1) applies only to estimating the operator norm, which can be estimated given any OSE, and (2) applies to distributions over general linear operators S which treat A as a vector and compute S(A), rather than the restricted class of linear operators corresponding to matrix multiplication. Our technique also implies the first tight bounds for approximating the Schatten p-norm for even integers p via general linear sketches, improving the previous lower bound from k = Omega(n^{2-6/p}) [Regev, 2014] to k = Omega(n^{2-4/p}). Importantly, for sketching the operator norm up to a factor of alpha, where alpha - 1 = \Omega(1), we obtain a tight k = Omega(n^2/alpha^4) bound, matching the upper bound of Andoni and Nguyen (SODA, 2013), and improving the previous k = Omega(n^2/\alpha^6) lower bound. Finally, we also obtain the first lower bounds for approximating Ky Fan norms.

Yi Li and David P. Woodruff. Tight Bounds for Sketching the Operator Norm, Schatten Norms, and Subspace Embeddings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 60, pp. 39:1-39:11, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.APPROX-RANDOM.2016.39, author = {Li, Yi and Woodruff, David P.}, title = {{Tight Bounds for Sketching the Operator Norm, Schatten Norms, and Subspace Embeddings}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016)}, pages = {39:1--39:11}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-018-7}, ISSN = {1868-8969}, year = {2016}, volume = {60}, editor = {Jansen, Klaus and Mathieu, Claire and Rolim, Jos\'{e} D. P. and Umans, Chris}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2016.39}, URN = {urn:nbn:de:0030-drops-66620}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2016.39}, annote = {Keywords: data streams, sketching, matrix norms, subspace embeddings} }

Document

**Published in:** LIPIcs, Volume 55, 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)

We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having m = O(˜r/epsilon^2) rows. Here r˜ is the maximum stable rank, i.e., the squared ratio of Frobenius and operator norms, of the two matrices being multiplied. This is a quantitative improvement over previous work of [Magen and Zouzias, SODA, 2011] and [Kyrillidis et al., arXiv, 2014] and is also optimal for any oblivious dimensionality-reducing map. Furthermore, due to the black box reliance on the subspace embedding property in our proofs, our theorem can be applied to a much more general class of sketching matrices than what was known before, in addition to achieving better bounds. For example, one can apply our theorem to efficient subspace embeddings such as the Subsampled Randomized Hadamard Transform or sparse subspace embeddings, or even with subspace embedding constructions that may be developed in the future.
Our main theorem, via connections with spectral error matrix multiplication proven in previous work, implies quantitative improvements for approximate least squares regression and low rank approximation, and implies faster low rank approximation for popular kernels in machine learning such as the gaussian and Sobolev kernels. Our main result has also already been applied to improve dimensionality reduction guarantees for k-means clustering, and also implies new results for nonparametric regression.
Lastly, we point out that the proof of the "BSS" deterministic row-sampling result of [Batson et al., SICOMP, 2012] can be modified to obtain deterministic row-sampling for approximate matrix product in terms of the stable rank of the matrices. The original "BSS" proof was in terms of the rank rather than the stable rank.

Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal Approximate Matrix Product in Terms of Stable Rank. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 55, pp. 11:1-11:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{cohen_et_al:LIPIcs.ICALP.2016.11, author = {Cohen, Michael B. and Nelson, Jelani and Woodruff, David P.}, title = {{Optimal Approximate Matrix Product in Terms of Stable Rank}}, booktitle = {43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)}, pages = {11:1--11:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-013-2}, ISSN = {1868-8969}, year = {2016}, volume = {55}, editor = {Chatzigiannakis, Ioannis and Mitzenmacher, Michael and Rabani, Yuval and Sangiorgi, Davide}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2016.11}, URN = {urn:nbn:de:0030-drops-62788}, doi = {10.4230/LIPIcs.ICALP.2016.11}, annote = {Keywords: subspace embeddings, approximate matrix multiplication, stable rank, regression, low rank approximation} }

Document

**Published in:** LIPIcs, Volume 57, 24th Annual European Symposium on Algorithms (ESA 2016)

We address the trade-off between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the trade-offs between the space and sample complexity. We study these trade-offs for several canonical problems studied in the data stream model: estimating the collision probability, i.e., the second moment of a distribution, deciding if a graph is connected, and approximating the dimension of an unknown subspace. Our results are based on techniques for simulating different classical sampling procedures in this model, emulating random walks given a sequence of IID samples, as well as leveraging a characterization between communication bounded protocols and statistical query algorithms.

Michael Crouch, Andrew McGregor, Gregory Valiant, and David P. Woodruff. Stochastic Streams: Sample Complexity vs. Space Complexity. In 24th Annual European Symposium on Algorithms (ESA 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 57, pp. 32:1-32:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{crouch_et_al:LIPIcs.ESA.2016.32, author = {Crouch, Michael and McGregor, Andrew and Valiant, Gregory and Woodruff, David P.}, title = {{Stochastic Streams: Sample Complexity vs. Space Complexity}}, booktitle = {24th Annual European Symposium on Algorithms (ESA 2016)}, pages = {32:1--32:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-015-6}, ISSN = {1868-8969}, year = {2016}, volume = {57}, editor = {Sankowski, Piotr and Zaroliagis, Christos}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2016.32}, URN = {urn:nbn:de:0030-drops-63838}, doi = {10.4230/LIPIcs.ESA.2016.32}, annote = {Keywords: data streams, sample complexity, frequency moments, graph connectivity, subspace approximation} }

Document

**Published in:** LIPIcs, Volume 50, 31st Conference on Computational Complexity (CCC 2016)

Recently, [Li, Nguyen, Woodruff, STOC 2014] showed any 1-pass constant probability streaming algorithm for computing a relation f on a vector x in {-m, -(m-1), ..., m}^n presented in the turnstile data stream model can be implemented by maintaining a linear sketch Ax mod q, where A is an r times n integer matrix and q = (q_1, ..., q_r) is a vector of positive integers. The space complexity of maintaining Ax mod q, not including the random bits used for sampling A and q, matches the space of the optimal algorithm.
We give multiple strengthenings of this reduction, together with new applications. In particular, we show how to remove the following shortcomings of their reduction:
1. The Box Constraint. Their reduction applies only to algorithms that must be correct even if x_{infinity} = max_{i in [n]} |x_i| is allowed to be much larger than m at intermediate points in the stream, provided that x is in {-m, -(m-1), ..., m}^n at the end of the stream. We give a condition under which the optimal algorithm is a linear sketch even if it works only when promised that x is in {-m, -(m-1), ..., m}^n at all points in the stream. Using this, we show the first super-constant Omega(log m) bits lower bound for the problem of maintaining a counter up to an additive epsilon*m error in a turnstile stream, where epsilon is any constant in (0, 1/2). Previous lower bounds are based on communication complexity and are only for relative error approximation; interestingly, we do not know how to prove our result using communication complexity. More generally, we show the first super-constant Omega(log(m)) lower bound for additive approximation of l_p-norms; this bound is tight for p in [1, 2].
2. Negative Coordinates. Their reduction allows x_i to be negative while processing the stream. We show an equivalence between 1-pass algorithms and linear sketches Ax mod q in dynamic graph streams, or more generally, the strict turnstile model, in which for all i in [n], x_i is nonnegative at all points in the stream. Combined with [Assadi, Khanna, Li, Yaroslavtsev, SODA 2016], this resolves the 1-pass space complexity of approximating the maximum matching in a dynamic graph stream, answering a question in that work.
3. 1-Pass Restriction. Their reduction only applies to 1-pass data stream algorithms in the turnstile model, while there exist algorithms for heavy hitters and for low rank approximation which provably do better with multiple passes. We extend the reduction to algorithms which make any number of passes, showing the optimal algorithm is to choose a new linear sketch at the beginning of each pass, based on the output of previous passes.

Yuqing Ai, Wei Hu, Yi Li, and David P. Woodruff. New Characterizations in Turnstile Streams with Applications. In 31st Conference on Computational Complexity (CCC 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 50, pp. 20:1-20:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{ai_et_al:LIPIcs.CCC.2016.20, author = {Ai, Yuqing and Hu, Wei and Li, Yi and Woodruff, David P.}, title = {{New Characterizations in Turnstile Streams with Applications}}, booktitle = {31st Conference on Computational Complexity (CCC 2016)}, pages = {20:1--20:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-008-8}, ISSN = {1868-8969}, year = {2016}, volume = {50}, editor = {Raz, Ran}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CCC.2016.20}, URN = {urn:nbn:de:0030-drops-58337}, doi = {10.4230/LIPIcs.CCC.2016.20}, annote = {Keywords: communication complexity, data streams, dynamic graph streams, norm estimation} }

Document

Invited Talk

**Published in:** LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the l_1-heavy hitters and l_2-heavy hitters. There are a number of algorithmic solutions for these problems, starting with the work of Misra and Gries, as well as the CountMin and CountSketch data structures, among others.
In this paper (accompanying an invited talk) we cover several recent results developed in this area, which improve upon the classical solutions to these problems. In particular, we develop new algorithms for finding l_1-heavy hitters and l_2-heavy hitters, with significantly less memory required than what was known, and which are optimal in a number of parameter regimes.

David P. Woodruff. New Algorithms for Heavy Hitters in Data Streams (Invited Talk). In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 4:1-4:12, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{woodruff:LIPIcs.ICDT.2016.4, author = {Woodruff, David P.}, title = {{New Algorithms for Heavy Hitters in Data Streams}}, booktitle = {19th International Conference on Database Theory (ICDT 2016)}, pages = {4:1--4:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-002-6}, ISSN = {1868-8969}, year = {2016}, volume = {48}, editor = {Martens, Wim and Zeume, Thomas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.4}, URN = {urn:nbn:de:0030-drops-57739}, doi = {10.4230/LIPIcs.ICDT.2016.4}, annote = {Keywords: data streams, heavy hitters} }

Document

**Published in:** LIPIcs, Volume 40, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015)

Despite the large amount of work on solving graph problems in the data stream model, there do not exist tight space bounds for almost any of them, even in a stream with only edge insertions. For example, for testing connectivity, the upper bound is O(n * log(n)) bits, while the lower bound is only Omega(n) bits. We remedy this situation by providing the first tight Omega(n * log(n)) space lower bounds for randomized algorithms which succeed with constant probability in a stream of edge insertions for a number of graph problems. Our lower bounds apply to testing bipartiteness, connectivity, cycle-freeness, whether a graph is Eulerian, planarity, H-minor freeness, finding a minimum spanning tree of a connected graph, and testing if the diameter of a sparse graph is constant. We also give the first Omega(n * k * log(n)) space lower bounds for deterministic algorithms for k-edge connectivity and k-vertex connectivity; these are optimal in light of known deterministic upper bounds (for k-vertex connectivity we also need to allow edge duplications, which known upper bounds allow). Finally, we give an Omega(n * log^2(n)) lower bound for randomized algorithms approximating the minimum cut up to a constant factor with constant probability in a graph with integer weights between 1 and n, presented as a stream of insertions and deletions to its edges. This lower bound also holds for cut sparsifiers, and gives the first separation of maintaining a sparsifier in the data stream model versus the offline model.

Xiaoming Sun and David P. Woodruff. Tight Bounds for Graph Problems in Insertion Streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 40, pp. 435-448, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{sun_et_al:LIPIcs.APPROX-RANDOM.2015.435, author = {Sun, Xiaoming and Woodruff, David P.}, title = {{Tight Bounds for Graph Problems in Insertion Streams}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015)}, pages = {435--448}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-89-7}, ISSN = {1868-8969}, year = {2015}, volume = {40}, editor = {Garg, Naveen and Jansen, Klaus and Rao, Anup and Rolim, Jos\'{e} D. P.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2015.435}, URN = {urn:nbn:de:0030-drops-53160}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2015.435}, annote = {Keywords: communication complexity, data streams, graphs, space complexity} }

Document

**Published in:** LIPIcs, Volume 28, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014)

The EQUALITY problem is usually one’s first encounter with communication complexity and is one of the most fundamental problems in the field. Although its deterministic and randomized communication complexity were settled decades ago, we find several new things to say about the problem by focusing on three subtle aspects. The first is to consider the expected communication cost (at a worst-case input) for a protocol that uses limited interaction—i.e., a bounded number of rounds of communication—and whose error probability is zero or close to it. The second is to treat the false negative error rate separately from the false positive error rate. The third is to consider the information cost of such protocols. We obtain asymptotically optimal rounds-versus-cost tradeoffs for EQUALITY: both expected communication cost and information cost scale as Theta(log log ... log n), with r-1 logs, where r is the number of rounds. These bounds hold even when the false negative rate approaches 1. For the case of zero-error communication cost, we obtain essentially matching bounds, up to a tiny additive constant. We also provide some applications.

Joshua Brody, Amit Chakrabarti, Ranganath Kondapally, David P. Woodruff, and Grigory Yaroslavtsev. Certifying Equality With Limited Interaction. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 28, pp. 545-581, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2014)

Copy BibTex To Clipboard

@InProceedings{brody_et_al:LIPIcs.APPROX-RANDOM.2014.545, author = {Brody, Joshua and Chakrabarti, Amit and Kondapally, Ranganath and Woodruff, David P. and Yaroslavtsev, Grigory}, title = {{Certifying Equality With Limited Interaction}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014)}, pages = {545--581}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-74-3}, ISSN = {1868-8969}, year = {2014}, volume = {28}, editor = {Jansen, Klaus and Rolim, Jos\'{e} and Devanur, Nikhil R. and Moore, Cristopher}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2014.545}, URN = {urn:nbn:de:0030-drops-47229}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2014.545}, annote = {Keywords: equality, communication complexity, information complexity} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail