Search Results

Documents authored by Wang, Dingyu


Document
Track A: Algorithms, Complexity and Games
Non-Mergeable Sketching for Cardinality Estimation

Authors: Seth Pettie, Dingyu Wang, and Longhui Yin

Published in: LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)


Abstract
Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error √{2/m}) has an MVP of 10. Our contributions are as follows. - Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches. - Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H₀/I₀ ≈ 1.98, among a class of mergeable sketches called "linearizable" sketches. (H₀ and I₀ are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H₀/2 ≈ 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch. - Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP≈ 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog.

Cite as

Seth Pettie, Dingyu Wang, and Longhui Yin. Non-Mergeable Sketching for Cardinality Estimation. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 104:1-104:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{pettie_et_al:LIPIcs.ICALP.2021.104,
  author =	{Pettie, Seth and Wang, Dingyu and Yin, Longhui},
  title =	{{Non-Mergeable Sketching for Cardinality Estimation}},
  booktitle =	{48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
  pages =	{104:1--104:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-195-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{198},
  editor =	{Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.104},
  URN =		{urn:nbn:de:0030-drops-141731},
  doi =		{10.4230/LIPIcs.ICALP.2021.104},
  annote =	{Keywords: Cardinality Estimation, Sketching}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail