License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ICALP.2021.104
URN: urn:nbn:de:0030-drops-141731
Go to the corresponding LIPIcs Volume Portal

Pettie, Seth ; Wang, Dingyu ; Yin, Longhui

Non-Mergeable Sketching for Cardinality Estimation

LIPIcs-ICALP-2021-104.pdf (1 MB)


Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance.
We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error √{2/m}) has an MVP of 10. Our contributions are as follows.
- Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches.
- Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H₀/I₀ ≈ 1.98, among a class of mergeable sketches called "linearizable" sketches. (H₀ and I₀ are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H₀/2 ≈ 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch.
- Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP≈ 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog.

BibTeX - Entry

  author =	{Pettie, Seth and Wang, Dingyu and Yin, Longhui},
  title =	{{Non-Mergeable Sketching for Cardinality Estimation}},
  booktitle =	{48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
  pages =	{104:1--104:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-195-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{198},
  editor =	{Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-141731},
  doi =		{10.4230/LIPIcs.ICALP.2021.104},
  annote =	{Keywords: Cardinality Estimation, Sketching}

Keywords: Cardinality Estimation, Sketching
Collection: 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)
Issue Date: 2021
Date of publication: 02.07.2021

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI