Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error √{2/m}) has an MVP of 10. Our contributions are as follows. - Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches. - Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H₀/I₀ ≈ 1.98, among a class of mergeable sketches called "linearizable" sketches. (H₀ and I₀ are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H₀/2 ≈ 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch. - Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP≈ 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog.
@InProceedings{pettie_et_al:LIPIcs.ICALP.2021.104, author = {Pettie, Seth and Wang, Dingyu and Yin, Longhui}, title = {{Non-Mergeable Sketching for Cardinality Estimation}}, booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)}, pages = {104:1--104:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-195-5}, ISSN = {1868-8969}, year = {2021}, volume = {198}, editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.104}, URN = {urn:nbn:de:0030-drops-141731}, doi = {10.4230/LIPIcs.ICALP.2021.104}, annote = {Keywords: Cardinality Estimation, Sketching} }
Feedback for Dagstuhl Publishing