Non-Mergeable Sketching for Cardinality Estimation

Pettie, Seth; Wang, Dingyu; Yin, Longhui

doi:10.4230/LIPIcs.ICALP.2021.104

Abstract

Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance.
We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error √{2/m}) has an MVP of 10. Our contributions are as follows.
- Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches.
- Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H₀/I₀ ≈ 1.98, among a class of mergeable sketches called "linearizable" sketches. (H₀ and I₀ are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H₀/2 ≈ 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch.
- Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP≈ 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog.

Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings 18th ACM Symposium on Principles of Database Systems (PODS), pages 10-20, 1999. URL: https://doi.org/10.1145/303976.303978.
Daniel N Baker and Ben Langmead. Dashing: Fast and accurate genomic distances with hyperloglog. bioRxiv, 2019. URL: https://doi.org/10.1101/501726.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Proceedings 6th International Workshop on Randomization and Approximation Techniques (RANDOM), volume 2483 of Lecture Notes in Computer Science, pages 1-10, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 623-632, 2002.
Ran Ben-Basat, Gil Einziger, Shir Landau Feibish, Jalil Moraney, and Danny Raz. Network-wide routing-oblivious heavy hitters. In Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems (ANCS), pages 66-73, 2018. URL: https://doi.org/10.1145/3230718.3230729.
Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. ACM Trans. Algorithms, 16(1):3:1-3:28, 2020. URL: https://doi.org/10.1145/3309193.
Andrei Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES, pages 21-29, 1997. URL: https://doi.org/10.1109/SEQUEN.1997.666900.
Thilina Buddhika, Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. Synopsis: A distributed sketch over voluminous spatiotemporal observational streams. IEEE Trans. Knowl. Data Eng., 29(11):2552-2566, 2017. URL: https://doi.org/10.1109/TKDE.2017.2734661.
Philippe Chassaing and Lucas Gerin. Efficient estimation of the cardinality of large data sets. In Proceedings of the 4th Colloquium on Mathematics and Computer Science Algorithms, Trees, Combinatorics and Probabilities, 2006.
Aiyou Chen, Jin Cao, Larry Shepp, and Tuan Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879-890, 2011. URL: https://doi.org/10.1198/jasa.2011.ap10217.
Min Chen, Shigang Chen, and Zhiping Cai. Counter tree: A scalable counter architecture for per-flow traffic measurement. IEEE/ACM Trans. Netw., 25(2):1249-1262, 2017. URL: https://doi.org/10.1109/TNET.2016.2621159.
Pern Hui Chia, Damien Desfontaines, Irippuge Milinda Perera, Daniel Simmons-Marengo, Chao Li, Wei-Yen Day, Qiushi Wang, and Miguel Guevara. KHyperLogLog: Estimating reidentifiability and joinability of large data at scale. In Proceedings of the 2019 IEEE Symposium on Security and Privacy, pages 350-364, 2019. URL: https://doi.org/10.1109/SP.2019.00046.
Edith Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441-453, 1997. URL: https://doi.org/10.1006/jcss.1997.1534.
Edith Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. IEEE Trans. Knowl. Data Eng., 27(9):2320-2334, 2015. URL: https://doi.org/10.1109/TKDE.2015.2411606.
Edith Cohen and Haim Kaplan. Tighter estimation using bottom k sketches. Proc. VLDB Endow., 1(1):213-224, 2008. URL: https://doi.org/10.14778/1453856.1453884.
Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities. In Proceedings 11th Annual European Symposium on Algorithms (ESA), volume 2832 of Lecture Notes in Computer Science, pages 605-617. Springer, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
R. A. Leo Elworth, Qi Wang, Pavan K. Kota, C. J. Barberan, Benjamin Coleman, Advait Balaji, Gaurav Gupta, Richard G. Baraniuk, Anshumali Shrivastava, and Todd J. Treangen. To petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Research, 48(10):5217-5234, 2020. URL: https://doi.org/10.1093/nar/gkaa265.
Cristian Estan, George Varghese, and Michael E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925-937, 2006. URL: https://doi.org/10.1145/1217709.
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the 18th International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Analysis of Algorithms (AofA), 2007.
Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
Michael J. Freitag and Thomas Neumann. Every row counts: Combining sketches and sampling for accurate group-by result estimates. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR), 2019.
Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In Proceedings 13th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
Frédéric Giroire. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math., 157(2):406-427, 2009. URL: https://doi.org/10.1016/j.dam.2008.06.020.
Ahmed Helmi, Jérémie Lumbroso, Conrado Martínez, and Alfredo Viola. Data Streams as Random Permutations: the Distinct Element Problem. In Proceedings of the 23rd International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Analysis of Algorithms (AofA), pages 323-338, 2012.
Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In Proceedings 44th IEEE Symposium on Foundations of Computer Science (FOCS), October 2003, Cambridge, MA, USA, Proceedings, pages 283-288, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238202.
T. S. Jayram and David P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26:1-26:17, 2013. URL: https://doi.org/10.1145/2483699.2483706.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings 29th ACM Symposium on Principles of Database Systems (PODS), pages 41-52, 2010. URL: https://doi.org/10.1145/1807085.1807094.
Kevin J. Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm. CoRR, abs/1708.06839, 2017. URL: http://arxiv.org/abs/1708.06839.
Jérémie Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Proceedings of the 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA), pages 489-504, 2010.
Guillaume Marçais, Brad Solomon, Rob Patro, and Carl Kingsford. Sketching and sublinear data structures in genomics. Annual Review of Biomedical Data Science, 2(1):93-118, 2019. URL: https://doi.org/10.1146/annurev-biodatasci-072018-021156.
Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets Shannon. In Proceedings 53rd ACM Symposium on Theory of Computing (STOC), 2021.
Seth Pettie, Dingyu Wang, and Longhui Yin. Non-mergeable sketching for cardinality estimation, 2021. URL: http://arxiv.org/abs/2008.08739.
Ninh Pham. Hybrid LSH: faster near neighbors reporting in high-dimensional space. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT), pages 454-457, 2017. URL: https://doi.org/10.5441/002/edbt.2017.43.
Björn Scheuermann and Martin Mauve. Near-optimal compression of probabilistic counting sketches for networking applications. In Proceedings of the 4th International Workshop on Foundations of Mobile Computing (DIALM-POMC), 2007.
Daniel Ting. Streamed approximate counting of distinct elements: beating optimal batch methods. In Proceedings 20th ACM Conference on Knowledge Discovery and Data Mining (KDD), pages 442-451, 2014. URL: https://doi.org/10.1145/2623330.2623669.
Daniel Ting. Towards optimal cardinality estimation of unions and intersections with sketches. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1195-1204. ACM, 2016. URL: https://doi.org/10.1145/2939672.2939772.
Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas J. A. Harvey, and Andrew Warfield. Characterizing storage workloads with counter stacks. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 335-349, 2014.
Derrick E. Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. bioRxiv, 2019. URL: https://doi.org/10.1101/762302.
Qingjun Xiao, Shigang Chen, You Zhou, Min Chen, Junzhou Luo, Tengli Li, and Yibei Ling. Cardinality estimation for elephant flows: A compact solution based on virtual register sharing. IEEE/ACM Trans. Netw., 25(6):3738-3752, 2017. URL: https://doi.org/10.1109/TNET.2017.2753842.

Non-Mergeable Sketching for Cardinality Estimation

Authors Seth Pettie, Dingyu Wang, Longhui Yin

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Non-Mergeable Sketching for Cardinality Estimation

Authors Seth Pettie, Dingyu Wang, Longhui Yin

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message