Formalization of Randomized Approximation Algorithms for Frequency Moments

Karayel, Emin

doi:10.4230/LIPIcs.ITP.2022.21

Abstract

In 1999 Alon et al. introduced the still active research topic of approximating the frequency moments of a data stream using randomized algorithms with minimal space usage. This includes the problem of estimating the cardinality of the stream elements - the zeroth frequency moment. Higher-order frequency moments provide information about the skew of the data stream which is, for example, critical information for parallel processing. (The k-th frequency moment of a data stream is the sum of the k-th powers of the occurrence counts of each element in the stream.) They introduce both lower bounds and upper bounds on the space complexity of the problems, which were later improved by newer publications. The algorithms have guaranteed success probabilities and accuracies without making any assumptions on the input distribution. They are an interesting use case for formal verification because their correctness proofs require a large body of deep results from algebra, analysis and probability theory. This work reports on the formal verification of three algorithms for the approximation of F₀, F₂ and F_k for k ≥ 3. The results include the identification of significantly simpler algorithms with the same runtime and space complexities as the previously known ones as well as the development of several reusable components, such as a formalization of universal hash families, amplification methods for randomized algorithms, a model for one-pass data stream algorithms or a generic flexible encoding library for the verification of space complexities.

Reynald Affeldt, Jacques Garrigue, Xuanrui Qi, and Kazunari Tanaka. Proving Tree Algorithms for Succinct Data Structures. In 10th International Conference on Interactive Theorem Proving (ITP 2019), volume 141 of Leibniz International Proceedings in Informatics (LIPIcs), pages 5:1-5:19, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.ITP.2019.5.
Archive of Formal Proofs. https://isa-afp.org. Accessed: 2021-11-13.
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. URL: https://doi.org/10.1006/jcss.1997.1545.
Jialu Bao, Marco Gaboardi, Justin Hsu, and Joseph Tassarotti. A separation logic for negative dependence. Proceedings of the ACM on Programming Languages, 6:57:1-57:29, 2022. URL: https://doi.org/10.1145/3498719.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1-10. Springer Berlin Heidelberg, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 199-210, New York, 2007. URL: https://doi.org/10.1145/1247480.1247504.
Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh, and Chandan Saha. Simpler algorithm for estimating frequency moments of data streams. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA '06, pages 708-713, USA, 2006. Society for Industrial and Applied Mathematics. URL: https://doi.org/10.5555/1109557.1109634.
Julian Biendarra and Manuel Eberl. Bertrand’s postulate. Archive of Formal Proofs, January 2017. , Formal proof development. URL: https://isa-afp.org/entries/Bertrands_Postulate.html.
Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2432-2448, 2018. URL: https://doi.org/10.1137/1.9781611975031.156.
Carsten Bormann and Paul E. Hoffman. Concise Binary Object Representation (CBOR). RFC 8949, December 2020. URL: https://doi.org/10.17487/RFC8949.
R. C. Bose and K. A. Bush. Orthogonal arrays of strength two and three. The Annals of Mathematical Statistics, 23(4):508-524, 1952. URL: https://doi.org/10.1214/AOMS/1177729331.
Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. An Optimal Algorithm for Large Frequency Moments Using O(n^1-2/k) Bits. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), volume 28 of Leibniz International Proceedings in Informatics (LIPIcs), pages 531-544, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2014.531.
Vladimir Braverman, Emanuele Viola, David P. Woodruff, and Lin F. Yang. Revisiting Frequency Moment Estimation in Random Order Streams. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 25:1-25:14, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.25.
David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and Sridhar Seshadri. Practical skew handling in parallel joins. In Proceedings of the 18th VLDB conference, 1992.
Manuel Eberl, Max W. Haslbeck, and Tobias Nipkow. Verified analysis of random binary tree structures. J. Autom. Reason., 64(5):879-910, 2020. URL: https://doi.org/10.1007/s10817-020-09545-0.
Manuel Eberl, Johannes Hölzl, and Tobias Nipkow. A verified compiler for probability density functions. In Programming Languages and Systems, pages 80-104. Springer Berlin Heidelberg, 2015. URL: https://doi.org/10.1007/978-3-662-46669-8_4.
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975. URL: https://doi.org/10.1109/TIT.1975.1055349.
Robin Eßmann, Tobias Nipkow, and Simon Robillard. Verified approximation algorithms. Automated Reasoning, 12167:291-306, 2020. URL: https://doi.org/10.46298/lmcs-18(1:36)2022.
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science, pages 137-156, 2007.
Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '01, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
Frédéric Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406-427, 2009. URL: https://doi.org/10.1016/j.dam.2008.06.020.
Kiran Gopinathan and Ilya Sergey. Certifying certainty and uncertainty in approximate membership query structures. Computer Aided Verification, 12225:279-303, 2020. URL: https://doi.org/10.1007/978-3-030-53291-8_16.
Sebastien Gouezel. Lp spaces. Archive of Formal Proofs, October 2016. , Formal proof development. URL: https://isa-afp.org/entries/Lp.html.
Benjamin Gufler., Nikolaus Augsten., Angelika Reiser., and Alfons Kemper. Handling data skew in MapReduce. In Proceedings of the 1st International Conference on Cloud Computing and Services Science - CLOSER, pages 574-583. INSTICC, SciTePress, 2011. URL: https://doi.org/10.5220/0003391105740583.
Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, pages 683-692, New York, 2013. ACM. URL: https://doi.org/10.1145/2452376.2452456.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13-30, 1963. URL: https://doi.org/10.1007/978-1-4612-0865-5_26.
Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC '05, pages 202-208, New York, 2005. URL: https://doi.org/10.1145/1060590.1060621.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 1161-1178, USA, 2010. Society for Industrial and Applied Mathematics. URL: https://doi.org/10.1137/1.9781611973075.93.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '10, pages 41-52, New York, 2010. URL: https://doi.org/10.1145/1807085.1807094.
Emin Karayel. A combinator library for prefix-free codes. Archive of Formal Proofs, April 2022. , Formal proof development. URL: https://isa-afp.org/entries/Prefix_Free_Code_Combinators.html.
Emin Karayel. Enumeration of equivalence relations. Archive of Formal Proofs, February 2022. , Formal proof development. URL: https://isa-afp.org/entries/Equivalence_Relation_Enumeration.html.
Emin Karayel. Formalization of randomized approximation algorithms for frequency moments. Archive of Formal Proofs, April 2022. , Formal proof development. URL: https://isa-afp.org/entries/Frequency_Moments.html.
Emin Karayel. Interpolation polynomials (in hol-algebra). Archive of Formal Proofs, January 2022. , Formal proof development. URL: https://isa-afp.org/entries/Interpolation_Polynomials_HOL_Algebra.html.
Emin Karayel. Median method. Archive of Formal Proofs, January 2022. , Formal proof development. URL: https://isa-afp.org/entries/Median_Method.html.
Emin Karayel. Universal hash families. Archive of Formal Proofs, February 2022. , Formal proof development. URL: https://isa-afp.org/entries/Universal_Hash_Families.html.
Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840-842, 1978. URL: https://doi.org/10.1145/359619.359627.
R.C. Mullin, I.M. Onyszchuk, S.A. Vanstone, and R.M. Wilson. Optimal normal bases in GF(pⁿ). Discrete Applied Mathematics, 22(2):149-161, 1988. URL: https://doi.org/10.1016/0166-218X(88)90090-X.
Magnus O. Myreen. The CakeML Project’s Quest for Ever Stronger Correctness Theorems. In 12th International Conference on Interactive Theorem Proving (ITP 2021), volume 193 of Leibniz International Proceedings in Informatics (LIPIcs), pages 1:1-1:10, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ITP.2021.1.
Tobias Nipkow, Lawrence C Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, Heidelberg, 1 edition, 2002.
Vaughan R. Pratt. Shellsort and sorting networks. In Outstanding Dissertations in the Computer Sciences, 1972.
Pedro Reviriego, Alfonso Sanchez-Macian, Shanshan Liu, and Fabrizio Lombardi. On the security of the k minimum values (KMV) sketch. IEEE Transactions on Dependable and Secure Computing, 2021. URL: https://doi.org/10.1109/TDSC.2021.3101280.
Joseph H. Silverman. Fast multiplication in finite fields GF(2ⁿ). In Cryptographic Hardware and Embedded Systems, pages 122-134. Springer Berlin Heidelberg, 1999. URL: https://doi.org/10.1007/3-540-48059-5_12.
Hagen Sparka, Florian Tschorsch, and Björn Scheuermann. P2KMV: A privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations. Cryptology ePrint Archive, Report 2018/234, 2018. URL: https://doi.org/10.14279/DEPOSITONCE-8374.
Daniel Stüwe and Manuel Eberl. Probabilistic primality testing. Archive of Formal Proofs, February 2019. , Formal proof development. URL: https://isa-afp.org/entries/Probabilistic_Prime_Tests.html.
Joseph Tassarotti, Koundinya Vajjha, Anindya Banerjee, and Jean-Baptiste Tristan. A formal proof of PAC learnability for decision stumps. In CPP '21: 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, Virtual Event, Denmark, January 17-19, 2021, pages 5-17, 2021. URL: https://doi.org/10.1145/3437992.3439917.
Mikkel Thorup and Yin Zhang. Tabulation based 5-universal hashing and linear probing. In Proceedings of the Meeting on Algorithm Engineering & Expermiments, ALENEX '10, pages 62-76, USA, 2010. Society for Industrial and Applied Mathematics. URL: https://doi.org/10.1137/1.9781611972900.7.
Salil P. Vadhan. Pseudorandomness. Foundations and Trends® in Theoretical Computer Science, 7(1-3):1-336, 2012. URL: https://doi.org/10.1561/0400000010.
Mark N. Wegman and J. Lawrence Carter. New hash functions and their use in authentication and set equality. Journal of Computer and System Sciences, 22(3):265-279, 1981. URL: https://doi.org/10.1016/0022-0000(81)90033-7.
Yang Yang, Ying Zhang, Wenjie Zhang, and Zengfeng Huang. GB-KMV: An augmented KMV sketch for approximate containment similarity search. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 458-469, 2019. URL: https://doi.org/10.1109/ICDE.2019.00048.

Formalization of Randomized Approximation Algorithms for Frequency Moments

Author Emin Karayel

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Formalization of Randomized Approximation Algorithms for Frequency Moments

Author Emin Karayel

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message