Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs

Authors Sepehr Assadi, Nirmit Joshi, Milind Prabhu, Vihan Shah



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2023.19.pdf
  • Filesize: 0.87 MB
  • 19 pages

Document Identifiers

Author Details

Sepehr Assadi
  • Department of Computer Science, Rutgers University, Piscataway, NJ, USA
Nirmit Joshi
  • Department of Computer Science, Northwestern University, Evanston, IL, USA
Milind Prabhu
  • Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA
Vihan Shah
  • Department of Computer Science, Rutgers University, Piscataway, NJ, USA

Acknowledgements

We would like to thank Rajiv Gandhi for making the collaboration between the authors possible and for his support throughout this project.

Cite AsGet BibTex

Sepehr Assadi, Nirmit Joshi, Milind Prabhu, and Vihan Shah. Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 19:1-19:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ICDT.2023.19

Abstract

Estimating quantiles, like the median or percentiles, is a fundamental task in data mining and data science. A (streaming) quantile summary is a data structure that can process a set S of n elements in a streaming fashion and at the end, for any ϕ ∈ (0,1], return a ϕ-quantile of S up to an ε error, i.e., return a ϕ'-quantile with ϕ' = ϕ ± ε. We are particularly interested in comparison-based summaries that only compare elements of the universe under a total ordering and are otherwise completely oblivious of the universe. The best known deterministic quantile summary is the 20-year old Greenwald-Khanna (GK) summary that uses O((1/ε) log{(ε n)}) space [SIGMOD'01]. This bound was recently proved to be optimal for all deterministic comparison-based summaries by Cormode and Vesleý [PODS'20]. In this paper, we study weighted quantiles, a generalization of the quantiles problem, where each element arrives with a positive integer weight which denotes the number of copies of that element being inserted. The only known method of handling weighted inputs via GK summaries is the naive approach of breaking each weighted element into multiple unweighted items, and feeding them one by one to the summary, which results in a prohibitively large update time (proportional to the maximum weight of input elements). We give the first non-trivial extension of GK summaries for weighted inputs and show that it takes O((1/ε) log(εn)) space and O(log(1/ε)+log log(εn)) update time per element to process a stream of length n (under some quite mild assumptions on the range of weights and ε). En route to this, we also simplify the original GK summaries for unweighted quantiles.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Data structures design and analysis
Keywords
  • Streaming algorithms
  • Quantile summaries
  • Rank estimation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, 2012. Google Scholar
  2. Noga Alon, Omri Ben-Eliezer, Yuval Dagan, Shay Moran, Moni Naor, and Eylon Yogev. Adversarial laws of large numbers and optimal regret in online classification. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 447-455, 2021. Google Scholar
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, USA, May 22-24, 1996, pages 20-29, 1996. Google Scholar
  4. Tianqi Chen and Carlos Guestrin. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, August 2016. URL: https://doi.org/10.1145/2939672.2939785.
  5. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785-794, 2016. Google Scholar
  6. Graham Cormode and Pavel Veselý. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pages 81-93, 2020. Google Scholar
  7. Anna C Gilbert, Brett Hemenway, Atri Rudra, Martin J Strauss, and Mary Wootters. Recovering simple signals. In 2012 Information Theory and Applications Workshop, pages 382-391. IEEE, 2012. Google Scholar
  8. Anna C Gilbert, Brett Hemenway, Martin J Strauss, David P Woodruff, and Mary Wootters. Reusable low-error compressive sampling schemes through privacy. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 536-539. IEEE, 2012. Google Scholar
  9. Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pages 58-66, 2001. Google Scholar
  10. Moritz Hardt and David P Woodruff. How robust are linear sketches to adaptive inputs? In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 121-130, 2013. Google Scholar
  11. Regant Y. S. Hung and Hing-Fung Ting. An Ω (1/(ε) log 1/(ε)) space lower bound for finding ε-approximate quantiles in a data stream. In Frontiers in Algorithmics, 4th International Workshop, FAW 2010, Wuhan, China, August 11-13, 2010. Proceedings, pages 89-100, 2010. Google Scholar
  12. Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 71-78, 2016. Google Scholar
  13. Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. VLDB J., 25(4):449-472, 2016. Google Scholar
  14. Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 426-435, 1998. Google Scholar
  15. Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA, pages 251-262, 1999. Google Scholar
  16. Ilya Mironov, Moni Naor, and Gil Segev. Sketching in adversarial environments. SIAM Journal on Computing, 40(6):1845-1870, 2011. Google Scholar
  17. J. Ian Munro and Mike Paterson. Selection and sorting with limited storage. In 19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16-18 October 1978, pages 253-258, 1978. Google Scholar
  18. Moni Naor and Eylon Yogev. Bloom filters in adversarial environments. In Annual Cryptology Conference, pages 565-584. Springer, 2015. Google Scholar
  19. List of open problems in sublinear algorithms - problem 2: Quantiles. URL: https://sublinear.info/2.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail