Range Entropy Queries and Partitioning

Krishnan, Sanjay; Sintos, Stavros

doi:10.4230/LIPIcs.ICDT.2024.6

Abstract

Data partitioning that maximizes or minimizes Shannon entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to find the entropy in different subsets of data when the algorithm needs to decide what block to construct. While it is generally known how to compute the entropy of a discrete distribution efficiently, we want to efficiently derive the entropy among the data items that lie in a specific area. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set P of n weighted and colored points in ℝ^d. The goal is to construct a low space data structure, such that given a query (hyper)rectangle R, it computes the entropy based on the colors of the points in P∩ R, in sublinear time. We show a conditional lower bound for this problem proving that we cannot hope for data structures with near-linear space and near-constant query time. Then, we propose exact data structures for d = 1 and d > 1 with o(n^{2d}) space and o(n) query time. We also provide a tune parameter t that the user can choose to bound the asymptotic space and query time of the new data structures. Next, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the entropy. Finally, we show how we can use the new data structures to efficiently partition time series and histograms with respect to entropy.

Peyman Afshani and Jeff M Phillips. Independent range sampling, revisited again. In 35th International Symposium on Computational Geometry (SoCG 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.SoCG.2019.4.
Peyman Afshani and Zhewei Wei. Independent range sampling, revisited. In 25th Annual European Symposium on Algorithms (ESA 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.ESA.2017.3.
Pankaj K Agarwal, Nirman Kumar, Stavros Sintos, and Subhash Suri. Range-max queries on uncertain data. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 465-476, 2016. URL: https://doi.org/10.1145/2902251.2902281.
Pankaj K Agarwal, Nirman Kumar, Stavros Sintos, and Subhash Suri. Range-max queries on uncertain data. Journal of Computer and System Sciences, 94:118-134, 2018. URL: https://doi.org/10.1016/j.jcss.2017.09.006.
Linas Baltrunas, Arturas Mazeika, and Michael Bohlen. Multi-dimensional histograms with tight bounds for the error. In 2006 10th International Database Engineering and Applications Symposium (IDEAS'06), pages 105-112. IEEE, 2006. URL: https://doi.org/10.1109/IDEAS.2006.31.
Daniel Barbará, Yi Li, and Julia Couto. Coolcat: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pages 582-589, 2002. URL: https://doi.org/10.1145/584792.584888.
Tuǧkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating entropy. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 678-687, 2002. URL: https://doi.org/10.1145/509907.510005.
Irad Ben-Gal, Shahar Weinstock, Gonen Singer, and Nicholas Bambos. Clustering users by their mobility behavioral patterns. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(4):1-28, 2019. URL: https://doi.org/10.1145/3322126.
Jon Louis Bentley and James B Saxe. Decomposable searching problems i. static-to-dynamic transformation. Journal of Algorithms, 1(4):301-358, 1980. URL: https://doi.org/10.1016/0196-6774(80)90015-2.
Mark de Berg, Marc van Kreveld, Mark Overmars, and Otfried Schwarzkopf. Computational geometry. In Computational geometry, pages 1-17. Springer, 1997.
Lakshminath Bhuvanagiri and Sumit Ganguly. Estimating entropy over data streams. In Algorithms-ESA 2006: 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006. Proceedings 14, pages 148-159. Springer, 2006. URL: https://doi.org/10.1007/11841036_16.
Cafer Caferov, Barış Kaya, Ryan O’Donnell, and AC Say. Optimal bounds for estimating entropy with pmf queries. In International Symposium on Mathematical Foundations of Computer Science, pages 187-198. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48054-0_16.
Clément Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. In International Colloquium on Automata, Languages, and Programming, pages 283-295. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-43948-7_24.
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. In SODA, volume 7, pages 328-335. Citeseer, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283418.
Amit Chakrabarti, Khanh Do Ba, and S Muthukrishnan. Estimating entropy and entropy norm on data streams. Internet Mathematics, 3(1):63-78, 2006. URL: https://doi.org/10.1080/15427951.2006.10129117.
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1247-1261, 2015. URL: https://doi.org/10.1145/2723372.2749431.
Peter Clifford and Ioana Cosma. A simple sketching algorithm for entropy estimation over streaming data. In Artificial Intelligence and Statistics, pages 196-206. PMLR, 2013. URL: http://proceedings.mlr.press/v31/clifford13a.html.
Graham Cormode, Minos Garofalakis, Peter J Haas, Chris Jermaine, et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trendsregistered in Databases, 4(1-3):1-294, 2011. URL: https://doi.org/10.1561/1900000004.
Juan David Cruz, Cécile Bothorel, and François Poulet. Entropy based community detection in augmented social networks. In 2011 International Conference on computational aspects of social networks (CASoN), pages 163-168. IEEE, 2011. URL: https://doi.org/10.1109/CASON.2011.6085937.
Pooya Davoodi, Michiel Smid, and Freek van Walderveen. Two-dimensional range diameter queries. In Latin American Symposium on Theoretical Informatics, pages 219-230. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-29344-3_19.
J. Erickson. Static-to-dynamic transformations. URL: http://jeffe.cs.illinois.edu/teaching/datastructures/notes/01-statictodynamic.pdf.
Sudipto Guha, Nick Koudas, and Kyuseok Shim. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems (TODS), 31(1):396-438, 2006. URL: https://doi.org/10.1145/1132863.1132873.
Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 733-742, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109637.
Prosenjit Gupta, Ravi Janardan, and Michiel Smid. Further results on generalized intersection searching problems: counting, reporting, and dynamization. Journal of Algorithms, 19(2):282-317, 1995. URL: https://doi.org/10.1006/jagm.1995.1038.
Nicholas JA Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy via approximation theory. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 489-498. IEEE, 2008. URL: https://doi.org/10.1109/FOCS.2008.76.
Xiaocheng Hu, Miao Qiao, and Yufei Tao. Independent range sampling. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 246-255, 2014. URL: https://doi.org/10.1145/2594538.2594545.
Sanjay Krishnan and Stavros Sintos. Range entropy queries and partitioning. CoRR, abs/2312.15959, 2023. URL: https://doi.org/10.48550/arXiv.2312.15959.
Ping Li and Cun-Hui Zhang. A new algorithm for compressed counting with applications in shannon entropy estimation in dynamic data. In Proceedings of the 24th Annual Conference on Learning Theory, pages 477-496. JMLR Workshop and Conference Proceedings, 2011. URL: http://proceedings.mlr.press/v19/li11a/li11a.pdf.
Tao Li, Sheng Ma, and Mitsunori Ogihara. Entropy-based criterion in categorical clustering. In Proceedings of the twenty-first international conference on Machine learning, page 68, 2004. URL: https://doi.org/10.1145/1015330.1015404.
Xi Liang, Stavros Sintos, and Sanjay Krishnan. JanusAQP: Efficient partition tree maintenance for dynamic approximate query processing. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 572-584. IEEE, 2023. URL: https://doi.org/10.1109/ICDE55515.2023.00050.
Xi Liang, Stavros Sintos, Zechao Shang, and Sanjay Krishnan. Combining aggregation and sampling (nearly) optimally for approximate query processing. In Proceedings of the 2021 International Conference on Management of Data, pages 1129-1141, 2021. URL: https://doi.org/10.1145/3448016.3457277.
Andres Lopez Martinez. Parallel minimum cuts: An improved crew pram algorithm. Master’s thesis. KTH, School of Electrical Engineering and Computer Science (EECS), 2020.
Mark H Overmars. The design of dynamic data structures, volume 156. Springer Science & Business Media, 1983. URL: https://doi.org/10.1007/BFb0014927.
Mark H Overmars and Jan van Leeuwen. Worst-case optimal insertion and deletion methods for decomposable searching problems. Information Processing Letters, 12(4):168-173, 1981. URL: https://doi.org/10.1016/0020-0190(81)90093-4.
Mihai Patrascu and Liam Roditty. Distance oracles beyond the thorup-zwick bound. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 815-823. IEEE, 2010. URL: https://doi.org/10.1109/FOCS.2010.83.
Saladi Rahul and Ravi Janardan. Algorithms for range-skyline queries. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pages 526-529, 2012. URL: https://doi.org/10.1145/2424321.2424406.
Yufei Tao. Algorithmic techniques for independent query sampling. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 129-138, 2022. URL: https://doi.org/10.1145/3517804.3526068.
Hien To, Kuorong Chiang, and Cyrus Shahabi. Entropy-based histograms for selectivity estimation. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1939-1948, 2013. URL: https://doi.org/10.1145/2505515.2505756.
Lu Wang, Robert Christensen, Feifei Li, and Ke Yi. Spatial online sampling and aggregation. Proceedings of the VLDB Endowment, 9(3):84-95, 2015. URL: https://doi.org/10.14778/2850583.2850584.
Dong Xie, Jeff M Phillips, Michael Matheny, and Feifei Li. Spatial independent range sampling. In Proceedings of the 2021 International Conference on Management of Data, pages 2023-2035, 2021. URL: https://doi.org/10.1145/3448016.3452806.

Range Entropy Queries and Partitioning

Authors Sanjay Krishnan , Stavros Sintos

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Range Entropy Queries and Partitioning

Authors Sanjay Krishnan , Stavros Sintos

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References