Daisy Bloom Filters

Bercea, Ioana O.; Houen, Jakob Bæk Tejs; Pagh, Rasmus

doi:10.4230/LIPIcs.SWAT.2024.9

Abstract

A filter is a widely used data structure for storing an approximation of a given set S of elements from some universe 𝒰 (a countable set). It represents a superset S' ⊇ S that is "close to S" in the sense that for x ∉ S, the probability that x ∈ S' is bounded by some ε > 0. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store S exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in S with probability close to 1. Then it would make sense to always include them in S', saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most ε with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the Daisy Bloom filter, that executes operations faster and uses significantly less space than the standard Bloom filter.

Yuriy Arbitman, Moni Naor, and Gil Segev. Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 787-796. IEEE, 2010. See also URL: https://arxiv.org/abs/0912.5424v3.
Michael A. Bender, Martin Farach-Colton, Mayank Goswami, Rob Johnson, Samuel McCauley, and Shikha Singh. Bloom filters, adaptivity, and the dictionary problem. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 182-193, 2018. URL: https://doi.org/10.1109/FOCS.2018.00026.
Michael A Bender, Martin Farach-Colton, Rob Johnson, Bradley C Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P Spillane, and Erez Zadok. Don't thrash: How to cache your hash on flash. In 3rd Workshop on Hot Topics in Storage and File Systems (HotStorage 11), 2011.
Michael A. Bender, Martin Farach-Colton, John Kuszmaul, William Kuszmaul, and Mingmou Liu. On the optimal time/space tradeoff for hash tables. In Stefano Leonardi and Anupam Gupta, editors, STOC '22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20-24, 2022, pages 1284-1297. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519969.
Ioana O. Bercea and Guy Even. A dynamic space-efficient filter with constant time operations. In Susanne Albers, editor, 17th Scandinavian Symposium and Workshops on Algorithm Theory, SWAT 2020, June 22-24, 2020, Tórshavn, Faroe Islands, volume 162 of LIPIcs, pages 11:1-11:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.SWAT.2020.11.
Ioana O. Bercea and Guy Even. Dynamic dictionaries for multisets and counting filters with constant time operations. In Anna Lubiw and Mohammad R. Salavatipour, editors, Algorithms and Data Structures - 17th International Symposium, WADS 2021, Virtual Event, August 9-11, 2021, Proceedings, volume 12808 of Lecture Notes in Computer Science, pages 144-157. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-83508-8_11.
Ioana O. Bercea, Jakob Bæk Tejs Houen, and Rasmus Pagh. Daisy bloom filters. CoRR, abs/2205.14894, 2022. URL: https://doi.org/10.48550/arXiv.2205.14894.
Ioana Oriana Bercea and Guy Even. An extendable data structure for incremental stable perfect hashing. In Stefano Leonardi and Anupam Gupta, editors, STOC '22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20-24, 2022, pages 1298-1310. ACM, 2022. URL: https://doi.org/10.1145/3519935.3520070.
Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422-426, 1970. URL: https://doi.org/10.1145/362686.362692.
Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, and George Varghese. An improved construction for counting bloom filters. In Algorithms-ESA 2006: 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006. Proceedings 14, pages 684-695. Springer, 2006.
Andrei Z. Broder and Michael Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Math., 1(4):485-509, 2003. URL: https://doi.org/10.1080/15427951.2004.10129096.
Jehoshua Bruck, Jie Gao, and Anxiao Jiang. Weighted bloom filter. In International Symposium on Information Theory (ISIT), pages 2304-2308. IEEE, 2006. URL: https://doi.org/10.1109/ISIT.2006.261978.
Clément Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. In International Colloquium on Automata, Languages, and Programming, pages 283-295. Springer, 2014.
Xinyuan Cao, Jingbang Chen, Li Chen, Chris Lambert, Richard Peng, and Daniel Sleator. Learning-augmented b-trees, 2023. URL: https://arxiv.org/abs/2211.09251.
Larry Carter, Robert W. Floyd, John Gill, George Markowsky, and Mark N. Wegman. Exact and approximate membership testers. In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors, Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pages 59-65. ACM, 1978.
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693-703. Springer, 2002.
Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58-75, 2005.
Zhenwei Dai and Anshumali Shrivastava. Adaptive learned bloom filter (ada-bf): Efficient utilization of the classifier with application to real-time information filtering on the web. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Niv Dayan, Ioana O. Bercea, Pedro Reviriego, and Rasmus Pagh. Infinifilter: Expanding filters to infinity and beyond. Proc. ACM Manag. Data, 1(2):140:1-140:27, 2023. URL: https://doi.org/10.1145/3589285.
Martin Dietzfelbinger and Rasmus Pagh. Succinct data structures for retrieval and approximate membership. In International Colloquium on Automata, Languages, and Programming, pages 385-396. Springer, 2008.
Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Learning space partitions for nearest neighbor search. In International Conference on Learning Representations (ICLR), 2020.
Devdatt P. Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, USA, 2012.
Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. Learning-based support estimation in sublinear time. In International Conference on Learning Representations, 2020.
Tomer Even, Guy Even, and Adam Morrison. Prefix filter: Practically and theoretically better than bloom. Proc. VLDB Endow., 15(7):1311-1323, 2022. URL: https://doi.org/10.14778/3523210.3523211.
Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pages 75-88, 2014.
Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM transactions on networking, 8(3):281-293, 2000.
Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned monotone minimal perfect hashing. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands, volume 274 of LIPIcs, pages 46:1-46:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.ESA.2023.46.
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective? In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research. PMLR, 2020.
Paolo Ferragina and Giorgio Vinciguerra. Learned data structures. In Luca Oneto, Nicolò Navarin, Alessandro Sperduti, and Davide Anguita, editors, Recent Trends in Learning From Data - Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL 2019), volume 896 of Studies in Computational Intelligence, pages 5-41. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-43883-8_2.
Paolo Ferragina and Giorgio Vinciguerra. The pgm-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment, 13(8):1162-1175, 2020.
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. Fiting-tree: A data-aware index structure. In Proceedings International Conference on Management of Data (SIGMOD), pages 1189-1206, 2019.
Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 489-504. ACM, 2018.
Mingmou Liu, Yitong Yin, and Huacheng Yu. Succinct filters for sets of unknown sizes. arXiv preprint, 2020. URL: https://arxiv.org/abs/2004.12465.
Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, and Xueshan Luo. Optimizing bloom filter: Challenges, solutions, and comparisons. IEEE Commun. Surv. Tutorials, 21(2):1912-1949, 2019. URL: https://doi.org/10.1109/COMST.2018.2889329.
Samuel McCauley, Benjamin Moseley, Aidin Niaparast, and Shikha Singh. Online list labeling with predictions, 2023. URL: https://arxiv.org/abs/2305.10536.
Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
Michael Mitzenmacher, Salvatore Pontarelli, and Pedro Reviriego. Adaptive cuckoo filters. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 36-47. SIAM, 2018.
Moni Naor and Noa Oved. Bet-or-pass: Adversarially robust bloom filters. In Theory of Cryptography Conference, pages 777-808. Springer, 2022.
Moni Naor and Eylon Yogev. Bloom filters in adversarial environments. In Annual Cryptology Conference, pages 565-584. Springer, 2015.
Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao. An optimal Bloom filter replacement. In SODA, pages 823-829. SIAM, 2005.
Rasmus Pagh, Gil Segev, and Udi Wieder. How to approximate a set without knowing its size in advance. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 80-89. IEEE, 2013.
Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM international conference on Management of Data, pages 775-787, 2017.
Prashant Pandey, Alex Conway, Joe Durie, Michael A. Bender, Martin Farach-Colton, and Rob Johnson. Vector quotient filters: Overcoming the time/space trade-off in filter design. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pages 1386-1399. ACM, 2021. URL: https://doi.org/10.1145/3448016.3452841.
Ely Porat. An optimal Bloom filter replacement based on matrix solving. In International Computer Science Symposium in Russia, pages 263-273. Springer, 2009.
Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018.
MTCAJ Thomas and A Thomas Joy. Elements of information theory. Wiley-Interscience, 2006.
Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, and Tim Kraska. Partitioned learned bloom filters. In 9th International Conference on Learning Representations (ICLR). OpenReview.net, 2021.
Xiujun Wang, Yusheng Ji, Zhe Dang, Xiao Zheng, and Baohua Zhao. Improved weighted bloom filter and space lower bound analysis of algorithms for approximated membership querying. In Database Systems for Advanced Applications (DASFAA), volume 9050 of Lecture Notes in Computer Science, pages 346-362. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-18123-3_21.

Daisy Bloom Filters

Authors Ioana O. Bercea , Jakob Bæk Tejs Houen , Rasmus Pagh

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Daisy Bloom Filters

Authors Ioana O. Bercea , Jakob Bæk Tejs Houen , Rasmus Pagh

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References