Buffered Count-Min Sketch on SSD: Theory and Experiments

Goswami, Mayank; Medjedovic, Dzejla; Mekic, Emina; Pandey, Prashant

doi:10.4230/LIPIcs.ESA.2018.41

File

Author Details

Mayank Goswami

Queens College, City University of New York

Dzejla Medjedovic

International University of Sarajevo

Emina Mekic

Sarajevo School of Science and Technology

Prashant Pandey

Stony Brook University, New York

Cite AsGet BibTex

Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey. Buffered Count-Min Sketch on SSD: Theory and Experiments. In 26th Annual European Symposium on Algorithms (ESA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 112, pp. 41:1-41:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ESA.2018.41

Abstract

Frequency estimation data structures such as the count-min sketch (CMS) have found numerous applications in databases, networking, computational biology and other domains. Many applications that use the count-min sketch process massive and rapidly evolving data sets. For data-intensive applications that aim to keep the overestimate error low, the count-min sketch becomes too large to store in available RAM and may have to migrate to external storage (e.g., SSD.) Due to the random-read/write nature of hash operations of the count-min sketch, simply placing it on SSD stifles the performance of time-critical applications, requiring about 4-6 random reads/writes to SSD per estimate (lookup) and update (insert) operation. In this paper, we expand on the preliminary idea of the buffered count-min sketch (BCMS) {[Eydi et al., 2017]}, an SSD variant of the count-min sketch, that uses hash localization to scale efficiently out of RAM while keeping the total error bounded. We describe the design and implementation of the buffered count-min sketch, and empirically show that our implementation achieves 3.7 x-4.7 x speedup on update and 4.3 x speedup on estimate operations compared to the traditional count-min sketch on SSD. Our design also offers an asymptotic improvement in the external-memory model over the original data structure: r random I/Os are reduced to 1 I/O for the estimate operation. For a data structure that uses k blocks on SSD, w as the word/counter size, r as the number of rows, M as the number of bits in the main memory, our data structure uses kwr/M amortized I/Os for updates, or, if kwr/M > 1, 1 I/O in the worst case. In typical scenarios, kwr/M is much smaller than 1. This is in contrast to O(r) I/Os incurred for each update in the original data structure. Lastly, we mathematically show that for the buffered count-min sketch, the error rate does not substantially degrade over the traditional count-min sketch. Specifically, we prove that for any query q, our data structure provides the guarantee: Pr[Error(q) >= n epsilon (1+o(1))] <= delta + o(1), which, up to o(1) terms, is the same guarantee as that of a traditional count-min sketch.

Subject Classification

ACM Subject Classification

Theory of computation → Data structures and algorithms for data management
Theory of computation → Streaming models
Theory of computation → Database query processing and optimization (theory)

Keywords

Streaming model
Count-min sketch
Counting
Frequency
External memory
I/O efficiency
Bloom filter
Counting filter
Quotient filter

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alok Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116-1127, 1988. URL: http://dx.doi.org/10.1145/48529.48535.
Austin Appleby. 32-bit variant of murmurhash3, 2011. URL: https://sites.google.com/site/murmurhash/.
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '02, pages 1-16, New York, NY, USA, 2002. ACM. URL: http://dx.doi.org/10.1145/543613.543615.
Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. Don't thrash: How to cache your hash on flash. Proc. VLDB Endow., 5(11):1627-1637, jul 2012. URL: http://dx.doi.org/10.14778/2350229.2350275.
Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422-426, 1970. URL: http://dx.doi.org/10.1145/362686.362692.
Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, and George Varghese. An improved construction for counting bloom filters. In Proceedings of the 14th Conference on Annual European Symposium - Volume 14, ESA'06, pages 684-695, London, UK, UK, 2006. Springer-Verlag. URL: http://dx.doi.org/10.1007/11841036_61.
Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM '99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, pages 126-134, 1999.
Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee, Christian A. Lang, and Kenneth A. Ross. Buffered bloom filters on solid state storage. In Rajesh Bordawekar and Christian A. Lang, editors, ADMS@VLDB, pages 1-8, 2010. URL: http://dblp.uni-trier.de/db/conf/vldb/adms2010.html#CanimMBLR10.
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming, ICALP '02, pages 693-703, Berlin, Heidelberg, 2002. Springer-Verlag. URL: http://dl.acm.org/citation.cfm?id=646255.684566.
Aiyou Chen, Yu Jin, Jin Cao, and Li Erran Li. Tracking long duration flows in network traffic. In Proceedings of the 29th Conference on Information Communications, INFOCOM'10, pages 206-210, Piscataway, NJ, USA, 2010. IEEE Press. URL: http://dl.acm.org/citation.cfm?id=1833515.1833557.
Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55(1):58-75, 2005. URL: http://dx.doi.org/10.1016/j.jalgor.2003.12.001.
Biplob Debnath, Sudipta Sengupta, Jin Li, David J. Lilja, and David H. C. Du. Bloomflash: Bloom filter on flash-based storage. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems, ICDCS '11, pages 635-644, Washington, DC, USA, 2011. IEEE Computer Society. URL: http://dx.doi.org/10.1109/ICDCS.2011.44.
Ehsan Eydi, Dzejla Medjedovic, Emina Mekic, and Elmedin Selmanovic. Buffered count-min sketch. In Mirsad Hadžikadić and Samir Avdaković, editors, Advanced Technologies, Systems, and Applications II, pages 249-255, Cham, 2018. Springer International Publishing.
Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18-26, 2005. URL: http://dx.doi.org/10.1145/1083784.1083789.
Sumit Ganguly. Lower bounds on frequency estimation of data streams. In International Computer Science Symposium in Russia, pages 204-215. Springer, 2008.
Michael Gertz, Quinn Hart, Carlos Rueda, Shefali Singhal, and Jie Zhang. A data and query model for streaming geospatial image data. In Torsten Grust, Hagen Höpfner, Arantza Illarramendi, Stefan Jablonski, Marco Mesiti, Sascha Müller, Paula-Lavinia Patranjan, Kai-Uwe Sattler, Myra Spiliopoulou, and Jef Wijsen, editors, Current Trends in Database Technology - EDBT 2006, pages 687-699, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
Amit Goyal, Jagadeesh Jagarlamudi, Hal Daumé, III, and Suresh Venkatasubramanian. Sketch techniques for scaling distributional similarity to the web. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, GEMS '10, pages 51-56, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URL: http://dl.acm.org/citation.cfm?id=1870516.1870524.
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB '02, pages 346-357. VLDB Endowment, 2002. URL: http://dl.acm.org/citation.cfm?id=1287369.1287400.
Suman Nath, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In Proceedings of the 2Nd International Conference on Embedded Networked Sensor Systems, SenSys '04, pages 250-262, New York, NY, USA, 2004. ACM. URL: http://dx.doi.org/10.1145/1031495.1031525.
Prashant Pandey, Michael A. Bender, Rob Johnson, and Robert Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 775-787, 2017. URL: http://dx.doi.org/10.1145/3035918.3035963.
Martin Raab and Angelika Steger. “balls into bins”—a simple and tight analysis. Randomization and Approximation Techniques in Computer Science, pages 159-170, 1998.
Tim Roughgarden and Gregory Valiant. Cs168: The modern algorithmic toolbox lecture #2: Approximate heavy hitters and the count-min sketch, 2018.
Tamás Sarlós, Adrás A. Benczúr, Károly Csalogány, Dániel Fogaras, and Balázs Rácz. To randomize or not to randomize: Space optimal summaries for hyperlink analysis. In Proceedings of the 15th International Conference on World Wide Web, WWW '06, pages 297-306, New York, NY, USA, 2006. ACM. URL: http://dx.doi.org/10.1145/1135777.1135823.
Stuart Schechter, Cormac Herley, and Michael Mitzenmacher. Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. In Proceedings of the 5th USENIX Conference on Hot Topics in Security, HotSec'10, pages 1-8, Berkeley, CA, USA, 2010. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1924931.1924935.
David P. Woodruff. New algorithms for heavy hitters in data streams. CoRR, abs/1603.01733, 2016. URL: http://arxiv.org/abs/1603.01733,
Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick Duffield, and Carsten Lund. Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, IMC '04, pages 101-114, New York, NY, USA, 2004. ACM. URL: http://dx.doi.org/10.1145/1028788.1028802.
Qi (George) Zhao, Mitsunori Ogihara, Haixun Wang, and Jun (Jim) Xu. Finding global icebergs over distributed data sets. In Proceedings of the Twenty-fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '06, pages 298-307, New York, NY, USA, 2006. ACM. URL: http://dx.doi.org/10.1145/1142351.1142394.

Buffered Count-Min Sketch on SSD: Theory and Experiments

Authors Mayank Goswami, Dzejla Medjedovic, Emina Mekic, Prashant Pandey

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message