Affirmative Sampling: Theory and Applications

Authors Jérémie Lumbroso , Conrado Martínez



PDF
Thumbnail PDF

File

LIPIcs.AofA.2022.12.pdf
  • Filesize: 0.8 MB
  • 17 pages

Document Identifiers

Author Details

Jérémie Lumbroso
  • Department of Computer Science, Princeton University, NJ, USA
Conrado Martínez
  • Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, Spain

Acknowledgements

We want to thank the anonymous reviewers who carefully read the submitted article and made very useful remarks and suggested changes which have allowed us to improve the paper and correct a couple of flawed arguments.

Cite AsGet BibTex

Jérémie Lumbroso and Conrado Martínez. Affirmative Sampling: Theory and Applications. In 33rd International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 225, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.AofA.2022.12

Abstract

Affirmative Sampling is a practical and efficient novel algorithm to obtain random samples of distinct elements from a data stream. Its most salient feature is that the size S of the sample will, on expectation, grow with the (unknown) number n of distinct elements in the data stream. As any distinct element has the same probability to be sampled, and the sample size is greater when the "diversity" (the number of distinct elements) is greater, the samples that Affirmative Sampling delivers are more representative than those produced by any scheme where the sample size is fixed a priori - hence its name. Our algorithm is straightforward to implement, and several implementations already exist.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Sketching and sampling
Keywords
  • Data streams
  • Distinct sampling
  • Random sampling
  • Cardinality estimation
  • Analysis of algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M. Abramowitz and I.A. Stegun, editors. Handbook of Mathematical Functions. Dover Publ., New York, 1964. Google Scholar
  2. M. Archibald and C. Martínez. The hiring problem and permutations. In Proc. of the 21st Int. Col. on Formal Power Series and Algebraic Combinatorics (FPSAC), volume AK of Discrete Mathematics & Theoretical Computer Science (Proceedings), pages 63-76, 2009. Google Scholar
  3. B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja. Records. Wiley series in probability and mathematical statistics. John Wiley & Sons, Inc., New York, 1998. Google Scholar
  4. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In J. D. P. Rolim and S. P. Vadhan, editors, Randomization and Approximation Techniques, 6th International Workshop, RANDOM 2002, Cambridge, MA, USA, September 13-15, 2002, Proceedings, volume 2483 of Lecture Notes in Computer Science, pages 1-10. Springer, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
  5. K. S. Beyer, R. Gemulla, P. J. Haas, B. Reinwald, and Y. Sismanis. Distinct-value synopses for multiset operations. Commun. ACM, 52(10):87-95, 2009. URL: https://doi.org/10.1145/1562764.1562787.
  6. A. Z. Broder. On the resemblance and containment of documents. In B. Carpentieri, A. De Santis, U. Vaccaro, and J.A. Storer, editors, Proc. of the Compression and Complexity of SEQUENCES 1997, pages 21-29. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SEQUEN.1997.666900.
  7. W. G. Cochran. Sampling Techniques. John Wiley & Sons, Inc., New York, 3rd edition, 1977. Google Scholar
  8. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Indranil Gupta and Roger Wattenhofer, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC 2007), pages 225-234. ACM, 2007. URL: https://doi.org/10.1145/1281100.1281133.
  9. J. Ernvall and O. Nevalainen. An algorithm for unbiased random sampling. The Computer Journal, 25(1):45-47, 1982. URL: https://doi.org/10.1093/comjnl/25.1.45.
  10. C. T. Fan, M. E. Muller, and I. Rezucha. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57(298):387-402, 1962. URL: https://doi.org/10.2307/2281647.
  11. Ph. Flajolet. On adaptive sampling. Computing, 43(4):391-400, 1990. URL: https://doi.org/10.1007/BF02241657.
  12. Ph. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Ph. Jacquet, editor, Proc. of the 2007 Conference on Analysis of Algorithms (AofA 07), volume AH of Discrete Mathematics & Theoretical Computer Science (Proceedings), pages 127-146, 2007. Google Scholar
  13. Ph. Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
  14. Ph. Flajolet and A. Odlyzko. Singularity analysis of generating functions. SIAM Journal on Discrete Mathematics, 3(1):216-240, 1990. URL: https://doi.org/10.1137/0403019.
  15. Ph. Flajolet and R. Sedgewick. Analytic Combinatorics. Cambridge University Press, 2009. URL: https://doi.org/10.1017/CBO9780511801655.
  16. J. Gaither and M. D. Ward. Analytic methods for select sets. Probability in the Engineering and Informational Sciences, 26:561-568, 2012. URL: https://doi.org/10.1017/S0269964812000186.
  17. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), pages 541-550. Morgan Kaufmann, 2001. Google Scholar
  18. R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison Wesley, 2nd edition, 1994. Google Scholar
  19. A. Helmi. The Hiring Problem and its Algorithmic Applications. PhD thesis, Dept. Computer Science, Universitat Politècnica de Catalunya, 2013. Google Scholar
  20. A. Helmi, J. Lumbroso, C. Martínez, and A. Viola. Counting distinct elements in data streams: the random permutation viewpoint. In N. Broutin and L. Devroye, editors, Proc. of the 23rd Int. Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA), volume AQ of Discrete Mathematics & Theoretical Computer Science (Proceedings), pages 323-338, 2012. URL: https://doi.org/10.46298/dmtcs.3002.
  21. A. Helmi, C. Martínez, and A. Panholzer. Analysis of the strategy "hiring above the m-th best candidate". Algorithmica, 70(2):267-300, 2014. URL: https://doi.org/10.1007/s00453-014-9895-3.
  22. A. Helmi and A. Panholzer. Analysis of the "hiring above the median" selection strategy for the hiring problem. Algorithmica, 66(4):762-803, 2013. URL: https://doi.org/10.1007/s00453-012-9727-2.
  23. S. Janson. The hiring problem with rank-based strategies. Electronic Journal of Probability, 24:1-35, 2019. URL: https://doi.org/10.1214/19-EJP382.
  24. T.G. Jones. A note on sampling a tape file. Comm. ACM, 5(6):343, 1962. URL: https://doi.org/10.1145/367766.368159.
  25. J. Kawarasaki and M. Bbuya. Random numbers for simple random sampling without replacement. Technical Report 7, Keio University, Dept. Mathematics, 1982. Google Scholar
  26. D.E. Knuth. The Art of Computer Programming: Seminumerical Algorithms, volume 2. Addison-Wesley, 3 edition, 1997. Google Scholar
  27. S. Langowski and M. D. Ward. Moments of select sets. In M. Mishna and J. I. Munro, editors, Proceedings of the 16th Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2019, San Diego, CA, USA, January 6, 2019, pages 67-73. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975505.7.
  28. G. Louchard. Probabilistic analysis of adaptative sampling. Random Structures & Algorithms, 10(1-2):157-168, 1997. Google Scholar
  29. J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. Discrete Mathematics & Theoretical Computer Science, 2010. Google Scholar
  30. J. Lumbroso and C. Martínez. Affirmative Sampling: Reference Python Implementation, March 2022. URL: https://doi.org/10.5281/zenodo.6601690.
  31. J. W. Tukey. Some sampling simplified. Journal of the American Statistical Association, 45(252):501-519, December 1950. URL: https://doi.org/10.2307/2280719.
  32. J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37-57, 1985. URL: https://doi.org/10.1145/3147.3165.
  33. J.S. Vitter. Faster methods for random sampling. Comm. ACM, 27(7):703-718, 1984. URL: https://doi.org/10.1145/358105.893.