Maximum Coverage in Sublinear Space, Faster

Authors Stephen Jaud , Anthony Wirth , Farhana Choudhury



PDF
Thumbnail PDF

File

LIPIcs.SEA.2023.21.pdf
  • Filesize: 2.4 MB
  • 20 pages

Document Identifiers

Author Details

Stephen Jaud
  • School of Computing and Information Systems, The University of Melbourne, Australia
Anthony Wirth
  • School of Computing and Information Systems, The University of Melbourne, Australia
Farhana Choudhury
  • School of Computing and Information Systems, The University of Melbourne, Australia

Acknowledgements

Rowan Warneke, for reading and advising on an earlier version.

Cite AsGet BibTex

Stephen Jaud, Anthony Wirth, and Farhana Choudhury. Maximum Coverage in Sublinear Space, Faster. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 21:1-21:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.SEA.2023.21

Abstract

Given a collection of m sets from a universe 𝒰, the Maximum Set Coverage problem consists of finding k sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor 1-1/e. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe n = |𝒰|. However, one randomized streaming algorithm has been shown to produce a 1-1/e-ε approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to m and n. In order to achieve such a low space complexity, the authors used two techniques in their multi-pass approach: - F₀-sketching, allows to determine with great accuracy the number of distinct elements in a set using less space than the set itself. - Subsampling, consists of only solving the problem on a subspace of the universe. It is implemented using γ-independent hash functions. This article focuses on the sublinear-space algorithm and highlights the time cost of these two techniques, especially subsampling. We present optimizations that significantly reduce the time complexity of the algorithm. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of Ω(ε^{-2} k log m) can be fine-tuned to Ω(k log m); we also show how F₀-sketching can be removed. Secondly, we derive a new lower bound for the probability of producing a 1-1/e-ε approximation using only pairwise independence: 1- (4/(c k log m)) compared to 1-(2e/(m^{ck/6})) with Ω(k log m)-independence. Although the theoretical guarantees are weaker, suggesting the approximation quality would suffer, for large streams, our algorithms perform well in practice. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce worse solution than the original algorithm, while running significantly faster by several orders of magnitude.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • streaming algorithms
  • subsampling
  • maximum set cover
  • k-wise independent hash functions

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Shipra Agrawal, Mohammad Shadravan, and Cliff Stein. Submodular secretary problem with shortlists. In 10th ITCS, pages 1:1-1:19, 2018. URL: https://doi.org/10.4230/LIPIcs.ITCS.2019.1.
  2. Sepehr Assadi. Tight space-approximation tradeoff for the multi-pass streaming set cover problem. In 36th ACM PODS, pages 321-335, 2017. URL: https://doi.org/10.1145/3034786.3056116.
  3. Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: massive data summarization on the fly. In 20th ACM SIGKDD, pages 671-680, August 2014. URL: https://doi.org/10.1145/2623330.2623637.
  4. MohammadHossein Bateni, Hossein Esfandiari, and Vahab Mirrokni. Almost optimal streaming algorithms for coverage problems. In 29th ACM SPAA, pages 13-23, 2017. URL: https://doi.org/10.1145/3087556.3087585.
  5. Paolo Boldi, Massimo Santini, and Sebastiano Vigna. A large time-aware web graph. SIGIR Forum, 42(2):33-38, 2008. URL: https://doi.org/10.1145/1480506.1480511.
  6. Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In 13th WWW, pages 595-601, 2004. URL: https://doi.org/10.1145/988672.988752.
  7. Amit Chakrabarti and Anthony Wirth. Incidence geometries and the pass complexity of semi-streaming set cover. In 27th ACM-SIAM SODA, pages 1365-1373, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch94.
  8. G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Transactions on Knowledge and Data Engineering, 15(3):529-540, 2003. URL: https://doi.org/10.1109/tkde.2003.1198388.
  9. Graham Cormode and Donatella Firmani. A unifying framework for 𝓁₀-sampling algorithms. Distributed and Parallel Databases, 32(3):315-335, 2013. URL: https://doi.org/10.1007/s10619-013-7131-9.
  10. Graham Cormode, Howard Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In 19th ACM CIKM, pages 479-488, 2010. URL: https://doi.org/10.1145/1871437.1871501.
  11. Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634-652, 1998. URL: https://doi.org/10.1145/285055.285059.
  12. Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, and Rico Zenklusen. The one-way communication complexity of submodular maximization with applications to streaming and robustness. In 52nd ACM STOC, pages 1363-1374, 2020. URL: https://doi.org/10.1145/3357713.3384286.
  13. Bart Goethals and Mohammed J Zaki. Fimi’03: Workshop on frequent itemset mining implementations. In 3rd IEEE Data Mining Workshop on Frequent Itemset Mining Implementations, pages 1-13, 2003. Google Scholar
  14. Tal Grossman and Avishai Wool. Computational experience with approximation algorithms for the set covering problem. European Journal of Operational Research, 101(1):81-92, 1997. URL: https://doi.org/10.1016/s0377-2217(96)00161-0.
  15. Piotr Indyk and Ali Vakilian. Tight trade-offs for the maximum k-coverage problem in the general streaming model. In 38th ACM PODS, pages 200-217, 2019. URL: https://doi.org/10.1145/3294052.3319691.
  16. Ching Lih Lim, Alistair Moffat, and Anthony Wirth. Lazy and eager approaches for the set cover problem. In 37th ACSC, pages 19-27, 2014. URL: https://doi.org/10.5555/2667473.2667476.
  17. Andrew McGregor and Hoa T. Vu. Better streaming algorithms for the maximum coverage problem. Theory of Computing Systems, 63(7):1595-1619, 2018. URL: https://doi.org/10.1007/s00224-018-9878-x.
  18. Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In 35th ICML, pages 3829-3838, 2018. URL: https://proceedings.mlr.press/v80/norouzi-fard18a.html.
  19. Mihai Pǎtraşcu and Mikkel Thorup. The power of simple tabulation hashing. J. ACM, 59(3):1-50, 2012. URL: https://doi.org/10.1145/2220357.2220361.
  20. Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In 9th SDM, pages 697-708, 2009. URL: https://doi.org/10.1137/1.9781611972795.60.
  21. Jeanette P. Schmidt, Alan Siegel, and Aravind Srinivasan. ChernoffendashHoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics, 8(2):223-250, 1995. URL: https://doi.org/10.1137/s089548019223872x.
  22. Huiwen Yu and Dayu Yuan. Set coverage problems in a one-pass data stream. In 13th SDM, pages 758-766, 2013. URL: https://doi.org/10.1137/1.9781611972832.84.