Trie-Compressed Adaptive Set Intersection

Authors Diego Arroyuelo , Juan Pablo Castillo



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.1.pdf
  • Filesize: 1.03 MB
  • 19 pages

Document Identifiers

Author Details

Diego Arroyuelo
  • Departamento de Informática, Universidad Técnica Federico Santa María, Santiago, Chile
  • Millennium Institute for Foundational Research on Data, Santiago, Chile
Juan Pablo Castillo
  • Departamento de Informática, Universidad Técnica Federico Santa María, Santiago, Chile
  • Millennium Institute for Foundational Research on Data, Santiago, Chile

Acknowledgements

We thank Gonzalo Navarro, Cristian Riveros, Adrián Gómez-Brandón, and Francesco Tosoni for enlightening comments, suggestions, and discussions about this work. We also thank the anonymous reviewers whose thorough reviews helped us to improve this paper.

Cite As Get BibTex

Diego Arroyuelo and Juan Pablo Castillo. Trie-Compressed Adaptive Set Intersection. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 1:1-1:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.1

Abstract

We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set S ⊆ [0..u) of n elements can be represented using compressed space while supporting k-way intersections in adaptive O(kδlg(u/δ)) time, δ being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Data structures and algorithms for data management
  • Information systems → Information retrieval query processing
Keywords
  • Set intersection problem
  • Adaptive Algorithms
  • Compressed and compact data structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. The Lemur Project. https://lemurproject.org/. Accessed March 14, 2023.
  2. Roaring bitmaps. https://github.com/RoaringBitmap/CRoaring. Accessed March 14, 2023.
  3. Roaring bitmaps: A better compressed bitset. https://roaringbitmap.org/. Accessed March 14, 2023.
  4. A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. Google Scholar
  5. V. Ngoc Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151-166, 2005. Google Scholar
  6. D. Arroyuelo, P. Davoodi, and S. Rao Satti. Succinct dynamic cardinal trees. Algorithmica, 74(2):742-777, 2016. Google Scholar
  7. D. Arroyuelo, J. Fuentes-Sepúlveda, and D. Seco. Three success stories about compact data structures. Communications of the ACM, 63(11):64-65, 2020. Google Scholar
  8. D. Arroyuelo and R. Raman. Adaptive succinctness. Algorithmica, 84(3):694-718, 2022. Google Scholar
  9. R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 3109, pages 400-408. Springer, 2004. Google Scholar
  10. R. Baeza-Yates and A. Salinger. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proc. 12th International Conference on String Processing and Information Retrieval (SPIRE), LNCS 3772, pages 13-24. Springer, 2005. Google Scholar
  11. J. Barbay and C. Kenyon. Adaptive intersection and t-threshold problems. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 390-399. ACM/SIAM, 2002. Google Scholar
  12. J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Transations on Algorithms, 4(1):4:1-4:18, 2008. URL: https://doi.org/10.1145/1328911.1328915.
  13. David Benoit, Erik D. Demaine, J. Ian Munro, Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Representing trees of higher degree. Algorithmica, 43(4):275-292, 2005. URL: https://doi.org/10.1007/s00453-004-1146-6.
  14. P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In Proc. 18th International Symposium on Algorithms and Computation (ISAAC), LNCS 4835, pages 739-750. Springer, 2007. Google Scholar
  15. S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010. Google Scholar
  16. D. Clark. Compact PAT trees. PhD thesis, University of Waterloo, 1997. Google Scholar
  17. C. Clarke, F. Scholer, and I. Soboroff. TREC terabyte track. https://www-nlpir.nist.gov/projects/terabyte/. Accessed March 14, 2023.
  18. H. Cohen and E. Porat. Fast set intersection and two-patterns matching. Theoretical Computer Science, 411(40-42):3795-3800, 2010. URL: https://doi.org/10.1016/j.tcs.2010.06.002.
  19. J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In Proc. 2nd ACM International Conference on Web Search and Data Mining (WSDM'09), pages 1-1, 2009. Google Scholar
  20. E. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive set intersections, unions, and differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 743-752. ACM/SIAM, 2000. Google Scholar
  21. B. Ding and A. König. Fast set intersection in memory. Proc. VLDB Endowment, 4(4):255-266, 2011. URL: https://doi.org/10.14778/1938545.1938550.
  22. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975. Google Scholar
  23. R. Elmasri and S. B. Navathe. Fundamentals of Database Systems, 6th Edition. Pearson, 2011. Google Scholar
  24. L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611-639, 2006. Google Scholar
  25. A. S. Fraenkel and S. T. Klein. Robust universal complete codes for transmission and compression. Discrete Applied Mathematics, 64(1):31-55, 1996. URL: https://doi.org/10.1016/0166-218X(93)00116-H.
  26. T. Gagie, G. Navarro, and S. J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, 426:25-41, 2012. Google Scholar
  27. S. Gog and M. Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287-1314, 2014. Google Scholar
  28. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841-850. ACM/SIAM, 2003. Google Scholar
  29. A. Gupta, W.-K. Hon, R. Shah, and J. S. Vitter. Compressed data structures: Dictionaries and data-aware measures. Theoretical Computer Science, 387(3):313-331, 2007. Google Scholar
  30. G. Jacobson. Space-efficient static trees and graphs. In Proc. 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
  31. S. T. Klein and D. Shapira. Searching in compressed dictionaries. In Proc. Data Compression Conference (DCC), page 142. IEEE Computer Society, 2002. Google Scholar
  32. F. Kurpicz. Engineering compact data structures for rank and select queries on bit vectors. In Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 13617, pages 257-272. Springer, 2022. Google Scholar
  33. R. M. Layer and A. R. Quinlan. A parallel algorithm for n-way interval set intersection. Proc. IEEE, 105(3):542-551, 2017. URL: https://doi.org/10.1109/JPROC.2015.2461494.
  34. D. Lemire. Document identifier data set. https://lemire.me/data/integercompression2014.html. Accessed March 14, 2023.
  35. D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1-29, 2015. Google Scholar
  36. D. Lemire, O. Kaser, N. Kurz, L. Deri, C. O'Hara, F. Saint-Jacques, and G. Ssi Yan Kai. Roaring bitmaps: Implementation of an optimized software library. Software: Practice & Experience, 48(4):867-895, 2018. URL: https://doi.org/10.1002/spe.2560.
  37. J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. Supporting interoperability between open-source search engines with the common index file format, 2020. URL: https://doi.org/10.48550/ARXIV.2003.08276.
  38. J. M. Mackenzie, R. Benham, M. Petri, J. R. Trippas, J. S. Culpepper, and A. Moffat. CC-News-En: A large english news corpus. In Proc. 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 3077-3084. ACM, 2020. Google Scholar
  39. A. Mallia, M. Siedlaczek, J. Mackenzie, and T. Suel. PISA: performant indexes and search for academia. In Proc. of the Open-Source IR Replicability Challenge co-located with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50-56, 2019. Google Scholar
  40. A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25-47, 2000. Google Scholar
  41. G. Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. Google Scholar
  42. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. of 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-70, 2007. Google Scholar
  43. G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In Proc. of 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 273-282, 2014. Google Scholar
  44. G. E. Pibiri. Sliced indices. https://github.com/jermp/s_indexes. Accessed March 14, 2023.
  45. G. E. Pibiri. Fast and compact set intersection through recursive universe partitioning. In Proc. Data Compression Conference (DCC), pages 293-302. IEEE, 2021. Google Scholar
  46. G. E. Pibiri and R. Venturini. Techniques for inverted index compression. ACM Computing Surveys, 53(6):125:1-125:36, 2021. URL: https://doi.org/10.1145/3415148.
  47. R. Raman, V. Raman, and S. Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):43, 2007. Google Scholar
  48. F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conference on IR Research (ECIR), LNCS 4425, pages 101-112. Springer, 2007. Google Scholar
  49. A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. SIMD-based decoding of posting lists. In Proc. 20th ACM International Conference on Information and Knowledge Management (CIKM'11), pages 317-326, 2011. Google Scholar
  50. L. Trabb-Pardo. Set Representation and Set Intersection. PhD thesis, STAN-CS-78-681, Department of Computer Science, Stanford University, 1978. D. E. Knuth, advisor. Google Scholar
  51. T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy, editors, Proc. 17th International Conference on Database Theory (ICDT), pages 96-106. OpenProceedings.org, 2014. Google Scholar
  52. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd Edition. Morgan Kaufmann, 1999. Google Scholar
  53. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. 18th International Conference on World Wide Web (WWW), pages 401-410, 2009. Google Scholar
  54. J. Zhang, X. Long, , and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. 17th International Conference on World Wide Web (WWW), pages 387-396, 2008. Google Scholar
  55. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2):6, 2006. URL: https://doi.org/10.1145/1132956.1132959.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail