Faster Block Tree Construction

Authors Dominik Köppl , Florian Kurpicz , Daniel Meyer



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.74.pdf
  • Filesize: 1.04 MB
  • 20 pages

Document Identifiers

Author Details

Dominik Köppl
  • Department of Computer Science, University of Muenster, Germany
Florian Kurpicz
  • Karlsruhe Institute of Technology, Germany
Daniel Meyer
  • Karlsruhe Institute of Technology, Germany

Cite AsGet BibTex

Dominik Köppl, Florian Kurpicz, and Daniel Meyer. Faster Block Tree Construction. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 74:1-74:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.74

Abstract

The block tree [Belazzougui et al. J. Comput. Syst. Sci. '21] is a compressed text index that can answer access (extract a character at a position), rank (number of occurrences of a specified character in a prefix of the text), and select (size of smallest prefix such that a specified character has a specified rank) queries. It requires O(zlog(n/z)) words of space, where z is the number of Lempel-Ziv factors of the text. For some highly repetitive inputs, a block tree can require as little as 0.015 bits per character of the text. Small values of z make the block tree a space-efficient alternative to the wavelet tree, which is another index for these three types of queries. While wavelet trees can be constructed fast in practice, up so far compressed versions of the wavelet tree only leverage statistical compression, meaning that they are blind to spaced repetitions. To make block trees usable in practice, a first step is to find ways in constructing them efficiently. We address this problem by presenting a practically efficient construction algorithm for block trees, which is up to an order of magnitude faster than previous implementations. Additionally, we parallelize our implementation, making it the first block tree construction implementation that works in parallel in shared memory.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Pattern matching
Keywords
  • compressed data structure
  • block tree
  • Lempel-Ziv compression
  • longest previous factor array
  • rank and select

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
  2. Anisa Al-Hafeedh, Maxime Crochemore, Lucian Ilie, Evguenia Kopylova, William F. Smyth, German Tischler, and Munina Yusufu. A comparison of index-based Lempel-Ziv LZ77 factorization algorithms. ACM Comput. Surv., 45(1):5:1-5:17, 2012. URL: https://doi.org/10.1145/2379776.2379781.
  3. Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In CPM, volume 78 of LIPIcs, pages 22:1-22:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPICS.CPM.2017.22.
  4. Jérémy Barbay, Francisco Claude, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Efficient fully-compressed sequence representations. Algorithmica, 69(1):232-268, 2014. URL: https://doi.org/10.1007/S00453-012-9726-3.
  5. Jérémy Barbay, Meng He, J. Ian Munro, and Srinivasa Rao Satti. Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms, 7(4):52:1-52:27, 2011. URL: https://doi.org/10.1145/2000807.2000820.
  6. Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees. J. Comput. Syst. Sci., 117:1-22, 2021. URL: https://doi.org/10.1016/j.jcss.2020.11.002.
  7. Djamal Belazzougui, Patrick Hagge Cording, Simon J. Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In ESA, volume 9294 of Lecture Notes in Computer Science, pages 142-154. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_13.
  8. Djamal Belazzougui, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Queries on lz-bounded encodings. In DCC, pages 83-92. IEEE, 2015. URL: https://doi.org/10.1109/DCC.2015.69.
  9. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1-31:21, 2015. URL: https://doi.org/10.1145/2629339.
  10. Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms, 18(3):24:1-24:28, 2022. URL: https://doi.org/10.1145/3524060.
  11. Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, and Gonzalo Navarro. Two-dimensional block trees. In DCC, pages 227-236. IEEE, 2018. URL: https://doi.org/10.1109/DCC.2018.00031.
  12. Manuel Cáceres and Gonzalo Navarro. Faster repetition-aware compressed suffix trees based on block trees. Inf. Comput., 285(Part):104749, 2022. URL: https://doi.org/10.1016/J.IC.2021.104749.
  13. Matteo Ceregini, Florian Kurpicz, and Rossano Venturini. Faster wavelet trees with quad vectors. CoRR, abs/2302.09239, 2023. URL: https://doi.org/10.48550/arXiv.2302.09239.
  14. Yu-Feng Chien, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Geometric BWT: compressed text indexing via sparse suffixes and range searching. Algorithmica, 71(2):258-278, 2015. URL: https://doi.org/10.1007/S00453-013-9792-1.
  15. Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Universal indexes for highly repetitive document collections. Inf. Syst., 61:1-23, 2016. URL: https://doi.org/10.1016/J.IS.2016.04.002.
  16. Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In SPIRE, volume 7608 of Lecture Notes in Computer Science, pages 180-192. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-34109-0_19.
  17. Maxime Crochemore and Lucian Ilie. Computing longest previous factor in linear time and applications. Inf. Process. Lett., 106(2):75-80, 2008. URL: https://doi.org/10.1016/J.IPL.2007.10.006.
  18. Patrick Dinklage, Jonas Ellert, Johannes Fischer, Florian Kurpicz, and Marvin Löbel. Practical wavelet tree construction. ACM J. Exp. Algorithmics, 26:1.8:1-1.8:67, 2021. URL: https://doi.org/10.1145/3457197.
  19. Patrick Dinklage, Johannes Fischer, and Florian Kurpicz. Constructing the wavelet tree and wavelet matrix in distributed memory. In ALENEX, pages 214-228. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611976007.17.
  20. Patrick Dinklage, Johannes Fischer, Florian Kurpicz, and Jan-Philipp Tarnowski. Bit-parallel (compressed) wavelet tree construction. In DCC, pages 81-90. IEEE, 2023. URL: https://doi.org/10.1109/DCC55655.2023.00016.
  21. Jonas Ellert and Florian Kurpicz. Parallel external memory wavelet tree and wavelet matrix construction. In SPIRE, volume 11811 of Lecture Notes in Computer Science, pages 392-406. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_28.
  22. Paolo Ferragina, Raffaele Giancarlo, and Giovanni Manzini. The myriad virtues of wavelet trees. Inf. Comput., 207(8):849-866, 2009. URL: https://doi.org/10.1016/J.IC.2008.12.010.
  23. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. An alphabet-friendly fm-index. In SPIRE, volume 3246 of Lecture Notes in Computer Science, pages 150-160. Springer, 2004. URL: https://doi.org/10.1007/978-3-540-30213-1_23.
  24. Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. Compressing and querying integer dictionaries under linearities and repetitions. IEEE Access, 10:118831-118848, 2022. URL: https://doi.org/10.1109/ACCESS.2022.3221520.
  25. Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci., 372(1):115-121, 2007. URL: https://doi.org/10.1016/J.TCS.2006.12.012.
  26. Johannes Fischer, Florian Kurpicz, and Marvin Löbel. Simple, fast and lightweight parallel wavelet tree construction. In ALENEX, pages 9-20. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975055.2.
  27. Frantisek Franek, Jan Holub, William F. Smyth, and Xiangdong Xiao. Computing quasi suffix arrays. J. Autom. Lang. Comb., 8(4):593-606, 2003. URL: https://doi.org/10.25596/JALC-2003-593.
  28. José Fuentes-Sepúlveda, Erick Elejalde, Leo Ferres, and Diego Seco. Parallel construction of wavelet trees on multicore architectures. Knowl. Inf. Syst., 51(3):1043-1066, 2017. URL: https://doi.org/10.1007/s10115-016-1000-6.
  29. Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. In FOCS, pages 1169-1183. IEEE Computer Society, 2019. URL: https://doi.org/10.1109/FOCS.2019.00073.
  30. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In SEA, volume 8504 of Lecture Notes in Computer Science, pages 326-337. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  31. Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287-1314, 2014. URL: https://doi.org/10.1002/SPE.2198.
  32. Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In SODA, pages 368-373. ACM Press, 2006. Google Scholar
  33. Alexander Golynski, Rajeev Raman, and S. Srinivasa Rao. On the redundancy of succinct data structures. In SWAT, volume 5124 of Lecture Notes in Computer Science, pages 148-159. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-69903-3_15.
  34. Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. Practical implementation of rank and select queries. In WEA, pages 27-38, 2005. Google Scholar
  35. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In SODA, pages 841-850. ACM/SIAM, 2003. Google Scholar
  36. Roberto Grossi, Alessio Orlandi, and Rajeev Raman. Optimal trade-offs for succinct string indexes. In ICALP (1), volume 6198 of Lecture Notes in Computer Science, pages 678-689. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-14165-2_57.
  37. Roberto Grossi, Jeffrey Scott Vitter, and Bojian Xu. Wavelet trees: From theory to practice. In CCP, pages 210-221. IEEE Computer Society, 2011. URL: https://doi.org/10.1109/CCP.2011.16.
  38. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: https://doi.org/10.1147/RD.312.0249.
  39. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In STOC, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  40. Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In LATIN, volume 12118 of Lecture Notes in Computer Science, pages 207-219. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-61792-9_17.
  41. Dominik Köppl, Gonzalo Navarro, and Nicola Prezza. HOLZ: high-order entropy encoding of lempel-ziv factor distances. In DCC, pages 83-92. IEEE, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00016.
  42. Florian Kurpicz. pasta::block_tree_experiments. URL: https://doi.org/10.5281/zenodo.8114299.
  43. Florian Kurpicz. Engineering compact data structures for rank and select queries on bit vectors. In SPIRE, volume 13617 of Lecture Notes in Computer Science, pages 257-272. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_19.
  44. Florian Kurpicz and Daniel Meyer. pasta::block_tree. URL: https://doi.org/10.5281/zenodo.8114255.
  45. Julian Labeit, Julian Shun, and Guy E. Blelloch. Parallel lightweight wavelet tree, suffix array and fm-index construction. J. Discrete Algorithms, 43:2-17, 2017. URL: https://doi.org/10.1016/j.jda.2017.04.001.
  46. Tobias Maier, Peter Sanders, and Roman Dementiev. Concurrent hash tables: Fast and general(?)! ACM Trans. Parallel Comput., 5(4):16:1-16:32, 2019. URL: https://doi.org/10.1145/3309206.
  47. Stefano Marchini and Sebastiano Vigna. Compact fenwick trees for dynamic ranking and selection. Softw. Pract. Exp., 50(7):1184-1202, 2020. URL: https://doi.org/10.1002/spe.2791.
  48. Daniel Meyer. Engineering block trees. Master’s thesis, Karlsruhe Institute of Technology, 2022. Google Scholar
  49. Gonzalo Navarro. A self-index on block trees. In SPIRE, volume 10508 of Lecture Notes in Computer Science, pages 278-289. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_24.
  50. Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
  51. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2-es, 2007. URL: https://doi.org/10.1145/1216370.1216372.
  52. Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theor. Comput. Sci., 762:41-50, 2019. URL: https://doi.org/10.1016/J.TCS.2018.09.007.
  53. Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps. In SEA, volume 7276 of Lecture Notes in Computer Science, pages 295-306. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-30850-5_26.
  54. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX. SIAM, 2007. URL: https://doi.org/10.1137/1.9781611972870.6.
  55. Mihai Pătraşcu. Succincter. In FOCS, pages 305-313. IEEE Computer Society, 2008. URL: https://doi.org/10.1109/FOCS.2008.83.
  56. Alberto Ordóñez Pereira, Gonzalo Navarro, and Nieves R. Brisaboa. Grammar compressed sequences with rank/select support. J. Discrete Algorithms, 43:54-71, 2017. URL: https://doi.org/10.1016/J.JDA.2016.10.001.
  57. Giulio Ermanno Pibiri and Shunsuke Kanda. Rank/select queries over mutable bitmaps. Inf. Syst., 99:101756, 2021. URL: https://doi.org/10.1016/j.is.2021.101756.
  58. Nicola Prezza. Optimal rank and select queries on dictionary-compressed text. In CPM, volume 128 of LIPIcs, pages 4:1-4:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPICS.CPM.2019.4.
  59. Nicola Prezza and Giovanna Rosone. Faster online computation of the succinct longest previous factor array. In CiE, volume 12098 of Lecture Notes in Computer Science, pages 339-352. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-51466-2_31.
  60. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. URL: https://doi.org/10.1145/1290672.1290680.
  61. Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013. URL: https://doi.org/10.1007/s00453-012-9618-6.
  62. Julian Shun and Fuyao Zhao. Practical parallel Lempel-Ziv factorization. In DCC, pages 123-132. IEEE, 2013. URL: https://doi.org/10.1109/DCC.2013.20.
  63. Zachary Stephens, Skylar Lee, Faraz Faghri, Roy Campbell, Chengxiang Zhai, Miles Efron, Ravishankar Iyer, Michael Schatz, Saurabh Sinha, and Gene Robinson. Big data: Astronomical or genomical? PLoS biology, 13(7):1-11, 2015. Google Scholar
  64. Sebastiano Vigna. Broadword implementation of rank/select queries. In WEA, volume 5038 of Lecture Notes in Computer Science, pages 154-168. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-68552-4_12.
  65. Dong Zhou, David G. Andersen, and Michael Kaminsky. Space-efficient, high-performance rank and select structures on uncompressed bit sequences. In SEA, volume 7933 of Lecture Notes in Computer Science, pages 151-163. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38527-8_15.
  66. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail