Engineering Zuffix Arrays

Authors Paolo Boldi , Stefano Marchini, Sebastiano Vigna



PDF
Thumbnail PDF

File

LIPIcs.SEA.2024.2.pdf
  • Filesize: 0.92 MB
  • 18 pages

Document Identifiers

Author Details

Paolo Boldi
  • Università degli Studi di Milano, Italy
Stefano Marchini
  • Università degli Studi di Milano, Italy
Sebastiano Vigna
  • Università degli Studi di Milano, Italy

Cite AsGet BibTex

Paolo Boldi, Stefano Marchini, and Sebastiano Vigna. Engineering Zuffix Arrays. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 2:1-2:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SEA.2024.2

Abstract

Searching patterns in long strings is a classical algorithmic problem with countless practical applications. Suffix trees and suffix arrays (and their variants) are a long-established solution that yields linear-time search (in the size of the pattern). In [Paolo Boldi and Sebastiano Vigna, 2018] it is shown that a z-map gadget can be attached to (enhanced) suffix arrays to improve their theoretical query time, obtaining a data structure called zuffix array. The main contribution of this paper is to show that a carefully engineered implementation of the z-map gadget does provide significant speedups with respect to enhanced suffix arrays on real-world datasets, albeit doubling the required space. In particular, for large alphabets we observe a sevenfold improvement in query time with respect to enhanced suffix arrays; even in the worst case (small alphabets), the query time is almost halved. Thus, zuffix arrays provide a very interesting new point in the space-time tradeoff spectrum.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • Suffix trees
  • suffix arrays
  • z-fast tries

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004. Google Scholar
  2. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), pages 785-794, New York, 2009. ACM Press. Google Scholar
  3. Djamal Belazzougui, Paolo Boldi, and Sebastiano Vigna. Dynamic z-fast tries. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 159-172. Springer, 2010. Google Scholar
  4. Jean Berstel. Fibonacci words - A survey. In G. Rozenberg and A. Salomaa, editors, The Book of L, pages 13-27. Springer-Verlag, 1986. Google Scholar
  5. Paolo Boldi and Sebastiano Vigna. Kings, name days, lazy servants and magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms (FUN 2018), volume 100 of Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1-10:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Google Scholar
  6. Manuel Cáceres and Gonzalo Navarro. Faster repetition-aware compressed suffix trees based on block trees. In Nieves R. Brisaboa and Simon J. Puglisi, editors, String Processing and Information Retrieval - 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7-9, 2019, Proceedings, volume 11811 of Lecture Notes in Computer Science, pages 434-451. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_31.
  7. Guy Castagnoli, Stefan Brauer, and Martin Herrmann. Optimization of cyclic redundancy-check codes with 24 and 32 parity bits. IEEE Trans. Commun., 41(6):883-892, 1993. URL: https://doi.org/10.1109/26.231911.
  8. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, July 2005. Google Scholar
  9. Paolo Ferragina and Gonzalo Navarro. The pizza & chili corpus, 2007. URL: http://pizzachili.dcc.uchile.cl/texts.html.
  10. Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci., 410(51):5354-5364, 2009. URL: https://doi.org/10.1016/J.TCS.2009.09.012.
  11. Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. ACM Trans. Algorithms, 8(1):4:1-4:22, January 2012. Google Scholar
  12. Jean-Loup Gailly and Mark Adler. Zlib compression library. Technical report, Apollo - University of Cambridge Repository, 2004. URL: http://www.dspace.cam.ac.uk/handle/1810/3486.
  13. Ilya Grebnov. libsais. https://github.com/IlyaGrebnov/libsais, 2021.
  14. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005. Google Scholar
  15. Shay Gueron and Michael E. Kounavis. Efficient implementation of the Galois counter mode using a carry-less multiplier and a fast reduction algorithm. Inf. Process. Lett., 110(14-15):549-553, 2010. URL: https://doi.org/10.1016/J.IPL.2010.04.011.
  16. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. Google Scholar
  17. Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, second edition, 1997. Google Scholar
  18. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms, 3(2-4):143-156, 2005. Google Scholar
  19. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  20. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471-1484, 2011. Google Scholar
  21. Enno Ohlebusch, Johannes Fischer, and Simon Gog. CST++. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 322-333. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-16321-0_34.
  22. Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. J. Algorithms, 48(2):294-313, 2003. URL: https://doi.org/10.1016/S0196-6774(03)00087-7.
  23. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: https://doi.org/10.1007/S00224-006-1198-X.
  24. Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT'08. IEEE Conference Record of 14th Annual Symposium on, pages 1-11. IEEE, 1973. Google Scholar