Compressing Suffix Trees by Path Decompositions

Becker, Ruben; Cenzato, Davide; Gagie, Travis; Groot Koerkamp, Ragnar; Kim, Sung-Hwan; Manzini, Giovanni; Prezza, Nicola

doi:10.4230/LIPIcs.ICALP.2026.24

Abstract

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the r-index (Gagie et al., JACM 2020), fit in space proportional to r, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? 
We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by r. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most r, strictly improving upon the (tight) 2r bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.
Experiments confirm that this theoretical I/O efficiency translates to practice in pangenomic applications: our index locates pattern occurrences using less space and orders of magnitude less time than the r-index when performing pattern matching on repetitive DNA collections. Beyond this, our contributions are twofold: (i) unlike Suffixient Arrays, our technique supports most standard suffix tree operations in O(r) space on top of the text while matching the I/O complexity of uncompressed suffix trees; and (ii) we establish a general framework where any valid path decomposition induces a Suffix Array sampling whose size is a new strong repetitiveness measure; we provide a universal mechanism for locating all pattern occurrences for each such path decomposition.

Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6), 2021.
Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: Improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023.
Ruben Becker, Davide Cenzato, Travis Gagie, Ragnar Groot Koerkamp, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. STPD-index. Software, Funded by the European Union (ERC, REGINDEX, 101039208), (visited on 2026-06-19). URL: https://github.com/regindex/STPD-index
archived version
full metadata available at: https://doi.org/10.4230/artifacts.26761
Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Ragnar Groot Koerkamp, Giovanni Manzini, and Nicola Prezza. Compressing suffix trees by path decompositions. arXiv preprint arXiv.2506.14734, 2025. URL: https://doi.org/10.48550/arXiv.2506.14734.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Claire Mathieu, editor, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009, pages 785-794. SIAM, 2009. URL: https://doi.org/10.1137/1.9781611973068.86.
Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees. J. Comput. Syst. Sci., 117:1-22, 2021. URL: https://doi.org/10.1016/j.jcss.2020.11.002.
Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, Vienna, Austria, July 23-26, 2024, volume 301 of LIPIcs, pages 1:1-1:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024. URL: https://doi.org/10.4230/LIPIcs.SEA.2024.1.
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015. URL: https://doi.org/10.1137/130936889.
Paolo Boldi and Sebastiano Vigna. Kings, name days, lazy servants and magic. In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors, 9th International Conference on Fun with Algorithms, FUN 2018, La Maddalena, Italy, June 13-15, 2018, volume 100 of LIPIcs, pages 10:1-10:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.FUN.2018.10.
Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, and Nicola Prezza. Suffixient arrays: A new efficient suffix array compression technique. arXiv preprint, 2024. URL: https://arxiv.org/abs/2407.18753.
Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM J. Comput., 15(3):703-724, 1986. URL: https://doi.org/10.1137/0215051.
Yu-Feng Chien, Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Geometric burrows-wheeler transform: Linking range searching and text indexing. In Data Compression Conference (DCC 2008), pages 252-261, 2008. URL: https://doi.org/10.1109/DCC.2008.67.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1-8:39, 2021. URL: https://doi.org/10.1145/3426473.
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Liliana Calderón-Benavides, Cristina N. González-Caro, Edgar Chávez, and Nivio Ziviani, editors, String Processing and Information Retrieval - 19th International Symposium, SPIRE 2012, Cartagena de Indias, Colombia, October 21-25, 2012. Proceedings, volume 7608 of Lecture Notes in Computer Science, pages 180-192. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-34109-0_19.
John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Comput. J., 40(2/3):67-75, 1997.
Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, and Paola Bonizzoni. μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK biobank data. Bioinform., 39(9), 2023. URL: https://doi.org/10.1093/bioinformatics/btad552.
Vinicius T. V. Date and Leandro M. Zatesko. On the near-tightness of χ ≤ 2r: A general σ-ary construction and a binary case via LFSRs. arXiv preprint, 2025. URL: https://arxiv.org/abs/2512.20598.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, California, USA, November 12-14, 2000, pages 390-398. IEEE Computer Society, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
Paolo Ferragina and Rossano Venturini. Compressed cache-oblivious string b-tree. ACM Trans. Algorithms, 12(4), August 2016. URL: https://doi.org/10.1145/2903141.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 1459-1477. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.96.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. In William B. Frakes and Ricardo A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 66-82. Prentice-Hall, 1992.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In F. Frances Yao and Eugene M. Luks, editors, Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May 21-23, 2000, Portland, OR, USA, pages 397-406. ACM, 2000. URL: https://doi.org/10.1145/335305.335351.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings, volume 5577 of Lecture Notes in Computer Science, pages 181-192. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02441-2_17.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 1002-1013. IEEE, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877-1886. IEEE, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00114.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074-2092, 2023. URL: https://doi.org/10.1109/TIT.2022.3224382.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: https://doi.org/10.1016/j.tcs.2012.02.006.
Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 201-206. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-16321-0_20.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory, 22(1):75-81, 1976. URL: https://doi.org/10.1109/TIT.1976.1055501.
Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nord. J. Comput., 12(1):40-66, 2005.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: https://doi.org/10.1089/cmb.2009.0169.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Francisco, California, USA, pages 319-327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218.
Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976. URL: https://doi.org/10.1145/321941.321946.
Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2-20, 2014. URL: https://doi.org/10.1016/j.jda.2013.07.004.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Comput. Surv., 54(2):26:1-26:32, 2022. URL: https://doi.org/10.1145/3432999.
Gonzalo Navarro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a repetitiveness measure. arXiv preprint arXiv:2506.05638, 2025. URL: https://doi.org/10.48550/arXiv.2506.05638.
Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, Glasgow, Scotland (Virtual Conference), July 12-16, 2021, volume 198 of LIPIcs, pages 101:1-101:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.101.
Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1-7:23, 2021. URL: https://doi.org/10.1145/3426870.
Simon J. Puglisi and Bella Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 89-96. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_7.
Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188-194, 2022. URL: https://doi.org/10.1089/cmb.2021.0445.
Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169-187, 2022. URL: https://doi.org/10.1089/cmb.2021.0290.
Vikram Shivakumar, Omar Y. Ahmed, Sam Kovaka, Mohsen Zakeri, and Ben Langmead. Sigmoni: Classification of nanopore signal with a compressed pangenome index. Bioinform., 40(Supplement_1):i287-i296, 2024. URL: https://doi.org/10.1093/bioinformatics/btae213.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome Biology, 20:1-13, 2019.
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15:1-12, 2014.
Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024.

Compressing Suffix Trees by Path Decompositions

Authors Ruben Becker , Davide Cenzato , Travis Gagie , Ragnar Groot Koerkamp , Sung-Hwan Kim , Giovanni Manzini , Nicola Prezza

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Compressing Suffix Trees by Path Decompositions

Authors Ruben Becker , Davide Cenzato , Travis Gagie , Ragnar Groot Koerkamp , Sung-Hwan Kim , Giovanni Manzini , Nicola Prezza

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message