,
Davide Cenzato
,
Travis Gagie
,
Ragnar Groot Koerkamp
,
Sung-Hwan Kim
,
Giovanni Manzini
,
Nicola Prezza
Creative Commons Attribution 4.0 International license
The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the r-index (Gagie et al., JACM 2020), fit in space proportional to r, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by r. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most r, strictly improving upon the (tight) 2r bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique. Experiments confirm that this theoretical I/O efficiency translates to practice in pangenomic applications: our index locates pattern occurrences using less space and orders of magnitude less time than the r-index when performing pattern matching on repetitive DNA collections. Beyond this, our contributions are twofold: (i) unlike Suffixient Arrays, our technique supports most standard suffix tree operations in O(r) space on top of the text while matching the I/O complexity of uncompressed suffix trees; and (ii) we establish a general framework where any valid path decomposition induces a Suffix Array sampling whose size is a new strong repetitiveness measure; we provide a universal mechanism for locating all pattern occurrences for each such path decomposition.
@InProceedings{becker_et_al:LIPIcs.ICALP.2026.24,
author = {Becker, Ruben and Cenzato, Davide and Gagie, Travis and Groot Koerkamp, Ragnar and Kim, Sung-Hwan and Manzini, Giovanni and Prezza, Nicola},
title = {{Compressing Suffix Trees by Path Decompositions}},
booktitle = {53rd International Colloquium on Automata, Languages, and Programming (ICALP 2026)},
pages = {24:1--24:25},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-428-4},
ISSN = {1868-8969},
year = {2026},
volume = {374},
editor = {Bhattacharya, Sayan and Nanongkai, Danupon and Benedikt, Michael and Puppis, Gabriele},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2026.24},
URN = {urn:nbn:de:0030-drops-264139},
doi = {10.4230/LIPIcs.ICALP.2026.24},
annote = {Keywords: Text indexing, suffix tree, I/O-efficient, Compressed Data Structures}
}
archived version
archived version