Computing MEMs on Repetitive Text Collections

Navarro, Gonzalo

doi:10.4230/LIPIcs.CPM.2023.24

Abstract

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n], which is represented as a (hopefully much smaller) run-length context-free grammar of size g_{rl}. We show that the problem can be solved in time O(m² log^ε n), for any constant ε > 0, on a data structure of size O(g_{rl}). Further, on a locally consistent grammar of size O(δ log n/δ), the time decreases to O(m log m(log m + log^ε n)). The value δ is a function of the substring complexity of T and Ω(δ log n/δ) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n and δ.

H. Bannai, T. Gagie, and T. I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020.
D. Belazzougui and S. J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016.
M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75-94, 2005.
P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. S. Rao, and O. Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513-539, 2015.
C. Boucher, O. Cvacho, T. Gagie, J. Holub G. Manzini, G. Navarro, and M. Rossi. PFP compressed suffix trees. In Proc. 23rd Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-72, 2021.
C. Boucher, T. Gagie, T. I, D. Köppl, B. Langmead, G. Manzini, G. Navarro, A. Pacheco, and M. Rossi. PHONI: Streamed matching statistics with multi-genome references. In Proc. 31th Data Compression Conference (DCC), pages 193-202, 2021.
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
M. Cáceres and G. Navarro. Faster repetition-aware compressed suffix trees based on block trees. Information and Computation, 285B:article 104749, 2022.
T. M. Chan, K. G. Larsen, and M. Patrascu. Orthogonal range searching on the RAM, revisited. In Proc. 27th ACM Symposium on Computational Geometry (SoCG), pages 1-10, 2011.
M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005.
A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):article 8, 2020.
F. Claude, G. Navarro, and A. Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021.
M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002.
D. Díaz-Domínguez, G. Navarro, and A. Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 100-113, 2021.
A. Farruggia, T. Gagie, G. Navarro, S. J. Puglisi, and J. Sirén. Relative suffix trees. The Computer Journal, 61(5):773-788, 2018.
H. Ferrada, D. Kempa, and S. J. Puglisi. Hybrid indexing revisited. In Proc. 20th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8, 2018.
T. Gagie, G. Navarro, and N. Prezza. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020.
Y. Gao. Computing matching statistics on repetitive texts. In Proc. 32nd Data Compression Conference (DCC), pages 73-82, 2022.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 2:249-260, 1987.
D. Kempa and T. Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. In Proc. 61st IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 1002-1013, 2020.
D. Kempa and N. Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th Annual ACM Symposium on the Theory of Computing (STOC), pages 827-840, 2018.
J. C. Kieffer and E.-H. Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000.
T. Kociumaka, G. Navarro, and F. Olivares. Near-optimal search time in δ-optimal space. In Proc. 15th Latin American Symposium on Theoretical Informatics (LATIN), pages 88-103, 2022.
T. Kociumaka, G. Navarro, and N. Prezza. Towards a definitive measure of repetitiveness. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023.
S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.
J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000.
V. Mäkinen, D. Belazzougui, F. Cunial, and A. I. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015.
V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976.
D. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514-534, 1968.
G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
G. Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021.
Y. Nekrich and G. Navarro. Sorted range reporting. In Proc. 13th Scandinavian Symposium on Algorithmic Theory (SWAT), pages 271-282, 2012.
D. Nunes, F. Louza, S. Gog, M. Ayala-Rincón, and G. Navarro. Grammar compression by induced suffix sorting. ACM Journal of Experimental Algorithmics, 27:article 1.1, 2022.
E. Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013.
M. Rossi, M. Oliva, B. Langmead, T. Gagie, and C. Boucher. MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169-187, 2022.
L. M. S. Russo, G. Navarro, and A. Oliveira. Fully-compressed suffix trees. ACM Transactions on Algorithms, 7(4):article 53, 2011.
I. Tatarnikov, A. S. Farahani, S. Kashgouli, and T. Gagie. MONI can find k-MEMs. CoRR, 2202.05085, 2022. URL: https://arxiv.org/abs/2202.05085.
P. Weiner. Linear Pattern Matching Algorithms. In Proc. 14th IEEE Symp. on Switching and Automata Theory (FOCS), pages 1-11, 1973.

Computing MEMs on Repetitive Text Collections

Author Gonzalo Navarro

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Computing MEMs on Repetitive Text Collections

Author Gonzalo Navarro

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References