Computing MEMs on Repetitive Text Collections

Author Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.24.pdf
  • Filesize: 0.71 MB
  • 17 pages

Document Identifiers

Author Details

Gonzalo Navarro
  • Center for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
  • Department of Computer Science, University of Chile, Santiago, Chile

Cite As Get BibTex

Gonzalo Navarro. Computing MEMs on Repetitive Text Collections. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 24:1-24:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.24

Abstract

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n], which is represented as a (hopefully much smaller) run-length context-free grammar of size g_{rl}. We show that the problem can be solved in time O(m² log^ε n), for any constant ε > 0, on a data structure of size O(g_{rl}). Further, on a locally consistent grammar of size O(δ log n/δ), the time decreases to O(m log m(log m + log^ε n)). The value δ is a function of the substring complexity of T and Ω(δ log n/δ) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n and δ.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • grammar-based indices
  • maximal exact matches
  • locally consistent grammars
  • substring complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. H. Bannai, T. Gagie, and T. I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. Google Scholar
  2. D. Belazzougui and S. J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016. Google Scholar
  3. M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75-94, 2005. Google Scholar
  4. P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. S. Rao, and O. Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513-539, 2015. Google Scholar
  5. C. Boucher, O. Cvacho, T. Gagie, J. Holub G. Manzini, G. Navarro, and M. Rossi. PFP compressed suffix trees. In Proc. 23rd Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-72, 2021. Google Scholar
  6. C. Boucher, T. Gagie, T. I, D. Köppl, B. Langmead, G. Manzini, G. Navarro, A. Pacheco, and M. Rossi. PHONI: Streamed matching statistics with multi-genome references. In Proc. 31th Data Compression Conference (DCC), pages 193-202, 2021. Google Scholar
  7. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  8. M. Cáceres and G. Navarro. Faster repetition-aware compressed suffix trees based on block trees. Information and Computation, 285B:article 104749, 2022. Google Scholar
  9. T. M. Chan, K. G. Larsen, and M. Patrascu. Orthogonal range searching on the RAM, revisited. In Proc. 27th ACM Symposium on Computational Geometry (SoCG), pages 1-10, 2011. Google Scholar
  10. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. Google Scholar
  11. A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):article 8, 2020. Google Scholar
  12. F. Claude, G. Navarro, and A. Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021. Google Scholar
  13. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002. Google Scholar
  14. D. Díaz-Domínguez, G. Navarro, and A. Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 100-113, 2021. Google Scholar
  15. A. Farruggia, T. Gagie, G. Navarro, S. J. Puglisi, and J. Sirén. Relative suffix trees. The Computer Journal, 61(5):773-788, 2018. Google Scholar
  16. H. Ferrada, D. Kempa, and S. J. Puglisi. Hybrid indexing revisited. In Proc. 20th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8, 2018. Google Scholar
  17. T. Gagie, G. Navarro, and N. Prezza. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020. Google Scholar
  18. Y. Gao. Computing matching statistics on repetitive texts. In Proc. 32nd Data Compression Conference (DCC), pages 73-82, 2022. Google Scholar
  19. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  20. R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 2:249-260, 1987. Google Scholar
  21. D. Kempa and T. Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. In Proc. 61st IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 1002-1013, 2020. Google Scholar
  22. D. Kempa and N. Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th Annual ACM Symposium on the Theory of Computing (STOC), pages 827-840, 2018. Google Scholar
  23. J. C. Kieffer and E.-H. Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. Google Scholar
  24. T. Kociumaka, G. Navarro, and F. Olivares. Near-optimal search time in δ-optimal space. In Proc. 15th Latin American Symposium on Theoretical Informatics (LATIN), pages 88-103, 2022. Google Scholar
  25. T. Kociumaka, G. Navarro, and N. Prezza. Towards a definitive measure of repetitiveness. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023. Google Scholar
  26. S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013. Google Scholar
  27. J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000. Google Scholar
  28. V. Mäkinen, D. Belazzougui, F. Cunial, and A. I. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015. Google Scholar
  29. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  30. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  31. E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. Google Scholar
  32. D. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514-534, 1968. Google Scholar
  33. G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021. Google Scholar
  34. G. Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021. Google Scholar
  35. Y. Nekrich and G. Navarro. Sorted range reporting. In Proc. 13th Scandinavian Symposium on Algorithmic Theory (SWAT), pages 271-282, 2012. Google Scholar
  36. D. Nunes, F. Louza, S. Gog, M. Ayala-Rincón, and G. Navarro. Grammar compression by induced suffix sorting. ACM Journal of Experimental Algorithmics, 27:article 1.1, 2022. Google Scholar
  37. E. Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  38. M. Rossi, M. Oliva, B. Langmead, T. Gagie, and C. Boucher. MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169-187, 2022. Google Scholar
  39. L. M. S. Russo, G. Navarro, and A. Oliveira. Fully-compressed suffix trees. ACM Transactions on Algorithms, 7(4):article 53, 2011. Google Scholar
  40. I. Tatarnikov, A. S. Farahani, S. Kashgouli, and T. Gagie. MONI can find k-MEMs. CoRR, 2202.05085, 2022. URL: https://arxiv.org/abs/2202.05085.
  41. P. Weiner. Linear Pattern Matching Algorithms. In Proc. 14th IEEE Symp. on Switching and Automata Theory (FOCS), pages 1-11, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail