Minimizing the Minimizers via Alphabet Reordering

Authors Hilde Verbeek , Lorraine A.K. Ayad , Grigorios Loukides , Solon P. Pissis



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.28.pdf
  • Filesize: 1.07 MB
  • 13 pages

Document Identifiers

Author Details

Hilde Verbeek
  • CWI, Amsterdam, The Netherlands
Lorraine A.K. Ayad
  • Brunel University London, London, UK
Grigorios Loukides
  • King’s College London, London, UK
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands

Cite AsGet BibTex

Hilde Verbeek, Lorraine A.K. Ayad, Grigorios Loukides, and Solon P. Pissis. Minimizing the Minimizers via Alphabet Reordering. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 28:1-28:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.28

Abstract

Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let S = S[1]… S[n] be a string over a totally ordered alphabet Σ. Further let w ≥ 2 and k ≥ 1 be two integers. The minimizer of S[i..i+w+k-2] is the smallest position in [i,i+w-1] where the lexicographically smallest length-k substring of S[i..i+w+k-2] starts. The set of minimizers over all i ∈ [1,n-w-k+2] is the set ℳ_{w,k}(S) of the minimizers of S. We consider the following basic problem: Given S, w, and k, can we efficiently compute a total order on Σ that minimizes |ℳ_{w,k}(S)|? We show that this is unlikely by proving that the problem is NP-hard for any w ≥ 3 and k ≥ 1. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • sequence analysis
  • minimizers
  • alphabet reordering
  • feedback arc set

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. Text indexing for long patterns: Anchors are all you need. Proc. VLDB Endow., 16(9):2117-2131, 2023. URL: https://doi.org/10.14778/3598581.3598586.
  2. Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 15:1-15:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPICS.ESA.2020.15.
  3. Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T. Simpson, and Paul Medvedev. On the representation of de Bruijn graphs. J. Comput. Biol., 22(5):336-352, 2015. URL: https://doi.org/10.1089/CMB.2014.0160.
  4. Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinform., 31(10):1569-1576, 2015. URL: https://doi.org/10.1093/BIOINFORMATICS/BTV022.
  5. Daniel Gibney and Sharma V. Thankachan. Finding an optimal alphabet ordering for Lyndon factorization is hard. In Markus Bläser and Benjamin Monmege, editors, 38th International Symposium on Theoretical Aspects of Computer Science, STACS 2021, March 16-19, 2021, Saarbrücken, Germany (Virtual Conference), volume 187 of LIPIcs, pages 35:1-35:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.STACS.2021.35.
  6. Szymon Grabowski and Marcin Raniszewski. Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755-1771, 2017. URL: https://doi.org/10.1002/SPE.2481.
  7. Minh Hoang, Hongyu Zheng, and Carl Kingsford. Differentiable learning of sequence-specific minimizer schemes with DeepMinimizer. J. Comput. Biol., 29(12):1288-1304, 2022. URL: https://doi.org/10.1089/CMB.2022.0275.
  8. Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Brian Walenz, Sergey Koren, and Adam M. Phillippy. Weighted minimizer sampling improves long read mapping. Bioinform., 36(Supplement-1):i111-i118, 2020. URL: https://doi.org/10.1093/BIOINFORMATICS/BTAA435.
  9. Richard M. Karp. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA, The IBM Research Symposia Series, pages 85-103. Plenum Press, New York, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9.
  10. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: https://doi.org/10.1147/RD.312.0249.
  11. Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinform., 32(14):2103-2110, 2016. URL: https://doi.org/10.1093/BIOINFORMATICS/BTW152.
  12. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinform., 34(18):3094-3100, 2018. URL: https://doi.org/10.1093/BIOINFORMATICS/BTY191.
  13. Grigorios Loukides and Solon P. Pissis. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1-64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ESA.2021.64.
  14. Grigorios Loukides, Solon P. Pissis, and Michelle Sweering. Bidirectional string anchors for improved text indexing and top-dollarkdollar similarity search. IEEE Trans. Knowl. Data Eng., 35(11):11093-11111, 2023. URL: https://doi.org/10.1109/TKDE.2022.3231780.
  15. Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Compact universal k-mer hitting sets. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 257-268. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-43681-4_21.
  16. Michael Roberts, Wayne B. Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinform., 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
  17. Saul Schleimer, Daniel Shawcross Wilkerson, and Alexander Aiken. Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76-85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770.
  18. Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-efficient representation of genomic k-mer count tables. Algorithms Mol. Biol., 17(1):5, 2022. URL: https://doi.org/10.1186/S13015-022-00212-0.
  19. Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):R46, 2014. Google Scholar
  20. Daniel H. Younger. Minimum feedback arc sets for a directed graph. IEEE Transactions on Circuit Theory, 10(2):238-245, 1963. URL: https://doi.org/10.1109/TCT.1963.1082116.
  21. Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Improved design and analysis of practical minimizers. Bioinform., 36(Supplement-1):i119-i127, 2020. URL: https://doi.org/10.1093/BIOINFORMATICS/BTAA472.
  22. Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Sequence-specific minimizers via polar sets. Bioinform., 37(Supplement):187-195, 2021. URL: https://doi.org/10.1093/BIOINFORMATICS/BTAB313.
  23. Hongyu Zheng, Guillaume Marçais, and Carl Kingsford. Creating and using minimizer sketches in computational genomics. J. Comput. Biol., 30(12):1251-1276, 2023. URL: https://doi.org/10.1089/CMB.2023.0094.