,
Lorraine A.K. Ayad
,
Grigorios Loukides
,
Solon P. Pissis
Creative Commons Attribution 4.0 International license
Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let S = S[1]… S[n] be a string over a totally ordered alphabet Σ. Further let w ≥ 2 and k ≥ 1 be two integers. The minimizer of S[i..i+w+k-2] is the smallest position in [i,i+w-1] where the lexicographically smallest length-k substring of S[i..i+w+k-2] starts. The set of minimizers over all i ∈ [1,n-w-k+2] is the set ℳ_{w,k}(S) of the minimizers of S.
We consider the following basic problem:
Given S, w, and k, can we efficiently compute a total order on Σ that minimizes |ℳ_{w,k}(S)|?
We show that this is unlikely by proving that the problem is NP-hard for any w ≥ 3 and k ≥ 1. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
@InProceedings{verbeek_et_al:LIPIcs.CPM.2024.28,
author = {Verbeek, Hilde and Ayad, Lorraine A.K. and Loukides, Grigorios and Pissis, Solon P.},
title = {{Minimizing the Minimizers via Alphabet Reordering}},
booktitle = {35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
pages = {28:1--28:13},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-326-3},
ISSN = {1868-8969},
year = {2024},
volume = {296},
editor = {Inenaga, Shunsuke and Puglisi, Simon J.},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.28},
URN = {urn:nbn:de:0030-drops-201383},
doi = {10.4230/LIPIcs.CPM.2024.28},
annote = {Keywords: sequence analysis, minimizers, alphabet reordering, feedback arc set}
}