,
Grigorios Loukides
,
Solon P. Pissis
,
Sharma V. Thankachan
Creative Commons Attribution 4.0 International license
In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S₁,…,S_r, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1..r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r² pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online.
We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPL^k_{i,j} be the length of the longest suffix of S_i that is at distance at most k from a prefix of S_j. In particular, for any k = 𝒪(1), we show an 𝒪(nlog^k n)-sized data structure to support the following queries:
- One-to-One^k(i,j): output SPL^k_{i,j} in 𝒪(log^k nlog log n) time.
- Report^k(i,d): output all j ∈ [1..r], such that SPL^k_{i,j} ≥ d, in 𝒪(log^{k}n(log n/log log n+output)) time, where output denotes the size of the output.
In fact, our algorithms work for any value of k not just for k = 𝒪(1), but the formulas bounding the complexities get much more complicated for larger values of k.
@InProceedings{zuba_et_al:LIPIcs.MFCS.2024.85,
author = {Zuba, Wiktor and Loukides, Grigorios and Pissis, Solon P. and Thankachan, Sharma V.},
title = {{Approximate Suffix-Prefix Dictionary Queries}},
booktitle = {49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)},
pages = {85:1--85:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-335-5},
ISSN = {1868-8969},
year = {2024},
volume = {306},
editor = {Kr\'{a}lovi\v{c}, Rastislav and Ku\v{c}era, Anton{\'\i}n},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2024.85},
URN = {urn:nbn:de:0030-drops-206416},
doi = {10.4230/LIPIcs.MFCS.2024.85},
annote = {Keywords: all-pairs suffix-prefix, suffix-prefix queries, suffix tree, k-errata tree}
}