eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2024-08-23
85:1
85:18
10.4230/LIPIcs.MFCS.2024.85
article
Approximate Suffix-Prefix Dictionary Queries
Zuba, Wiktor
1
https://orcid.org/0000-0002-1988-3507
Loukides, Grigorios
2
https://orcid.org/0000-0003-0888-5061
Pissis, Solon P.
1
3
https://orcid.org/0000-0002-1445-1932
Thankachan, Sharma V.
4
https://orcid.org/0000-0002-6852-1035
CWI, Amsterdam, The Netherlands
King’s College London, UK
Vrije Universiteit, Amsterdam, The Netherlands
North Carolina State University, Raleigh, NC, USA
In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S₁,…,S_r, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1..r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r² pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online.
We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPL^k_{i,j} be the length of the longest suffix of S_i that is at distance at most k from a prefix of S_j. In particular, for any k = 𝒪(1), we show an 𝒪(nlog^k n)-sized data structure to support the following queries:
- One-to-One^k(i,j): output SPL^k_{i,j} in 𝒪(log^k nlog log n) time.
- Report^k(i,d): output all j ∈ [1..r], such that SPL^k_{i,j} ≥ d, in 𝒪(log^{k}n(log n/log log n+output)) time, where output denotes the size of the output.
In fact, our algorithms work for any value of k not just for k = 𝒪(1), but the formulas bounding the complexities get much more complicated for larger values of k.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol306-mfcs2024/LIPIcs.MFCS.2024.85/LIPIcs.MFCS.2024.85.pdf
all-pairs suffix-prefix
suffix-prefix queries
suffix tree
k-errata tree