LIPIcs.CPM.2023.21.pdf
- Filesize: 1.12 MB
- 20 pages
In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S_1,…,S_k, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1,k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ n^𝒪(1), APSP can be solved in the optimal 𝒪(n+k²) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k² usually dominates n, and thus the k² factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: - One-to-One(i,j): output SPL_{i,j}. - One-to-All(i): output SPL_{i,j} for every j ∈ [1,k]. - Report(i,𝓁): output all distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer. - Count(i,𝓁): output the number of distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer. - Top(i,K): output K distinct j ∈ [1,k] with the highest values of SPL_{i,j} breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ n^𝒪(1). We show the following upper bounds: Query | Space (words) | Query time | Note One-to-One(i,j) | 𝒪(n) | 𝒪(log log k) | Theorem 11 One-to-All(i) | 𝒪(n) | 𝒪(k) | Theorem 14 Report(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n+output) | Theorem 19(i) Count(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n) | Theorem 19(ii) Top(i,K) | 𝒪(n) | 𝒪(log² n/log log n+K) | Theorem 22 We also present efficient algorithms for constructing these data structures.
Feedback for Dagstuhl Publishing