Document

**Published in:** LIPIcs, Volume 308, 32nd Annual European Symposium on Algorithms (ESA 2024)

The longest common substring (also known as longest common factor) and longest common subsequence problems are two well-studied classical string problems. The former is solvable in optimal 𝒪(n) time for two strings of length m and n with m ≤ n, and the latter is solvable in 𝒪(nm) time, which is conditionally optimal under the Strong Exponential Time Hypothesis. In this work, we study the problem of longest common factor with gaps, that is, finding a set of at most k matching substrings obeying precedence conditions with maximum total length. For k = 1, this is equivalent to the longest common factor problem, and for k = m, this is equivalent to the longest common subsequence problem. Our work demonstrates that, for constant k, this problem can be solved in strongly subquadratic time, i.e., nm^{1 - Θ(1)}. Motivated by co-linear chaining applications in Computational Biology, we further demonstrate that the longest common factor with gaps results can be extended to the case where the matches are restricted to maximal exact matches (MEMs). To further demonstrate the applicability of our techniques, we show that a similar approach can be used for a restricted version of the episode matching problem where one seeks an ordered set of at most k matches whose concatenation equals a query pattern P and the length of the substring of T containing the matches is minimized. These solutions all run in strongly subquadratic time for constant k.

Aranya Banerjee, Daniel Gibney, and Sharma V. Thankachan. Longest Common Substring with Gaps and Related Problems. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{banerjee_et_al:LIPIcs.ESA.2024.16, author = {Banerjee, Aranya and Gibney, Daniel and Thankachan, Sharma V.}, title = {{Longest Common Substring with Gaps and Related Problems}}, booktitle = {32nd Annual European Symposium on Algorithms (ESA 2024)}, pages = {16:1--16:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-338-6}, ISSN = {1868-8969}, year = {2024}, volume = {308}, editor = {Chan, Timothy and Fischer, Johannes and Iacono, John and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2024.16}, URN = {urn:nbn:de:0030-drops-210877}, doi = {10.4230/LIPIcs.ESA.2024.16}, annote = {Keywords: Pattern Matching, Longest Common Subsequence, Episode Matching} }

Document

**Published in:** LIPIcs, Volume 306, 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)

In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S₁,…,S_r, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1..r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r² pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online.
We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPL^k_{i,j} be the length of the longest suffix of S_i that is at distance at most k from a prefix of S_j. In particular, for any k = 𝒪(1), we show an 𝒪(nlog^k n)-sized data structure to support the following queries:
- One-to-One^k(i,j): output SPL^k_{i,j} in 𝒪(log^k nlog log n) time.
- Report^k(i,d): output all j ∈ [1..r], such that SPL^k_{i,j} ≥ d, in 𝒪(log^{k}n(log n/log log n+output)) time, where output denotes the size of the output.
In fact, our algorithms work for any value of k not just for k = 𝒪(1), but the formulas bounding the complexities get much more complicated for larger values of k.

Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan. Approximate Suffix-Prefix Dictionary Queries. In 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 306, pp. 85:1-85:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{zuba_et_al:LIPIcs.MFCS.2024.85, author = {Zuba, Wiktor and Loukides, Grigorios and Pissis, Solon P. and Thankachan, Sharma V.}, title = {{Approximate Suffix-Prefix Dictionary Queries}}, booktitle = {49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)}, pages = {85:1--85:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-335-5}, ISSN = {1868-8969}, year = {2024}, volume = {306}, editor = {Kr\'{a}lovi\v{c}, Rastislav and Ku\v{c}era, Anton{\'\i}n}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2024.85}, URN = {urn:nbn:de:0030-drops-206416}, doi = {10.4230/LIPIcs.MFCS.2024.85}, annote = {Keywords: all-pairs suffix-prefix, suffix-prefix queries, suffix tree, k-errata tree} }

Document

**Published in:** LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S_1,…,S_k, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1,k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ n^𝒪(1), APSP can be solved in the optimal 𝒪(n+k²) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992].
In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k² usually dominates n, and thus the k² factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries:
- One-to-One(i,j): output SPL_{i,j}.
- One-to-All(i): output SPL_{i,j} for every j ∈ [1,k].
- Report(i,𝓁): output all distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer.
- Count(i,𝓁): output the number of distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer.
- Top(i,K): output K distinct j ∈ [1,k] with the highest values of SPL_{i,j} breaking ties arbitrarily.
We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ n^𝒪(1). We show the following upper bounds:
Query | Space (words) | Query time | Note
One-to-One(i,j) | 𝒪(n) | 𝒪(log log k) | Theorem 11
One-to-All(i) | 𝒪(n) | 𝒪(k) | Theorem 14
Report(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n+output) | Theorem 19(i)
Count(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n) | Theorem 19(ii)
Top(i,K) | 𝒪(n) | 𝒪(log² n/log log n+K) | Theorem 22
We also present efficient algorithms for constructing these data structures.

Grigorios Loukides, Solon P. Pissis, Sharma V. Thankachan, and Wiktor Zuba. Suffix-Prefix Queries on a Dictionary. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 21:1-21:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{loukides_et_al:LIPIcs.CPM.2023.21, author = {Loukides, Grigorios and Pissis, Solon P. and Thankachan, Sharma V. and Zuba, Wiktor}, title = {{Suffix-Prefix Queries on a Dictionary}}, booktitle = {34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)}, pages = {21:1--21:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-276-1}, ISSN = {1868-8969}, year = {2023}, volume = {259}, editor = {Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.21}, URN = {urn:nbn:de:0030-drops-179757}, doi = {10.4230/LIPIcs.CPM.2023.21}, annote = {Keywords: all-pairs suffix-prefix, suffix-prefix queries, internal pattern matching} }

Document

**Published in:** LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

The decomposition of flow-networks is an essential part of many transcriptome assembly algorithms used in Computational Biology. The addition of subpath constraints to this decomposition appeared recently as an effective way to incorporate longer, already known, portions of the transcript. The problem is defined as follows: given a weakly connected directed acyclic flow network G = (V, E, f) and a set ℛ of subpaths in G, find a flow decomposition so that every subpath in ℛ is included in some flow in the decomposition [Williams et al., WABI 2021]. The authors of that work presented an exponential time algorithm for determining the feasibility of such a flow decomposition, and more recently presented an O(|E| + L+|ℛ|³) time algorithm, where L is the sum of the path lengths in ℛ [Williams et al., TCBB 2022]. Our work provides an improved, linear O(|E| + L) time algorithm for determining the feasibility of such a flow decomposition. We also introduce two natural optimization variants of the feasibility problem: (i) determining the minimum sized subset of ℛ that must be removed to make a flow decomposition feasible, and (ii) determining the maximum sized subset of ℛ that can be maintained while making a flow decomposition feasible. We show that, under the assumption P ≠ NP, (i) does not admit a polynomial-time o(log |V|)-approximation algorithm and (ii) does not admit a polynomial-time O(|V|^{1/2-ε} + |ℛ|^{1-ε})-approximation algorithm for any constant ε > 0.

Daniel Gibney, Sharma V. Thankachan, and Srinivas Aluru. Feasibility of Flow Decomposition with Subpath Constraints in Linear Time. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 17:1-17:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{gibney_et_al:LIPIcs.WABI.2022.17, author = {Gibney, Daniel and Thankachan, Sharma V. and Aluru, Srinivas}, title = {{Feasibility of Flow Decomposition with Subpath Constraints in Linear Time}}, booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)}, pages = {17:1--17:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-243-3}, ISSN = {1868-8969}, year = {2022}, volume = {242}, editor = {Boucher, Christina and Rahmann, Sven}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.17}, URN = {urn:nbn:de:0030-drops-170516}, doi = {10.4230/LIPIcs.WABI.2022.17}, annote = {Keywords: Flow networks, flow decomposition, subpath constraints} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant.
- Space O(n logσ) bits and query time O(log_σ^ε n);
- Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and
- Space O(n logσ log^ε_σ n) bits and query time O(1).
The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely.

Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully Functional Parameterized Suffix Trees in Compact Space. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 65:1-65:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ICALP.2022.65, author = {Ganguly, Arnab and Shah, Rahul and Thankachan, Sharma V.}, title = {{Fully Functional Parameterized Suffix Trees in Compact Space}}, booktitle = {49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)}, pages = {65:1--65:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-235-8}, ISSN = {1868-8969}, year = {2022}, volume = {229}, editor = {Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.65}, URN = {urn:nbn:de:0030-drops-164061}, doi = {10.4230/LIPIcs.ICALP.2022.65}, annote = {Keywords: Data Structures, Suffix Trees, String Algorithms, Compression} }

Document

Invited Talk

**Published in:** LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

In the past two decades, we have witnessed the design of various compact data structures for pattern matching over an indexed text [Navarro, 2016]. Popular indexes like the FM-index [Paolo Ferragina and Giovanni Manzini, 2005], compressed suffix arrays/trees [Roberto Grossi and Jeffrey Scott Vitter, 2005; Kunihiko Sadakane, 2007], the recent r-index [Travis Gagie et al., 2020; Takaaki Nishimoto and Yasuo Tabei, 2021], etc., capture the key functionalities of classic suffix arrays/trees [Udi Manber and Eugene W. Myers, 1993; Peter Weiner, 1973] in compact space. Mostly, they rely on the Burrows-Wheeler Transform (BWT) and its associated operations [Burrows and Wheeler, 1994]. However, compactly encoding some advanced suffix tree (ST) variants, like parameterized ST [Brenda S. Baker, 1993; S. Rao Kosaraju, 1995; Juan Mendivelso et al., 2020], order-isomorphic/preserving ST [Maxime Crochemore et al., 2016], two-dimensional ST [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998], etc. [Sung Gwan Park et al., 2019; Tetsuo Shibuya, 2000]- collectively known as suffix trees with missing suffix links [Richard Cole and Ramesh Hariharan, 2003], has been challenging. The previous techniques are not easily extendable because these variants do not hold some structural properties of the standard ST that enable compression. However, some limited progress has been made in these directions recently [Arnab Ganguly et al., 2017; Travis Gagie et al., 2017; Gianni Decaroli et al., 2017; Dhrumil Patel and Rahul Shah, 2021; Arnab Ganguly et al., 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Arnab Ganguly et al., 2017; Arnab Ganguly et al., 2022; Arnab Ganguly et al., 2021]. This talk will briefly survey them and highlight some interesting open problems.

Sharma V. Thankachan. Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk). In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 3:1-3:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{thankachan:LIPIcs.CPM.2022.3, author = {Thankachan, Sharma V.}, title = {{Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc.}}, booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)}, pages = {3:1--3:3}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-234-1}, ISSN = {1868-8969}, year = {2022}, volume = {223}, editor = {Bannai, Hideo and Holub, Jan}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.3}, URN = {urn:nbn:de:0030-drops-161300}, doi = {10.4230/LIPIcs.CPM.2022.3}, annote = {Keywords: Text Indexing, Suffix Trees, String Matching} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

Two strings are order isomorphic iff the relative ordering of their characters is the same at all positions. For a given text T[1,n] over an ordered alphabet of size σ, we can maintain an order-isomorphic suffix tree/array in O(nlog n) bits and support (order-isomorphic) pattern/substring matching queries efficiently. It is interesting to know if we can encode these structures in space close to the text’s size of nlogσ bits. We answer this question positively by presenting an O(nlog σ)-bit index that allows access to any entry in order-isomorphic suffix array (and its inverse array) in t_{SA} = {O}(log²n/logσ) time. For any pattern P given as a query, this index can count the number of substrings of T that are order-isomorphic to P (denoted by occ) in {O}((|P|logσ+t_{SA})log n) time using standard techniques. Also, it can report the locations of those substrings in additional O(occ ⋅ t_{SA}) time.

Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan. LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 71:1-71:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ICALP.2021.71, author = {Ganguly, Arnab and Patel, Dhrumil and Shah, Rahul and Thankachan, Sharma V.}, title = {{LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching}}, booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)}, pages = {71:1--71:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-195-5}, ISSN = {1868-8969}, year = {2021}, volume = {198}, editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.71}, URN = {urn:nbn:de:0030-drops-141400}, doi = {10.4230/LIPIcs.ICALP.2021.71}, annote = {Keywords: Succinct data structures, Pattern Matching} }

Document

**Published in:** LIPIcs, Volume 187, 38th International Symposium on Theoretical Aspects of Computer Science (STACS 2021)

This work establishes several strong hardness results on the problem of finding an ordering on a string’s alphabet that either minimizes or maximizes the number of factors in that string’s Lyndon factorization. In doing so, we demonstrate that these ordering problems are sufficiently complex to model a wide variety of ordering constraint satisfaction problems (OCSPs). Based on this, we prove that (i) the decision versions of both the minimization and maximization problems are NP-complete, (ii) for both the minimization and maximization problems there does not exist a constant approximation algorithm running in polynomial time under the Unique Game Conjecture and (iii) there does not exist an algorithm to solve the minimization problem in time poly(|T|) ⋅ 2^o(σlog σ) for a string T over an alphabet of size σ under the Exponential Time Hypothesis (essentially the brute force approach of trying every alphabet order is hard to improve significantly).

Daniel Gibney and Sharma V. Thankachan. Finding an Optimal Alphabet Ordering for Lyndon Factorization Is Hard. In 38th International Symposium on Theoretical Aspects of Computer Science (STACS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 187, pp. 35:1-35:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gibney_et_al:LIPIcs.STACS.2021.35, author = {Gibney, Daniel and Thankachan, Sharma V.}, title = {{Finding an Optimal Alphabet Ordering for Lyndon Factorization Is Hard}}, booktitle = {38th International Symposium on Theoretical Aspects of Computer Science (STACS 2021)}, pages = {35:1--35:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-180-1}, ISSN = {1868-8969}, year = {2021}, volume = {187}, editor = {Bl\"{a}ser, Markus and Monmege, Benjamin}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2021.35}, URN = {urn:nbn:de:0030-drops-136809}, doi = {10.4230/LIPIcs.STACS.2021.35}, annote = {Keywords: Lyndon Factorization, String Algorithms, Burrows-Wheeler Transform} }

Document

**Published in:** LIPIcs, Volume 173, 28th Annual European Symposium on Algorithms (ESA 2020)

The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to entropy-based lower bound. Within the last decade, it has seen its role further enhanced with the development of compact suffix trees in space proportional to "r", the number of runs in the BWT. While r would superficially appear to be only a measure of space complexity, it is actually appearing increasingly often in the time complexity of new algorithms as well. This makes having the smallest value of r of growing importance. Interestingly, unlike other popular measures of compression, the parameter r is sensitive to the lexicographic ordering given to the text’s alphabet. Despite several past attempts to exploit this fact, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes r has been open for years.
We help to explain this lack of progress by presenting the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time poly(n)⋅ 2^o(σ) unless the Exponential Time Hypothesis fails, where σ is the size of the alphabet and n is the length of the text. Moreover, we show that the optimization variant is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path of a graph and the number of runs in the BWT of a text. In addition, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides an O(log² n)-approximation. Lastly, we provide an optimal linear-time algorithm for a more restricted problem of finding an optimal ordering on a subset of symbols (occurring only once) under ordering constraints.

Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the Complexity of BWT-Runs Minimization via Alphabet Reordering. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bentley_et_al:LIPIcs.ESA.2020.15, author = {Bentley, Jason W. and Gibney, Daniel and Thankachan, Sharma V.}, title = {{On the Complexity of BWT-Runs Minimization via Alphabet Reordering}}, booktitle = {28th Annual European Symposium on Algorithms (ESA 2020)}, pages = {15:1--15:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-162-7}, ISSN = {1868-8969}, year = {2020}, volume = {173}, editor = {Grandoni, Fabrizio and Herman, Grzegorz and Sanders, Peter}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2020.15}, URN = {urn:nbn:de:0030-drops-128819}, doi = {10.4230/LIPIcs.ESA.2020.15}, annote = {Keywords: BWT, NP-hardness, APX-hardness} }

Document

**Published in:** LIPIcs, Volume 173, 28th Annual European Symposium on Algorithms (ESA 2020)

We present the first fine-grained complexity results on two classic problems on strings. The first one is the k-Median-Edit-Distance problem, where the input is a collection of k strings, each of length at most n, and the task is to find a new string that minimizes the sum of the edit distances from itself to all other strings in the input. Arising frequently in computational biology, this problem provides an important generalization of edit distance to multiple strings and is similar to the multiple sequence alignment problem in bioinformatics. We demonstrate that for any ε > 0 and k ≥ 2, an O(n^{k-ε}) time solution for the k-Median-Edit-Distance problem over an alphabet of size O(k) refutes the Strong Exponential Time Hypothesis (SETH). This provides the first matching conditional lower bound for the O(n^k) time algorithm established in 1975 by Sankoff.
The second problem we study is the k-Center-Edit-Distance problem. Here also, the input is a collection of k strings, each of length at most n. The task is to find a new string that minimizes the maximum edit distance from itself to any other string in the input. We prove that the same conditional lower bound as before holds. Our results also imply new conditional lower bounds for the k-Tree-Alignment and the k-Bottleneck-Tree-Alignment problems studied in phylogenetics.

Gary Hoppenworth, Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. The Fine-Grained Complexity of Median and Center String Problems Under Edit Distance. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 61:1-61:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{hoppenworth_et_al:LIPIcs.ESA.2020.61, author = {Hoppenworth, Gary and Bentley, Jason W. and Gibney, Daniel and Thankachan, Sharma V.}, title = {{The Fine-Grained Complexity of Median and Center String Problems Under Edit Distance}}, booktitle = {28th Annual European Symposium on Algorithms (ESA 2020)}, pages = {61:1--61:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-162-7}, ISSN = {1868-8969}, year = {2020}, volume = {173}, editor = {Grandoni, Fabrizio and Herman, Grzegorz and Sanders, Peter}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2020.61}, URN = {urn:nbn:de:0030-drops-129278}, doi = {10.4230/LIPIcs.ESA.2020.61}, annote = {Keywords: Edit Distance, Median String, Center String, SETH} }

Document

**Published in:** LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time.
We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.

Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan. FM-Index Reveals the Reverse Suffix Array. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.CPM.2020.13, author = {Ganguly, Arnab and Gibney, Daniel and Hooshmand, Sahar and K\"{u}lekci, M. O\u{g}uzhan and Thankachan, Sharma V.}, title = {{FM-Index Reveals the Reverse Suffix Array}}, booktitle = {31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)}, pages = {13:1--13:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-149-8}, ISSN = {1868-8969}, year = {2020}, volume = {161}, editor = {G{\o}rtz, Inge Li and Weimann, Oren}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.13}, URN = {urn:nbn:de:0030-drops-121388}, doi = {10.4230/LIPIcs.CPM.2020.13}, annote = {Keywords: Data Structures, Suffix Trees, String Algorithms, Compression, Burrows - Wheeler transform, FM-Index} }

Document

**Published in:** LIPIcs, Volume 144, 27th Annual European Symposium on Algorithms (ESA 2019)

In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs, and that Wheeler graphs can be indexed in a way which is space efficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present:
- The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NP-complete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called d-NFA’s for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFA’s where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1;
- There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = |V| and e = |E|. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time;
- We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NP-hard to find a C-approximation;
- We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1);
The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial time solvable, raising the open question of which parameters determine this problem’s difficulty.

Daniel Gibney and Sharma V. Thankachan. On the Hardness and Inapproximability of Recognizing Wheeler Graphs. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 51:1-51:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{gibney_et_al:LIPIcs.ESA.2019.51, author = {Gibney, Daniel and Thankachan, Sharma V.}, title = {{On the Hardness and Inapproximability of Recognizing Wheeler Graphs}}, booktitle = {27th Annual European Symposium on Algorithms (ESA 2019)}, pages = {51:1--51:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-124-5}, ISSN = {1868-8969}, year = {2019}, volume = {144}, editor = {Bender, Michael A. and Svensson, Ola and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2019.51}, URN = {urn:nbn:de:0030-drops-111728}, doi = {10.4230/LIPIcs.ESA.2019.51}, annote = {Keywords: Burrows–Wheeler transform, string algorithms, suffix trees, NP-completeness} }

Document

**Published in:** LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)

In this paper, we consider a variant of the color range reporting problem called color reporting with frequencies. Our goal is to pre-process a set of colored points into a data structure, so that given a query range Q, we can report all colors that appear in Q, along with their respective frequencies. In other words, for each reported color, we also output the number of times it occurs in Q. We describe an external-memory data structure that uses O(N(1+log^2D/log N)) words and answers one-dimensional queries in O(1 +K/B) I/Os, where N is the total number of points in the data structure, D is the total number of colors in the data structure, K is the number of reported colors, and B is the block size.
Next we turn to an approximate version of this problem: report all colors sigma that appear in the query range; for every reported color, we provide a constant-factor approximation on its frequency. We consider color reporting with approximate frequencies in two dimensions. Our data structure uses O(N) space and answers two-dimensional queries in O(log_B N +log^*B + K/B) I/Os in the special case when the query range is bounded on two sides. As a corollary, we can also answer one-dimensional approximate queries within the same time and space bounds.

Arnab Ganguly, J. Ian Munro, Yakov Nekrich, Rahul Shah, and Sharma V. Thankachan. Categorical Range Reporting with Frequencies. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ICDT.2019.9, author = {Ganguly, Arnab and Munro, J. Ian and Nekrich, Yakov and Shah, Rahul and Thankachan, Sharma V.}, title = {{Categorical Range Reporting with Frequencies}}, booktitle = {22nd International Conference on Database Theory (ICDT 2019)}, pages = {9:1--9:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-101-6}, ISSN = {1868-8969}, year = {2019}, volume = {127}, editor = {Barcelo, Pablo and Calautti, Marco}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.9}, URN = {urn:nbn:de:0030-drops-103115}, doi = {10.4230/LIPIcs.ICDT.2019.9}, annote = {Keywords: Data Structures, Range Reporting, Range Counting, Categorical Range Reporting, Orthogonal Range Query} }

Document

**Published in:** LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

The non-overlapping indexing problem is defined as follows: pre-process a given text T[1,n] of length n into a data structure such that whenever a pattern P[1,p] comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in T. The best known solution is by Cohen and Porat [ISAAC, 2009]. Their index size is O(n) words and query time is optimal O(p+nocc), where nocc is the output size. We study this problem in the cache-oblivious model and present a new data structure of size O(n log n) words. It can answer queries in optimal O(p/(B)+log_B n+nocc/B) I/Os, where B is the block size.

Sahar Hooshmand, Paniz Abedin, M. Oguzhan Külekci, and Sharma V. Thankachan. Non-Overlapping Indexing - Cache Obliviously. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 8:1-8:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{hooshmand_et_al:LIPIcs.CPM.2018.8, author = {Hooshmand, Sahar and Abedin, Paniz and K\"{u}lekci, M. Oguzhan and Thankachan, Sharma V.}, title = {{Non-Overlapping Indexing - Cache Obliviously}}, booktitle = {29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)}, pages = {8:1--8:9}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-074-3}, ISSN = {1868-8969}, year = {2018}, volume = {105}, editor = {Navarro, Gonzalo and Sankoff, David and Zhu, Binhai}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.8}, URN = {urn:nbn:de:0030-drops-87009}, doi = {10.4230/LIPIcs.CPM.2018.8}, annote = {Keywords: Suffix Trees, Cache Oblivious, Data Structure, String Algorithms} }

Document

**Published in:** LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

We revisit the heaviest induced ancestors problem, which has several interesting applications in string matching. Let T_1 and T_2 be two weighted trees, where the weight W(u) of a node u in either of the two trees is more than the weight of u's parent. Additionally, the leaves in both trees are labeled and the labeling of the leaves in T_2 is a permutation of those in T_1. A node x in T_1 and a node y in T_2 are induced, iff their subtree have at least one common leaf label. A heaviest induced ancestor query HIA(u_1,u_2) is: given a node u_1 in T_1 and a node u_2 in T_2, output the pair (u_1^*,u_2^*) of induced nodes with the highest combined weight W(u^*_1) + W(u^*_2), such that u_1^* is an ancestor of u_1 and u^*_2 is an ancestor of u_2. Let n be the number of nodes in both trees combined and epsilon >0 be an arbitrarily small constant. Gagie et al. [CCCG' 13] introduced this problem and proposed three solutions with the following space-time trade-offs:
- an O(n log^2n)-word data structure with O(log n log log n) query time
- an O(n log n)-word data structure with O(log^2 n) query time
- an O(n)-word data structure with O(log^{3+epsilon}n) query time.
In this paper, we revisit this problem and present new data structures, with improved bounds. Our results are as follows.
- an O(n log n)-word data structure with O(log n log log n) query time
- an O(n)-word data structure with O(log^2 n/log log n) query time.
As a corollary, we also improve the LZ compressed index of Gagie et al. [CCCG' 13] for answering longest common substring (LCS) queries. Additionally, we show that the LCS after one edit problem of size n [Amir et al., SPIRE' 17] can also be reduced to the heaviest induced ancestors problem over two trees of n nodes in total. This yields a straightforward improvement over its current solution of O(n log^3 n) space and O(log^3 n) query time.

Paniz Abedin, Sahar Hooshmand, Arnab Ganguly, and Sharma V. Thankachan. The Heaviest Induced Ancestors Problem Revisited. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{abedin_et_al:LIPIcs.CPM.2018.20, author = {Abedin, Paniz and Hooshmand, Sahar and Ganguly, Arnab and Thankachan, Sharma V.}, title = {{The Heaviest Induced Ancestors Problem Revisited}}, booktitle = {29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)}, pages = {20:1--20:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-074-3}, ISSN = {1868-8969}, year = {2018}, volume = {105}, editor = {Navarro, Gonzalo and Sankoff, David and Zhu, Binhai}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.20}, URN = {urn:nbn:de:0030-drops-86898}, doi = {10.4230/LIPIcs.CPM.2018.20}, annote = {Keywords: Data Structure, String Algorithms, Orthogonal Range Queries} }

Document

**Published in:** LIPIcs, Volume 92, 28th International Symposium on Algorithms and Computation (ISAAC 2017)

Let T be a text of length n containing characters from an alphabet \Sigma, which is the union of two disjoint sets: \Sigma_s containing static characters (s-characters) and \Sigma_p containing parameterized characters (p-characters).
Each character in \Sigma_p has an associated complementary character from \Sigma_p.
A pattern P (also over \Sigma) matches an equal-length substring $S$ of T iff the s-characters match exactly, there exists a one-to-one function that renames the p-characters in S to the p-characters in P, and if a p-character x is renamed to another p-character y then the complement of x is renamed to the complement of y.
The task is to find the starting positions (occurrences) of all such substrings S.
Previous indexing solution [Shibuya, SWAT 2000], known as Structural Suffix Tree, requires \Theta(n\log n) bits of space, and can find all occ occurrences in time O(|P|\log \sigma+ occ), where \sigma = |\Sigma|.
In this paper, we present the first succinct index for this problem, which occupies n \log \sigma + O(n) bits and offers O(|P|\log\sigma+ occ\cdot \log n \log\sigma) query time.

Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Structural Pattern Matching - Succinctly. In 28th International Symposium on Algorithms and Computation (ISAAC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 92, pp. 35:1-35:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ISAAC.2017.35, author = {Ganguly, Arnab and Shah, Rahul and Thankachan, Sharma V.}, title = {{Structural Pattern Matching - Succinctly}}, booktitle = {28th International Symposium on Algorithms and Computation (ISAAC 2017)}, pages = {35:1--35:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-054-5}, ISSN = {1868-8969}, year = {2017}, volume = {92}, editor = {Okamoto, Yoshio and Tokuyama, Takeshi}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2017.35}, URN = {urn:nbn:de:0030-drops-82566}, doi = {10.4230/LIPIcs.ISAAC.2017.35}, annote = {Keywords: Parameterized Pattern Matching, Suffix tree, Burrows-Wheeler Transform, Wavelet Tree, Fully-functional succinct tree} }

Document

**Published in:** LIPIcs, Volume 64, 27th International Symposium on Algorithms and Computation (ISAAC 2016)

Given a string X[1, n] and a position k in [1, n], the Shortest Unique Substring of X covering k, denoted by S_k, is a substring X[i, j] of X which satisfies the following conditions: (i) i leq k leq j, (ii) i is the only position where there is an occurrence of X[i, j], and (iii) j - i is minimized. The best-known algorithm [Hon et al., ISAAC 2015] can find S k for all k in [1, n] in time O(n) using the string X and additional 2n words of working space. Let tau be a given parameter. We present the following new results. For any given k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n tau) time using X and additional O(n/tau) words of working space. For every k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n/tau) time using X and additional O(n/tau) words and 4n + o(n) bits of working space. For both problems above, we present an O(n tau log^{c+1} n)-time randomized algorithm that uses n/ log c n words in addition to that mentioned above, where c geq 0 is an arbitrary constant. In this case, the reported string is unique and covers k, but with probability at most n^{-O(1)} , may not be the shortest. As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [Delcher et al., Nucleic Acids Res. 1999].

Arnab Ganguly, Wing-Kai Hon, Rahul Shah, and Sharma V. Thankachan. Space-Time Trade-Offs for the Shortest Unique Substring Problem. In 27th International Symposium on Algorithms and Computation (ISAAC 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 64, pp. 34:1-34:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.ISAAC.2016.34, author = {Ganguly, Arnab and Hon, Wing-Kai and Shah, Rahul and Thankachan, Sharma V.}, title = {{Space-Time Trade-Offs for the Shortest Unique Substring Problem}}, booktitle = {27th International Symposium on Algorithms and Computation (ISAAC 2016)}, pages = {34:1--34:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-026-2}, ISSN = {1868-8969}, year = {2016}, volume = {64}, editor = {Hong, Seok-Hee}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2016.34}, URN = {urn:nbn:de:0030-drops-68041}, doi = {10.4230/LIPIcs.ISAAC.2016.34}, annote = {Keywords: Suffix Tree, Sparsification, Rabin-Karp Fingerprint, Probabilistic z-Fast Trie, Succinct Data-Structures} }

Document

**Published in:** LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Let S and S' be two strings of the same length.We consider the following two variants of string matching.
* Parameterized Matching: The characters of S and S' are partitioned into static characters and parameterized characters.
The strings are parameterized match iff the static characters match exactly and there exists a one-to-one function which renames the parameterized characters in S to those in S'.
* Order-Preserving Matching: The strings are order-preserving match iff for any two integers i,j in [1,|S|], S[i] <= S[j] iff S'[i] <= S'[j].
Let P be a collection of d patterns {P_1, P_2, ..., P_d} of total length n characters, which are chosen from an alphabet Sigma.
Given a text T, also over Sigma, we consider the dictionary indexing problem under the above definitions of string matching.
Specifically, the task is to index P, such that we can report all positions j where at least one of the patterns P_i in P is a parameterized-match (resp. order-preserving match) with the same-length substring of $T$ starting at j. Previous best-known indexes occupy O(n * log(n)) bits and can report all occ positions in O(|T| * log(|Sigma|) + occ) time. We present space-efficient indexes that occupy O(n * log(|Sigma|+d) * log(n)) bits and reports all occ positions in O(|T| * (log(|Sigma|) + log_{|Sigma|}(n)) + occ) time for parameterized matching and in O(|T| * log(n) + occ) time for order-preserving matching.

Arnab Ganguly, Wing-Kai Hon, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, and Yilin Yang. Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 2:1-2:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.CPM.2016.2, author = {Ganguly, Arnab and Hon, Wing-Kai and Sadakane, Kunihiko and Shah, Rahul and Thankachan, Sharma V. and Yang, Yilin}, title = {{Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching}}, booktitle = {27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)}, pages = {2:1--2:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-012-5}, ISSN = {1868-8969}, year = {2016}, volume = {54}, editor = {Grossi, Roberto and Lewenstein, Moshe}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2016.2}, URN = {urn:nbn:de:0030-drops-60736}, doi = {10.4230/LIPIcs.CPM.2016.2}, annote = {Keywords: Parameterized Matching, Order-preserving Matching, Dictionary Indexing, Aho-Corasick Automaton, Sparsification} }

Document

**Published in:** LIPIcs, Volume 45, 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern.
Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions.
Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query < P^+,P^- > asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query < P^+,P^-,k > asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank:
- an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k).
- for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document.

Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Forbidden Extension Queries. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 45, pp. 320-335, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{biswas_et_al:LIPIcs.FSTTCS.2015.320, author = {Biswas, Sudip and Ganguly, Arnab and Shah, Rahul and Thankachan, Sharma V.}, title = {{Forbidden Extension Queries}}, booktitle = {35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)}, pages = {320--335}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-97-2}, ISSN = {1868-8969}, year = {2015}, volume = {45}, editor = {Harsha, Prahladh and Ramalingam, G.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2015.320}, URN = {urn:nbn:de:0030-drops-56522}, doi = {10.4230/LIPIcs.FSTTCS.2015.320}, annote = {Keywords: document retrieval, suffix trees, range queries, succinct data structure} }

Document

**Published in:** LIPIcs, Volume 31, 18th International Conference on Database Theory (ICDT 2015)

Orthogonal range reporting is one of the classic and most fundamental data structure problems. (2,1,1) query is a 3 dimensional query with two-sided constraint on the first dimension and one sided constraint on each of the 2nd and 3rd dimension. Given a set of N points in three dimension, a particular formulation of such a (2,1,1) query (known as four-sided range reporting in three-dimension) asks to report all those K points within a query region [a, b]X(-infinity, c]X[d, infinity). These queries have overall 4 constraints. In Word-RAM model, the best known structure capable of answering such queries with optimal query time takes O(N log^{epsilon} N) space, where epsilon>0 is any positive constant. It has been shown that any external memory structure in optimal I/Os must use Omega(N log N/ log log_B N) space (in words), where B is the block size [Arge et al., PODS 1999]. In this paper, we study a special type of (2,1,1) queries, where the query parameters a and c are the same i.e., a=c. Even though the query is still four-sided, the number of independent constraints is only three. In other words, one constraint is shared. We call this as a Shared-Constraint Range Reporting (SCRR) problem. We study this problem in both internal as well as external memory models. In RAM model where coordinates can only be compared, we achieve linear-space and O(log N+K) query time solution, matching the best-known three dimensional dominance query bound. Whereas in external memory, we present a linear space structure with O(log_B N + log log N + K/B) query I/Os. We also present an I/O-optimal (i.e., O(log_B N+K/B) I/Os) data structure which occupies O(N log log N)-word space. We achieve these results by employing a novel divide and conquer approach. SCRR finds application in database queries containing sharing among the constraints. We also show that SCRR queries naturally arise in many well known problems such as top-k color reporting, range skyline reporting and ranked document retrieval.

Sudip Biswas, Manish Patil, Rahul Shah, and Sharma V. Thankachan. Shared-Constraint Range Reporting. In 18th International Conference on Database Theory (ICDT 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 31, pp. 277-290, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{biswas_et_al:LIPIcs.ICDT.2015.277, author = {Biswas, Sudip and Patil, Manish and Shah, Rahul and Thankachan, Sharma V.}, title = {{Shared-Constraint Range Reporting}}, booktitle = {18th International Conference on Database Theory (ICDT 2015)}, pages = {277--290}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-79-8}, ISSN = {1868-8969}, year = {2015}, volume = {31}, editor = {Arenas, Marcelo and Ugarte, Mart{\'\i}n}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2015.277}, URN = {urn:nbn:de:0030-drops-49900}, doi = {10.4230/LIPIcs.ICDT.2015.277}, annote = {Keywords: data structure, shared constraint, multi-slab, point partitioning} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail