Track A: Algorithms, Complexity and Games
Fully Functional Parameterized Suffix Trees in Compact Space

Authors: Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. - Space O(n logσ) bits and query time O(log_σ^ε n); - Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and - Space O(n logσ log^ε_σ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely.

Cite as

Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully Functional Parameterized Suffix Trees in Compact Space. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 65:1-65:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching

Authors: Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

Two strings are order isomorphic iff the relative ordering of their characters is the same at all positions. For a given text T[1,n] over an ordered alphabet of size σ, we can maintain an order-isomorphic suffix tree/array in O(nlog n) bits and support (order-isomorphic) pattern/substring matching queries efficiently. It is interesting to know if we can encode these structures in space close to the text’s size of nlogσ bits. We answer this question positively by presenting an O(nlog σ)-bit index that allows access to any entry in order-isomorphic suffix array (and its inverse array) in t_{SA} = {O}(log²n/logσ) time. For any pattern P given as a query, this index can count the number of substrings of T that are order-isomorphic to P (denoted by occ) in {O}((|P|logσ+t_{SA})log n) time using standard techniques. Also, it can report the locations of those substrings in additional O(occ ⋅ t_{SA}) time.

Cite as

Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan. LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 71:1-71:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

FM-Index Reveals the Reverse Suffix Array

Authors: Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time. We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.

Cite as

Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan. FM-Index Reveals the Reverse Suffix Array. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

Categorical Range Reporting with Frequencies

Authors: Arnab Ganguly, J. Ian Munro, Yakov Nekrich, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)

In this paper, we consider a variant of the color range reporting problem called color reporting with frequencies. Our goal is to pre-process a set of colored points into a data structure, so that given a query range Q, we can report all colors that appear in Q, along with their respective frequencies. In other words, for each reported color, we also output the number of times it occurs in Q. We describe an external-memory data structure that uses O(N(1+log^2D/log N)) words and answers one-dimensional queries in O(1 +K/B) I/Os, where N is the total number of points in the data structure, D is the total number of colors in the data structure, K is the number of reported colors, and B is the block size. Next we turn to an approximate version of this problem: report all colors sigma that appear in the query range; for every reported color, we provide a constant-factor approximation on its frequency. We consider color reporting with approximate frequencies in two dimensions. Our data structure uses O(N) space and answers two-dimensional queries in O(log_B N +log^*B + K/B) I/Os in the special case when the query range is bounded on two sides. As a corollary, we can also answer one-dimensional approximate queries within the same time and space bounds.

Cite as

Arnab Ganguly, J. Ian Munro, Yakov Nekrich, Rahul Shah, and Sharma V. Thankachan. Categorical Range Reporting with Frequencies. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

The Heaviest Induced Ancestors Problem Revisited

Authors: Paniz Abedin, Sahar Hooshmand, Arnab Ganguly, and Sharma V. Thankachan

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

We revisit the heaviest induced ancestors problem, which has several interesting applications in string matching. Let T_1 and T_2 be two weighted trees, where the weight W(u) of a node u in either of the two trees is more than the weight of u's parent. Additionally, the leaves in both trees are labeled and the labeling of the leaves in T_2 is a permutation of those in T_1. A node x in T_1 and a node y in T_2 are induced, iff their subtree have at least one common leaf label. A heaviest induced ancestor query HIA(u_1,u_2) is: given a node u_1 in T_1 and a node u_2 in T_2, output the pair (u_1^*,u_2^*) of induced nodes with the highest combined weight W(u^*_1) + W(u^*_2), such that u_1^* is an ancestor of u_1 and u^*_2 is an ancestor of u_2. Let n be the number of nodes in both trees combined and epsilon >0 be an arbitrarily small constant. Gagie et al. [CCCG' 13] introduced this problem and proposed three solutions with the following space-time trade-offs: - an O(n log^2n)-word data structure with O(log n log log n) query time - an O(n log n)-word data structure with O(log^2 n) query time - an O(n)-word data structure with O(log^{3+epsilon}n) query time. In this paper, we revisit this problem and present new data structures, with improved bounds. Our results are as follows. - an O(n log n)-word data structure with O(log n log log n) query time - an O(n)-word data structure with O(log^2 n/log log n) query time. As a corollary, we also improve the LZ compressed index of Gagie et al. [CCCG' 13] for answering longest common substring (LCS) queries. Additionally, we show that the LCS after one edit problem of size n [Amir et al., SPIRE' 17] can also be reduced to the heaviest induced ancestors problem over two trees of n nodes in total. This yields a straightforward improvement over its current solution of O(n log^3 n) space and O(log^3 n) query time.

Cite as

Paniz Abedin, Sahar Hooshmand, Arnab Ganguly, and Sharma V. Thankachan. The Heaviest Induced Ancestors Problem Revisited. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

Structural Pattern Matching - Succinctly

Authors: Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 92, 28th International Symposium on Algorithms and Computation (ISAAC 2017)

Let T be a text of length n containing characters from an alphabet \Sigma, which is the union of two disjoint sets: \Sigma_s containing static characters (s-characters) and \Sigma_p containing parameterized characters (p-characters). Each character in \Sigma_p has an associated complementary character from \Sigma_p. A pattern P (also over \Sigma) matches an equal-length substring $S$ of T iff the s-characters match exactly, there exists a one-to-one function that renames the p-characters in S to the p-characters in P, and if a p-character x is renamed to another p-character y then the complement of x is renamed to the complement of y. The task is to find the starting positions (occurrences) of all such substrings S. Previous indexing solution [Shibuya, SWAT 2000], known as Structural Suffix Tree, requires \Theta(n\log n) bits of space, and can find all occ occurrences in time O(|P|\log \sigma+ occ), where \sigma = |\Sigma|. In this paper, we present the first succinct index for this problem, which occupies n \log \sigma + O(n) bits and offers O(|P|\log\sigma+ occ\cdot \log n \log\sigma) query time.

Cite as

Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Structural Pattern Matching - Succinctly. In 28th International Symposium on Algorithms and Computation (ISAAC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 92, pp. 35:1-35:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

Space-Time Trade-Offs for the Shortest Unique Substring Problem

Authors: Arnab Ganguly, Wing-Kai Hon, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 64, 27th International Symposium on Algorithms and Computation (ISAAC 2016)

Given a string X[1, n] and a position k in [1, n], the Shortest Unique Substring of X covering k, denoted by S_k, is a substring X[i, j] of X which satisfies the following conditions: (i) i leq k leq j, (ii) i is the only position where there is an occurrence of X[i, j], and (iii) j - i is minimized. The best-known algorithm [Hon et al., ISAAC 2015] can find S k for all k in [1, n] in time O(n) using the string X and additional 2n words of working space. Let tau be a given parameter. We present the following new results. For any given k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n tau) time using X and additional O(n/tau) words of working space. For every k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n/tau) time using X and additional O(n/tau) words and 4n + o(n) bits of working space. For both problems above, we present an O(n tau log^{c+1} n)-time randomized algorithm that uses n/ log c n words in addition to that mentioned above, where c geq 0 is an arbitrary constant. In this case, the reported string is unique and covers k, but with probability at most n^{-O(1)} , may not be the shortest. As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [Delcher et al., Nucleic Acids Res. 1999].

Cite as

Arnab Ganguly, Wing-Kai Hon, Rahul Shah, and Sharma V. Thankachan. Space-Time Trade-Offs for the Shortest Unique Substring Problem. In 27th International Symposium on Algorithms and Computation (ISAAC 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 64, pp. 34:1-34:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching

Authors: Arnab Ganguly, Wing-Kai Hon, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, and Yilin Yang

Published in: LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Let S and S' be two strings of the same length.We consider the following two variants of string matching. * Parameterized Matching: The characters of S and S' are partitioned into static characters and parameterized characters. The strings are parameterized match iff the static characters match exactly and there exists a one-to-one function which renames the parameterized characters in S to those in S'. * Order-Preserving Matching: The strings are order-preserving match iff for any two integers i,j in [1,|S|], S[i] <= S[j] iff S'[i] <= S'[j]. Let P be a collection of d patterns {P_1, P_2, ..., P_d} of total length n characters, which are chosen from an alphabet Sigma. Given a text T, also over Sigma, we consider the dictionary indexing problem under the above definitions of string matching. Specifically, the task is to index P, such that we can report all positions j where at least one of the patterns P_i in P is a parameterized-match (resp. order-preserving match) with the same-length substring of $T$ starting at j. Previous best-known indexes occupy O(n * log(n)) bits and can report all occ positions in O(|T| * log(|Sigma|) + occ) time. We present space-efficient indexes that occupy O(n * log(|Sigma|+d) * log(n)) bits and reports all occ positions in O(|T| * (log(|Sigma|) + log_{|Sigma|}(n)) + occ) time for parameterized matching and in O(|T| * log(n) + occ) time for order-preserving matching.

Cite as

Arnab Ganguly, Wing-Kai Hon, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, and Yilin Yang. Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 2:1-2:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

A Framework for Dynamic Parameterized Dictionary Matching

Authors: Arnab Ganguly, Wing-Kai Hon, and Rahul Shah

Published in: LIPIcs, Volume 53, 15th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2016)

Two equal-length strings S and S' are a parameterized-match (p-match) iff there exists a one-to-one function that renames the characters in S to those in S'. Let P be a collection of d patterns of total length n characters that are chosen from an alphabet Sigma of cardinality sigma. The task is to index P such that we can support the following operations. * search(T): given a text T, report all occurrences <j,P_i> such that there exists a pattern P_i in P that is a p-match with the substring T[j,j+|P_i|-1]. * ins(P_i)/del(P_i): modify the index when a pattern P_i is inserted/deleted. We present a linear-space index that occupies O(n*log n) bits and supports (i) search(T) in worst-case O(|T|*log^2 n + occ) time, where occ is the number of occurrences reported, and (ii) ins(P_i) and del(P_i) in amortized O(|P_i|*polylog(n)) time. Then, we present a succinct index that occupies (1+o(1))n*log sigma + O(d*log n) bits and supports (i) search(T) in worst-case O(|T|*log^2 n + occ) time, and (ii) ins(P_i) and del(P_i) in amortized O(|P_i|*polylog(n)) time. We also present results related to the semi-dynamic variant of the problem, where deletion is not allowed.

Cite as

Arnab Ganguly, Wing-Kai Hon, and Rahul Shah. A Framework for Dynamic Parameterized Dictionary Matching. In 15th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 53, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

Forbidden Extension Queries

Authors: Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 45, 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions. Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query < P^+,P^- > asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query < P^+,P^-,k > asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank: - an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k). - for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document.

Cite as

Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Forbidden Extension Queries. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 45, pp. 320-335, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

