DROPS

Document

Research

Specific Patterns Against Reference Sequences

Authors: Marie-Pierre Béal and Maxime Crochemore

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We design alignment-free techniques for comparing a set of sequences or just a word, called a target, against another set of words, called a reference. This is done with the detection of factor patterns that distinguish the target from the reference. A target-specific factor of a target T against a reference R is then a factor w of a word in T that is not a factor of a word in R but whose proper factors of w are factors of a word in R. The strategy is based on the notion of minimal absent/forbidden words. We first address the computation of the set of target-specific factors of a target T against a reference R, where T and R are finite sets of sequences. The result is the construction of an automaton accepting the set of all considered target-specific factors. The construction algorithm runs in linear time according to the size of T ∪ R. The second result is the design of an algorithm to compute all the occurrences in a single sequence T of its target-specific factors against a reference R. The algorithm runs in real-time on the target sequence, independently of the number of occurrences of target-specific factors.

Cite as

Marie-Pierre Béal and Maxime Crochemore. Specific Patterns Against Reference Sequences. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 14:1-14:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{beal_et_al:OASIcs.Grossi.14,
  author =	{B\'{e}al, Marie-Pierre and Crochemore, Maxime},
  title =	{{Specific Patterns Against Reference Sequences}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{14:1--14:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.14},
  URN =		{urn:nbn:de:0030-drops-238130},
  doi =		{10.4230/OASIcs.Grossi.14},
  annote =	{Keywords: Specific pattern, Minimal absent word, Minimal forbidden word, Directed Acyclic Word Graph (DAWG), Suffix automaton}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.17

On the Compressiveness of the Burrows-Wheeler Transform

Authors: Hideo Bannai, Tomohiro I, and Yuto Nakashima

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string w into another string BWT(w). The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes r and r_B. We also show that the so-called "clustering effect" of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

Cite as

Hideo Bannai, Tomohiro I, and Yuto Nakashima. On the Compressiveness of the Burrows-Wheeler Transform. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 17:1-17:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:LIPIcs.CPM.2025.17,
  author =	{Bannai, Hideo and I, Tomohiro and Nakashima, Yuto},
  title =	{{On the Compressiveness of the Burrows-Wheeler Transform}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{17:1--17:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.17},
  URN =		{urn:nbn:de:0030-drops-231116},
  doi =		{10.4230/LIPIcs.CPM.2025.17},
  annote =	{Keywords: Data Compression, Bijective Burrows-Wheeler Transform, Fibonacci words}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.23

Space-Efficient Online Computation of String Net Occurrences

Authors: Takuya Mieno and Shunsuke Inenaga

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

A substring u of a string T is said to be a repeat if u occurs at least twice in T. An occurrence [i..j] of a repeat u in T is said to be a net occurrence if each of the substrings aub = T[i-1..j+1], au = T[i-1..j], and ub = T[i..j+1] occurs exactly once in T. The occurrence [i-1..j+1] of aub is said to be an extended net occurrence of u. Let T be an input string of length n over an alphabet of size σ, and let ENO(T) denote the set of extended net occurrences of repeats in T. Guo et al. [SPIRE 2024] presented an online algorithm which can report ENO(T[1..i]) in T[1..i] in O(nσ²) time, for each prefix T[1..i] of T. Very recently, Inenaga [arXiv 2024] gave a faster online algorithm that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where #S denotes the cardinality of a set S. Both of the aforementioned data structures can be maintained in O(n log σ) time and occupy O(n) space, where the O(n)-space requirement comes from the suffix tree data structure. In particular, Inenaga’s recent algorithm is based on Weiner’s right-to-left online suffix tree construction. In this paper, we show that one can modify Ukkonen’s left-to-right online suffix tree construction algorithm in O(n) space, so that ENO(T[1..i]) can be reported in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T. This is an improvement over Guo et al.’s method that is also based on Ukkonen’s algorithm. Further, this leads us to the two following space-efficient alternatives: - A sliding-window algorithm of O(d) working space that can report ENO(T[i-d+1..i]) in optimal O(#ENO(T[i-d+1..i])) time for each sliding window T[i-d+1..i] of size d in T. - A CDAWG-based online algorithm of O(𝖾) working space that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where 𝖾 < 2n is the number of edges in the CDAWG for T. All of our proposed data structures can be maintained in O(n log σ) time for the input online string T. We also discuss that the extended net occurrences of repeats in T can be fully characterized in terms of the minimal unique substrings (MUSs) in T.

Cite as

Takuya Mieno and Shunsuke Inenaga. Space-Efficient Online Computation of String Net Occurrences. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{mieno_et_al:LIPIcs.CPM.2025.23,
  author =	{Mieno, Takuya and Inenaga, Shunsuke},
  title =	{{Space-Efficient Online Computation of String Net Occurrences}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{23:1--23:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.23},
  URN =		{urn:nbn:de:0030-drops-231175},
  doi =		{10.4230/LIPIcs.CPM.2025.23},
  annote =	{Keywords: string net occurrences, suffix trees, CDAWGs, maximal repeats, minimal unique substrings (MUSs)}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.CPM.2021.1

Repetitions in Strings: A "Constant" Problem (Invited Talk)

Authors: Hideo Bannai

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

Repeating structures in strings is one of the most fundamental characteristics of strings, and has been an important topic in the field of combinatorics on words and combinatorial pattern matching since their beginnings. In this talk, I will focus on squares and maximal repetitions and review the "runs" theorem [Hideo Bannai et al., 2017] as well as related results (e.g. [Aviezri S. Fraenkel and Jamie Simpson, 1998; Yuta Fujishige et al., 2017; Ryo Sugahara et al., 2019; Philip Bille et al., 2020; Hideo Bannai et al., 2020; Jonas Ellert and Johannes Fischer, 2021]) which address the two main questions: how many of them can be contained in a string of given length, and algorithms for computing them.

Cite as

Hideo Bannai. Repetitions in Strings: A "Constant" Problem (Invited Talk). In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bannai:LIPIcs.CPM.2021.1,
  author =	{Bannai, Hideo},
  title =	{{Repetitions in Strings: A "Constant" Problem}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.1},
  URN =		{urn:nbn:de:0030-drops-139523},
  doi =		{10.4230/LIPIcs.CPM.2021.1},
  annote =	{Keywords: Maximal repetitions, Squares, Lyndon words}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2019.40

An Improved Data Structure for Left-Right Maximal Generic Words Problem

Authors: Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda

Published in: LIPIcs, Volume 149, 30th International Symposium on Algorithms and Computation (ISAAC 2019)

Abstract

For a set D of documents and a positive integer d, a string w is said to be d-left-right maximal, if (1) w occurs in at least d documents in D, and (2) any proper superstring of w occurs in less than d documents. The left-right-maximal generic words problem is, given a set D of documents, to preprocess D so that for any string p and for any positive integer d, all the superstrings of p that are d-left-right maximal can be answered quickly. In this paper, we present an O(n log m) space data structure (in words) which answers queries in O(|p| + o log log m) time, where n is the total length of documents in D, m is the number of documents in D and o is the number of outputs. Our solution improves the previous one by Nishimoto et al. (PSC 2015), which uses an O(n log n) space data structure answering queries in O(|p|+ r * log n + o * log^2 n) time, where r is the number of right-extensions q of p occurring in at least d documents such that any proper right extension of q occurs in less than d documents.

Cite as

Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. An Improved Data Structure for Left-Right Maximal Generic Words Problem. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 40:1-40:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{fujishige_et_al:LIPIcs.ISAAC.2019.40,
  author =	{Fujishige, Yuta and Nakashima, Yuto and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki},
  title =	{{An Improved Data Structure for Left-Right Maximal Generic Words Problem}},
  booktitle =	{30th International Symposium on Algorithms and Computation (ISAAC 2019)},
  pages =	{40:1--40:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-130-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{149},
  editor =	{Lu, Pinyan and Zhang, Guochuan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2019.40},
  URN =		{urn:nbn:de:0030-drops-115366},
  doi =		{10.4230/LIPIcs.ISAAC.2019.40},
  annote =	{Keywords: generic words, suffix trees, string processing algorithms}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2017.33

Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings

Authors: Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda

Published in: LIPIcs, Volume 92, 28th International Symposium on Algorithms and Computation (ISAAC 2017)

Abstract

We consider the problem of computing all maximal repetitions contained in a string that is given in run-length encoding. Given a run-length encoding of a string, we show that the maximum number of maximal repetitions contained in the string is at most m+k-1, where m is the size of the run-length encoding, and k is the number of run-length factors whose exponent is at least 2. We also show an algorithm for computing all maximal repetitions in O(m \alpha(m)) time and O(m) space, where \alpha denotes the inverse Ackermann function.

Cite as

Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings. In 28th International Symposium on Algorithms and Computation (ISAAC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 92, pp. 33:1-33:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{fujishige_et_al:LIPIcs.ISAAC.2017.33,
  author =	{Fujishige, Yuta and Nakashima, Yuto and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki},
  title =	{{Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings}},
  booktitle =	{28th International Symposium on Algorithms and Computation (ISAAC 2017)},
  pages =	{33:1--33:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-054-5},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{92},
  editor =	{Okamoto, Yoshio and Tokuyama, Takeshi},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2017.33},
  URN =		{urn:nbn:de:0030-drops-82610},
  doi =		{10.4230/LIPIcs.ISAAC.2017.33},
  annote =	{Keywords: maximal repetitions,run length encoding}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.20

Faster STR-IC-LCS Computation via RLE

Authors: Keita Kuboi, Yuta Fujishige, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

The constrained LCS problem asks one to find a longest common subsequence of two input strings A and B with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string C as a substring. Given two strings A and B of respective lengths M and N, and a constraint string C of length at most min{M, N}, the best known algorithm for the STR-IC-LCS problem, proposed by Deorowicz (Inf. Process. Lett., 11:423-426, 2012), runs in O(MN) time. In this work, we present an O(mN + nM)-time solution to the STR-IC-LCS problem, where m and n denote the sizes of the run-length encodings of A and B, respectively. Since m <= M and n <= N always hold, our algorithm is always as fast as Deorowicz's algorithm, and is faster when input strings are compressible via RLE.

Cite as

Keita Kuboi, Yuta Fujishige, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster STR-IC-LCS Computation via RLE. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{kuboi_et_al:LIPIcs.CPM.2017.20,
  author =	{Kuboi, Keita and Fujishige, Yuta and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki},
  title =	{{Faster STR-IC-LCS Computation via RLE}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{20:1--20:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.20},
  URN =		{urn:nbn:de:0030-drops-73335},
  doi =		{10.4230/LIPIcs.CPM.2017.20},
  annote =	{Keywords: longest common subsequence, STR-IC-LCS, run-length encoding}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2016.38

Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets

Authors: Yuta Fujishige, Yuki Tsujimaru, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda

Published in: LIPIcs, Volume 58, 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016)

Abstract

The directed acyclic word graph (DAWG) of a string y is the smallest (partial) DFA which recognizes all suffixes of y and has only O(n) nodes and edges. We present the first O(n)-time algorithm for computing the DAWG of a given string y of length n over an integer alphabet of polynomial size in n. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O(n)-time algorithm for constructing the affix tree of a given string y over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. As an application to our O(n)-time DAWG construction algorithm, we show that the set MAW(y) of all minimal absent words of y can be computed in optimal O(n + |MAW(y)|) time and O(n) working space for integer alphabets.

Cite as

Yuta Fujishige, Yuki Tsujimaru, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 58, pp. 38:1-38:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{fujishige_et_al:LIPIcs.MFCS.2016.38,
  author =	{Fujishige, Yuta and Tsujimaru, Yuki and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki},
  title =	{{Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets}},
  booktitle =	{41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016)},
  pages =	{38:1--38:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-016-3},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{58},
  editor =	{Faliszewski, Piotr and Muscholl, Anca and Niedermeier, Rolf},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2016.38},
  URN =		{urn:nbn:de:0030-drops-64528},
  doi =		{10.4230/LIPIcs.MFCS.2016.38},
  annote =	{Keywords: string algorithms, DAWGs, suffix trees, minimal absent words}
}

8 Search Results for "Fujishige, Yuta"

Specific Patterns Against Reference Sequences

Abstract

Cite as

On the Compressiveness of the Burrows-Wheeler Transform

Abstract

Cite as

Space-Efficient Online Computation of String Net Occurrences

Abstract

Cite as

Repetitions in Strings: A "Constant" Problem (Invited Talk)

Abstract

Cite as

An Improved Data Structure for Left-Right Maximal Generic Words Problem

Abstract

Cite as

Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings

Abstract

Cite as

Faster STR-IC-LCS Computation via RLE

Abstract

Cite as

Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets

Abstract

Cite as

Thanks for your feedback!

Could not send message