DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2020.8

Counting Distinct Patterns in Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We consider the problem of preprocessing a text T of length n and a dictionary 𝒟 in order to be able to efficiently answer queries CountDistinct(i,j), that is, given i and j return the number of patterns from 𝒟 that occur in the fragment T[i..j]. The dictionary is internal in the sense that each pattern in 𝒟 is given as a fragment of T. This way, the dictionary takes space proportional to the number of patterns d=|𝒟| rather than their total length, which could be Θ(n⋅ d). An 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 𝒪(log n)-approximately in 𝒪̃(1) time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 2-approximately in 𝒪̃(1) time. Using range queries, for any m, we give an 𝒪̃(min(nd/m,n²/m²)+d)-size data structure that answers CountDistinct(i,j) queries exactly in 𝒪̃(m) time. We also consider the special case when the dictionary consists of all square factors of the string. We design an 𝒪(n log² n)-size data structure that allows us to count distinct squares in a text fragment T[i..j] in 𝒪(log n) time.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. Counting Distinct Patterns in Internal Dictionary Matching. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.CPM.2020.8,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Straszy\'{n}ski, Juliusz and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Patterns in Internal Dictionary Matching}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{8:1--8:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.8},
  URN =		{urn:nbn:de:0030-drops-121336},
  doi =		{10.4230/LIPIcs.CPM.2020.8},
  annote =	{Keywords: dictionary matching, internal pattern matching, squares}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2019.22

Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń

Published in: LIPIcs, Volume 149, 30th International Symposium on Algorithms and Computation (ISAAC 2019)

Abstract

We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D is given as a fragment of T. This way, D takes space proportional to the number of patterns d=|D| rather than their total length, which could be Theta(n * d). In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D in a fragment T[i..j] (operations Report(i,j) and Count(i,j) below, as well as operation Exists(i,j) that returns true iff Count(i,j)>0) and reporting distinct patterns from D that occur in T[i..j] (operation ReportDistinct(i,j)). We show how to construct, in O((n+d) log^{O(1)} n) time, a data structure that answers each of these queries in time O(log^{O(1)} n+|output|) - see the table below for specific time and space complexities. Query | Preprocessing time | Space | Query time Exists(i,j) | O(n+d) | O(n) | O(1) Report(i,j) | O(n+d) | O(n+d) | O(1+|output|) ReportDistinct(i,j) | O(n log n+d) | O(n+d) | O(log n+|output|) Count(i,j) | O({n log n}/{log log n} + d log^{3/2} n) | O(n+d log n) | O({log^2n}/{log log n}) The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight - up to subpolynomial factors - upper and lower bounds for the case of a dynamic dictionary.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal Dictionary Matching. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.ISAAC.2019.22,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz},
  title =	{{Internal Dictionary Matching}},
  booktitle =	{30th International Symposium on Algorithms and Computation (ISAAC 2019)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-130-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{149},
  editor =	{Lu, Pinyan and Zhang, Guochuan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2019.22},
  URN =		{urn:nbn:de:0030-drops-115182},
  doi =		{10.4230/LIPIcs.ISAAC.2019.22},
  annote =	{Keywords: string algorithms, dictionary matching, internal pattern matching}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2018.70

Longest Unbordered Factor in Quasilinear Time

Authors: Tomasz Kociumaka, Ritu Kundu, Manal Mohamed, and Solon P. Pissis

Published in: LIPIcs, Volume 123, 29th International Symposium on Algorithms and Computation (ISAAC 2018)

Abstract

A border u of a word w is a proper factor of w occurring both as a prefix and as a suffix. The maximal unbordered factor of w is the longest factor of w which does not have a border. Here an O(n log n)-time with high probability (or O(n log n log^2 log n)-time deterministic) algorithm to compute the Longest Unbordered Factor Array of w for general alphabets is presented, where n is the length of w. This array specifies the length of the maximal unbordered factor starting at each position of w. This is a major improvement on the running time of the currently best worst-case algorithm working in O(n^{1.5}) time for integer alphabets [Gawrychowski et al., 2015].

Cite as

Tomasz Kociumaka, Ritu Kundu, Manal Mohamed, and Solon P. Pissis. Longest Unbordered Factor in Quasilinear Time. In 29th International Symposium on Algorithms and Computation (ISAAC 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 123, pp. 70:1-70:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{kociumaka_et_al:LIPIcs.ISAAC.2018.70,
  author =	{Kociumaka, Tomasz and Kundu, Ritu and Mohamed, Manal and Pissis, Solon P.},
  title =	{{Longest Unbordered Factor in Quasilinear Time}},
  booktitle =	{29th International Symposium on Algorithms and Computation (ISAAC 2018)},
  pages =	{70:1--70:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-094-1},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{123},
  editor =	{Hsu, Wen-Lian and Lee, Der-Tsai and Liao, Chung-Shou},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2018.70},
  URN =		{urn:nbn:de:0030-drops-100184},
  doi =		{10.4230/LIPIcs.ISAAC.2018.70},
  annote =	{Keywords: longest unbordered factor, factorisation, period, border, strings}
}

Document

DOI: 10.4230/LIPIcs.WABI.2017.4

Optimal Computation of Overabundant Words

Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

Cite as

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. Optimal Computation of Overabundant Words. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{almirantis_et_al:LIPIcs.WABI.2017.4,
  author =	{Almirantis, Yannis and Charalampopoulos, Panagiotis and Gao, Jia and Iliopoulos, Costas S. and Mohamed, Manal and Pissis, Solon P. and Polychronopoulos, Dimitris},
  title =	{{Optimal Computation of Overabundant Words}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{4:1--4:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.4},
  URN =		{urn:nbn:de:0030-drops-76468},
  doi =		{10.4230/LIPIcs.WABI.2017.4},
  annote =	{Keywords: overabundant words, avoided words, suffix tree, DNA sequence analysis}
}

4 Search Results for "Mohamed, Manal"

Counting Distinct Patterns in Internal Dictionary Matching

Abstract

Cite as

Internal Dictionary Matching

Abstract

Cite as

Longest Unbordered Factor in Quasilinear Time

Abstract

Cite as

Optimal Computation of Overabundant Words

Abstract

Cite as

Thanks for your feedback!

Could not send message