Search Results

Document
Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

Authors: Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract
An elastic-degenerate (ED) string T is a sequence of n sets T[1],…,T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as ℒ(T) = {S_1 ⋯ S_n : S_i ∈ T[i] for all i ∈ [1,n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T₁ and T₂ of lengths n₁ and n₂, cardinalities m₁ and m₂, and sizes N₁ and N₂, respectively, we show the following: - There is no 𝒪((N₁N₂)^{1-ε})-time algorithm, thus no 𝒪((N₁m₂+N₂m₁)^{1-ε})-time algorithm and no 𝒪((N₁n₂+N₂n₁)^{1-ε})-time algorithm, for any constant ε > 0, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial 𝒪((N₁+N₂)^{1.2-ε}f(n₁,n₂))-time algorithm, for any constant ε > 0 and any function f, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An 𝒪(N₁log N₁log n₁+N₂log N₂log n₂)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T₁ and T₂ are given in a compact representation, we show that the problem is NP-complete. - An 𝒪(N₁m₂+N₂m₁)-time algorithm for EDSI. - An Õ(N₁^{ω-1}n₂+N₂^{ω-1}n₁)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison.

Cite as

Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba. Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 11:1-11:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

```@InProceedings{gabory_et_al:LIPIcs.CPM.2023.11,
author =	{Gabory, Esteban and Mwaniki, Moses Njagi and Pisanti, Nadia and Pissis, Solon P. and Radoszewski, Jakub and Sweering, Michelle and Zuba, Wiktor},
title =	{{Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications}},
booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
pages =	{11:1--11:20},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-276-1},
ISSN =	{1868-8969},
year =	{2023},
volume =	{259},
editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.11},
URN =		{urn:nbn:de:0030-drops-179650},
doi =		{10.4230/LIPIcs.CPM.2023.11},
annote =	{Keywords: elastic-degenerate string, sequence comparison, languages intersection, pangenome, acronym identification}
}```
Document

Authors: Panagiotis Charalampopoulos, Huiping Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Jakub Radoszewski

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)

Abstract
Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary 𝒟 of d strings, each of length 𝓁, a query string q of length 𝓁, and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,…,𝓁}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from 𝒟. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial 𝒪((d𝓁)^{|K|/3}+d𝓁)-time and 𝒪(d𝓁)-space algorithm for PMDM for |K| = 𝒪(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in 𝒟. Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from 𝒟. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time 𝒪(2^𝓁 d). We present a data structure for PMDM that answers queries over 𝒟 in time 𝒪(2^{𝓁/2}(2^{𝓁/2}+τ)𝓁) and requires space 𝒪(2^𝓁 d²/τ²+2^{𝓁/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time 𝒪(d^{1/4+ε})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture.

Cite as

Panagiotis Charalampopoulos, Huiping Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Jakub Radoszewski. Pattern Masking for Dictionary Matching. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 65:1-65:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

```@InProceedings{charalampopoulos_et_al:LIPIcs.ISAAC.2021.65,
author =	{Charalampopoulos, Panagiotis and Chen, Huiping and Christen, Peter and Loukides, Grigorios and Pisanti, Nadia and Pissis, Solon P. and Radoszewski, Jakub},
title =	{{Pattern Masking for Dictionary Matching}},
booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
pages =	{65:1--65:19},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-214-3},
ISSN =	{1868-8969},
year =	{2021},
volume =	{212},
editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.65},
URN =		{urn:nbn:de:0030-drops-154982},
doi =		{10.4230/LIPIcs.ISAAC.2021.65},
annote =	{Keywords: string algorithms, dictionary matching, wildcards, record linkage, query term dropping}
}```
Document
Invited Talk
On-Line Pattern Matching on D-Texts (Invited Talk)

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract
The Elastic Degenerate String Matching (EDSM) problem is defined as that of finding an occurrence of a pattern P of length m in an ED-text T. A D-text (Degenerate text) is a string that actually represents a set of similar and aligned strings (e.g. a pan-genome [The Computational Pan-Genomics Consortium, 2018]) by collapsing common fragments into a standard string, and representing variants with sets of alternative substrings. When such substrings are not bound to have the same size, then we talk about elastic D-strings (ED-strings). In [R.Grossi et al., 2017] we gave an O(nm²+N) time on-line algorithm for EDSM, where n is the length of T and N is its size, defined as the total number of letters. A fundamental toolkit of our algorithm is the O(m²+N) time solution of the later called Active Prefixes problem (AP). In [K.Aoyama et al., 2018], a O(m^{1.5} √{log m}+N) solution for AP was shown, leading to a O(nm^{1.5} √{log m}+N) time solution for EDSM. The natural open problem was thus whether the 1.5 exponent could furtherly be decreased. In [G.Bernardini et al., 2019], we prove several properties that answer this and other questions: we give a conditional O(nm^{1.5}+N) lower bound for EDSM, proving that a combinatorial algorithm solving EDSM in O(nm^{1.5-ε} +N) time would break the Boolean Matrix Multiplication (BMM) conjecture; we use this result as a hint to devise a non-combinatorial algorithm that solves EDSM in O(nm^{1.381}+N) time; we do so by successfully combining Fast Fourier Transform and properties of string periodicity. In my talk I will overview the results above, as well as some interesting side results: the extension to a dictionary rather than a single pattern [S.P.Pissis and A.Retha, 2018], the introduction of errors [G.Bernardini et al., 2020], and a notion of matching among D-strings with its linear time solution [M.Alzamel et al., 2020].

Cite as

Nadia Pisanti. On-Line Pattern Matching on D-Texts (Invited Talk). In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 3:1-3:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

```@InProceedings{pisanti:LIPIcs.CPM.2021.3,
title =	{{On-Line Pattern Matching on D-Texts}},
booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
pages =	{3:1--3:2},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-186-3},
ISSN =	{1868-8969},
year =	{2021},
volume =	{191},
editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.3},
URN =		{urn:nbn:de:0030-drops-139548},
doi =		{10.4230/LIPIcs.CPM.2021.3},
annote =	{Keywords: pattern matching, elastic-degenerate string, matrix multiplication}
}```
Document
Complete Volume
LIPIcs, Volume 172, WABI 2020, Complete Volume

Authors: Carl Kingsford and Nadia Pisanti

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract
LIPIcs, Volume 172, WABI 2020, Complete Volume

Cite as

20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 1-360, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

```@Proceedings{kingsford_et_al:LIPIcs.WABI.2020,
title =	{{LIPIcs, Volume 172, WABI 2020, Complete Volume}},
booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
pages =	{1--360},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-161-0},
ISSN =	{1868-8969},
year =	{2020},
volume =	{172},
editor =	{Kingsford, Carl and Pisanti, Nadia},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020},
URN =		{urn:nbn:de:0030-drops-127881},
doi =		{10.4230/LIPIcs.WABI.2020},
annote =	{Keywords: LIPIcs, Volume 172, WABI 2020, Complete Volume}
}```
Document
Front Matter

Authors: Carl Kingsford and Nadia Pisanti

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Cite as

20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 0:i-0:x, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

```@InProceedings{kingsford_et_al:LIPIcs.WABI.2020.0,
author =	{Kingsford, Carl and Pisanti, Nadia},
booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
pages =	{0:i--0:x},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-161-0},
ISSN =	{1868-8969},
year =	{2020},
volume =	{172},
editor =	{Kingsford, Carl and Pisanti, Nadia},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.0},
URN =		{urn:nbn:de:0030-drops-127891},
doi =		{10.4230/LIPIcs.WABI.2020.0},
}```
Document
String Sanitization Under Edit Distance

Authors: Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract
Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.

Cite as

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

```@InProceedings{bernardini_et_al:LIPIcs.CPM.2020.7,
author =	{Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pisanti, Nadia and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},
title =	{{String Sanitization Under Edit Distance}},
booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
pages =	{7:1--7:14},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-149-8},
ISSN =	{1868-8969},
year =	{2020},
volume =	{161},
editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.7},
URN =		{urn:nbn:de:0030-drops-121324},
doi =		{10.4230/LIPIcs.CPM.2020.7},
annote =	{Keywords: String algorithms, data sanitization, edit distance, dynamic programming, conditional lower bound}
}```
Document
Track A: Algorithms, Complexity and Games
Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication

Authors: Giulia Bernardini, Paweł Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone

Published in: LIPIcs, Volume 132, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)

Abstract
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm^{1.5}sqrt{log m} + N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction we show that a combinatorial algorithm solving the EDSM problem in O(nm^{1.5-epsilon} + N) time, for any epsilon>0, refutes this conjecture. Of course, the notion of combinatorial algorithms is not clearly defined, so our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. Two standard tools used in algorithms on strings are string periodicity and fast Fourier transform. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a non-combinatorial O(nm^{1.381} + N)-time algorithm for EDSM. To the best of our knowledge, we are the first to do so.

Cite as

Giulia Bernardini, Paweł Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 21:1-21:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

```@InProceedings{bernardini_et_al:LIPIcs.ICALP.2019.21,
author =	{Bernardini, Giulia and Gawrychowski, Pawe{\l} and Pisanti, Nadia and Pissis, Solon P. and Rosone, Giovanna},
title =	{{Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication}},
booktitle =	{46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)},
pages =	{21:1--21:15},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-109-2},
ISSN =	{1868-8969},
year =	{2019},
volume =	{132},
editor =	{Baier, Christel and Chatzigiannakis, Ioannis and Flocchini, Paola and Leonardi, Stefano},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2019.21},
URN =		{urn:nbn:de:0030-drops-105973},
doi =		{10.4230/LIPIcs.ICALP.2019.21},
annote =	{Keywords: string algorithms, pattern matching, elastic-degenerate string, matrix multiplication, fast Fourier transform}
}```
Document
Complete Volume
LIPIcs, Volume 128, CPM'19, Complete Volume

Authors: Nadia Pisanti and Solon P. Pissis

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract
LIPIcs, Volume 128, CPM'19, Complete Volume

Cite as

30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

```@Proceedings{pisanti_et_al:LIPIcs.CPM.2019,
title =	{{LIPIcs, Volume 128, CPM'19, Complete Volume}},
booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-103-0},
ISSN =	{1868-8969},
year =	{2019},
volume =	{128},
editor =	{Pisanti, Nadia and P. Pissis, Solon},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019},
URN =		{urn:nbn:de:0030-drops-105577},
doi =		{10.4230/LIPIcs.CPM.2019},
annote =	{Keywords: Mathematics of computing, Discrete mathematics, Applied computing, Computational biology, Information theory, Information systems}
}```
Document
Front Matter

Authors: Nadia Pisanti and Solon P. Pissis

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Cite as

30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 0:i-0:xviii, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

```@InProceedings{pisanti_et_al:LIPIcs.CPM.2019.0,
author =	{Pisanti, Nadia and P. Pissis, Solon},
booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
pages =	{0:i--0:xviii},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-103-0},
ISSN =	{1868-8969},
year =	{2019},
volume =	{128},
editor =	{Pisanti, Nadia and P. Pissis, Solon},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.0},
URN =		{urn:nbn:de:0030-drops-104719},
doi =		{10.4230/LIPIcs.CPM.2019.0},
}```
Document
Detecting Mutations by eBWT

Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.

Cite as

Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone. Detecting Mutations by eBWT. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

```@InProceedings{prezza_et_al:LIPIcs.WABI.2018.3,
author =	{Prezza, Nicola and Pisanti, Nadia and Sciortino, Marinella and Rosone, Giovanna},
title =	{{Detecting Mutations by eBWT}},
booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
pages =	{3:1--3:15},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-082-8},
ISSN =	{1868-8969},
year =	{2018},
volume =	{113},
editor =	{Parida, Laxmi and Ukkonen, Esko},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.3},
URN =		{urn:nbn:de:0030-drops-93051},
doi =		{10.4230/LIPIcs.WABI.2018.3},
annote =	{Keywords: BWT, LCP Array, SNPs, Reference-free, Assembly-free}
}```
Document
Degenerate String Comparison and Applications

Authors: Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract
A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.

Cite as

Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate String Comparison and Applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 21:1-21:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

```@InProceedings{alzamel_et_al:LIPIcs.WABI.2018.21,
author =	{Alzamel, Mai and Ayad, Lorraine A. K. and Bernardini, Giulia and Grossi, Roberto and Iliopoulos, Costas S. and Pisanti, Nadia and Pissis, Solon P. and Rosone, Giovanna},
title =	{{Degenerate String Comparison and Applications}},
booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
pages =	{21:1--21:14},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-082-8},
ISSN =	{1868-8969},
year =	{2018},
volume =	{113},
editor =	{Parida, Laxmi and Ukkonen, Esko},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.21},
URN =		{urn:nbn:de:0030-drops-93236},
doi =		{10.4230/LIPIcs.WABI.2018.21},
annote =	{Keywords: degenerate strings, generalised degenerate strings, elastic-degenerate strings, string comparison, palindromes}
}```
Document
On-Line Pattern Matching on Similar Texts

Authors: Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract
Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.

Cite as

Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-Line Pattern Matching on Similar Texts. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 9:1-9:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

```@InProceedings{grossi_et_al:LIPIcs.CPM.2017.9,
author =	{Grossi, Roberto and Iliopoulos, Costas S. and Liu, Chang and Pisanti, Nadia and Pissis, Solon P. and Retha, Ahmad and Rosone, Giovanna and Vayani, Fatima and Versari, Luca},
title =	{{On-Line Pattern Matching on Similar Texts}},
booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
pages =	{9:1--9:14},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-95977-039-2},
ISSN =	{1868-8969},
year =	{2017},
volume =	{78},
editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.9},
URN =		{urn:nbn:de:0030-drops-73379},
doi =		{10.4230/LIPIcs.CPM.2017.9},
annote =	{Keywords: string algorithms, pattern matching, degenerate strings, elastic-degenerate strings, on-line algorithms}
}```
Document
Output-Sensitive Pattern Extraction in Sequences

Authors: Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, and Soren Vind

Published in: LIPIcs, Volume 29, 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)

Abstract
Genomic Analysis, Plagiarism Detection, Data Mining, Intrusion Detection, Spam Fighting and Time Series Analysis are just some examples of applications where extraction of recurring patterns in sequences of objects is one of the main computational challenges. Several notions of patterns exist, and many share the common idea of strictly specifying some parts of the pattern and to don't care about the remaining parts. Since the number of patterns can be exponential in the length of the sequences, pattern extraction focuses on statistically relevant patterns, where any attempt to further refine or extend them causes a loss of significant information (where the number of occurrences changes). Output-sensitive algorithms have been proposed to enumerate and list these patterns, taking polynomial time O(n^c) per pattern for constant c > 1, which is impractical for massive sequences of very large length n. We address the problem of extracting maximal patterns with at most k don't care symbols and at least q occurrences. Our contribution is to give the first algorithm that attains a stronger notion of output-sensitivity, borrowed from the analysis of data structures: the cost is proportional to the actual number of occurrences of each pattern, which is at most n and practically much smaller than n in real applications, thus avoiding the aforementioned cost of O(n^c) per pattern.

Cite as

Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, and Soren Vind. Output-Sensitive Pattern Extraction in Sequences. In 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 29, pp. 303-314, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)

```@InProceedings{grossi_et_al:LIPIcs.FSTTCS.2014.303,
author =	{Grossi, Roberto and Menconi, Giulia and Pisanti, Nadia and Trani, Roberto and Vind, Soren},
title =	{{Output-Sensitive Pattern Extraction in Sequences}},
booktitle =	{34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)},
pages =	{303--314},
series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN =	{978-3-939897-77-4},
ISSN =	{1868-8969},
year =	{2014},
volume =	{29},
editor =	{Raman, Venkatesh and Suresh, S. P.},
publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2014.303},
URN =		{urn:nbn:de:0030-drops-48513},
doi =		{10.4230/LIPIcs.FSTTCS.2014.303},
annote =	{Keywords: Pattern Extraction, Motif Detection, Pattern Discovery, Motif Trie}
}```
X

Feedback for Dagstuhl Publishing