DROPS

Document

Invited Talk

DOI: 10.4230/LIPIcs.CPM.2022.3

Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)

Authors: Sharma V. Thankachan

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

In the past two decades, we have witnessed the design of various compact data structures for pattern matching over an indexed text [Navarro, 2016]. Popular indexes like the FM-index [Paolo Ferragina and Giovanni Manzini, 2005], compressed suffix arrays/trees [Roberto Grossi and Jeffrey Scott Vitter, 2005; Kunihiko Sadakane, 2007], the recent r-index [Travis Gagie et al., 2020; Takaaki Nishimoto and Yasuo Tabei, 2021], etc., capture the key functionalities of classic suffix arrays/trees [Udi Manber and Eugene W. Myers, 1993; Peter Weiner, 1973] in compact space. Mostly, they rely on the Burrows-Wheeler Transform (BWT) and its associated operations [Burrows and Wheeler, 1994]. However, compactly encoding some advanced suffix tree (ST) variants, like parameterized ST [Brenda S. Baker, 1993; S. Rao Kosaraju, 1995; Juan Mendivelso et al., 2020], order-isomorphic/preserving ST [Maxime Crochemore et al., 2016], two-dimensional ST [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998], etc. [Sung Gwan Park et al., 2019; Tetsuo Shibuya, 2000]- collectively known as suffix trees with missing suffix links [Richard Cole and Ramesh Hariharan, 2003], has been challenging. The previous techniques are not easily extendable because these variants do not hold some structural properties of the standard ST that enable compression. However, some limited progress has been made in these directions recently [Arnab Ganguly et al., 2017; Travis Gagie et al., 2017; Gianni Decaroli et al., 2017; Dhrumil Patel and Rahul Shah, 2021; Arnab Ganguly et al., 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Arnab Ganguly et al., 2017; Arnab Ganguly et al., 2022; Arnab Ganguly et al., 2021]. This talk will briefly survey them and highlight some interesting open problems.

Cite as

Sharma V. Thankachan. Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk). In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 3:1-3:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{thankachan:LIPIcs.CPM.2022.3,
  author =	{Thankachan, Sharma V.},
  title =	{{Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc.}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{3:1--3:3},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.3},
  URN =		{urn:nbn:de:0030-drops-161300},
  doi =		{10.4230/LIPIcs.CPM.2022.3},
  annote =	{Keywords: Text Indexing, Suffix Trees, String Matching}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2021.60

Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space

Authors: Dhrumil Patel and Rahul Shah

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)

Abstract

In a 2-dimensional (2D) pattern matching problem, the text is arranged as a matrix 𝖬[1..n, 1..n] and consists of N = n × n symbols drawn from alphabet set Σ of size σ. The query consists of a m × m square matrix 𝖯[1..m, 1..m] drawn from the same alphabet set Σ and the task is to find all the locations in 𝖬 where 𝖯 appears as a (contiguous) submatrix. The patterns can be of any size, but as long as they are square in shape data structures like suffix trees and suffix array exist [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998] for the task of efficient pattern matching. These are essentially 2D counterparts of classic suffix trees and arrays known for traditional 1-dimensional (1D) pattern matching. They work based on linearization of 2D suffixes which would preserve the prefix match property (i.e., every pattern match is a prefix of some suffix). The main limitation of the suffix trees and the suffix arrays (in 1D) was their space utilization of O(N log N) bits, where N is the size of the text. This was suboptimal compared to Nlog σ bits of space, which is information theoretic optimal for the text. With the advent of the field of succinct/compressed data structures, it was possible to develop compressed variants of suffix trees and array based on Burrows-Wheeler Tansform and LF-mapping (or Φ function) [Roberto Grossi and Jeffrey Scott Vitter, 2005; Paolo Ferragina and Giovanni Manzini, 2005; Kunihiko Sadakane, 2007]. These data structures indeed achieve O(N log σ) bits of space or better. This gives rise to the question: analogous to 1D case, can we design a succinct or compressed index for 2D pattern matching? Can there be a 2D compressed suffix tree? Are there analogues of Burrows-Wheeler Transform or LF-mapping? The problem has been acknowledged for over a decade now and there have been a few attempts at applying Φ function [Ankur Gupta, 2004] and achieving entropy based compression [Veli Mäkinen and Gonzalo Navarro, 2008]. However, achieving the complexity breakthrough akin to 1D case has yet to be found. In this paper, we still do not know how to answer suffix array queries in O(N log σ) bits of space - which would have led to efficient pattern matching. However, for the first time, we show an interesting result that it is indeed possible to compute inverse suffix array (ISA) queries in near compact space in O(polylog n) time. Our 2D succinct text index design is based on two 1D compressed suffix trees and it takes O(N log log N + N logσ) bits of space which is much smaller than its naive design that takes O(N log N) bits. Although the main problem is still evasive, this index gives a hope on the existence of a full 2D succinct index with all functionalities similar to that of 1D case.

Cite as

Dhrumil Patel and Rahul Shah. Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 60:1-60:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{patel_et_al:LIPIcs.ISAAC.2021.60,
  author =	{Patel, Dhrumil and Shah, Rahul},
  title =	{{Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space}},
  booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
  pages =	{60:1--60:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-214-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{212},
  editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.60},
  URN =		{urn:nbn:de:0030-drops-154932},
  doi =		{10.4230/LIPIcs.ISAAC.2021.60},
  annote =	{Keywords: Pattern Matching, Succinct Data Structures}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2021.64

Repetition- and Linearity-Aware Rank/Select Dictionaries

Authors: Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)

Abstract

We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first contribution is a Lempel-Ziv parsing properly enriched to also capture approximate linearity in the data and still be compressed to the kth order entropy. Our second contribution is a variant of the block tree structure whose space complexity takes advantage of both repetitiveness and approximate linearity, and results highly competitive in time too. Our third and final contribution is an implementation and experimentation of this last data structure, which achieves new space-time trade-offs compared to known data structures that exploit only one of the two regularities.

Cite as

Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. Repetition- and Linearity-Aware Rank/Select Dictionaries. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 64:1-64:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:LIPIcs.ISAAC.2021.64,
  author =	{Ferragina, Paolo and Manzini, Giovanni and Vinciguerra, Giorgio},
  title =	{{Repetition- and Linearity-Aware Rank/Select Dictionaries}},
  booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
  pages =	{64:1--64:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-214-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{212},
  editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.64},
  URN =		{urn:nbn:de:0030-drops-154974},
  doi =		{10.4230/LIPIcs.ISAAC.2021.64},
  annote =	{Keywords: Data compression, Compressed data structures, Entropy}
}

Document

DOI: 10.4230/LIPIcs.WABI.2021.13

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, and Giovanni Manzini

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Cite as

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.WABI.2021.13,
  author =	{Gagie, Travis and Gourdel, Garance and Manzini, Giovanni},
  title =	{{Compressing and Indexing Aligned Readsets}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{13:1--13:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.13},
  URN =		{urn:nbn:de:0030-drops-143660},
  doi =		{10.4230/LIPIcs.WABI.2021.13},
  annote =	{Keywords: data compression, compact data structures, FM-index, Burrows-Wheeler Transform, EBWT, XBWT, DNA reads}
}

Document

DOI: 10.4230/DagRep.9.6.55

25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)

Authors: Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Jens Stoye

Published in: Dagstuhl Reports, Volume 9, Issue 6 (2020)

Abstract

Dagstuhl Seminar 19241 ("25 Years of the Burrows-Wheeler Transform") took place from June 10th to 14th, 2019, and was attended by 45 people from 13 countries and the three fields of Algorithms and Data Structures, Bioinformatics, and Combinatorics on Words. There were four talks and a panel session for each field. Feedback was generally positive and we are confident the seminar fostered interdisciplinary connections and will eventually result in noteworthy joint publications.

Cite as

Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Jens Stoye. 25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241). In Dagstuhl Reports, Volume 9, Issue 6, pp. 55-68, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@Article{gagie_et_al:DagRep.9.6.55,
  author =	{Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens},
  title =	{{25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)}},
  pages =	{55--68},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2019},
  volume =	{9},
  number =	{6},
  editor =	{Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagRep.9.6.55},
  URN =		{urn:nbn:de:0030-drops-114874},
  doi =		{10.4230/DagRep.9.6.55},
  annote =	{Keywords: Bioinformatics, Burrows-Wheeler Transform, Combinatorics on Words, Data Compression, Data Structures, Indexing, Sequence Alignment}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2019.68

Additive Cellular Automata Over Finite Abelian Groups: Topological and Measure Theoretic Properties

Authors: Alberto Dennunzio, Enrico Formenti, Darij Grinberg, and Luciano Margara

Published in: LIPIcs, Volume 138, 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)

Abstract

We study the dynamical behavior of D-dimensional (D >= 1) additive cellular automata where the alphabet is any finite abelian group. This class of discrete time dynamical systems is a generalization of the systems extensively studied by many authors among which one may list [Masanobu Ito et al., 1983; Giovanni Manzini and Luciano Margara, 1999; Giovanni Manzini and Luciano Margara, 1999; Jarkko Kari, 2000; Gianpiero Cattaneo et al., 2000; Gianpiero Cattaneo et al., 2004]. Our main contribution is the proof that topologically transitive additive cellular automata are ergodic. This result represents a solid bridge between the world of measure theory and that of topology theory and greatly extends previous results obtained in [Gianpiero Cattaneo et al., 2000; Giovanni Manzini and Luciano Margara, 1999] for linear CA over Z_m i.e. additive CA in which the alphabet is the cyclic group Z_m and the local rules are linear combinations with coefficients in Z_m. In our scenario, the alphabet is any finite abelian group and the global rule is any additive map. This class of CA strictly contains the class of linear CA over Z_m^n, i.e. , with the local rule defined by n x n matrices with elements in Z_m which, in turn, strictly contains the class of linear CA over Z_m. In order to further emphasize that finite abelian groups are more expressive than Z_m we prove that, contrary to what happens in Z_m, there exist additive CA over suitable finite abelian groups which are roots (with arbitrarily large indices) of the shift map. As a consequence of our results, we have that, for additive CA, ergodic mixing, weak ergodic mixing, ergodicity, topological mixing, weak topological mixing, topological total transitivity and topological transitivity are all equivalent properties. As a corollary, we have that invertible transitive additive CA are isomorphic to Bernoulli shifts. Finally, we provide a first characterization of strong transitivity for additive CA which we suspect it might be true also for the general case.

Cite as

Alberto Dennunzio, Enrico Formenti, Darij Grinberg, and Luciano Margara. Additive Cellular Automata Over Finite Abelian Groups: Topological and Measure Theoretic Properties. In 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 138, pp. 68:1-68:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{dennunzio_et_al:LIPIcs.MFCS.2019.68,
  author =	{Dennunzio, Alberto and Formenti, Enrico and Grinberg, Darij and Margara, Luciano},
  title =	{{Additive Cellular Automata Over Finite Abelian Groups: Topological and Measure Theoretic Properties}},
  booktitle =	{44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)},
  pages =	{68:1--68:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-117-7},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{138},
  editor =	{Rossmanith, Peter and Heggernes, Pinar and Katoen, Joost-Pieter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2019.68},
  URN =		{urn:nbn:de:0030-drops-110126},
  doi =		{10.4230/LIPIcs.MFCS.2019.68},
  annote =	{Keywords: Cellular Automata, Symbolic Dynamics, Complex Systems}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.12

A New Class of Searchable and Provably Highly Compressible String Transformations

Authors: Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words.

Cite as

Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino. A New Class of Searchable and Provably Highly Compressible String Transformations. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{giancarlo_et_al:LIPIcs.CPM.2019.12,
  author =	{Giancarlo, Raffaele and Manzini, Giovanni and Rosone, Giovanna and Sciortino, Marinella},
  title =	{{A New Class of Searchable and Provably Highly Compressible String Transformations}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{12:1--12:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.12},
  URN =		{urn:nbn:de:0030-drops-104833},
  doi =		{10.4230/LIPIcs.CPM.2019.12},
  annote =	{Keywords: Data Indexing and Compression, Burrows-Wheeler Transformation, Combinatorics on Words}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.2

Prefix-Free Parsing for Building Big BWTs

Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Cite as

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{boucher_et_al:LIPIcs.WABI.2018.2,
  author =	{Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Manzini, Giovanni},
  title =	{{Prefix-Free Parsing for Building Big BWTs}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{2:1--2:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.2},
  URN =		{urn:nbn:de:0030-drops-93044},
  doi =		{10.4230/LIPIcs.WABI.2018.2},
  annote =	{Keywords: Burrows-Wheeler Transform, prefix-free parsing, compression-aware algorithms, genomic databases}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.10

External memory BWT and LCP computation for sequence collections with applications

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs.

Cite as

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{egidi_et_al:LIPIcs.WABI.2018.10,
  author =	{Egidi, Lavinia and Louza, Felipe A. and Manzini, Giovanni and Telles, Guilherme P.},
  title =	{{External memory BWT and LCP computation for sequence collections with applications}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{10:1--10:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.10},
  URN =		{urn:nbn:de:0030-drops-93122},
  doi =		{10.4230/LIPIcs.WABI.2018.10},
  annote =	{Keywords: Burrows-Wheeler Transform, Longest Common Prefix Array, All pairs suffix-prefix overlaps, Succinct de Bruijn graph, Maximal repeats}
}

Document

DOI: 10.4230/LIPIcs.ESA.2017.38

An Encoding for Order-Preserving Matching

Authors: Travis Gagie, Giovanni Manzini, and Rossano Venturini

Published in: LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

Abstract

Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the relative order of their characters is the same: e.g., (4, 1, 3, 2) and (10, 3, 7, 5) are an order-preserving match. We show how, given a string S[1..n] over an arbitrary alphabet of size sigma and a constant c >=1, we can build an O(n log log n)-bit encoding such that later, given a pattern P[1..m] with m >= log^c n, we can return the number of order-preserving occurrences of P in S in O(m) time. Within the same time bound we can also return the starting position of some order-preserving match for P in S (if such a match exists). We prove that our space bound is within a constant factor of optimal if log(sigma) = Omega(log log n); our query time is optimal if log(sigma) = Omega(log n). Our space bound contrasts with the Omega(n log n) bits needed in the worst case to store S itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to O(log^c n) neighbouring characters.

Cite as

Travis Gagie, Giovanni Manzini, and Rossano Venturini. An Encoding for Order-Preserving Matching. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 38:1-38:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.ESA.2017.38,
  author =	{Gagie, Travis and Manzini, Giovanni and Venturini, Rossano},
  title =	{{An Encoding for Order-Preserving Matching}},
  booktitle =	{25th Annual European Symposium on Algorithms (ESA 2017)},
  pages =	{38:1--38:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-049-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{87},
  editor =	{Pruhs, Kirk and Sohler, Christian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.38},
  URN =		{urn:nbn:de:0030-drops-78726},
  doi =		{10.4230/LIPIcs.ESA.2017.38},
  annote =	{Keywords: Compact data structures, encodings, order-preserving matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.1

Wheeler Graphs: Variations on a Theme by Burrows and Wheeler

Authors: Giovanni Manzini

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

The famous Burrows-Wheeler Transform was originally defined for single strings but variations have been developed for sets of strings, labelled trees, de Bruijn graphs, alignments, etc. In this talk we propose a unifying view that includes many of these variations and that we hope will simplify the search for more. Somewhat surprisingly we get our unifying view by considering the Nondeterministic Finite Automata related to different pattern-matching problems. We show that the state graphs associated with these automata have common properties that we summarize with the concept of a Wheeler graph. Using the notion of a Wheeler graph, we show that it is possible to process strings efficiently even if the automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactly represented and traversed using up to three arrays with additional data structures supporting efficient rank and select operations. It turns out that these arrays coincide with, or are substantially equivalent to, the output of many Burrows-Wheeler Transform variants described in the literature. This is joint work with Travis Gagie and Jouni Sirén.

Cite as

Giovanni Manzini. Wheeler Graphs: Variations on a Theme by Burrows and Wheeler. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{manzini:LIPIcs.CPM.2017.1,
  author =	{Manzini, Giovanni},
  title =	{{Wheeler Graphs: Variations on a Theme by Burrows and Wheeler}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.1},
  URN =		{urn:nbn:de:0030-drops-73343},
  doi =		{10.4230/LIPIcs.CPM.2017.1},
  annote =	{Keywords: compressed data structures, pattern matching}
}

11 Search Results for "Manzini, Giovanni"

Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)

Abstract

Cite as

Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space

Abstract

Cite as

Repetition- and Linearity-Aware Rank/Select Dictionaries

Abstract

Cite as

Compressing and Indexing Aligned Readsets

Abstract

Cite as

25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)

Abstract

Cite as

Additive Cellular Automata Over Finite Abelian Groups: Topological and Measure Theoretic Properties

Abstract

Cite as

A New Class of Searchable and Provably Highly Compressible String Transformations

Abstract

Cite as

Prefix-Free Parsing for Building Big BWTs

Abstract

Cite as

External memory BWT and LCP computation for sequence collections with applications

Abstract

Cite as

An Encoding for Order-Preserving Matching

Abstract

Cite as

Wheeler Graphs: Variations on a Theme by Burrows and Wheeler

Abstract

Cite as

Thanks for your feedback!

Could not send message