DROPS

Document

DOI: 10.4230/LIPIcs.SEA.2024.10

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Cite as

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10,
  author =	{Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo},
  title =	{{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10},
  URN =		{urn:nbn:de:0030-drops-203756},
  doi =		{10.4230/LIPIcs.SEA.2024.10},
  annote =	{Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests}
}

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10,
  author =	{Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo},
  title =	{{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10},
  URN =		{urn:nbn:de:0030-drops-203756},
  doi =		{10.4230/LIPIcs.SEA.2024.10},
  annote =	{Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests}
}

Document

DOI: 10.4230/LIPIcs.CPM.2024.21

BAT-LZ out of hell

Authors: Zsuzsanna Lipták, Francesco Masillo, and Gonzalo Navarro

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Abstract

Despite consistently yielding the best compression on repetitive text collections, the Lempel-Ziv parsing has resisted all attempts at offering relevant guarantees on the cost to access an arbitrary symbol. This makes it less attractive for use on compressed self-indexes and other compressed data structures. In this paper we introduce a variant we call BAT-LZ (for Bounded Access Time Lempel-Ziv) where the access cost is bounded by a parameter given at compression time. We design and implement a linear-space algorithm that, in time O(nlog³ n), obtains a BAT-LZ parse of a text of length n by greedily maximizing each next phrase length. The algorithm builds on a new linear-space data structure that solves 5-sided orthogonal range queries in rank space, allowing updates to the coordinate where the one-sided queries are supported, in O(log³ n) time for both queries and updates. This time can be reduced to O(log² n) if O(nlog n) space is used. We design a second algorithm that chooses the sources for the phrases in a clever way, using an enhanced suffix tree, albeit no longer guaranteeing longest possible phrases. This algorithm is much slower in theory, but in practice it is comparable to the greedy parser, while achieving significantly superior compression. We then combine the two algorithms, resulting in a parser that always chooses the longest possible phrases, and the best sources for those. Our experimentation shows that, on most repetitive texts, our algorithms reach an access cost close to log₂ n on texts of length n, while incurring almost no loss in the compression ratio when compared with classical LZ-compression. Several open challenges are discussed at the end of the paper.

Cite as

Zsuzsanna Lipták, Francesco Masillo, and Gonzalo Navarro. BAT-LZ out of hell. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 21:1-21:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{liptak_et_al:LIPIcs.CPM.2024.21,
  author =	{Lipt\'{a}k, Zsuzsanna and Masillo, Francesco and Navarro, Gonzalo},
  title =	{{BAT-LZ out of hell}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{21:1--21:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.21},
  URN =		{urn:nbn:de:0030-drops-201317},
  doi =		{10.4230/LIPIcs.CPM.2024.21},
  annote =	{Keywords: Lempel-Ziv parsing, data compression, compressed data structures, repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.24

Computing MEMs on Repetitive Text Collections

Authors: Gonzalo Navarro

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n], which is represented as a (hopefully much smaller) run-length context-free grammar of size g_{rl}. We show that the problem can be solved in time O(m² log^ε n), for any constant ε > 0, on a data structure of size O(g_{rl}). Further, on a locally consistent grammar of size O(δ log n/δ), the time decreases to O(m log m(log m + log^ε n)). The value δ is a function of the substring complexity of T and Ω(δ log n/δ) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n and δ.

Cite as

Gonzalo Navarro. Computing MEMs on Repetitive Text Collections. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 24:1-24:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{navarro:LIPIcs.CPM.2023.24,
  author =	{Navarro, Gonzalo},
  title =	{{Computing MEMs on Repetitive Text Collections}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{24:1--24:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.24},
  URN =		{urn:nbn:de:0030-drops-179787},
  doi =		{10.4230/LIPIcs.CPM.2023.24},
  annote =	{Keywords: grammar-based indices, maximal exact matches, locally consistent grammars, substring complexity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.25

L-Systems for Measuring Repetitiveness

Authors: Gonzalo Navarro and Cristian Urbina

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms.

Cite as

Gonzalo Navarro and Cristian Urbina. L-Systems for Measuring Repetitiveness. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 25:1-25:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{navarro_et_al:LIPIcs.CPM.2023.25,
  author =	{Navarro, Gonzalo and Urbina, Cristian},
  title =	{{L-Systems for Measuring Repetitiveness}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{25:1--25:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.25},
  URN =		{urn:nbn:de:0030-drops-179792},
  doi =		{10.4230/LIPIcs.CPM.2023.25},
  annote =	{Keywords: L-systems, String morphisms, Repetitiveness measures, Text compression}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.ICDT.2023.2

Compact Data Structures Meet Databases (Invited Talk)

Authors: Gonzalo Navarro

Published in: LIPIcs, Volume 255, 26th International Conference on Database Theory (ICDT 2023)

Abstract

We describe two success stories on the application of compact data structures (cds) to solve the problem of the excessively redundant space requirements posed by worst-case-optimal (wco) algorithms for multijoins in databases, and particularly basic graph patterns on graph databases. The aim of cds is to represent the data and additional data structures on it, using total space close to that of the plain (and, sometimes, compressed) data, while efficiently simulating the data structure operations. Cds turn out to be a perfect approach for the described problem: We designed and implemented cds that effectively use space close to that of the plain or compressed data, which is orders of magnitude less than existing systems, while retaining worst-case optimality and performing competitively with those systems in query time, sometimes being even considerably faster.

Cite as

Gonzalo Navarro. Compact Data Structures Meet Databases (Invited Talk). In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{navarro:LIPIcs.ICDT.2023.2,
  author =	{Navarro, Gonzalo},
  title =	{{Compact Data Structures Meet Databases}},
  booktitle =	{26th International Conference on Database Theory (ICDT 2023)},
  pages =	{2:1--2:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-270-9},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{255},
  editor =	{Geerts, Floris and Vandevoort, Brecht},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2023.2},
  URN =		{urn:nbn:de:0030-drops-177446},
  doi =		{10.4230/LIPIcs.ICDT.2023.2},
  annote =	{Keywords: succinct data structures, tries, multidimensional grids, text searching}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.ESA.2022

LIPIcs, Volume 244, ESA 2022, Complete Volume

Authors: Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman

Published in: LIPIcs, Volume 244, 30th Annual European Symposium on Algorithms (ESA 2022)

Abstract

LIPIcs, Volume 244, ESA 2022, Complete Volume

Cite as

30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 1-1406, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@Proceedings{chechik_et_al:LIPIcs.ESA.2022,
  title =	{{LIPIcs, Volume 244, ESA 2022, Complete Volume}},
  booktitle =	{30th Annual European Symposium on Algorithms (ESA 2022)},
  pages =	{1--1406},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-247-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{244},
  editor =	{Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022},
  URN =		{urn:nbn:de:0030-drops-169374},
  doi =		{10.4230/LIPIcs.ESA.2022},
  annote =	{Keywords: LIPIcs, Volume 244, ESA 2022, Complete Volume}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.ESA.2022.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman

Published in: LIPIcs, Volume 244, 30th Annual European Symposium on Algorithms (ESA 2022)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 0:i-0:xxii, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{chechik_et_al:LIPIcs.ESA.2022.0,
  author =	{Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{30th Annual European Symposium on Algorithms (ESA 2022)},
  pages =	{0:i--0:xxii},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-247-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{244},
  editor =	{Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.0},
  URN =		{urn:nbn:de:0030-drops-169382},
  doi =		{10.4230/LIPIcs.ESA.2022.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.11

Bi-Directional r-Indexes

Authors: Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

Indexing highly repetitive texts is important in fields such as bioinformatics and versioned repositories. The run-length compression of the Burrows-Wheeler transform (BWT) provides a compressed representation particularly well-suited to text indexing. The r-index is one such index. It enables fast locating of occurrences of a pattern within O(r) words of space, where r is the number of equal-letter runs in the BWT. Its mechanism of locating is to maintain one suffix array sample along the backward-search of the pattern, and to compute all the pattern positions from that sample once the backward-search is complete. In this paper we develop this algorithm further, and propose a new bi-directional text index called the br-index, which supports extending the matched pattern both in forward and backward directions, and locating the occurrences of the pattern at any step of the search, within O(r+r_R) words of space, where r_R is the number of equal-letter runs in the BWT of the reversed text. Our experiments show that the br-index captures the long repetitions of the text, and outperforms the existing indexes in text searching allowing some mismatches except in an internal part.

Cite as

Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane. Bi-Directional r-Indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 11:1-11:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{arakawa_et_al:LIPIcs.CPM.2022.11,
  author =	{Arakawa, Yuma and Navarro, Gonzalo and Sadakane, Kunihiko},
  title =	{{Bi-Directional r-Indexes}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{11:1--11:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.11},
  URN =		{urn:nbn:de:0030-drops-161386},
  doi =		{10.4230/LIPIcs.CPM.2022.11},
  annote =	{Keywords: Compressed text indexes, Burrows-Wheeler Transform, highly repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.29

Efficient Construction of the BWT for Repetitive Text Using String Compression

Authors: Diego Díaz-Domínguez and Gonzalo Navarro

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.

Cite as

Diego Díaz-Domínguez and Gonzalo Navarro. Efficient Construction of the BWT for Repetitive Text Using String Compression. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 29:1-29:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.CPM.2022.29,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Navarro, Gonzalo},
  title =	{{Efficient Construction of the BWT for Repetitive Text Using String Compression}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{29:1--29:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.29},
  URN =		{urn:nbn:de:0030-drops-161564},
  doi =		{10.4230/LIPIcs.CPM.2022.29},
  annote =	{Keywords: BWT, string compression, repetitive text}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.13

A Fast and Small Subsampled R-Index

Authors: Dustin Cobas, Travis Gagie, and Gonzalo Navarro

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

Cite as

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{cobas_et_al:LIPIcs.CPM.2021.13,
  author =	{Cobas, Dustin and Gagie, Travis and Navarro, Gonzalo},
  title =	{{A Fast and Small Subsampled R-Index}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{13:1--13:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.13},
  URN =		{urn:nbn:de:0030-drops-139647},
  doi =		{10.4230/LIPIcs.CPM.2021.13},
  annote =	{Keywords: Pattern matching, r-index, compressed text indexing, repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.24

Text Indexing and Searching in Sublinear Time

Authors: J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n log σ) bits, is built in O(n log σ / √{log n}) deterministic time, and computes the number of occurrences of the pattern in time O(q/log_σ n + log n log_σ n). Each such occurrence can then be found in O(log n) time. Other trade-offs between the space usage and the cost of reporting occurrences are also possible.

Cite as

J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Text Indexing and Searching in Sublinear Time. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 24:1-24:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{munro_et_al:LIPIcs.CPM.2020.24,
  author =	{Munro, J. Ian and Navarro, Gonzalo and Nekrich, Yakov},
  title =	{{Text Indexing and Searching in Sublinear Time}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{24:1--24:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.24},
  URN =		{urn:nbn:de:0030-drops-121497},
  doi =		{10.4230/LIPIcs.CPM.2020.24},
  annote =	{Keywords: data structures, string indexes}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2020.21

Optimal Joins Using Compact Data Structures

Authors: Gonzalo Navarro, Juan L. Reutter, and Javiel Rojas-Ledesma

Published in: LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)

Abstract

Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B+-trees. Either way, this means spending an extra amount of storage space that may be non-negligible. We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quadtree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal.

Cite as

Gonzalo Navarro, Juan L. Reutter, and Javiel Rojas-Ledesma. Optimal Joins Using Compact Data Structures. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 21:1-21:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{navarro_et_al:LIPIcs.ICDT.2020.21,
  author =	{Navarro, Gonzalo and Reutter, Juan L. and Rojas-Ledesma, Javiel},
  title =	{{Optimal Joins Using Compact Data Structures}},
  booktitle =	{23rd International Conference on Database Theory (ICDT 2020)},
  pages =	{21:1--21:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-139-9},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{155},
  editor =	{Lutz, Carsten and Jung, Jean Christoph},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.21},
  URN =		{urn:nbn:de:0030-drops-119453},
  doi =		{10.4230/LIPIcs.ICDT.2020.21},
  annote =	{Keywords: Join algorithms, Compact data structures, Quadtrees, AGM bound}
}

Document

DOI: 10.4230/DagRep.9.6.55

25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)

Authors: Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Jens Stoye

Published in: Dagstuhl Reports, Volume 9, Issue 6 (2020)

Abstract

Dagstuhl Seminar 19241 ("25 Years of the Burrows-Wheeler Transform") took place from June 10th to 14th, 2019, and was attended by 45 people from 13 countries and the three fields of Algorithms and Data Structures, Bioinformatics, and Combinatorics on Words. There were four talks and a panel session for each field. Feedback was generally positive and we are confident the seminar fostered interdisciplinary connections and will eventually result in noteworthy joint publications.

Cite as

Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Jens Stoye. 25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241). In Dagstuhl Reports, Volume 9, Issue 6, pp. 55-68, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@Article{gagie_et_al:DagRep.9.6.55,
  author =	{Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens},
  title =	{{25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)}},
  pages =	{55--68},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2019},
  volume =	{9},
  number =	{6},
  editor =	{Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.9.6.55},
  URN =		{urn:nbn:de:0030-drops-114874},
  doi =		{10.4230/DagRep.9.6.55},
  annote =	{Keywords: Bioinformatics, Burrows-Wheeler Transform, Combinatorics on Words, Data Compression, Data Structures, Indexing, Sequence Alignment}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.26

Simulating the DNA Overlap Graph in Succinct Space

Authors: Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

Cite as

Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA Overlap Graph in Succinct Space. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.CPM.2019.26,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Gagie, Travis and Navarro, Gonzalo},
  title =	{{Simulating the DNA Overlap Graph in Succinct Space}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{26:1--26:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.26},
  URN =		{urn:nbn:de:0030-drops-104978},
  doi =		{10.4230/LIPIcs.CPM.2019.26},
  annote =	{Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2018.68

Tree Path Majority Data Structures

Authors: Travis Gagie, Meng He, and Gonzalo Navarro

Published in: LIPIcs, Volume 123, 29th International Symposium on Algorithms and Computation (ISAAC 2018)

Abstract

We present the first solution to tau-majorities on tree paths. Given a tree of n nodes, each with a label from [1..sigma], and a fixed threshold 0<tau<1, such a query gives two nodes u and v and asks for all the labels that appear more than tau * |P_{uv}| times in the path P_{uv} from u to v, where |P_{uv}| denotes the number of nodes in P_{uv}. Note that the answer to any query is of size up to 1/tau. On a w-bit RAM, we obtain a linear-space data structure with O((1/tau)lg^* n lg lg_w sigma) query time. For any kappa > 1, we can also build a structure that uses O(n lg^{[kappa]} n) space, where lg^{[kappa]} n denotes the function that applies logarithm kappa times to n, and answers queries in time O((1/tau)lg lg_w sigma). The construction time of both structures is O(n lg n). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH + 4n + o(n)(H+1) bits, where H <=lg sigma is the entropy of the label distribution, and can be built in O(n lg n) time. The other uses nH + O(n) + o(nH) bits and is built in O(n lg n) time w.h.p.

Cite as

Travis Gagie, Meng He, and Gonzalo Navarro. Tree Path Majority Data Structures. In 29th International Symposium on Algorithms and Computation (ISAAC 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 123, pp. 68:1-68:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.ISAAC.2018.68,
  author =	{Gagie, Travis and He, Meng and Navarro, Gonzalo},
  title =	{{Tree Path Majority Data Structures}},
  booktitle =	{29th International Symposium on Algorithms and Computation (ISAAC 2018)},
  pages =	{68:1--68:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-094-1},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{123},
  editor =	{Hsu, Wen-Lian and Lee, Der-Tsai and Liao, Chung-Shou},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2018.68},
  URN =		{urn:nbn:de:0030-drops-100166},
  doi =		{10.4230/LIPIcs.ISAAC.2018.68},
  annote =	{Keywords: Majorities on Trees, Succinct data structures}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.CPM.2018

LIPIcs, Volume 105, CPM'18, Complete Volume

Authors: Gonzalo Navarro, David Sankoff, and Binhai Zhu

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

LIPIcs, Volume 105, CPM'18, Complete Volume

Cite as

29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@Proceedings{navarro_et_al:LIPIcs.CPM.2018,
  title =	{{LIPIcs, Volume 105, CPM'18, Complete Volume}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018},
  URN =		{urn:nbn:de:0030-drops-89341},
  doi =		{10.4230/LIPIcs.CPM.2018},
  annote =	{Keywords: Mathematics of computing, Discrete mathematics, Information theory,Information systems, Information retrieval, Theory of computation}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.CPM.2018.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Gonzalo Navarro, David Sankoff, and Binhai Zhu

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{navarro_et_al:LIPIcs.CPM.2018.0,
  author =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  pages =	{0:i--0:xvi},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.0},
  URN =		{urn:nbn:de:0030-drops-86849},
  doi =		{10.4230/LIPIcs.CPM.2018.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2017.57

Fast Compressed Self-Indexes with Deterministic Linear-Time Construction

Authors: J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich

Published in: LIPIcs, Volume 92, 28th International Symposium on Algorithms and Computation (ISAAC 2017)

Abstract

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size \sigma, can be built in O(n) deterministic time, within O(n\log\sigma) bits of working space, and counts the number of occurrences of any pattern P in T in time O(|P| + \log\log_w \sigma) on a RAM machine of w=\Omega(\log n)-bit words. This new index outperforms all the other compressed indexes that can be built in linear deterministic time, and some others. The only faster indexes can be built in linear time only in expectation, or require \Theta(n\log n) bits.

Cite as

J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Fast Compressed Self-Indexes with Deterministic Linear-Time Construction. In 28th International Symposium on Algorithms and Computation (ISAAC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 92, pp. 57:1-57:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{munro_et_al:LIPIcs.ISAAC.2017.57,
  author =	{Munro, J. Ian and Navarro, Gonzalo and Nekrich, Yakov},
  title =	{{Fast Compressed Self-Indexes with Deterministic Linear-Time Construction}},
  booktitle =	{28th International Symposium on Algorithms and Computation (ISAAC 2017)},
  pages =	{57:1--57:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-054-5},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{92},
  editor =	{Okamoto, Yoshio and Tokuyama, Takeshi},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2017.57},
  URN =		{urn:nbn:de:0030-drops-82328},
  doi =		{10.4230/LIPIcs.ISAAC.2017.57},
  annote =	{Keywords: Succinct data structures, Self-indexes, Suffix arrays, Deterministic construction}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.4

Document Listing on Repetitive Collections with Guaranteed Performance

Authors: Gonzalo Navarro

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,a] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size O~(n + s), precisely O((n lg a + s lg^2 N) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc strings where it appears in time O(m^2 + m lg N (lg D + lg^e N) ndoc), for any constant e > 0.

Cite as

Gonzalo Navarro. Document Listing on Repetitive Collections with Guaranteed Performance. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{navarro:LIPIcs.CPM.2017.4,
  author =	{Navarro, Gonzalo},
  title =	{{Document Listing on Repetitive Collections with Guaranteed Performance}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{4:1--4:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.4},
  URN =		{urn:nbn:de:0030-drops-73268},
  doi =		{10.4230/LIPIcs.CPM.2017.4},
  annote =	{Keywords: repetitive string collections, document listing, grammar compression, range minimum queries, succinct data structures}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.5

Path Queries on Functions

Authors: Travis Gagie, Meng He, and Gonzalo Navarro

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

Let f : [1..n] -> [1..n] be a function, and l : [1..n] -> [1..s] indicate a label assigned to each element of the domain. We design several compact data structures that answer various queries on the labels of paths in f. For example, we can find the minimum label in f^k (i) for a given i and any k >= 0 in a given range [k1..k2], using n lg n + O(n) bits, or the minimum label in f^(-k) (i) for a given i and k > 0, using 2n lg n + O(n) bits, both in time O(lg n/ lg lg n). By using n lg s + o(n lg s) further bits, we can also count, within the same time, the number of elements within a range of labels, and report each such element in O(1 + lg s / lg lg n) additional time. Several other possible queries are considered, such as top-t queries and t-majorities.

Cite as

Travis Gagie, Meng He, and Gonzalo Navarro. Path Queries on Functions. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 5:1-5:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.CPM.2017.5,
  author =	{Gagie, Travis and He, Meng and Navarro, Gonzalo},
  title =	{{Path Queries on Functions}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{5:1--5:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.5},
  URN =		{urn:nbn:de:0030-drops-73274},
  doi =		{10.4230/LIPIcs.CPM.2017.5},
  annote =	{Keywords: succinct data structures, integer functions, range queries, trees and permutations}
}

Document

DOI: 10.4230/DagRep.6.10.99

Computation over Compressed Structured Data (Dagstuhl Seminar 16431)

Authors: Philip Bille, Markus Lohrey, Sebastian Maneth, and Gonzalo Navarro

Published in: Dagstuhl Reports, Volume 6, Issue 10 (2017)

Abstract

This report documents the program and the outcomes of Dagstuhl Seminar 16431 "Computation over Compressed Structured Data".

Cite as

Philip Bille, Markus Lohrey, Sebastian Maneth, and Gonzalo Navarro. Computation over Compressed Structured Data (Dagstuhl Seminar 16431). In Dagstuhl Reports, Volume 6, Issue 10, pp. 99-119, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@Article{bille_et_al:DagRep.6.10.99,
  author =	{Bille, Philip and Lohrey, Markus and Maneth, Sebastian and Navarro, Gonzalo},
  title =	{{Computation over Compressed Structured Data (Dagstuhl Seminar 16431)}},
  pages =	{99--119},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2017},
  volume =	{6},
  number =	{10},
  editor =	{Bille, Philip and Lohrey, Markus and Maneth, Sebastian and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.6.10.99},
  URN =		{urn:nbn:de:0030-drops-69521},
  doi =		{10.4230/DagRep.6.10.99},
  annote =	{Keywords: algorithms on compressed structures, data compression, indexing, straight- line programs}
}

Document

DOI: 10.4230/LIPIcs.FSTTCS.2014.291

Asymptotically Optimal Encodings for Range Selection

Authors: Gonzalo Navarro, Rajeev Raman, and Srinivasa Rao Satti

Published in: LIPIcs, Volume 29, 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)

Abstract

We consider the problem of preprocessing an array A[1..n] to answer range selection and range top-k queries. Given a query interval [i..j] and a value k, the former query asks for the position of the k-th largest value in A[i..j], whereas the latter asks for the positions of all the k largest values in A[i..j]. We consider the encoding} version of the problem, where A is not available at query time, and an upper bound kappa on k, the rank that is to be selected, is given at construction time. We obtain data structures with asymptotically optimal size and query time on a RAM model with word size Theta(lg(n)): our structures use O(n*lg(kappa)) bits and answer range selection queries in time O(1+lg(k) / lg(lg(n))) and range top-k queries in time O(k), for any k <= kappa.

Cite as

Gonzalo Navarro, Rajeev Raman, and Srinivasa Rao Satti. Asymptotically Optimal Encodings for Range Selection. In 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 29, pp. 291-301, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)

Copy BibTex To Clipboard

@InProceedings{navarro_et_al:LIPIcs.FSTTCS.2014.291,
  author =	{Navarro, Gonzalo and Raman, Rajeev and Satti, Srinivasa Rao},
  title =	{{Asymptotically Optimal Encodings for Range Selection}},
  booktitle =	{34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)},
  pages =	{291--301},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-77-4},
  ISSN =	{1868-8969},
  year =	{2014},
  volume =	{29},
  editor =	{Raman, Venkatesh and Suresh, S. P.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2014.291},
  URN =		{urn:nbn:de:0030-drops-48502},
  doi =		{10.4230/LIPIcs.FSTTCS.2014.291},
  annote =	{Keywords: Data Structures, Order Statistics, Succinct Data Structures, Space-efficient Data Structures}
}

Document

DOI: 10.4230/DagRep.3.6.22

Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)

Authors: Sebastian Maneth and Gonzalo Navarro

Published in: Dagstuhl Reports, Volume 3, Issue 6 (2013)

Abstract

This report documents the program and the outcomes of Dagstuhl Seminar 13232 "Indexes and Computation over Compressed Structured Data".

Cite as

Sebastian Maneth and Gonzalo Navarro. Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232). In Dagstuhl Reports, Volume 3, Issue 6, pp. 22-37, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2013)

Copy BibTex To Clipboard

@Article{maneth_et_al:DagRep.3.6.22,
  author =	{Maneth, Sebastian and Navarro, Gonzalo},
  title =	{{Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)}},
  pages =	{22--37},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2013},
  volume =	{3},
  number =	{6},
  editor =	{Maneth, Sebastian and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.3.6.22},
  URN =		{urn:nbn:de:0030-drops-42558},
  doi =		{10.4230/DagRep.3.6.22},
  annote =	{Keywords: Compression; Indexes; Data Structures}
}

Document

DOI: 10.4230/LIPIcs.STACS.2009.1814

Compressed Representations of Permutations, and Applications

Authors: Jeremy Barbay and Gonzalo Navarro

Published in: LIPIcs, Volume 3, 26th International Symposium on Theoretical Aspects of Computer Science (2009)

Abstract

We explore various techniques to compress a permutation $\pi$ over $n$ integers, taking advantage of ordered subsequences in $\pi$, while supporting its application $\pi(i)$ and the application of its inverse $\pi^{-1}(i)$ in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications $\pi^{k}(i)$ of it, of integer functions, and of inverted lists and suffix arrays.

Cite as

Jeremy Barbay and Gonzalo Navarro. Compressed Representations of Permutations, and Applications. In 26th International Symposium on Theoretical Aspects of Computer Science. Leibniz International Proceedings in Informatics (LIPIcs), Volume 3, pp. 111-122, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2009)

Copy BibTex To Clipboard

@InProceedings{barbay_et_al:LIPIcs.STACS.2009.1814,
  author =	{Barbay, Jeremy and Navarro, Gonzalo},
  title =	{{Compressed Representations of Permutations, and Applications}},
  booktitle =	{26th International Symposium on Theoretical Aspects of Computer Science},
  pages =	{111--122},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-09-5},
  ISSN =	{1868-8969},
  year =	{2009},
  volume =	{3},
  editor =	{Albers, Susanne and Marion, Jean-Yves},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2009.1814},
  URN =		{urn:nbn:de:0030-drops-18148},
  doi =		{10.4230/LIPIcs.STACS.2009.1814},
  annote =	{Keywords: Compression, Permutations, Succinct data structures, Adaptive sorting}
}

Document

DOI: 10.4230/DagSemProc.08261.6

An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation

Authors: Angela Bonifati, Gregory Leighton, Veli Mäkinen, Sebastian Maneth, Gonzalo Navarro, and Andrea Pugliese

Published in: Dagstuhl Seminar Proceedings, Volume 8261, Structure-Based Compression of Complex Massive Data (2008)

Abstract

We describe the architecture and main algorithmic design decisions for an XQuery/XPath processing engine over XML collections which will be represented using a self-indexing approach, that is, a compressed representation that will allow for basic searching and navigational operations in compressed form. The goal is a structure that occupies little space and thus permits manipulating large collections in main memory.

Cite as

Angela Bonifati, Gregory Leighton, Veli Mäkinen, Sebastian Maneth, Gonzalo Navarro, and Andrea Pugliese. An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation. In Structure-Based Compression of Complex Massive Data. Dagstuhl Seminar Proceedings, Volume 8261, pp. 1-17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{bonifati_et_al:DagSemProc.08261.6,
  author =	{Bonifati, Angela and Leighton, Gregory and M\"{a}kinen, Veli and Maneth, Sebastian and Navarro, Gonzalo and Pugliese, Andrea},
  title =	{{An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation}},
  booktitle =	{Structure-Based Compression of Complex Massive Data},
  pages =	{1--17},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8261},
  editor =	{Stefan B\"{o}ttcher and Markus Lohrey and Sebastian Maneth and Wojcieh Rytter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08261.6},
  URN =		{urn:nbn:de:0030-drops-16776},
  doi =		{10.4230/DagSemProc.08261.6},
  annote =	{Keywords: Compressed self-index, compressed XML representation, XPath, XQuery}
}

Document

DOI: 10.4230/DagSemProc.08261.10

Storage and Retrieval of Individual Genomes

Authors: Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki

Published in: Dagstuhl Seminar Proceedings, Volume 8261, Structure-Based Compression of Complex Massive Data (2008)

Abstract

A repetitive sequence collection is one where portions of a emph{base sequence} of length $n$ are repeated many times with small variations, forming a collection of total length $N$. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies $O(N log N)$ bits, which very soon inhibits in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to $O(N log sigma)$ bits, where $sigma$ is the alphabet size. In practice, the space reduction is more than $10$-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length $n$ of the base sequence and the number $s$ of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on $N/n$. We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

Cite as

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and Retrieval of Individual Genomes. In Structure-Based Compression of Complex Massive Data. Dagstuhl Seminar Proceedings, Volume 8261, pp. 1-14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{makinen_et_al:DagSemProc.08261.10,
  author =	{M\"{a}kinen, Veli and Navarro, Gonzalo and Sir\'{e}n, Jouni and V\"{a}lim\"{a}ki, Niko},
  title =	{{Storage and Retrieval of Individual Genomes}},
  booktitle =	{Structure-Based Compression of Complex Massive Data},
  pages =	{1--14},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8261},
  editor =	{Stefan B\"{o}ttcher and Markus Lohrey and Sebastian Maneth and Wojcieh Rytter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08261.10},
  URN =		{urn:nbn:de:0030-drops-16743},
  doi =		{10.4230/DagSemProc.08261.10},
  annote =	{Keywords: Pattern matching, text indexing, compressed data structures, comparative genomics}
}

Search Results

Documents authored by Navarro, Gonzalo

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as