DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2024.13

PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Authors: Md. Hasin Abrar and Paul Medvedev

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer’s rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.

Cite as

Md. Hasin Abrar and Paul Medvedev. PLA-index: A k-mer Index Exploiting Rank Curve Linearity. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{abrar_et_al:LIPIcs.WABI.2024.13,
  author =	{Abrar, Md. Hasin and Medvedev, Paul},
  title =	{{PLA-index: A k-mer Index Exploiting Rank Curve Linearity}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{13:1--13:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.13},
  URN =		{urn:nbn:de:0030-drops-206578},
  doi =		{10.4230/LIPIcs.WABI.2024.13},
  annote =	{Keywords: K-mer index, Piece-wise linear approximation, Learned index}
}

Document

DOI: 10.4230/LIPIcs.WABI.2024.23

AlfaPang: Alignment Free Algorithm for Pangenome Graph Construction

Authors: Adam Cicherski, Anna Lisiecka, and Norbert Dojer

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

The success of pangenome-based approaches to genomics analysis depends largely on the existence of efficient methods for constructing pangenome graphs that are applicable to large genome collections. In the current paper we present AlfaPang, a new pangenome graph building algorithm. AlfaPang is based on a novel alignment-free approach that allows to construct pangenome graphs using significantly less computational resources than state-of-the-art tools. The code of AlfaPang is freely available at https://github.com/AdamCicherski/AlfaPang.

Cite as

Adam Cicherski, Anna Lisiecka, and Norbert Dojer. AlfaPang: Alignment Free Algorithm for Pangenome Graph Construction. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 23:1-23:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{cicherski_et_al:LIPIcs.WABI.2024.23,
  author =	{Cicherski, Adam and Lisiecka, Anna and Dojer, Norbert},
  title =	{{AlfaPang: Alignment Free Algorithm for Pangenome Graph Construction}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{23:1--23:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.23},
  URN =		{urn:nbn:de:0030-drops-206673},
  doi =		{10.4230/LIPIcs.WABI.2024.23},
  annote =	{Keywords: pangenome, variation graph, genome alignment, population genomics}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.16

Compressed Weighted de Bruijn Graphs

Authors: Giuseppe F. Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest.

Cite as

Giuseppe F. Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini. Compressed Weighted de Bruijn Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{italiano_et_al:LIPIcs.CPM.2021.16,
  author =	{Italiano, Giuseppe F. and Prezza, Nicola and Sinaimeri, Blerina and Venturini, Rossano},
  title =	{{Compressed Weighted de Bruijn Graphs}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.16},
  URN =		{urn:nbn:de:0030-drops-139675},
  doi =		{10.4230/LIPIcs.CPM.2021.16},
  annote =	{Keywords: weighted de Bruijn graphs, k-mer annotation, compressed data structures, partial sums}
}

Document

DOI: 10.4230/LIPIcs.ESA.2017.38

An Encoding for Order-Preserving Matching

Authors: Travis Gagie, Giovanni Manzini, and Rossano Venturini

Published in: LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

Abstract

Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the relative order of their characters is the same: e.g., (4, 1, 3, 2) and (10, 3, 7, 5) are an order-preserving match. We show how, given a string S[1..n] over an arbitrary alphabet of size sigma and a constant c >=1, we can build an O(n log log n)-bit encoding such that later, given a pattern P[1..m] with m >= log^c n, we can return the number of order-preserving occurrences of P in S in O(m) time. Within the same time bound we can also return the starting position of some order-preserving match for P in S (if such a match exists). We prove that our space bound is within a constant factor of optimal if log(sigma) = Omega(log log n); our query time is optimal if log(sigma) = Omega(log n). Our space bound contrasts with the Omega(n log n) bits needed in the worst case to store S itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to O(log^c n) neighbouring characters.

Cite as

Travis Gagie, Giovanni Manzini, and Rossano Venturini. An Encoding for Order-Preserving Matching. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 38:1-38:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.ESA.2017.38,
  author =	{Gagie, Travis and Manzini, Giovanni and Venturini, Rossano},
  title =	{{An Encoding for Order-Preserving Matching}},
  booktitle =	{25th Annual European Symposium on Algorithms (ESA 2017)},
  pages =	{38:1--38:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-049-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{87},
  editor =	{Pruhs, Kirk and Sohler, Christian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.38},
  URN =		{urn:nbn:de:0030-drops-78726},
  doi =		{10.4230/LIPIcs.ESA.2017.38},
  annote =	{Keywords: Compact data structures, encodings, order-preserving matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.30

Dynamic Elias-Fano Representation

Authors: Giulio Ermanno Pibiri and Rossano Venturini

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

We show that it is possible to store a dynamic ordered set S of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and preserve, at the same time, the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of monotone integer sequences, which can be shown to be less than half a bit per element away from the information-theoretic minimum. In particular, considering a RAM model with memory word size Theta(log u) bits, when integers are drawn from a polynomial universe of size u = n^gamma for any gamma = Theta(1), we add o(n) bits to the static Elias-Fano representation in order to: 1. support static predecessor/successor queries in O(min{1+log(u/n), loglog n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. describe a dynamic data structure supporting random access in O(log n / loglog n) worst-case, insertions/deletions in O(log n / loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n), loglog n}) worst-case time. These time bounds are optimal.

Cite as

Giulio Ermanno Pibiri and Rossano Venturini. Dynamic Elias-Fano Representation. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 30:1-30:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{pibiri_et_al:LIPIcs.CPM.2017.30,
  author =	{Pibiri, Giulio Ermanno and Venturini, Rossano},
  title =	{{Dynamic Elias-Fano Representation}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{30:1--30:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.30},
  URN =		{urn:nbn:de:0030-drops-73244},
  doi =		{10.4230/LIPIcs.CPM.2017.30},
  annote =	{Keywords: succinct data structures, integer sets, predecessor problem, Elias-Fano}
}

5 Search Results for "Venturini, Rossano"

PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Abstract

Cite as

AlfaPang: Alignment Free Algorithm for Pangenome Graph Construction

Abstract

Cite as

Compressed Weighted de Bruijn Graphs

Abstract

Cite as

An Encoding for Order-Preserving Matching

Abstract

Cite as

Dynamic Elias-Fano Representation

Abstract

Cite as

Thanks for your feedback!

Could not send message