DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2022.18

Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Authors: Adrián Goga and Andrej Baláž

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.

Cite as

Adrián Goga and Andrej Baláž. Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 18:1-18:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{goga_et_al:LIPIcs.WABI.2022.18,
  author =	{Goga, Adri\'{a}n and Bal\'{a}\v{z}, Andrej},
  title =	{{Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{18:1--18:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.18},
  URN =		{urn:nbn:de:0030-drops-170529},
  doi =		{10.4230/LIPIcs.WABI.2022.18},
  annote =	{Keywords: Wheeler graphs, BWT tunnelling, prefix-free parsing, pangenomic graphs}
}

Document

DOI: 10.4230/LIPIcs.WABI.2021.13

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, and Giovanni Manzini

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Cite as

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.WABI.2021.13,
  author =	{Gagie, Travis and Gourdel, Garance and Manzini, Giovanni},
  title =	{{Compressing and Indexing Aligned Readsets}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{13:1--13:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.13},
  URN =		{urn:nbn:de:0030-drops-143660},
  doi =		{10.4230/LIPIcs.WABI.2021.13},
  annote =	{Keywords: data compression, compact data structures, FM-index, Burrows-Wheeler Transform, EBWT, XBWT, DNA reads}
}

Document

DOI: 10.4230/LIPIcs.CPM.2018.3

On Undetected Redundancy in the Burrows-Wheeler Transform

Authors: Uwe Baier

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.

Cite as

Uwe Baier. On Undetected Redundancy in the Burrows-Wheeler Transform. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{baier:LIPIcs.CPM.2018.3,
  author =	{Baier, Uwe},
  title =	{{On Undetected Redundancy in the Burrows-Wheeler Transform}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  pages =	{3:1--3:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.3},
  URN =		{urn:nbn:de:0030-drops-87049},
  doi =		{10.4230/LIPIcs.CPM.2018.3},
  annote =	{Keywords: Lossless data compression, BWT, Tunneling}
}

Document

DOI: 10.4230/LIPIcs.CPM.2016.23

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction

Authors: Uwe Baier

Published in: LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Abstract

This paper presents a new approach for linear-time suffix sorting. It introduces a new sorting principle that can be used to build the first non-recursive linear-time suffix array construction algorithm named GSACA. Although GSACA cannot keep up with the performance of state of the art suffix array construction algorithms, the algorithm introduces a couple of new ideas for suffix array construction, and therefore can be seen as an ’idea collection’ for further suffix array construction improvements.

Cite as

Uwe Baier. Linear-time Suffix Sorting - A New Approach for Suffix Array Construction. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 23:1-23:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{baier:LIPIcs.CPM.2016.23,
  author =	{Baier, Uwe},
  title =	{{Linear-time Suffix Sorting - A New Approach for Suffix Array Construction}},
  booktitle =	{27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)},
  pages =	{23:1--23:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-012-5},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{54},
  editor =	{Grossi, Roberto and Lewenstein, Moshe},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2016.23},
  URN =		{urn:nbn:de:0030-drops-60698},
  doi =		{10.4230/LIPIcs.CPM.2016.23},
  annote =	{Keywords: Suffix array, sorting algorithm, linear time}
}

4 Search Results for "Baier, Uwe"

Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Abstract

Cite as

Compressing and Indexing Aligned Readsets

Abstract

Cite as

On Undetected Redundancy in the Burrows-Wheeler Transform

Abstract

Cite as

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction

Abstract

Cite as