DROPS

Document

DOI: 10.4230/LIPIcs.SEA.2023.7

Simple Runs-Bounded FM-Index Designs Are Fast

Authors: Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela

Published in: LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Abstract

Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

Cite as

Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela. Simple Runs-Bounded FM-Index Designs Are Fast. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 7:1-7:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.SEA.2023.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena},
  title =	{{Simple Runs-Bounded FM-Index Designs Are Fast}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{7:1--7:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.7},
  URN =		{urn:nbn:de:0030-drops-183579},
  doi =		{10.4230/LIPIcs.SEA.2023.7},
  annote =	{Keywords: data structures, efficient algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.29

Efficient Construction of the BWT for Repetitive Text Using String Compression

Authors: Diego Díaz-Domínguez and Gonzalo Navarro

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.

Cite as

Diego Díaz-Domínguez and Gonzalo Navarro. Efficient Construction of the BWT for Repetitive Text Using String Compression. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 29:1-29:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.CPM.2022.29,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Navarro, Gonzalo},
  title =	{{Efficient Construction of the BWT for Repetitive Text Using String Compression}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{29:1--29:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.29},
  URN =		{urn:nbn:de:0030-drops-161564},
  doi =		{10.4230/LIPIcs.CPM.2022.29},
  annote =	{Keywords: BWT, string compression, repetitive text}
}

Document

DOI: 10.4230/LIPIcs.WABI.2021.13

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, and Giovanni Manzini

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Cite as

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.WABI.2021.13,
  author =	{Gagie, Travis and Gourdel, Garance and Manzini, Giovanni},
  title =	{{Compressing and Indexing Aligned Readsets}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{13:1--13:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.13},
  URN =		{urn:nbn:de:0030-drops-143660},
  doi =		{10.4230/LIPIcs.WABI.2021.13},
  annote =	{Keywords: data compression, compact data structures, FM-index, Burrows-Wheeler Transform, EBWT, XBWT, DNA reads}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.26

Simulating the DNA Overlap Graph in Succinct Space

Authors: Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

Cite as

Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA Overlap Graph in Succinct Space. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.CPM.2019.26,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Gagie, Travis and Navarro, Gonzalo},
  title =	{{Simulating the DNA Overlap Graph in Succinct Space}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{26:1--26:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.26},
  URN =		{urn:nbn:de:0030-drops-104978},
  doi =		{10.4230/LIPIcs.CPM.2019.26},
  annote =	{Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees}
}

4 Search Results for "Díaz-Domínguez, Diego"

Simple Runs-Bounded FM-Index Designs Are Fast

Abstract

Cite as

Efficient Construction of the BWT for Repetitive Text Using String Compression

Abstract

Cite as

Compressing and Indexing Aligned Readsets

Abstract

Cite as

Simulating the DNA Overlap Graph in Succinct Space

Abstract

Cite as

Thanks for your feedback!

Could not send message