3 Search Results for "Durbin, Richard"

Document

DOI: 10.4230/LIPIcs.CPM.2023.22

Merging Sorted Lists of Similar Strings

Authors: Gene Myers

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Cite as

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{myers:LIPIcs.CPM.2023.22,
  author =	{Myers, Gene},
  title =	{{Merging Sorted Lists of Similar Strings}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.22},
  URN =		{urn:nbn:de:0030-drops-179763},
  doi =		{10.4230/LIPIcs.CPM.2023.22},
  annote =	{Keywords: heap, trie, longest common prefix}
}

Document

DOI: 10.4230/LIPIcs.WABI.2021.13

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, and Giovanni Manzini

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Cite as

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.WABI.2021.13,
  author =	{Gagie, Travis and Gourdel, Garance and Manzini, Giovanni},
  title =	{{Compressing and Indexing Aligned Readsets}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{13:1--13:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.13},
  URN =		{urn:nbn:de:0030-drops-143660},
  doi =		{10.4230/LIPIcs.WABI.2021.13},
  annote =	{Keywords: data compression, compact data structures, FM-index, Burrows-Wheeler Transform, EBWT, XBWT, DNA reads}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.4

Haplotype-aware graph indexes

Authors: Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Cite as

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin. Haplotype-aware graph indexes. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{siren_et_al:LIPIcs.WABI.2018.4,
  author =	{Sir\'{e}n, Jouni and Garrison, Erik and Novak, Adam M. and Paten, Benedict J. and Durbin, Richard},
  title =	{{Haplotype-aware graph indexes}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{4:1--4:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.4},
  URN =		{urn:nbn:de:0030-drops-93060},
  doi =		{10.4230/LIPIcs.WABI.2018.4},
  annote =	{Keywords: FM-indexes, variation graphs, haplotypes}
}

Refine by Author
1 Durbin, Richard
1 Gagie, Travis
1 Garrison, Erik
1 Gourdel, Garance
1 Manzini, Giovanni
Show More...

Refine by Classification
2 Theory of computation → Data compression
1 Applied computing → Computational genomics
1 Theory of computation → Pattern matching
1 Theory of computation → Theory and algorithms for application domains

Refine by Keyword
1 Burrows-Wheeler Transform
1 DNA reads
1 EBWT
1 FM-index
1 FM-indexes
Show More...

Refine by Type
3 document

Refine by Publication Year
1 2018
1 2021
1 2023

3 Search Results for "Durbin, Richard"

Merging Sorted Lists of Similar Strings

Abstract

Cite as

Compressing and Indexing Aligned Readsets

Abstract

Cite as

Haplotype-aware graph indexes

Abstract

Cite as

Thanks for your feedback!

Could not send message