5 Search Results for "Myers, Gene"


Document
Simple Runs-Bounded FM-Index Designs Are Fast

Authors: Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela

Published in: LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)


Abstract
Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

Cite as

Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela. Simple Runs-Bounded FM-Index Designs Are Fast. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 7:1-7:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.SEA.2023.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena},
  title =	{{Simple Runs-Bounded FM-Index Designs Are Fast}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{7:1--7:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.7},
  URN =		{urn:nbn:de:0030-drops-183579},
  doi =		{10.4230/LIPIcs.SEA.2023.7},
  annote =	{Keywords: data structures, efficient algorithms}
}
Document
Merging Sorted Lists of Similar Strings

Authors: Gene Myers

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)


Abstract
Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Cite as

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{myers:LIPIcs.CPM.2023.22,
  author =	{Myers, Gene},
  title =	{{Merging Sorted Lists of Similar Strings}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.22},
  URN =		{urn:nbn:de:0030-drops-179763},
  doi =		{10.4230/LIPIcs.CPM.2023.22},
  annote =	{Keywords: heap, trie, longest common prefix}
}
Document
Accurate k-mer Classification Using Read Profiles

Authors: Yoshihiko Suzuki and Gene Myers

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.

Cite as

Yoshihiko Suzuki and Gene Myers. Accurate k-mer Classification Using Read Profiles. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 10:1-10:20, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{suzuki_et_al:LIPIcs.WABI.2022.10,
  author =	{Suzuki, Yoshihiko and Myers, Gene},
  title =	{{Accurate k-mer Classification Using Read Profiles}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{10:1--10:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.10},
  URN =		{urn:nbn:de:0030-drops-170446},
  doi =		{10.4230/LIPIcs.WABI.2022.10},
  annote =	{Keywords: K-mer, K-mer count, K-mer classification, HiFi sequencing}
}
Document
Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs

Authors: Djamal Belazzougui and Fabio Cunial

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)


Abstract
Given a string T on an alphabet of size sigma, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log sigma) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.

Cite as

Djamal Belazzougui and Fabio Cunial. Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 10:1-10:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2019.10,
  author =	{Belazzougui, Djamal and Cunial, Fabio},
  title =	{{Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{10:1--10:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.10},
  URN =		{urn:nbn:de:0030-drops-104811},
  doi =		{10.4230/LIPIcs.CPM.2019.10},
  annote =	{Keywords: BWT, suffix tree, CDAWG, de Bruijn graph, maximal repeat, string depth, contraction, bidirectional index}
}
Document
Next Generation Sequencing (Dagstuhl Seminar 16351)

Authors: Gene Myers, Mihai Pop, Knut Reinert, and Tandy Warnow

Published in: Dagstuhl Reports, Volume 6, Issue 8 (2017)


Abstract
Next Generation Sequencing (NGS) data have begun to appear in many applications that are clinically relevant, such as resequencing of cancer patients, disease-gene discovery and diagnostics for rare diseases, microbiome analyses, and gene expression profiling. The analysis of sequencing data is demanding because of the enormous data volume and the need for fast turnaround time, accuracy, reproducibility, and data security. This Dagstuhl Seminar aimed at a free and deep exchange of ideas and needs between the communities of algorithmicists and theoreticians and practitioners from the biomedical field. It identified several relevant fields such as data structures and algorithms for large data sets, hardware acceleration, new problems in the upcoming age of genomes, etc. which were discussed in breakout groups.

Cite as

Gene Myers, Mihai Pop, Knut Reinert, and Tandy Warnow. Next Generation Sequencing (Dagstuhl Seminar 16351). In Dagstuhl Reports, Volume 6, Issue 8, pp. 91-130, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@Article{myers_et_al:DagRep.6.8.91,
  author =	{Myers, Gene and Pop, Mihai and Reinert, Knut and Warnow, Tandy},
  title =	{{Next Generation Sequencing (Dagstuhl Seminar 16351)}},
  pages =	{91--130},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2017},
  volume =	{6},
  number =	{8},
  editor =	{Myers, Gene and Pop, Mihai and Reinert, Knut and Warnow, Tandy},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.6.8.91},
  URN =		{urn:nbn:de:0030-drops-68395},
  doi =		{10.4230/DagRep.6.8.91},
  annote =	{Keywords: Cancer, DNA Sequence Assembly, Expression Profiles, Next Generation Sequencing, Sequence analysis, Software Engineering (Tools \& Libraries)}
}
  • Refine by Author
  • 3 Myers, Gene
  • 1 Belazzougui, Djamal
  • 1 Cunial, Fabio
  • 1 Díaz-Domínguez, Diego
  • 1 Dönges, Saska
  • Show More...

  • Refine by Classification
  • 2 Theory of computation → Design and analysis of algorithms
  • 1 Applied computing → Molecular sequence analysis
  • 1 Theory of computation → Data structures design and analysis
  • 1 Theory of computation → Theory and algorithms for application domains

  • Refine by Keyword
  • 1 BWT
  • 1 CDAWG
  • 1 Cancer
  • 1 DNA Sequence Assembly
  • 1 Expression Profiles
  • Show More...

  • Refine by Type
  • 5 document

  • Refine by Publication Year
  • 2 2023
  • 1 2017
  • 1 2019
  • 1 2022

Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail