DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2023.22

Merging Sorted Lists of Similar Strings

Authors: Gene Myers

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Cite as

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{myers:LIPIcs.CPM.2023.22,
  author =	{Myers, Gene},
  title =	{{Merging Sorted Lists of Similar Strings}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.22},
  URN =		{urn:nbn:de:0030-drops-179763},
  doi =		{10.4230/LIPIcs.CPM.2023.22},
  annote =	{Keywords: heap, trie, longest common prefix}
}

Document

DOI: 10.4230/LIPIcs.WABI.2022.10

Accurate k-mer Classification Using Read Profiles

Authors: Yoshihiko Suzuki and Gene Myers

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.

Cite as

Yoshihiko Suzuki and Gene Myers. Accurate k-mer Classification Using Read Profiles. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 10:1-10:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{suzuki_et_al:LIPIcs.WABI.2022.10,
  author =	{Suzuki, Yoshihiko and Myers, Gene},
  title =	{{Accurate k-mer Classification Using Read Profiles}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{10:1--10:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.10},
  URN =		{urn:nbn:de:0030-drops-170446},
  doi =		{10.4230/LIPIcs.WABI.2022.10},
  annote =	{Keywords: K-mer, K-mer count, K-mer classification, HiFi sequencing}
}

Document

DOI: 10.4230/DagSemProc.04292.1

Data Mining: The Next Generation

Authors: Raghu Ramakrishnan, Rakesh Agrawal, Johann-Christoph Freytag, Toni Bollinger, Christopher W. Clifton, Saso Dzeroski, Jochen Hipp, Daniel Keim, Stefan Kramer, Hans-Peter Kriegel, Ulf Leser, Bing Liu, Heikki Mannila, Rosa Meo, Shinichi Morishita, Raymond Ng, Jian Pei, Prabhakar Raghavan, Myra Spiliopoulou, Jaideep Srivastava, and Vicenc Torra

Published in: Dagstuhl Seminar Proceedings, Volume 4292, Perspectives Workshop: Data Mining: The Next Generation (2005)

Abstract

Data Mining (DM) has enjoyed great popularity in recent years, with advances in both research and commercialization. The first generation of DM research and development has yielded several commercially available systems, both stand-alone and integrated with database systems; produced scalable versions of algorithms for many classical DM problems; and introduced novel pattern discovery problems. In recent years, research has tended to be fragmented into several distinct pockets without a comprehensive framework. Researchers have continued to work largely within the parameters of their parent disciplines, building upon existing and distinct research methodologies. Even when they address a common problem (for example, how to cluster a dataset) they apply different techniques, different perspectives on what the important issues are, and different evaluation criteria. While different approaches can be complementary, and such a diversity is ultimately a strength of the field, better communication across disciplines is required if DM is to forge a distinct identity with a core set of principles, perspectives, and challenges that differentiate it from each of the parent disciplines. Further, while the amount and complexity of data continues to grow rapidly, and the task of distilling useful insight continues to be central, serious concerns have emerged about social implications of DM. Addressing these concerns will require advances in our theoretical understanding of the principles that underlie DM algorithms, as well as an integrated approach to security and privacy in all phases of data management and analysis. Researchers from a variety of backgrounds assembled at Dagstuhl to re-assess the current directions of the field, to identify critical problems that require attention, and to discuss ways to increase the flow of ideas across the different disciplines that DM has brought together. The workshop did not seek to draw up an agenda for the field of DM. Rather, it offers the participants’ perspective on two technical directions – compositionality and privacy – and describes some important application challenges that drove the discussion. Both of these directions illustrate the opportunities for crossdisciplinary research, and there was broad agreement that they represent important and timely areas for further work; of course, the choice of these directions as topics for discussion also reflects the personal interests and biases of the workshop participants.

Cite as

Raghu Ramakrishnan, Rakesh Agrawal, Johann-Christoph Freytag, Toni Bollinger, Christopher W. Clifton, Saso Dzeroski, Jochen Hipp, Daniel Keim, Stefan Kramer, Hans-Peter Kriegel, Ulf Leser, Bing Liu, Heikki Mannila, Rosa Meo, Shinichi Morishita, Raymond Ng, Jian Pei, Prabhakar Raghavan, Myra Spiliopoulou, Jaideep Srivastava, and Vicenc Torra. Data Mining: The Next Generation. In Perspectives Workshop: Data Mining: The Next Generation. Dagstuhl Seminar Proceedings, Volume 4292, pp. 1-33, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{ramakrishnan_et_al:DagSemProc.04292.1,
  author =	{Ramakrishnan, Raghu and Agrawal, Rakesh and Freytag, Johann-Christoph and Bollinger, Toni and Clifton, Christopher W. and Dzeroski, Saso and Hipp, Jochen and Keim, Daniel and Kramer, Stefan and Kriegel, Hans-Peter and Leser, Ulf and Liu, Bing and Mannila, Heikki and Meo, Rosa and Morishita, Shinichi and Ng, Raymond and Pei, Jian and Raghavan, Prabhakar and Spiliopoulou, Myra and Srivastava, Jaideep and Torra, Vicenc},
  title =	{{Data Mining: The Next Generation}},
  booktitle =	{Perspectives Workshop: Data Mining: The Next Generation},
  pages =	{1--33},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{4292},
  editor =	{Rakesh Agrawal and Johann Christoph Freytag and Raghu Ramakrishnan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.04292.1},
  URN =		{urn:nbn:de:0030-drops-2709},
  doi =		{10.4230/DagSemProc.04292.1},
  annote =	{Keywords: Data mining, databases, artificial intelligence, machine learning, statistics, semantics}
}

@InProceedings{ramakrishnan_et_al:DagSemProc.04292.1,
  author =	{Ramakrishnan, Raghu and Agrawal, Rakesh and Freytag, Johann-Christoph and Bollinger, Toni and Clifton, Christopher W. and Dzeroski, Saso and Hipp, Jochen and Keim, Daniel and Kramer, Stefan and Kriegel, Hans-Peter and Leser, Ulf and Liu, Bing and Mannila, Heikki and Meo, Rosa and Morishita, Shinichi and Ng, Raymond and Pei, Jian and Raghavan, Prabhakar and Spiliopoulou, Myra and Srivastava, Jaideep and Torra, Vicenc},
  title =	{{Data Mining: The Next Generation}},
  booktitle =	{Perspectives Workshop: Data Mining: The Next Generation},
  pages =	{1--33},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{4292},
  editor =	{Rakesh Agrawal and Johann Christoph Freytag and Raghu Ramakrishnan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.04292.1},
  URN =		{urn:nbn:de:0030-drops-2709},
  doi =		{10.4230/DagSemProc.04292.1},
  annote =	{Keywords: Data mining, databases, artificial intelligence, machine learning, statistics, semantics}
}

3 Search Results for "Morishita, Shinichi"

Merging Sorted Lists of Similar Strings

Abstract

Cite as

Accurate k-mer Classification Using Read Profiles

Abstract

Cite as

Data Mining: The Next Generation

Abstract

Cite as

Thanks for your feedback!

Could not send message