4 Search Results for "Leimeister, Chris"

Document

DOI: 10.4230/LIPIcs.WABI.2025.20

A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences

Authors: Haonan Wu, Antonio Blanca, and Paul Medvedev

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

K-mer-based analysis of genomic data is ubiquitous, but the presence of repetitive k-mers continues to pose problems for the accuracy of many methods. For example, the Mash tool (Ondov et al. 2016) can accurately estimate the substitution rate between two low-repetitive sequences from their k-mer sketches; however, it is inaccurate on repetitive sequences such as the centromere of a human chromosome. Follow-up work by Blanca et al. (2021) has attempted to model how mutations affect k-mer sets based on strong assumptions that the sequence is non-repetitive and that mutations do not create spurious k-mer matches. However, the theoretical foundations for extending an estimator like Mash to work in the presence of repeat sequences have been lacking. In this work, we relax the non-repetitive assumption and propose a novel estimator for the mutation rate. We derive theoretical bounds on our estimator’s bias. Our experiments show that it remains accurate for repetitive genomic sequences, such as the alpha satellite higher order repeats in centromeres. We demonstrate our estimator’s robustness across diverse datasets and various ranges of the substitution rate and k-mer size. Finally, we show how sketching can be used to avoid dealing with large k-mer sets while retaining accuracy. Our software is available at https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator.

Cite as

Haonan Wu, Antonio Blanca, and Paul Medvedev. A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 20:1-20:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{wu_et_al:LIPIcs.WABI.2025.20,
  author =	{Wu, Haonan and Blanca, Antonio and Medvedev, Paul},
  title =	{{A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{20:1--20:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.20},
  URN =		{urn:nbn:de:0030-drops-239465},
  doi =		{10.4230/LIPIcs.WABI.2025.20},
  annote =	{Keywords: k-mers, sketching, mutation rates}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.7

Sequence Similarity Estimation by Random Subsequence Sketching

Authors: Ke Chen, Vinamratha Pattar, and Mingfu Shao

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.

Cite as

Ke Chen, Vinamratha Pattar, and Mingfu Shao. Sequence Similarity Estimation by Random Subsequence Sketching. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{chen_et_al:LIPIcs.WABI.2025.7,
  author =	{Chen, Ke and Pattar, Vinamratha and Shao, Mingfu},
  title =	{{Sequence Similarity Estimation by Random Subsequence Sketching}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{7:1--7:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.7},
  URN =		{urn:nbn:de:0030-drops-239332},
  doi =		{10.4230/LIPIcs.WABI.2025.7},
  annote =	{Keywords: Alignment-free sequence comparison, Phylogenetic clustering, Nearest neighbor search, Edit distance embedding}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.20

Subsequence-Based Indices for Genome Sequence Analysis

Authors: Giovanni Buzzega, Alessio Conte, Veronica Guerrini, Giulia Punzi, Giovanna Rosone, and Lorenzo Tattini

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

Compact indices are a fundamental tool in string analysis, even more so in bioinformatics, where genomic sequences can reach billions in length. This paper presents some recent results in which Roberto Grossi has been involved, showing how some of these indices do more than just efficiently represent data, but rather are able to bring out salient information within it, which can be exploited for their downstream analysis. Specifically, we first review a recently-introduced method [Guerrini et al., 2023] that employs the Burrows-Wheeler Transform to build reasonably accurate phylogenetic trees in an assembly-free scenario. We then describe a recent practical tool [Buzzega et al., 2025] for indexing Maximal Common Subsequences between strings, which can enable analysis of genomic sequence similarity. Experimentally, we show that the results produced by the one index are consistent with the expectations about the results of the other index.

Cite as

Giovanni Buzzega, Alessio Conte, Veronica Guerrini, Giulia Punzi, Giovanna Rosone, and Lorenzo Tattini. Subsequence-Based Indices for Genome Sequence Analysis. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 20:1-20:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{buzzega_et_al:OASIcs.Grossi.20,
  author =	{Buzzega, Giovanni and Conte, Alessio and Guerrini, Veronica and Punzi, Giulia and Rosone, Giovanna and Tattini, Lorenzo},
  title =	{{Subsequence-Based Indices for Genome Sequence Analysis}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{20:1--20:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.20},
  URN =		{urn:nbn:de:0030-drops-238199},
  doi =		{10.4230/OASIcs.Grossi.20},
  annote =	{Keywords: String Indices, Burrows-Wheeler Transform, Maximal Common Subsequences, Sequence Analysis, Phylogeny}
}

@InProceedings{buzzega_et_al:OASIcs.Grossi.20,
  author =	{Buzzega, Giovanni and Conte, Alessio and Guerrini, Veronica and Punzi, Giulia and Rosone, Giovanna and Tattini, Lorenzo},
  title =	{{Subsequence-Based Indices for Genome Sequence Analysis}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{20:1--20:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.20},
  URN =		{urn:nbn:de:0030-drops-238199},
  doi =		{10.4230/OASIcs.Grossi.20},
  annote =	{Keywords: String Indices, Burrows-Wheeler Transform, Maximal Common Subsequences, Sequence Analysis, Phylogeny}
}

Document

DOI: 10.4230/OASIcs.GCB.2013.24

Alignment-free sequence comparison with spaced k-mers

Authors: Marcus Boden, Martin Schöneich, Sebastian Horwege, Sebastian Lindner, Chris Leimeister, and Burkhard Morgenstern

Published in: OASIcs, Volume 34, German Conference on Bioinformatics 2013

Abstract

Alignment-free methods are increasingly used for genome analysis and phylogeny reconstruction since they circumvent various difficulties of traditional approaches that rely on multiple sequence alignments. In particular, they are much faster than alignment-based methods. Most alignment-free approaches work by analyzing the k-mer composition of sequences. In this paper, we propose to use 'spaced k-mers', i.e. patterns of deterministic and 'don't care' positions instead of contiguous k-mers. Using simulated and real-world sequence data, we demonstrate that this approach produces better phylogenetic trees than alignment-free methods that rely on contiguous k-mers. In addition, distances calculated with spaced k-mers appear to be statistically more stable than distances based on contiguous k-mers.

Cite as

Marcus Boden, Martin Schöneich, Sebastian Horwege, Sebastian Lindner, Chris Leimeister, and Burkhard Morgenstern. Alignment-free sequence comparison with spaced k-mers. In German Conference on Bioinformatics 2013. Open Access Series in Informatics (OASIcs), Volume 34, pp. 24-34, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2013)

Copy BibTex To Clipboard

@InProceedings{boden_et_al:OASIcs.GCB.2013.24,
  author =	{Boden, Marcus and Sch\"{o}neich, Martin and Horwege, Sebastian and Lindner, Sebastian and Leimeister, Chris and Morgenstern, Burkhard},
  title =	{{Alignment-free sequence comparison with spaced k-mers}},
  booktitle =	{German Conference on Bioinformatics 2013},
  pages =	{24--34},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-939897-59-0},
  ISSN =	{2190-6807},
  year =	{2013},
  volume =	{34},
  editor =	{Bei{\ss}barth, Tim and Kollmar, Martin and Leha, Andreas and Morgenstern, Burkhard and Schultz, Anne-Kathrin and Waack, Stephan and Wingender, Edgar},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.GCB.2013.24},
  URN =		{urn:nbn:de:0030-drops-42334},
  doi =		{10.4230/OASIcs.GCB.2013.24},
  annote =	{Keywords: Alignment-free sequence comparison, phylogeny reconstruction}
}

Refine by Type
4 Document/PDF
3 Document/HTML

Refine by Publication Year
3 2025
1 2013

Refine by Author
1 Blanca, Antonio
1 Boden, Marcus
1 Buzzega, Giovanni
1 Chen, Ke
1 Conte, Alessio
Show More...

Refine by Series/Journal
2 LIPIcs
2 OASIcs

Refine by Classification
2 Applied computing → Bioinformatics
2 Applied computing → Computational biology
1 Theory of computation → Data structures design and analysis
1 Theory of computation → Design and analysis of algorithms

Refine by Keyword
2 Alignment-free sequence comparison
1 Burrows-Wheeler Transform
1 Edit distance embedding
1 Maximal Common Subsequences
1 Nearest neighbor search
Show More...

4 Search Results for "Leimeister, Chris"

A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences

Abstract

Cite as

Sequence Similarity Estimation by Random Subsequence Sketching

Abstract

Cite as

Subsequence-Based Indices for Genome Sequence Analysis

Abstract

Cite as

Alignment-free sequence comparison with spaced k-mers

Abstract

Cite as

Thanks for your feedback!

Could not send message