DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2024.8

Applying the Safe-And-Complete Framework to Practical Genome Assembly

Authors: Sebastian Schmidt, Santeri Toivonen, Paul Medvedev, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs (simple omnitigs), giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the D. melanogaster and the C. elegans genomes. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible additional computational costs and either no or a small increase in the number of misassemblies.

Cite as

Sebastian Schmidt, Santeri Toivonen, Paul Medvedev, and Alexandru I. Tomescu. Applying the Safe-And-Complete Framework to Practical Genome Assembly. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 8:1-8:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{schmidt_et_al:LIPIcs.WABI.2024.8,
  author =	{Schmidt, Sebastian and Toivonen, Santeri and Medvedev, Paul and Tomescu, Alexandru I.},
  title =	{{Applying the Safe-And-Complete Framework to Practical Genome Assembly}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{8:1--8:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.8},
  URN =		{urn:nbn:de:0030-drops-206520},
  doi =		{10.4230/LIPIcs.WABI.2024.8},
  annote =	{Keywords: Genome assembly, Omnitigs, Safe-and-complete framework, graph algorithm, HiFi sequencing data, Assembly evaluation}
}

Document

DOI: 10.4230/LIPIcs.WABI.2024.22

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Authors: Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, and Mingfu Shao

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Cite as

Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, and Mingfu Shao. Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{zang_et_al:LIPIcs.WABI.2024.22,
  author =	{Zang, Xiaofei Carl and Li, Xiang and Metcalfe, Kyle and Ben-Yehezkel, Tuval and Kelley, Ryan and Shao, Mingfu},
  title =	{{Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.22},
  URN =		{urn:nbn:de:0030-drops-206660},
  doi =		{10.4230/LIPIcs.WABI.2024.22},
  annote =	{Keywords: Genome assembly, de Bruijn graph, synthetic long reads, anchor-guided assembly, LoopSeq}
}

Document

DOI: 10.4230/LIPIcs.STACS.2020.46

Information Distance Revisited

Authors: Bruno Bauwens

Published in: LIPIcs, Volume 154, 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)

Abstract

We consider the notion of information distance between two objects x and y introduced by Bennett, Gács, Li, Vitanyi, and Zurek [C. H. Bennett et al., 1998] as the minimal length of a program that computes x from y as well as computing y from x, and study different versions of this notion. In the above paper, it was shown that the prefix version of information distance equals max (K(x|y),K(y|x)) up to additive logarithmic terms. It was claimed by Mahmud [Mahmud, 2009] that this equality holds up to additive O(1)-precision. We show that this claim is false, but does hold if the distance is at least logarithmic. This implies that the original definition provides a metric on strings that are at superlogarithmically separated.

Cite as

Bruno Bauwens. Information Distance Revisited. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 46:1-46:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bauwens:LIPIcs.STACS.2020.46,
  author =	{Bauwens, Bruno},
  title =	{{Information Distance Revisited}},
  booktitle =	{37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)},
  pages =	{46:1--46:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-140-5},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{154},
  editor =	{Paul, Christophe and Bl\"{a}ser, Markus},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2020.46},
  URN =		{urn:nbn:de:0030-drops-119071},
  doi =		{10.4230/LIPIcs.STACS.2020.46},
  annote =	{Keywords: Kolmogorov complexity, algorithmic information distance}
}

Document

DOI: 10.4230/LIPIcs.WABI.2019.24

Synteny Paths for Assembly Graphs Comparison

Authors: Evgeny Polevikov and Mikhail Kolmogorov

Published in: LIPIcs, Volume 143, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)

Abstract

Despite the recent developments of long-read sequencing technologies, it is still difficult to produce complete assemblies of eukaryotic genomes in an automated fashion. Genome assembly software typically output assembled fragments (contigs) along with assembly graphs, that encode all possible layouts of these contigs. Graph representation of the assembled genome can be useful for gene discovery, haplotyping, structural variations analysis and other applications. To facilitate the development of new graph-based approaches, it is important to develop algorithms for comparison and evaluation of assembly graphs produced by different software. In this work, we introduce synteny paths: maximal paths of homologous sequence between the compared assembly graphs. We describe Asgan - an algorithm for efficient synteny paths decomposition, and use it to evaluate assembly graphs of various bacterial assemblies produced by different approaches. We then apply Asgan to discover structural variations between the assemblies of 15 Drosophila genomes, and show that synteny paths are robust to contig fragmentation. The Asgan tool is freely available at: https://github.com/epolevikov/Asgan.

Cite as

Evgeny Polevikov and Mikhail Kolmogorov. Synteny Paths for Assembly Graphs Comparison. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 24:1-24:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{polevikov_et_al:LIPIcs.WABI.2019.24,
  author =	{Polevikov, Evgeny and Kolmogorov, Mikhail},
  title =	{{Synteny Paths for Assembly Graphs Comparison}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{24:1--24:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.24},
  URN =		{urn:nbn:de:0030-drops-110545},
  doi =		{10.4230/LIPIcs.WABI.2019.24},
  annote =	{Keywords: Assembly graphs, Genome assembly, Synteny blocks, Comparative Genomics}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2018.2

Plain Stopping Time and Conditional Complexities Revisited

Authors: Mikhail Andreev, Gleb Posobin, and Alexander Shen

Published in: LIPIcs, Volume 117, 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018)

Abstract

In this paper we analyze the notion of "stopping time complexity", the amount of information needed to specify when to stop while reading an infinite sequence. This notion was introduced by Vovk and Pavlovic [Vovk and Pavlovic, 2016]. It turns out that plain stopping time complexity of a binary string x could be equivalently defined as (a) the minimal plain complexity of a Turing machine that stops after reading x on a one-directional input tape; (b) the minimal plain complexity of an algorithm that enumerates a prefix-free set containing x; (c) the conditional complexity C(x|x*) where x in the condition is understood as a prefix of an infinite binary sequence while the first x is understood as a terminated binary string; (d) as a minimal upper semicomputable function K such that each binary sequence has at most 2^n prefixes z such that K(z)<n; (e) as maxC^X(x) where C^X(z) is plain Kolmogorov complexity of z relative to oracle X and the maximum is taken over all extensions X of x. We also show that some of these equivalent definitions become non-equivalent in the more general setting where the condition y and the object x may differ, and answer an open question from Chernov, Hutter and Schmidhuber [Alexey V. Chernov et al., 2007].

Cite as

Mikhail Andreev, Gleb Posobin, and Alexander Shen. Plain Stopping Time and Conditional Complexities Revisited. In 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 117, pp. 2:1-2:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{andreev_et_al:LIPIcs.MFCS.2018.2,
  author =	{Andreev, Mikhail and Posobin, Gleb and Shen, Alexander},
  title =	{{Plain Stopping Time and Conditional Complexities Revisited}},
  booktitle =	{43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018)},
  pages =	{2:1--2:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-086-6},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{117},
  editor =	{Potapov, Igor and Spirakis, Paul and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2018.2},
  URN =		{urn:nbn:de:0030-drops-95842},
  doi =		{10.4230/LIPIcs.MFCS.2018.2},
  annote =	{Keywords: Kolmogorov complexity, stopping time complexity, structured conditional complexity, algorithmic information theory}
}

5 Search Results for "Kolmogorov, Mikhail"

Applying the Safe-And-Complete Framework to Practical Genome Assembly

Abstract

Cite as

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Abstract

Cite as

Information Distance Revisited

Abstract

Cite as

Synteny Paths for Assembly Graphs Comparison

Abstract

Cite as

Plain Stopping Time and Conditional Complexities Revisited

Abstract

Cite as

Thanks for your feedback!

Could not send message