DROPS

Document

DOI: 10.4230/LIPIcs.SEA.2024.14

Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions

Authors: Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

Given a flow network, the Minimum Flow Decomposition (MFD) problem is finding the smallest possible set of weighted paths whose superposition equals the flow. It is a classical, strongly NP-hard problem that is proven to be useful in RNA transcript assembly and applications outside of Bioinformatics. We improve an existing ILP (Integer Linear Programming) model by Dias et al. [RECOMB 2022] for DAGs by decreasing the solver’s search space using solution safety and several other optimizations. This results in a significant speedup compared to the original ILP, of up to 34× on average on the hardest instances. Moreover, we show that our optimizations apply also to MFD problem variants, resulting in speedups that go up to 219× on the hardest instances. We also developed an ILP model of reduced dimensionality for an MFD variant in which the solution path weights are restricted to a given set. This model can find an optimal MFD solution for most instances, and overall, its accuracy significantly outperforms that of previous greedy algorithms while being up to an order of magnitude faster than our optimized ILP.

Cite as

Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu. Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{grigorjew_et_al:LIPIcs.SEA.2024.14,
  author =	{Grigorjew, Andreas and Dias, Fernando H. C. and Cracco, Andrea and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.14},
  URN =		{urn:nbn:de:0030-drops-203792},
  doi =		{10.4230/LIPIcs.SEA.2024.14},
  annote =	{Keywords: Flow decomposition, Integer Linear Programming, Safety, RNA-seq, RNA transcript assembly, isoform}
}

Document

DOI: 10.4230/LIPIcs.SEA.2024.26

Scalable Hard Instances for Independent Set Reconfiguration

Authors: Takehide Soh, Takumu Watanabe, Jun Kawahara, Akira Suzuki, and Takehiro Ito

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

The Token Jumping problem, also known as the independent set reconfiguration problem under the token jumping model, is defined as follows: Given a graph and two same-sized independent sets, determine whether one can be transformed into the other via a sequence of independent sets. Token Jumping has been extensively studied, mainly from the viewpoint of algorithmic theory, but its practical study has just begun. To develop a practically good solver, it is important to construct benchmark datasets that are scalable and hard. Here, "scalable" means the ability to change the scale of the instance while maintaining its characteristics by adjusting the given parameters; and "hard" means that the instance can become so difficult that it cannot be solved within a practical time frame by a solver. In this paper, we propose four types of instance series for Token Jumping. Our instance series is scalable in the sense that instance scales are controlled by the number of vertices. To establish their hardness, we focus on the numbers of transformation steps; our instance series requires exponential numbers of steps with respect to the number of vertices. Interestingly, three types of instance series are constructed by importing theories developed by algorithmic research. We experimentally evaluate the scalability and hardness of the proposed instance series, using the SAT solver and award-winning solvers of the international competition for Token Jumping.

Cite as

Takehide Soh, Takumu Watanabe, Jun Kawahara, Akira Suzuki, and Takehiro Ito. Scalable Hard Instances for Independent Set Reconfiguration. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 26:1-26:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{soh_et_al:LIPIcs.SEA.2024.26,
  author =	{Soh, Takehide and Watanabe, Takumu and Kawahara, Jun and Suzuki, Akira and Ito, Takehiro},
  title =	{{Scalable Hard Instances for Independent Set Reconfiguration}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{26:1--26:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.26},
  URN =		{urn:nbn:de:0030-drops-203913},
  doi =		{10.4230/LIPIcs.SEA.2024.26},
  annote =	{Keywords: Combinatorial reconfiguration, Benckmark dataset, Graph Algorithm, PSPACE-complete}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2024.76

Solution Discovery via Reconfiguration for Problems in P

Authors: Mario Grobler, Stephanie Maaz, Nicole Megow, Amer E. Mouawad, Vijayaragunathan Ramamoorthi, Daniel Schmand, and Sebastian Siebertz

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)

Abstract

In the recently introduced framework of solution discovery via reconfiguration [Fellows et al., ECAI 2023], we are given an initial configuration of k tokens on a graph and the question is whether we can transform this configuration into a feasible solution (for some problem) via a bounded number b of small modification steps. In this work, we study solution discovery variants of polynomial-time solvable problems, namely Spanning Tree Discovery, Shortest Path Discovery, Matching Discovery, and Vertex/Edge Cut Discovery in the unrestricted token addition/removal model, the token jumping model, and the token sliding model. In the unrestricted token addition/removal model, we show that all four discovery variants remain in P. For the token jumping model we also prove containment in P, except for Vertex/Edge Cut Discovery, for which we prove NP-completeness. Finally, in the token sliding model, almost all considered problems become NP-complete, the exception being Spanning Tree Discovery, which remains polynomial-time solvable. We then study the parameterized complexity of the NP-complete problems and provide a full classification of tractability with respect to the parameters solution size (number of tokens) k and transformation budget (number of steps) b. Along the way, we observe strong connections between the solution discovery variants of our base problems and their (weighted) rainbow variants as well as their red-blue variants with cardinality constraints.

Cite as

Mario Grobler, Stephanie Maaz, Nicole Megow, Amer E. Mouawad, Vijayaragunathan Ramamoorthi, Daniel Schmand, and Sebastian Siebertz. Solution Discovery via Reconfiguration for Problems in P. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 76:1-76:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{grobler_et_al:LIPIcs.ICALP.2024.76,
  author =	{Grobler, Mario and Maaz, Stephanie and Megow, Nicole and Mouawad, Amer E. and Ramamoorthi, Vijayaragunathan and Schmand, Daniel and Siebertz, Sebastian},
  title =	{{Solution Discovery via Reconfiguration for Problems in P}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{76:1--76:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.76},
  URN =		{urn:nbn:de:0030-drops-202195},
  doi =		{10.4230/LIPIcs.ICALP.2024.76},
  annote =	{Keywords: solution discovery, reconfiguration, spanning tree, shortest path, matching, cut}
}

@InProceedings{grobler_et_al:LIPIcs.ICALP.2024.76,
  author =	{Grobler, Mario and Maaz, Stephanie and Megow, Nicole and Mouawad, Amer E. and Ramamoorthi, Vijayaragunathan and Schmand, Daniel and Siebertz, Sebastian},
  title =	{{Solution Discovery via Reconfiguration for Problems in P}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{76:1--76:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.76},
  URN =		{urn:nbn:de:0030-drops-202195},
  doi =		{10.4230/LIPIcs.ICALP.2024.76},
  annote =	{Keywords: solution discovery, reconfiguration, spanning tree, shortest path, matching, cut}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2024.85

Optimal PSPACE-Hardness of Approximating Set Cover Reconfiguration

Authors: Shuichi Hirahara and Naoto Ohsaka

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)

Abstract

In the Minmax Set Cover Reconfiguration problem, given a set system ℱ over a universe 𝒰 and its two covers 𝒞^start and 𝒞^goal of size k, we wish to transform 𝒞^start into 𝒞^goal by repeatedly adding or removing a single set of ℱ while covering the universe 𝒰 in any intermediate state. Then, the objective is to minimize the maximum size of any intermediate cover during transformation. We prove that Minmax Set Cover Reconfiguration and Minmax Dominating Set Reconfiguration are PSPACE-hard to approximate within a factor of 2-(1/polyloglog N), where N is the size of the universe and the number of vertices in a graph, respectively, improving upon Ohsaka (SODA 2024) [Ohsaka, 2024] and Karthik C. S. and Manurangsi (2023) [Karthik C. S. and Manurangsi, 2023]. This is the first result that exhibits a sharp threshold for the approximation factor of any reconfiguration problem because both problems admit a 2-factor approximation algorithm as per Ito, Demaine, Harvey, Papadimitriou, Sideri, Uehara, and Uno (Theor. Comput. Sci., 2011) [Takehiro Ito et al., 2011]. Our proof is based on a reconfiguration analogue of the FGLSS reduction [Feige et al., 1996] from Probabilistically Checkable Reconfiguration Proofs of Hirahara and Ohsaka (STOC 2024) [Hirahara and Ohsaka, 2024]. We also prove that for any constant ε ∈ (0,1), Minmax Hypergraph Vertex Cover Reconfiguration on poly(ε^-1)-uniform hypergraphs is PSPACE-hard to approximate within a factor of 2-ε.

Cite as

Shuichi Hirahara and Naoto Ohsaka. Optimal PSPACE-Hardness of Approximating Set Cover Reconfiguration. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 85:1-85:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{hirahara_et_al:LIPIcs.ICALP.2024.85,
  author =	{Hirahara, Shuichi and Ohsaka, Naoto},
  title =	{{Optimal PSPACE-Hardness of Approximating Set Cover Reconfiguration}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{85:1--85:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.85},
  URN =		{urn:nbn:de:0030-drops-202283},
  doi =		{10.4230/LIPIcs.ICALP.2024.85},
  annote =	{Keywords: reconfiguration problems, hardness of approximation, probabilistic proof systems, FGLSS reduction}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.14

Exact Sketch-Based Read Mapping

Authors: Tizian Schulz and Paul Medvedev

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in {O}(|t| + |p| + 𝓁²) time and Θ(𝓁²) space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read’s sketch and 𝓁 is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

Cite as

Tizian Schulz and Paul Medvedev. Exact Sketch-Based Read Mapping. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{schulz_et_al:LIPIcs.WABI.2023.14,
  author =	{Schulz, Tizian and Medvedev, Paul},
  title =	{{Exact Sketch-Based Read Mapping}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.14},
  URN =		{urn:nbn:de:0030-drops-186403},
  doi =		{10.4230/LIPIcs.WABI.2023.14},
  annote =	{Keywords: Sequence Sketching, Long-read Mapping, Exact Algorithm, Dynamic Programming}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.17

Compression Algorithm for Colored de Bruijn Graphs

Authors: Amatur Rahman, Yoann Dufresne, and Paul Medvedev

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.

Cite as

Amatur Rahman, Yoann Dufresne, and Paul Medvedev. Compression Algorithm for Colored de Bruijn Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{rahman_et_al:LIPIcs.WABI.2023.17,
  author =	{Rahman, Amatur and Dufresne, Yoann and Medvedev, Paul},
  title =	{{Compression Algorithm for Colored de Bruijn Graphs}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{17:1--17:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.17},
  URN =		{urn:nbn:de:0030-drops-186434},
  doi =		{10.4230/LIPIcs.WABI.2023.17},
  annote =	{Keywords: colored de Bruijn graphs, disk compression, k-mer sets, simplitigs, spectrum-preserving string sets}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.WABI.2022.1

Efficient Solutions to Biological Problems Using de Bruijn Graphs (Invited Talk)

Authors: Leena Salmela

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

The de Bruijn graph has become a standard method in the analysis of sequencing reads in computational biology due to its ability to represent the information contained in large read sets in small space. A de Bruijn graph represents a set of sequencing reads by its k-mers, i.e. the set of substrings of length k that occur in the reads. In the classical definition, the k-mers are the edges of the graph and the nodes are the k-1 bases long prefixes and suffixes of the k-mers. Usually only k-mers occurring several times in the read set are kept to filter out noise in the data. De Bruijn graphs have been used to solve many problems in computational biology including genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001; Anton Bankevich et al., 2012; Yu Peng et al., 2010], sequencing error correction [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019], reference free variant calling [Raluca Uricaru et al., 2015], indexing read sets [Camille Marchet et al., 2021], and so on. Next I will discuss two of these problems in more depth. The de Bruijn graph first emerged in computation biology in the context of genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001] where the task is to reconstruct a genome based on sequencing reads. As the de Bruijn graph can represent large read sets compactly, it became the standard approach to assemble short reads [Anton Bankevich et al., 2012; Yu Peng et al., 2010]. In the theoretical framework of de Bruijn graph based genome assembly, a genome is thought to be the Eulerian path in the de Bruijn graph built on the sequencing reads. In practise, the Eulerian path is not unique and thus not useful in the biological context. Therefore, practical implementations report subpaths that are guaranteed to be part of any Eulerian path and thus part of the actual genome. Such models include unitigs, which are nonbranching paths of the de Bruijn graph, and more involved definitions such as omnitigs [Alexandru I. Tomescu and Paul Medvedev, 2017]. In genome assembly the choice of k is a crucial matter. A small k can result in a tangled graph, whereas a too large k will fragment the graph. Furthermore, a different value of k may be optimal for different parts of the genome. Variable order de Bruijn graphs [Christina Boucher et al., 2015; Djamal Belazzougui et al., 2016], which represent de Bruijn graphs of all orders k in a single data structure, have been proposed as a solution but no rigorous definition corresponding to unitigs has been presented. We give the first definition of assembled sequences, i.e. contigs, on such graphs and an algorithm for enumerating them. Another problem that can be solved with de Bruijn graphs is the correction of sequencing errors [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019]. Because each position of a genome is sequenced several times, it is possible to correct sequencing errors in reads if we can identify data originating from the same genomic region. A de Bruijn graph can be used to represent compactly the reliable information and the individual reads can be corrected by aligning them to the graph.

Cite as

Leena Salmela. Efficient Solutions to Biological Problems Using de Bruijn Graphs (Invited Talk). In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 1:1-1:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{salmela:LIPIcs.WABI.2022.1,
  author =	{Salmela, Leena},
  title =	{{Efficient Solutions to Biological Problems Using de Bruijn Graphs}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{1:1--1:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.1},
  URN =		{urn:nbn:de:0030-drops-170357},
  doi =		{10.4230/LIPIcs.WABI.2022.1},
  annote =	{Keywords: de Bruijn graph, variable order de Bruijn graph, genome assembly, sequencing error correction, k-mers}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.11

A Graph-Theoretic Barcode Ordering Model for Linked-Reads

Authors: Yoann Dufresne, Chen Sun, Pierre Marijon, Dominique Lavenier, Cedric Chauve, and Rayan Chikhi

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

Considering a set of intervals on the real line, an interval graph records these intervals as nodes and their intersections as edges. Identifying (i.e. merging) pairs of nodes in an interval graph results in a multiple-interval graph. Given only the nodes and the edges of the multiple-interval graph without knowing the underlying intervals, we are interested in the following questions. Can one determine how many intervals correspond to each node? Can one compute a walk over the multiple-interval graph nodes that reflects the ordering of the original intervals? These questions are closely related to linked-read DNA sequencing, where barcodes are assigned to long molecules whose intersection graph forms an interval graph. Each barcode may correspond to multiple molecules, which complicates downstream analysis, and corresponds to the identification of nodes of the corresponding interval graph. Resolving the above graph-theoretic problems would facilitate analyses of linked-reads sequencing data, through enabling the conceptual separation of barcodes into molecules and providing, through the molecules order, a skeleton for accurately assembling the genome. Here, we propose a framework that takes as input an arbitrary intersection graph (such as an overlap graph of barcodes) and constructs a heuristic approximation of the ordering of the original intervals.

Cite as

Yoann Dufresne, Chen Sun, Pierre Marijon, Dominique Lavenier, Cedric Chauve, and Rayan Chikhi. A Graph-Theoretic Barcode Ordering Model for Linked-Reads. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 11:1-11:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{dufresne_et_al:LIPIcs.WABI.2020.11,
  author =	{Dufresne, Yoann and Sun, Chen and Marijon, Pierre and Lavenier, Dominique and Chauve, Cedric and Chikhi, Rayan},
  title =	{{A Graph-Theoretic Barcode Ordering Model for Linked-Reads}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{11:1--11:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.11},
  URN =		{urn:nbn:de:0030-drops-128001},
  doi =		{10.4230/LIPIcs.WABI.2020.11},
  annote =	{Keywords: DNA sequencing, graph algorithms, linked-reads, interval graphs, cliques}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.16

Disk Compression of k-mer Sets

Authors: Amatur Rahman, Rayan Chikhi, and Paul Medvedev

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Cite as

Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk Compression of k-mer Sets. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{rahman_et_al:LIPIcs.WABI.2020.16,
  author =	{Rahman, Amatur and Chikhi, Rayan and Medvedev, Paul},
  title =	{{Disk Compression of k-mer Sets}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{16:1--16:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.16},
  URN =		{urn:nbn:de:0030-drops-128057},
  doi =		{10.4230/LIPIcs.WABI.2020.16},
  annote =	{Keywords: de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.29

Optimal Omnitig Listing for Safe and Complete Contig Assembly

Authors: Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

Genome assembly is the problem of reconstructing a genome sequence from a set of reads from a sequencing experiment. Typical formulations of the assembly problem admit in practice many genomic reconstructions, and actual genome assemblers usually output contigs, namely substrings that are promised to occur in the genome. To bridge the theory and practice, Tomescu and Medvedev [RECOMB 2016] reformulated contig assembly as finding all substrings common to all genomic reconstructions. They also gave a characterization of those walks (omnitigs) that are common to all closed edge-covering walks of a (directed) graph, a typical notion of genomic reconstruction. An algorithm for listing all maximal omnitigs was also proposed, by launching an exhaustive visit from every edge. In this paper, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm and show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev [RECOMB 2016].

Cite as

Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I. Tomescu. Optimal Omnitig Listing for Safe and Complete Contig Assembly. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 29:1-29:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{cairo_et_al:LIPIcs.CPM.2017.29,
  author =	{Cairo, Massimo and Medvedev, Paul and Obscura Acosta, Nidia and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Optimal Omnitig Listing for Safe and Complete Contig Assembly}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{29:1--29:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.29},
  URN =		{urn:nbn:de:0030-drops-73423},
  doi =		{10.4230/LIPIcs.CPM.2017.29},
  annote =	{Keywords: genome assembly, graph algorithm, edge-covering walk, strong bridge}
}

10 Search Results for "Medvedev, Paul"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as