DROPS

Document

DOI: 10.4230/LIPIcs.SEA.2024.3

Practical Minimum Path Cover

Authors: Manuel Cáceres, Brendan Mumey, Santeri Toivonen, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

Computing a minimum path cover (MPC) of a directed acyclic graph (DAG) is a fundamental problem with a myriad of applications, including reachability. Although it is known how to solve the problem by a simple reduction to minimum flow, recent theoretical advances exploit this idea to obtain algorithms parameterized by the number of paths of an MPC, known as the width. These results obtain fast [Mäkinen et al., TALG 2019] and even linear time [Cáceres et al., SODA 2022] algorithms in the small-width regime. In this paper, we present the first publicly available high-performance implementation of state-of-the-art MPC algorithms, including the parameterized approaches. Our experiments on random DAGs show that parameterized algorithms are orders-of-magnitude faster on dense graphs. Additionally, we present new fast pre-processing heuristics based on transitive edge sparsification. We show that our heuristics improve MPC-solvers by orders of magnitude.

Cite as

Manuel Cáceres, Brendan Mumey, Santeri Toivonen, and Alexandru I. Tomescu. Practical Minimum Path Cover. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 3:1-3:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{caceres_et_al:LIPIcs.SEA.2024.3,
  author =	{C\'{a}ceres, Manuel and Mumey, Brendan and Toivonen, Santeri and Tomescu, Alexandru I.},
  title =	{{Practical Minimum Path Cover}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{3:1--3:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.3},
  URN =		{urn:nbn:de:0030-drops-203687},
  doi =		{10.4230/LIPIcs.SEA.2024.3},
  annote =	{Keywords: minimum path cover, directed acyclic graph, maximum flow, parameterized algorithms, edge sparsification, algorithm engineering}
}

Document

DOI: 10.4230/LIPIcs.SEA.2024.10

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Cite as

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10,
  author =	{Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo},
  title =	{{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10},
  URN =		{urn:nbn:de:0030-drops-203756},
  doi =		{10.4230/LIPIcs.SEA.2024.10},
  annote =	{Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests}
}

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10,
  author =	{Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo},
  title =	{{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10},
  URN =		{urn:nbn:de:0030-drops-203756},
  doi =		{10.4230/LIPIcs.SEA.2024.10},
  annote =	{Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.11

A Graph-Theoretic Barcode Ordering Model for Linked-Reads

Authors: Yoann Dufresne, Chen Sun, Pierre Marijon, Dominique Lavenier, Cedric Chauve, and Rayan Chikhi

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

Considering a set of intervals on the real line, an interval graph records these intervals as nodes and their intersections as edges. Identifying (i.e. merging) pairs of nodes in an interval graph results in a multiple-interval graph. Given only the nodes and the edges of the multiple-interval graph without knowing the underlying intervals, we are interested in the following questions. Can one determine how many intervals correspond to each node? Can one compute a walk over the multiple-interval graph nodes that reflects the ordering of the original intervals? These questions are closely related to linked-read DNA sequencing, where barcodes are assigned to long molecules whose intersection graph forms an interval graph. Each barcode may correspond to multiple molecules, which complicates downstream analysis, and corresponds to the identification of nodes of the corresponding interval graph. Resolving the above graph-theoretic problems would facilitate analyses of linked-reads sequencing data, through enabling the conceptual separation of barcodes into molecules and providing, through the molecules order, a skeleton for accurately assembling the genome. Here, we propose a framework that takes as input an arbitrary intersection graph (such as an overlap graph of barcodes) and constructs a heuristic approximation of the ordering of the original intervals.

Cite as

Yoann Dufresne, Chen Sun, Pierre Marijon, Dominique Lavenier, Cedric Chauve, and Rayan Chikhi. A Graph-Theoretic Barcode Ordering Model for Linked-Reads. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 11:1-11:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{dufresne_et_al:LIPIcs.WABI.2020.11,
  author =	{Dufresne, Yoann and Sun, Chen and Marijon, Pierre and Lavenier, Dominique and Chauve, Cedric and Chikhi, Rayan},
  title =	{{A Graph-Theoretic Barcode Ordering Model for Linked-Reads}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{11:1--11:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.11},
  URN =		{urn:nbn:de:0030-drops-128001},
  doi =		{10.4230/LIPIcs.WABI.2020.11},
  annote =	{Keywords: DNA sequencing, graph algorithms, linked-reads, interval graphs, cliques}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.16

Disk Compression of k-mer Sets

Authors: Amatur Rahman, Rayan Chikhi, and Paul Medvedev

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Cite as

Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk Compression of k-mer Sets. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{rahman_et_al:LIPIcs.WABI.2020.16,
  author =	{Rahman, Amatur and Chikhi, Rayan and Medvedev, Paul},
  title =	{{Disk Compression of k-mer Sets}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{16:1--16:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.16},
  URN =		{urn:nbn:de:0030-drops-128057},
  doi =		{10.4230/LIPIcs.WABI.2020.16},
  annote =	{Keywords: de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets}
}

Document

DOI: 10.4230/LIPIcs.CPM.2018.18

Dualities in Tree Representations

Authors: Rayan Chikhi and Alexander Schönhuth

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

A characterization of the tree T^* such that BP(T^*)=ova{DFUDS(T)}, the reversal of DFUDS(T) is given. An immediate consequence is a rigorous characterization of the tree T^ such that BP(T^)=DFUDS(T). In summary, BP and DFUDS are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in BP and/or DFUDS. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.

Cite as

Rayan Chikhi and Alexander Schönhuth. Dualities in Tree Representations. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 18:1-18:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{chikhi_et_al:LIPIcs.CPM.2018.18,
  author =	{Chikhi, Rayan and Sch\"{o}nhuth, Alexander},
  title =	{{Dualities in Tree Representations}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  pages =	{18:1--18:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.18},
  URN =		{urn:nbn:de:0030-drops-86901},
  doi =		{10.4230/LIPIcs.CPM.2018.18},
  annote =	{Keywords: Data Structures, Succinct Tree Representation, Balanced Parenthesis Representation, Isomorphisms}
}

Document

DOI: 10.4230/LIPIcs.SEA.2017.25

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Authors: Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo

Published in: LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

Abstract

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHash

Cite as

Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and Scalable Minimal Perfect Hashing for Massive Key Sets. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 25:1-25:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{limasset_et_al:LIPIcs.SEA.2017.25,
  author =	{Limasset, Antoine and Rizk, Guillaume and Chikhi, Rayan and Peterlongo, Pierre},
  title =	{{Fast and Scalable Minimal Perfect Hashing for Massive Key Sets}},
  booktitle =	{16th International Symposium on Experimental Algorithms (SEA 2017)},
  pages =	{25:1--25:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-036-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{75},
  editor =	{Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017.25},
  URN =		{urn:nbn:de:0030-drops-76196},
  doi =		{10.4230/LIPIcs.SEA.2017.25},
  annote =	{Keywords: Minimal Perfect Hash Functions, Algorithms, Data Structures, Big Data}
}

6 Search Results for "Chikhi, Rayan"

Practical Minimum Path Cover

Abstract

Cite as

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Abstract

Cite as

A Graph-Theoretic Barcode Ordering Model for Linked-Reads

Abstract

Cite as

Disk Compression of k-mer Sets

Abstract

Cite as

Dualities in Tree Representations

Abstract

Cite as

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Abstract

Cite as