DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2023.26

MONI Can Find k-MEMs

Authors: Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Suppose we are asked to index a text T [0..n - 1] such that, given a pattern P [0..m - 1], we can quickly report the maximal substrings of P that each occur in T at least k times. We first show how we can add O (r log n) bits to Rossi et al.’s recent MONI index, where r is the number of runs in the Burrows-Wheeler Transform of T, such that it supports such queries in O (k m log n) time. We then show how, if we are given k at construction time, we can reduce the query time to O (m log n).

Cite as

Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI Can Find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 26:1-26:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{tatarnikov_et_al:LIPIcs.CPM.2023.26,
  author =	{Tatarnikov, Igor and Shahrabi Farahani, Ardavan and Kashgouli, Sana and Gagie, Travis},
  title =	{{MONI Can Find k-MEMs}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{26:1--26:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.26},
  URN =		{urn:nbn:de:0030-drops-179802},
  doi =		{10.4230/LIPIcs.CPM.2023.26},
  annote =	{Keywords: Compact data structures, Burrows-Wheeler Transform, run-length compression, maximal exact matches}
}

Document

DOI: 10.4230/LIPIcs.WABI.2022.19

Pangenomic Genotyping with the Marker Array

Authors: Taher Mun, Naga Sai Kavya Vaddadi, and Ben Langmead

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.

Cite as

Taher Mun, Naga Sai Kavya Vaddadi, and Ben Langmead. Pangenomic Genotyping with the Marker Array. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{mun_et_al:LIPIcs.WABI.2022.19,
  author =	{Mun, Taher and Vaddadi, Naga Sai Kavya and Langmead, Ben},
  title =	{{Pangenomic Genotyping with the Marker Array}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{19:1--19:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.19},
  URN =		{urn:nbn:de:0030-drops-170530},
  doi =		{10.4230/LIPIcs.WABI.2022.19},
  annote =	{Keywords: Sequence alignment indexing genotyping}
}

Document

DOI: 10.4230/LIPIcs.SEA.2022.16

RLBWT Tricks

Authors: Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi

Published in: LIPIcs, Volume 233, 20th International Symposium on Experimental Algorithms (SEA 2022)

Abstract

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation π, it stores an O (r)-space table - where r is the number of positions i where either i = 0 or π (i + 1) ≠ π (i) + 1 - that enables the computation of successive values of π(i) by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing π(i) is constant while maintaining O (r)-space. In this paper we refine Nishimoto and Tabei’s approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation π corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.

Cite as

Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{brown_et_al:LIPIcs.SEA.2022.16,
  author =	{Brown, Nathaniel K. and Gagie, Travis and Rossi, Massimiliano},
  title =	{{RLBWT Tricks}},
  booktitle =	{20th International Symposium on Experimental Algorithms (SEA 2022)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-251-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{233},
  editor =	{Schulz, Christian and U\c{c}ar, Bora},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2022.16},
  URN =		{urn:nbn:de:0030-drops-165500},
  doi =		{10.4230/LIPIcs.SEA.2022.16},
  annote =	{Keywords: Compressed String Indexes, Repetitive Text Collections, Burrows-Wheeler Transform}
}

Document

DOI: 10.4230/LIPIcs.SEA.2022.22

Computing Maximal Unique Matches with the r-Index

Authors: Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi

Published in: LIPIcs, Volume 233, 20th International Symposium on Experimental Algorithms (SEA 2022)

Abstract

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call MUM-PHINDER, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

Cite as

Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi. Computing Maximal Unique Matches with the r-Index. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 22:1-22:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{giuliani_et_al:LIPIcs.SEA.2022.22,
  author =	{Giuliani, Sara and Romana, Giuseppe and Rossi, Massimiliano},
  title =	{{Computing Maximal Unique Matches with the r-Index}},
  booktitle =	{20th International Symposium on Experimental Algorithms (SEA 2022)},
  pages =	{22:1--22:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-251-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{233},
  editor =	{Schulz, Christian and U\c{c}ar, Bora},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2022.22},
  URN =		{urn:nbn:de:0030-drops-165568},
  doi =		{10.4230/LIPIcs.SEA.2022.22},
  annote =	{Keywords: Burrows-Wheeler Transform, r-index, maximal unique matches, bioinformatics, pangenomics}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.25

A Theoretical and Experimental Analysis of BWT Variants for String Collections

Authors: Davide Cenzato and Zsuzsanna Lipták

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.

Cite as

Davide Cenzato and Zsuzsanna Lipták. A Theoretical and Experimental Analysis of BWT Variants for String Collections. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 25:1-25:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{cenzato_et_al:LIPIcs.CPM.2022.25,
  author =	{Cenzato, Davide and Lipt\'{a}k, Zsuzsanna},
  title =	{{A Theoretical and Experimental Analysis of BWT Variants for String Collections}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{25:1--25:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.25},
  URN =		{urn:nbn:de:0030-drops-161529},
  doi =		{10.4230/LIPIcs.CPM.2022.25},
  annote =	{Keywords: Burrows-Wheeler-Transform, extended BWT, string collections, repetitiveness measures, r, compression}
}

Document

DOI: 10.4230/LIPIcs.SEA.2021.12

Document Retrieval Hacks

Authors: Simon J. Puglisi and Bella Zhukova

Published in: LIPIcs, Volume 190, 19th International Symposium on Experimental Algorithms (SEA 2021)

Abstract

Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.

Cite as

Simon J. Puglisi and Bella Zhukova. Document Retrieval Hacks. In 19th International Symposium on Experimental Algorithms (SEA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 190, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{puglisi_et_al:LIPIcs.SEA.2021.12,
  author =	{Puglisi, Simon J. and Zhukova, Bella},
  title =	{{Document Retrieval Hacks}},
  booktitle =	{19th International Symposium on Experimental Algorithms (SEA 2021)},
  pages =	{12:1--12:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-185-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{190},
  editor =	{Coudert, David and Natale, Emanuele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2021.12},
  URN =		{urn:nbn:de:0030-drops-137848},
  doi =		{10.4230/LIPIcs.SEA.2021.12},
  annote =	{Keywords: String Processing, Pattern matching, Document listing, Document retrieval, Succinct data structures, Repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.9

Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph

Authors: Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, and Christina Boucher

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was the only one able to successfully run on all three genomes. The method of Valouev et al. (2006) only successfully ran on E. coli and Bionano Solve successfully ran on E. coli and human but not on the fish genome. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies.

Cite as

Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, and Christina Boucher. Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{mukherjee_et_al:LIPIcs.WABI.2020.9,
  author =	{Mukherjee, Kingshuk and Rossi, Massimiliano and Salmela, Leena and Boucher, Christina},
  title =	{{Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{9:1--9:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.9},
  URN =		{urn:nbn:de:0030-drops-127982},
  doi =		{10.4230/LIPIcs.WABI.2020.9},
  annote =	{Keywords: optical maps, de Bruijn graph, assembly}
}

Document

DOI: 10.4230/LIPIcs.SEA.2020.12

Pattern Discovery in Colored Strings

Authors: Zsuzsanna Lipták, Simon J. Puglisi, and Massimiliano Rossi

Published in: LIPIcs, Volume 160, 18th International Symposium on Experimental Algorithms (SEA 2020)

Abstract

We consider the problem of identifying patterns of interest in colored strings. A colored string is a string in which each position is colored with one of a finite set of colors. Our task is to find substrings that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically infer properties of the embedded system from the analysis of its simulation traces. We show that the number of interesting patterns is upper-bounded by 𝒪(n²) where n is the length of the string. We introduce a baseline algorithm with 𝒪(n²) running time which identifies all interesting patterns for all colors in the string satisfying certain minimality conditions. When one is interested in patterns related to only one color, we provide an algorithm that identifies patterns in 𝒪(n²log n) time, but is faster than the first algorithm in practice, both on simulated and on real-world patterns.

Cite as

Zsuzsanna Lipták, Simon J. Puglisi, and Massimiliano Rossi. Pattern Discovery in Colored Strings. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 12:1-12:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{liptak_et_al:LIPIcs.SEA.2020.12,
  author =	{Lipt\'{a}k, Zsuzsanna and Puglisi, Simon J. and Rossi, Massimiliano},
  title =	{{Pattern Discovery in Colored Strings}},
  booktitle =	{18th International Symposium on Experimental Algorithms (SEA 2020)},
  pages =	{12:1--12:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-148-1},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{160},
  editor =	{Faro, Simone and Cantone, Domenico},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2020.12},
  URN =		{urn:nbn:de:0030-drops-120862},
  doi =		{10.4230/LIPIcs.SEA.2020.12},
  annote =	{Keywords: property testing, suffix tree, pattern mining}
}

8 Search Results for "Rossi, Massimiliano"

MONI Can Find k-MEMs

Abstract

Cite as

Pangenomic Genotyping with the Marker Array

Abstract

Cite as

RLBWT Tricks

Abstract

Cite as

Computing Maximal Unique Matches with the r-Index

Abstract

Cite as

A Theoretical and Experimental Analysis of BWT Variants for String Collections

Abstract

Cite as

Document Retrieval Hacks

Abstract

Cite as

Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph

Abstract

Cite as

Pattern Discovery in Colored Strings

Abstract

Cite as

Thanks for your feedback!

Could not send message