Document

**Published in:** LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can
- build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order;
- for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes;
- find the minimum and maximum values stored in that interval;
- take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation:
- a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max;
- a minimizer digest;
- a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10, author = {Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo}, title = {{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}}, booktitle = {22nd International Symposium on Experimental Algorithms (SEA 2024)}, pages = {10:1--10:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-325-6}, ISSN = {1868-8969}, year = {2024}, volume = {301}, editor = {Liberti, Leo}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10}, URN = {urn:nbn:de:0030-drops-203756}, doi = {10.4230/LIPIcs.SEA.2024.10}, annote = {Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests} }

Document

**Published in:** LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem.

Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{bonizzoni_et_al:LIPIcs.CPM.2024.12, author = {Bonizzoni, Paola and Boucher, Christina and Cozzi, Davide and Gagie, Travis and Pirola, Yuri}, title = {{Solving the Minimal Positional Substring Cover Problem in Sublinear Space}}, booktitle = {35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)}, pages = {12:1--12:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-326-3}, ISSN = {1868-8969}, year = {2024}, volume = {296}, editor = {Inenaga, Shunsuke and Puglisi, Simon J.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.12}, URN = {urn:nbn:de:0030-drops-201225}, doi = {10.4230/LIPIcs.CPM.2024.12}, annote = {Keywords: Positional Burrows-Wheeler Transform, r-index, minimal position substring cover, set-maximal exact matches} }

Document

**Published in:** LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of FM-Index Queries Through Prefix-Free Parsing. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{hong_et_al:LIPIcs.WABI.2023.13, author = {Hong, Aaron and Oliva, Marco and K\"{o}ppl, Dominik and Bannai, Hideo and Boucher, Christina and Gagie, Travis}, title = {{Acceleration of FM-Index Queries Through Prefix-Free Parsing}}, booktitle = {23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)}, pages = {13:1--13:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-294-5}, ISSN = {1868-8969}, year = {2023}, volume = {273}, editor = {Belazzougui, Djamal and Ouangraoua, A\"{i}da}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.13}, URN = {urn:nbn:de:0030-drops-186390}, doi = {10.4230/LIPIcs.WABI.2023.13}, annote = {Keywords: FM-index, pangenomics, scalability, word-based indexing, random access} }

Document

**Published in:** LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Suppose we are asked to index a text T [0..n - 1] such that, given a pattern P [0..m - 1], we can quickly report the maximal substrings of P that each occur in T at least k times. We first show how we can add O (r log n) bits to Rossi et al.’s recent MONI index, where r is the number of runs in the Burrows-Wheeler Transform of T, such that it supports such queries in O (k m log n) time. We then show how, if we are given k at construction time, we can reduce the query time to O (m log n).

Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI Can Find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 26:1-26:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{tatarnikov_et_al:LIPIcs.CPM.2023.26, author = {Tatarnikov, Igor and Shahrabi Farahani, Ardavan and Kashgouli, Sana and Gagie, Travis}, title = {{MONI Can Find k-MEMs}}, booktitle = {34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)}, pages = {26:1--26:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-276-1}, ISSN = {1868-8969}, year = {2023}, volume = {259}, editor = {Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.26}, URN = {urn:nbn:de:0030-drops-179802}, doi = {10.4230/LIPIcs.CPM.2023.26}, annote = {Keywords: Compact data structures, Burrows-Wheeler Transform, run-length compression, maximal exact matches} }

Document

**Published in:** LIPIcs, Volume 244, 30th Annual European Symposium on Algorithms (ESA 2022)

We give a new and simple worst-case optimal algorithm for adaptive prefix-free coding that matches Gagie and Nekrich’s (2009) bounds except for lower-order terms, and uses no data structures more complicated than a lookup table.

Travis Gagie. Simple Worst-Case Optimal Adaptive Prefix-Free Coding. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 57:1-57:5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{gagie:LIPIcs.ESA.2022.57, author = {Gagie, Travis}, title = {{Simple Worst-Case Optimal Adaptive Prefix-Free Coding}}, booktitle = {30th Annual European Symposium on Algorithms (ESA 2022)}, pages = {57:1--57:5}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-247-1}, ISSN = {1868-8969}, year = {2022}, volume = {244}, editor = {Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.57}, URN = {urn:nbn:de:0030-drops-169959}, doi = {10.4230/LIPIcs.ESA.2022.57}, annote = {Keywords: Adaptive prefix-free coding, Shannon coding, Lookup tables} }

Document

**Published in:** LIPIcs, Volume 233, 20th International Symposium on Experimental Algorithms (SEA 2022)

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation π, it stores an O (r)-space table - where r is the number of positions i where either i = 0 or π (i + 1) ≠ π (i) + 1 - that enables the computation of successive values of π(i) by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing π(i) is constant while maintaining O (r)-space.
In this paper we refine Nishimoto and Tabei’s approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation π corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.

Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{brown_et_al:LIPIcs.SEA.2022.16, author = {Brown, Nathaniel K. and Gagie, Travis and Rossi, Massimiliano}, title = {{RLBWT Tricks}}, booktitle = {20th International Symposium on Experimental Algorithms (SEA 2022)}, pages = {16:1--16:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-251-8}, ISSN = {1868-8969}, year = {2022}, volume = {233}, editor = {Schulz, Christian and U\c{c}ar, Bora}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2022.16}, URN = {urn:nbn:de:0030-drops-165500}, doi = {10.4230/LIPIcs.SEA.2022.16}, annote = {Keywords: Compressed String Indexes, Repetitive Text Collections, Burrows-Wheeler Transform} }

Document

**Published in:** LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads.
The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT.
In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.WABI.2021.13, author = {Gagie, Travis and Gourdel, Garance and Manzini, Giovanni}, title = {{Compressing and Indexing Aligned Readsets}}, booktitle = {21st International Workshop on Algorithms in Bioinformatics (WABI 2021)}, pages = {13:1--13:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-200-6}, ISSN = {1868-8969}, year = {2021}, volume = {201}, editor = {Carbone, Alessandra and El-Kebir, Mohammed}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.13}, URN = {urn:nbn:de:0030-drops-143660}, doi = {10.4230/LIPIcs.WABI.2021.13}, annote = {Keywords: data compression, compact data structures, FM-index, Burrows-Wheeler Transform, EBWT, XBWT, DNA reads} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{cobas_et_al:LIPIcs.CPM.2021.13, author = {Cobas, Dustin and Gagie, Travis and Navarro, Gonzalo}, title = {{A Fast and Small Subsampled R-Index}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {13:1--13:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.13}, URN = {urn:nbn:de:0030-drops-139647}, doi = {10.4230/LIPIcs.CPM.2021.13}, annote = {Keywords: Pattern matching, r-index, compressed text indexing, repetitive text collections} }

Document

**Published in:** Dagstuhl Reports, Volume 9, Issue 6 (2020)

Dagstuhl Seminar 19241 ("25 Years of the Burrows-Wheeler Transform") took place from June 10th to 14th, 2019, and was attended by 45 people from 13 countries and the three fields of Algorithms and Data Structures, Bioinformatics, and Combinatorics on Words. There were four talks and a panel session for each field. Feedback was generally positive and we are confident the seminar fostered interdisciplinary connections and will eventually result in noteworthy joint publications.

Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Jens Stoye. 25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241). In Dagstuhl Reports, Volume 9, Issue 6, pp. 55-68, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@Article{gagie_et_al:DagRep.9.6.55, author = {Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens}, title = {{25 Years of the Burrows-Wheeler Transform (Dagstuhl Seminar 19241)}}, pages = {55--68}, journal = {Dagstuhl Reports}, ISSN = {2192-5283}, year = {2019}, volume = {9}, number = {6}, editor = {Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Stoye, Jens}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagRep.9.6.55}, URN = {urn:nbn:de:0030-drops-114874}, doi = {10.4230/DagRep.9.6.55}, annote = {Keywords: Bioinformatics, Burrows-Wheeler Transform, Combinatorics on Words, Data Compression, Data Structures, Indexing, Sequence Alignment} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA Overlap Graph in Succinct Space. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.CPM.2019.26, author = {D{\'\i}az-Dom{\'\i}nguez, Diego and Gagie, Travis and Navarro, Gonzalo}, title = {{Simulating the DNA Overlap Graph in Succinct Space}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {26:1--26:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.26}, URN = {urn:nbn:de:0030-drops-104978}, doi = {10.4230/LIPIcs.CPM.2019.26}, annote = {Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees} }

Document

**Published in:** LIPIcs, Volume 123, 29th International Symposium on Algorithms and Computation (ISAAC 2018)

We present the first solution to tau-majorities on tree paths. Given a tree of n nodes, each with a label from [1..sigma], and a fixed threshold 0<tau<1, such a query gives two nodes u and v and asks for all the labels that appear more than tau * |P_{uv}| times in the path P_{uv} from u to v, where |P_{uv}| denotes the number of nodes in P_{uv}. Note that the answer to any query is of size up to 1/tau. On a w-bit RAM, we obtain a linear-space data structure with O((1/tau)lg^* n lg lg_w sigma) query time. For any kappa > 1, we can also build a structure that uses O(n lg^{[kappa]} n) space, where lg^{[kappa]} n denotes the function that applies logarithm kappa times to n, and answers queries in time O((1/tau)lg lg_w sigma). The construction time of both structures is O(n lg n). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH + 4n + o(n)(H+1) bits, where H <=lg sigma is the entropy of the label distribution, and can be built in O(n lg n) time. The other uses nH + O(n) + o(nH) bits and is built in O(n lg n) time w.h.p.

Travis Gagie, Meng He, and Gonzalo Navarro. Tree Path Majority Data Structures. In 29th International Symposium on Algorithms and Computation (ISAAC 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 123, pp. 68:1-68:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.ISAAC.2018.68, author = {Gagie, Travis and He, Meng and Navarro, Gonzalo}, title = {{Tree Path Majority Data Structures}}, booktitle = {29th International Symposium on Algorithms and Computation (ISAAC 2018)}, pages = {68:1--68:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-094-1}, ISSN = {1868-8969}, year = {2018}, volume = {123}, editor = {Hsu, Wen-Lian and Lee, Der-Tsai and Liao, Chung-Shou}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2018.68}, URN = {urn:nbn:de:0030-drops-100166}, doi = {10.4230/LIPIcs.ISAAC.2018.68}, annote = {Keywords: Majorities on Trees, Succinct data structures} }

Document

**Published in:** LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{boucher_et_al:LIPIcs.WABI.2018.2, author = {Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Manzini, Giovanni}, title = {{Prefix-Free Parsing for Building Big BWTs}}, booktitle = {18th International Workshop on Algorithms in Bioinformatics (WABI 2018)}, pages = {2:1--2:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-082-8}, ISSN = {1868-8969}, year = {2018}, volume = {113}, editor = {Parida, Laxmi and Ukkonen, Esko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.2}, URN = {urn:nbn:de:0030-drops-93044}, doi = {10.4230/LIPIcs.WABI.2018.2}, annote = {Keywords: Burrows-Wheeler Transform, prefix-free parsing, compression-aware algorithms, genomic databases} }

Document

**Published in:** LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse T^R of a text T, to compute offline the LZ77 parse of T in O(n log r) time and O(r) space, where n is the length of T and r is the number of runs in the BWT of T^R. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of T while still using O(n log r) time and O(r) space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further - albeit making it static again and increasing its space by a factor proportional to the size of the alphabet - such that later, given another string S and O(log log n)-time random access to T, we can compute the matching statistics of S with respect to T in O(|S| log log n) time.

Hideo Bannai, Travis Gagie, and Tomohiro I. Online LZ77 Parsing and Matching Statistics with RLBWTs. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 7:1-7:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:LIPIcs.CPM.2018.7, author = {Bannai, Hideo and Gagie, Travis and I, Tomohiro}, title = {{Online LZ77 Parsing and Matching Statistics with RLBWTs}}, booktitle = {29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)}, pages = {7:1--7:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-074-3}, ISSN = {1868-8969}, year = {2018}, volume = {105}, editor = {Navarro, Gonzalo and Sankoff, David and Zhu, Binhai}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.7}, URN = {urn:nbn:de:0030-drops-87025}, doi = {10.4230/LIPIcs.CPM.2018.7}, annote = {Keywords: Lempel-Ziv 1977, Matching Statistics, Run-Length Compressed Burrows-Wheeler Transform} }

Document

**Published in:** LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the relative order of their characters is the same: e.g., (4, 1, 3, 2) and (10, 3, 7, 5) are an order-preserving match. We show how, given a string S[1..n] over an arbitrary alphabet of size sigma and a constant c >=1, we can build an O(n log log n)-bit encoding such that later, given a pattern P[1..m] with m >= log^c n, we can return the number of order-preserving occurrences of P in S in O(m) time. Within the same time bound we can also return the starting position of some order-preserving match for P in S (if such a match exists). We prove that our space bound is within a constant factor of optimal if log(sigma) = Omega(log log n); our query time is optimal if log(sigma) = Omega(log n). Our space bound contrasts with the Omega(n log n) bits needed in the worst case to store S itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to O(log^c n) neighbouring characters.

Travis Gagie, Giovanni Manzini, and Rossano Venturini. An Encoding for Order-Preserving Matching. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 38:1-38:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.ESA.2017.38, author = {Gagie, Travis and Manzini, Giovanni and Venturini, Rossano}, title = {{An Encoding for Order-Preserving Matching}}, booktitle = {25th Annual European Symposium on Algorithms (ESA 2017)}, pages = {38:1--38:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-049-1}, ISSN = {1868-8969}, year = {2017}, volume = {87}, editor = {Pruhs, Kirk and Sohler, Christian}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.38}, URN = {urn:nbn:de:0030-drops-78726}, doi = {10.4230/LIPIcs.ESA.2017.38}, annote = {Keywords: Compact data structures, encodings, order-preserving matching} }

Document

**Published in:** LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Let f : [1..n] -> [1..n] be a function, and l : [1..n] -> [1..s] indicate a label assigned to each element of the domain. We design several compact data structures that answer various queries on the labels of paths in f. For example, we can find the minimum label in f^k (i) for a given i and any k >= 0 in a given range [k1..k2], using n lg n + O(n) bits, or the minimum label in f^(-k) (i) for a given i and k > 0, using 2n lg n + O(n) bits, both in time O(lg n/ lg lg n). By using n lg s + o(n lg s) further bits, we can also count, within the same time, the number of elements within a range of labels, and report each such element in O(1 + lg s / lg lg n) additional time. Several other possible queries are considered, such as top-t queries and t-majorities.

Travis Gagie, Meng He, and Gonzalo Navarro. Path Queries on Functions. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 5:1-5:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{gagie_et_al:LIPIcs.CPM.2017.5, author = {Gagie, Travis and He, Meng and Navarro, Gonzalo}, title = {{Path Queries on Functions}}, booktitle = {28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)}, pages = {5:1--5:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-039-2}, ISSN = {1868-8969}, year = {2017}, volume = {78}, editor = {K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.5}, URN = {urn:nbn:de:0030-drops-73274}, doi = {10.4230/LIPIcs.CPM.2017.5}, annote = {Keywords: succinct data structures, integer functions, range queries, trees and permutations} }

Document

**Published in:** LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Important papers have appeared recently on the problem of indexing binary strings for jumbled pattern matching, and further lowering the time bounds in terms of the input size would now be a breakthrough with broad implications. We can still make progress on the problem, however, by considering other natural parameters. Badkobeh et al. (IPL, 2013) and Amir et al. (TCS, 2016) gave algorithms that index a binary string in O(n + r^2 log r) time, where n is the length and r is the number of runs, and Giaquinta and Grabowski (IPL, 2013) gave one that runs in O(n + r^2) time. In this paper we propose a new and very simple algorithm that also runs in O(n + r^2) time and can be extended either so that the index returns the position of a match (if there is one), or so that the algorithm uses only O(n) bits of space instead of O(n) words.

Luís Cunha, Simone Dantas, Travis Gagie, Roland Wittler, Luis Kowada, and Jens Stoye. Fast and Simple Jumbled Indexing for Binary Run-Length Encoded Strings. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 19:1-19:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{cunha_et_al:LIPIcs.CPM.2017.19, author = {Cunha, Lu{\'\i}s and Dantas, Simone and Gagie, Travis and Wittler, Roland and Kowada, Luis and Stoye, Jens}, title = {{Fast and Simple Jumbled Indexing for Binary Run-Length Encoded Strings}}, booktitle = {28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)}, pages = {19:1--19:9}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-039-2}, ISSN = {1868-8969}, year = {2017}, volume = {78}, editor = {K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.19}, URN = {urn:nbn:de:0030-drops-73418}, doi = {10.4230/LIPIcs.CPM.2017.19}, annote = {Keywords: string algorithms, indexing, jumbled pattern matching, run-length encoding} }

Document

**Published in:** Dagstuhl Seminar Proceedings, Volume 9281, Search Methodologies (2009)

A minimax tree is similar to a Huffman tree except that, instead of minimizing the weighted average of the leaves' depths, it minimizes the maximum of any leaf's weight plus its depth. Golumbic (1976) introduced minimax trees and gave a Huffman-like, $O (n log n)$-time algorithm for building them. Drmota and Szpankowski (2002) gave another $O (n log n)$-time algorithm, which takes linear time when the weights are already sorted by their fractional parts. In this paper we give the first linear-time algorithm for building minimax trees for unsorted real weights.

Pawel Gawrychowski and Travis Gagie. Minimax Trees in Linear Time with Applications. In Search Methodologies. Dagstuhl Seminar Proceedings, Volume 9281, pp. 1-11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2009)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:DagSemProc.09281.4, author = {Gawrychowski, Pawel and Gagie, Travis}, title = {{Minimax Trees in Linear Time with Applications}}, booktitle = {Search Methodologies}, pages = {1--11}, series = {Dagstuhl Seminar Proceedings (DagSemProc)}, ISSN = {1862-4405}, year = {2009}, volume = {9281}, editor = {Rudolf Ahlswede and Ferdinando Cicalese and Ugo Vaccaro}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.09281.4}, URN = {urn:nbn:de:0030-drops-22421}, doi = {10.4230/DagSemProc.09281.4}, annote = {Keywords: Data structures, data compression, prefix-free coding} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail