Document

Complete Volume

**Published in:** LIPIcs, Volume 274, 31st Annual European Symposium on Algorithms (ESA 2023)

LIPIcs, Volume 274, ESA 2023, Complete Volume

31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 1-1700, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@Proceedings{grtz_et_al:LIPIcs.ESA.2023, title = {{LIPIcs, Volume 274, ESA 2023, Complete Volume}}, booktitle = {31st Annual European Symposium on Algorithms (ESA 2023)}, pages = {1--1700}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-295-2}, ISSN = {1868-8969}, year = {2023}, volume = {274}, editor = {G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023}, URN = {urn:nbn:de:0030-drops-186529}, doi = {10.4230/LIPIcs.ESA.2023}, annote = {Keywords: LIPIcs, Volume 274, ESA 2023, Complete Volume} }

Document

Front Matter

**Published in:** LIPIcs, Volume 274, 31st Annual European Symposium on Algorithms (ESA 2023)

Front Matter, Table of Contents, Preface, Conference Organization

31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 0:i-0:xxii, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{grtz_et_al:LIPIcs.ESA.2023.0, author = {G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz}, title = {{Front Matter, Table of Contents, Preface, Conference Organization}}, booktitle = {31st Annual European Symposium on Algorithms (ESA 2023)}, pages = {0:i--0:xxii}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-295-2}, ISSN = {1868-8969}, year = {2023}, volume = {274}, editor = {G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023.0}, URN = {urn:nbn:de:0030-drops-186535}, doi = {10.4230/LIPIcs.ESA.2023.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization} }

Document

**Published in:** LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.

Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset Wavelet Trees. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.SEA.2023.4, author = {Alanko, Jarno N. and Biagi, Elena and Puglisi, Simon J. and Vuohtoniemi, Jaakko}, title = {{Subset Wavelet Trees}}, booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)}, pages = {4:1--4:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-279-2}, ISSN = {1868-8969}, year = {2023}, volume = {265}, editor = {Georgiadis, Loukas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.4}, URN = {urn:nbn:de:0030-drops-183549}, doi = {10.4230/LIPIcs.SEA.2023.4}, annote = {Keywords: degenerate strings, compressed data structures, succinct data structures, string processing, data structures, efficient algorithms} }

Document

**Published in:** LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela. Simple Runs-Bounded FM-Index Designs Are Fast. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 7:1-7:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:LIPIcs.SEA.2023.7, author = {D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena}, title = {{Simple Runs-Bounded FM-Index Designs Are Fast}}, booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)}, pages = {7:1--7:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-279-2}, ISSN = {1868-8969}, year = {2023}, volume = {265}, editor = {Georgiadis, Loukas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.7}, URN = {urn:nbn:de:0030-drops-183579}, doi = {10.4230/LIPIcs.SEA.2023.7}, annote = {Keywords: data structures, efficient algorithms} }

Document

**Published in:** LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Hierarchical Relative Lempel-Ziv Compression. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.SEA.2023.18, author = {Bille, Philip and G{\o}rtz, Inge Li and Puglisi, Simon J. and Tarnow, Simon R.}, title = {{Hierarchical Relative Lempel-Ziv Compression}}, booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)}, pages = {18:1--18:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-279-2}, ISSN = {1868-8969}, year = {2023}, volume = {265}, editor = {Georgiadis, Loukas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.18}, URN = {urn:nbn:de:0030-drops-183680}, doi = {10.4230/LIPIcs.SEA.2023.18}, annote = {Keywords: Relative compression, Lempel-Ziv compression, RLZ, LZ77, string collections, compressed representation, data structures, efficient algorithms} }

Document

**Published in:** LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi. Suffix Sorting via Matching Statistics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 20:1-20:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{liptak_et_al:LIPIcs.WABI.2022.20, author = {Lipt\'{a}k, Zsuzsanna and Masillo, Francesco and Puglisi, Simon J.}, title = {{Suffix Sorting via Matching Statistics}}, booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)}, pages = {20:1--20:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-243-3}, ISSN = {1868-8969}, year = {2022}, volume = {242}, editor = {Boucher, Christina and Rahmann, Sven}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.20}, URN = {urn:nbn:de:0030-drops-170545}, doi = {10.4230/LIPIcs.WABI.2022.20}, annote = {Keywords: Generalized suffix array, matching statistics, string collections, compressed representation, data structures, efficient algorithms} }

Document

Complete Volume

**Published in:** LIPIcs, Volume 202, 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)

LIPIcs, Volume 202, MFCS 2021, Complete Volume

46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 202, pp. 1-1560, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@Proceedings{bonchi_et_al:LIPIcs.MFCS.2021, title = {{LIPIcs, Volume 202, MFCS 2021, Complete Volume}}, booktitle = {46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)}, pages = {1--1560}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-201-3}, ISSN = {1868-8969}, year = {2021}, volume = {202}, editor = {Bonchi, Filippo and Puglisi, Simon J.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2021}, URN = {urn:nbn:de:0030-drops-144396}, doi = {10.4230/LIPIcs.MFCS.2021}, annote = {Keywords: LIPIcs, Volume 202, MFCS 2021, Complete Volume} }

Document

Front Matter

**Published in:** LIPIcs, Volume 202, 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)

Front Matter, Table of Contents, Preface, Conference Organization

46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 202, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bonchi_et_al:LIPIcs.MFCS.2021.0, author = {Bonchi, Filippo and Puglisi, Simon J.}, title = {{Front Matter, Table of Contents, Preface, Conference Organization}}, booktitle = {46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)}, pages = {0:i--0:xvi}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-201-3}, ISSN = {1868-8969}, year = {2021}, volume = {202}, editor = {Bonchi, Filippo and Puglisi, Simon J.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2021.0}, URN = {urn:nbn:de:0030-drops-144409}, doi = {10.4230/LIPIcs.MFCS.2021.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.

Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted Ancestors in Suffix Trees Revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2021.8, author = {Belazzougui, Djamal and Kosolobov, Dmitry and Puglisi, Simon J. and Raman, Rajeev}, title = {{Weighted Ancestors in Suffix Trees Revisited}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {8:1--8:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.8}, URN = {urn:nbn:de:0030-drops-139594}, doi = {10.4230/LIPIcs.CPM.2021.8}, annote = {Keywords: suffix tree, weighted ancestors, irreducible LCP, deterministic substring hashing} }

Document

**Published in:** LIPIcs, Volume 190, 19th International Symposium on Experimental Algorithms (SEA 2021)

Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources).
The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.

Simon J. Puglisi and Bella Zhukova. Document Retrieval Hacks. In 19th International Symposium on Experimental Algorithms (SEA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 190, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{puglisi_et_al:LIPIcs.SEA.2021.12, author = {Puglisi, Simon J. and Zhukova, Bella}, title = {{Document Retrieval Hacks}}, booktitle = {19th International Symposium on Experimental Algorithms (SEA 2021)}, pages = {12:1--12:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-185-6}, ISSN = {1868-8969}, year = {2021}, volume = {190}, editor = {Coudert, David and Natale, Emanuele}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2021.12}, URN = {urn:nbn:de:0030-drops-137848}, doi = {10.4230/LIPIcs.SEA.2021.12}, annote = {Keywords: String Processing, Pattern matching, Document listing, Document retrieval, Succinct data structures, Repetitive text collections} }

Document

**Published in:** LIPIcs, Volume 160, 18th International Symposium on Experimental Algorithms (SEA 2020)

Compact hash tables store a set S of n key-value pairs, where the keys are from the universe U = {0,…,u-1}, and the values are v-bit integers, in close to B(u, n) + nv bits of space, where {b(u, n)} = log₂ binom(u,n) is the information-theoretic lower bound for representing the set of keys in S, and support operations insert, delete and lookup on S.
Compact hash tables have received significant attention in recent years, and approaches dating back to Cleary [IEEE T. Comput, 1984], as well as more recent ones have been implemented and used in a number of applications. However, the wins on space usage of these approaches are outweighed by their slowness relative to conventional hash tables. In this paper, we demonstrate that compact hash tables based upon a simple idea of bucketing practically outperform existing compact hash table implementations in terms of memory usage and construction time, and existing fast hash table implementations in terms of memory usage (and sometimes also in terms of construction time).
A related notion is that of a compact Hash ID map, which stores a set Ŝ of n keys from U, and implicitly associates each key in Ŝ with a unique value (its ID), chosen by the data structure itself, which is an integer of magnitude O(n), and supports inserts and lookups on Ŝ, while using close to B(u,n) bits. One of our approaches is suitable for use as a compact Hash ID map.

Dominik Köppl, Simon J. Puglisi, and Rajeev Raman. Fast and Simple Compact Hashing via Bucketing. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{koppl_et_al:LIPIcs.SEA.2020.7, author = {K\"{o}ppl, Dominik and Puglisi, Simon J. and Raman, Rajeev}, title = {{Fast and Simple Compact Hashing via Bucketing}}, booktitle = {18th International Symposium on Experimental Algorithms (SEA 2020)}, pages = {7:1--7:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-148-1}, ISSN = {1868-8969}, year = {2020}, volume = {160}, editor = {Faro, Simone and Cantone, Domenico}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2020.7}, URN = {urn:nbn:de:0030-drops-120817}, doi = {10.4230/LIPIcs.SEA.2020.7}, annote = {Keywords: compact hashing, hash table, separate chaining} }

Document

**Published in:** LIPIcs, Volume 160, 18th International Symposium on Experimental Algorithms (SEA 2020)

We consider the problem of identifying patterns of interest in colored strings. A colored string is a string in which each position is colored with one of a finite set of colors. Our task is to find substrings that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically infer properties of the embedded system from the analysis of its simulation traces. We show that the number of interesting patterns is upper-bounded by 𝒪(n²) where n is the length of the string. We introduce a baseline algorithm with 𝒪(n²) running time which identifies all interesting patterns for all colors in the string satisfying certain minimality conditions. When one is interested in patterns related to only one color, we provide an algorithm that identifies patterns in 𝒪(n²log n) time, but is faster than the first algorithm in practice, both on simulated and on real-world patterns.

Zsuzsanna Lipták, Simon J. Puglisi, and Massimiliano Rossi. Pattern Discovery in Colored Strings. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 12:1-12:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{liptak_et_al:LIPIcs.SEA.2020.12, author = {Lipt\'{a}k, Zsuzsanna and Puglisi, Simon J. and Rossi, Massimiliano}, title = {{Pattern Discovery in Colored Strings}}, booktitle = {18th International Symposium on Experimental Algorithms (SEA 2020)}, pages = {12:1--12:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-148-1}, ISSN = {1868-8969}, year = {2020}, volume = {160}, editor = {Faro, Simone and Cantone, Domenico}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2020.12}, URN = {urn:nbn:de:0030-drops-120862}, doi = {10.4230/LIPIcs.SEA.2020.12}, annote = {Keywords: property testing, suffix tree, pattern mining} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

A string S[1,n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.

Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi. Computing the Antiperiod(s) of a String. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 32:1-32:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{alamro_et_al:LIPIcs.CPM.2019.32, author = {Alamro, Hayam and Badkobeh, Golnaz and Belazzougui, Djamal and Iliopoulos, Costas S. and Puglisi, Simon J.}, title = {{Computing the Antiperiod(s) of a String}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {32:1--32:11}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.32}, URN = {urn:nbn:de:0030-drops-105035}, doi = {10.4230/LIPIcs.CPM.2019.32}, annote = {Keywords: antiperiod, antipower, power, period, repetition, run, string} }

Document

**Published in:** LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Lastly, we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time. Kohdista is available at https://github.com/mmuggli/KOHDISTA/.

Martin D. Muggli, Simon J. Puglisi, and Christina Boucher. A Succinct Solution to Rmap Alignment. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{muggli_et_al:LIPIcs.WABI.2018.12, author = {Muggli, Martin D. and Puglisi, Simon J. and Boucher, Christina}, title = {{A Succinct Solution to Rmap Alignment}}, booktitle = {18th International Workshop on Algorithms in Bioinformatics (WABI 2018)}, pages = {12:1--12:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-082-8}, ISSN = {1868-8969}, year = {2018}, volume = {113}, editor = {Parida, Laxmi and Ukkonen, Esko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.12}, URN = {urn:nbn:de:0030-drops-93143}, doi = {10.4230/LIPIcs.WABI.2018.12}, annote = {Keywords: Optical mapping, index based data structures, FM-index, graph algorithms} }

Document

**Published in:** LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

While long reads produced by third-generation sequencing technology from, e.g, Pacific Biosciences have been shown to increase the quality of draft genomes in repetitive regions, fundamental computational challenges remain in overcoming their high error rate and assembling them efficiently. In this paper we show that the de Bruijn graph built on the long reads can be efficiently and substantially disentangled using optical mapping data as auxiliary information. Fundamental to our approach is the use of the positional de Bruijn graph and a succinct data structure for constructing and traversing this graph. Our experimental results show that over 97.7% of directed cycles have been removed from the resulting positional de Bruijn graph as compared to its non-positional counterpart. Our results thus indicate that disentangling the de Bruijn graph using positional information is a promising direction for developing a simple and efficient assembly algorithm for long reads.

Bahar Alipanahi, Leena Salmela, Simon J. Puglisi, Martin Muggli, and Christina Boucher. Disentangled Long-Read De Bruijn Graphs via Optical Maps. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 1:1-1:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{alipanahi_et_al:LIPIcs.WABI.2017.1, author = {Alipanahi, Bahar and Salmela, Leena and Puglisi, Simon J. and Muggli, Martin and Boucher, Christina}, title = {{Disentangled Long-Read De Bruijn Graphs via Optical Maps}}, booktitle = {17th International Workshop on Algorithms in Bioinformatics (WABI 2017)}, pages = {1:1--1:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-050-7}, ISSN = {1868-8969}, year = {2017}, volume = {88}, editor = {Schwartz, Russell and Reinert, Knut}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.1}, URN = {urn:nbn:de:0030-drops-76614}, doi = {10.4230/LIPIcs.WABI.2017.1}, annote = {Keywords: Positional de Bruijn graph, Genome Assembly, Long Read Data, Optical maps} }

Document

Complete Volume

**Published in:** LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

LIPIcs, Volume 75, SEA'17, Complete Volume

16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@Proceedings{iliopoulos_et_al:LIPIcs.SEA.2017, title = {{LIPIcs, Volume 75, SEA'17, Complete Volume}}, booktitle = {16th International Symposium on Experimental Algorithms (SEA 2017)}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-036-1}, ISSN = {1868-8969}, year = {2017}, volume = {75}, editor = {Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017}, URN = {urn:nbn:de:0030-drops-76644}, doi = {10.4230/LIPIcs.SEA.2017}, annote = {Keywords: Analysis of Algorithms and Problem Complexity, Algorithms} }

Document

Front Matter

**Published in:** LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers

16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 0:i-0:xiv, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{iliopoulos_et_al:LIPIcs.SEA.2017.0, author = {Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev}, title = {{Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers}}, booktitle = {16th International Symposium on Experimental Algorithms (SEA 2017)}, pages = {0:i--0:xiv}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-036-1}, ISSN = {1868-8969}, year = {2017}, volume = {75}, editor = {Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017.0}, URN = {urn:nbn:de:0030-drops-76006}, doi = {10.4230/LIPIcs.SEA.2017.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers} }

Document

**Published in:** LIPIcs, Volume 80, 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)

The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape.
In order to better understand the combinatorics of LCP arrays, we consider the problem of inferring a string from an LCP array, i.e., determining whether a given array of integers is a valid LCP array, and if it is, reconstructing some string or all strings with that LCP array. There are recent studies of inferring a string from a suffix tree shape but using significantly more information (in the form of suffix links) than is available in the LCP array.
We provide two main results. (1) We describe two algorithms for inferring strings from an LCP array when we allow a generalized form of LCP array defined for a multiset of cyclic strings: a linear time algorithm for binary alphabet and a general algorithm with polynomial time complexity for a constant alphabet size. (2) We prove that determining whether a given integer array is a valid LCP array is NP-complete when we require more restricted forms of LCP array defined for a single cyclic or non-cyclic string or a multiset of non-cyclic strings. The result holds whether or not the alphabet is restricted to be binary. In combination, the two results show that the generalized form of LCP array for a multiset of cyclic strings is fundamentally different from the other more restricted forms.

Juha Kärkkäinen, Marcin Piatkowski, and Simon J. Puglisi. String Inference from Longest-Common-Prefix Array. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 62:1-62:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{karkkainen_et_al:LIPIcs.ICALP.2017.62, author = {K\"{a}rkk\"{a}inen, Juha and Piatkowski, Marcin and Puglisi, Simon J.}, title = {{String Inference from Longest-Common-Prefix Array}}, booktitle = {44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)}, pages = {62:1--62:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-041-5}, ISSN = {1868-8969}, year = {2017}, volume = {80}, editor = {Chatzigiannakis, Ioannis and Indyk, Piotr and Kuhn, Fabian and Muscholl, Anca}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2017.62}, URN = {urn:nbn:de:0030-drops-74989}, doi = {10.4230/LIPIcs.ICALP.2017.62}, annote = {Keywords: LCP array, string inference, BWT, suffix array, suffix tree, NP-hardness} }

Document

**Published in:** LIPIcs, Volume 66, 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)

Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlapping LZ factorization (which we demonstrate by describing a new, non-trivial family of strings) it is always less than twice the size.

Juha Kärkkäinen, Dominik Kempa, Yuto Nakashima, Simon J. Puglisi, and Arseny M. Shur. On the Size of Lempel-Ziv and Lyndon Factorizations. In 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 66, pp. 45:1-45:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{karkkainen_et_al:LIPIcs.STACS.2017.45, author = {K\"{a}rkk\"{a}inen, Juha and Kempa, Dominik and Nakashima, Yuto and Puglisi, Simon J. and Shur, Arseny M.}, title = {{On the Size of Lempel-Ziv and Lyndon Factorizations}}, booktitle = {34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)}, pages = {45:1--45:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-028-6}, ISSN = {1868-8969}, year = {2017}, volume = {66}, editor = {Vollmer, Heribert and Vall\'{e}e, Brigitte}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2017.45}, URN = {urn:nbn:de:0030-drops-69878}, doi = {10.4230/LIPIcs.STACS.2017.45}, annote = {Keywords: Lempel-Ziv factorization, Lempel-Ziv parsing, LZ, Lyndon word, Lyndon factorization, Standard factorization} }

Document

**Published in:** LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Given a string S of n symbols, a longest common extension query LCE(i,j) asks for the length of the longest common prefix of the $i$th and $j$th suffixes of S. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015:65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter 1 <= tau <= n, their best deterministic solution is a data structure of size O(n/tau) which allows LCE queries to be answered in O(tau) time. However, the construction time for all deterministic versions of their data structure is quadratic in n. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of O(tau * min(log(tau),log(n/tau)) query time using O(n/tau) space, but significantly improve the construction time to O(n*tau).

Yuka Tanimura, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Simon J. Puglisi, and Masayuki Takeda. Deterministic Sub-Linear Space LCE Data Structures With Efficient Construction. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 1:1-1:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{tanimura_et_al:LIPIcs.CPM.2016.1, author = {Tanimura, Yuka and I, Tomohiro and Bannai, Hideo and Inenaga, Shunsuke and Puglisi, Simon J. and Takeda, Masayuki}, title = {{Deterministic Sub-Linear Space LCE Data Structures With Efficient Construction}}, booktitle = {27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)}, pages = {1:1--1:10}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-012-5}, ISSN = {1868-8969}, year = {2016}, volume = {54}, editor = {Grossi, Roberto and Lewenstein, Moshe}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2016.1}, URN = {urn:nbn:de:0030-drops-60655}, doi = {10.4230/LIPIcs.CPM.2016.1}, annote = {Keywords: longest common extension, longest common prefix, sparse suffix array} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail