Document

**Published in:** LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

A factorization of a string S is a partition of w into substrings u_1,… ,u_k such that S = u_1 u_2 ⋯ u_k. Such a partition is called equality-free if no two factors are equal: u_i ≠ u_j, ∀ i,j with i ≠ j. The maximum equality-free factorization problem is to find for a given string S, the largest integer k for which S admits an equality-free factorization with k factors.
Equality-free factorizations have lately received attention because of their applications in DNA self-assembly. The best approximation algorithm known for the problem is the natural greedy algorithm, that chooses iteratively from left to right the shortest factor that does not appear before. This algorithm has a √n approximation ratio (SOFSEM 2020) and it is an open problem whether there is a better solution.
Our main result is to show that the natural greedy algorithm is a Θ(n^{1/4}) approximation algorithm for the maximum equality-free factorization problem. Thus, we disprove one of the conjectures of Mincu and Popa (SOFSEM 2020) according to which the greedy algorithm is a Θ(√n) approximation.
The most challenging part of the proof is to show that the greedy algorithm is an O(n^{1/4}) approximation. We obtain this algorithm via prefix free factor families, i.e. a set of non-overlapping factors of the string which are pairwise non-prefixes of each other. In the paper we show the relation between prefix free factor families and the maximum equality-free factorization. Moreover, as a byproduct we present another approximation algorithm that achieves an approximation ratio of O(n^{1/4}) that we believe is of independent interest and may lead to improved algorithms. We then show that the natural greedy algorithm has an approximation ratio that is Ω(n^{1/4}) via a clever analysis which shows that the greedy algorithm is Θ(n^{1/4}) for the maximum equality-free factorization problem.

Matan Kraus, Moshe Lewenstein, Alexandru Popa, Ely Porat, and Yonathan Sadia. String Factorization via Prefix Free Families. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 19:1-19:10, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{kraus_et_al:LIPIcs.CPM.2023.19, author = {Kraus, Matan and Lewenstein, Moshe and Popa, Alexandru and Porat, Ely and Sadia, Yonathan}, title = {{String Factorization via Prefix Free Families}}, booktitle = {34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)}, pages = {19:1--19:10}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-276-1}, ISSN = {1868-8969}, year = {2023}, volume = {259}, editor = {Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.19}, URN = {urn:nbn:de:0030-drops-179738}, doi = {10.4230/LIPIcs.CPM.2023.19}, annote = {Keywords: string factorization, NP-hard problem, approximation algorithm} }

Document

**Published in:** LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

We consider the problem of finding, given two documents of total length n, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic 𝒪(n)-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require Θ(n) space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for n^{2/3} ≤ s ≤ n, the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/s) time. Kociumaka et al. (ESA 2014) generalized this tradeoff to 1 ≤ s ≤ n, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length L of the sought LCS is large. For 1 ≤ s ≤ n, we show that the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/(L⋅s) +n) time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.

Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-Space Tradeoffs for Finding a Long Common Substring. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 5:1-5:14, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bennun_et_al:LIPIcs.CPM.2020.5, author = {Ben-Nun, Stav and Golan, Shay and Kociumaka, Tomasz and Kraus, Matan}, title = {{Time-Space Tradeoffs for Finding a Long Common Substring}}, booktitle = {31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)}, pages = {5:1--5:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-149-8}, ISSN = {1868-8969}, year = {2020}, volume = {161}, editor = {G{\o}rtz, Inge Li and Weimann, Oren}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.5}, URN = {urn:nbn:de:0030-drops-121302}, doi = {10.4230/LIPIcs.CPM.2020.5}, annote = {Keywords: longest common substring, time-space tradeoff, local consistency, periodicity} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail