Longest Common Substring with Gaps and Related Problems

Authors: Aranya Banerjee, Daniel Gibney, and Sharma V. Thankachan

Published in: LIPIcs, Volume 308, 32nd Annual European Symposium on Algorithms (ESA 2024)

The longest common substring (also known as longest common factor) and longest common subsequence problems are two well-studied classical string problems. The former is solvable in optimal 𝒪(n) time for two strings of length m and n with m ≤ n, and the latter is solvable in 𝒪(nm) time, which is conditionally optimal under the Strong Exponential Time Hypothesis. In this work, we study the problem of longest common factor with gaps, that is, finding a set of at most k matching substrings obeying precedence conditions with maximum total length. For k = 1, this is equivalent to the longest common factor problem, and for k = m, this is equivalent to the longest common subsequence problem. Our work demonstrates that, for constant k, this problem can be solved in strongly subquadratic time, i.e., nm^{1 - Θ(1)}. Motivated by co-linear chaining applications in Computational Biology, we further demonstrate that the longest common factor with gaps results can be extended to the case where the matches are restricted to maximal exact matches (MEMs). To further demonstrate the applicability of our techniques, we show that a similar approach can be used for a restricted version of the episode matching problem where one seeks an ordered set of at most k matches whose concatenation equals a query pattern P and the length of the substring of T containing the matches is minimized. These solutions all run in strongly subquadratic time for constant k.

Aranya Banerjee, Daniel Gibney, and Sharma V. Thankachan. Longest Common Substring with Gaps and Related Problems. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

The Maximum Duo-Preservation String Mapping Problem with Bounded Alphabet

Authors: Nicolas Boria, Laurent Gourvès, Vangelis Th. Paschos, and Jérôme Monnot

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Given two strings A and B such that B is a permutation of A, the max duo-preservation string mapping (MPSM) problem asks to find a mapping π between them so as to preserve a maximum number of duos. A duo is any pair of consecutive characters in a string and it is preserved by π if its two consecutive characters in A are mapped to same two consecutive characters in B. This problem has received a growing attention in recent years, partly as an alternative way to produce approximation algorithms for its minimization counterpart, min common string partition, a widely studied problem due its applications in comparative genomics. Considering this favored field of application with short alphabet, it is surprising that MPSM^𝓁, the variant of MPSM with bounded alphabet, has received so little attention, with a single yet impressive work that provides a 2.67-approximation achieved in O(n) [Brubach, 2018], where n = |A| = |B|. Our work focuses on MPSM^𝓁, and our main contribution is the demonstration that this problem admits a Polynomial Time Approximation Scheme (PTAS) when 𝓁 = O(1). We also provide an alternate, somewhat simpler, proof of NP-hardness for this problem compared with the NP-hardness proof presented in [Haitao Jiang et al., 2012].

Nicolas Boria, Laurent Gourvès, Vangelis Th. Paschos, and Jérôme Monnot. The Maximum Duo-Preservation String Mapping Problem with Bounded Alphabet. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 5:1-5:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem

Authors: Haitao Jiang, Jiong Guo, Daming Zhu, and Binhai Zhu

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

The Maximal Strip Recovery problem (MSR) and its complementary (CMSR) are well-studied NP-hard problems in computational genomics. The input of these dual problems are two signed permutations. The goal is to delete some gene markers from both permutations, such that, in the remaining permutations, each gene marker has at least one common neighbor. Equivalently, the resulting permutations could be partitioned into common strips of length at least two. Then MSR is to maximize the number of remaining genes, while the objective of CMSR is to delete the minimum number of gene markers. In this paper, we present a new approximation algorithm for the Complementary Maximal Strip Recovery (CMSR) problem. Our approximation factor is 2, improving the currently best 7/3-approximation algorithm. Although the improvement on the factor is not huge, the analysis is greatly simplified by a compensating method, commonly referred to as the non-oblivious local search technique. In such a method a substitution may not always increase the value of the current solution (it sometimes may even decrease the solution value), though it always improves the value of another function seemingly unrelated to the objective function.

Haitao Jiang, Jiong Guo, Daming Zhu, and Binhai Zhu. A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 5:1-5:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Can a permutation be sorted by best short swaps?

Authors: Shu Zhang, Daming Zhu, Haitao Jiang, Jingjing Ma, Jiong Guo, and Haodi Feng

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

A short swap switches two elements with at most one element caught between them. Sorting permutation by short swaps asks to find a shortest short swap sequence to transform a permutation into another. A short swap can eliminate at most three inversions. It is still open for whether a permutation can be sorted by short swaps each of which can eliminate three inversions. In this paper, we present a polynomial time algorithm to solve the problem, which can decide whether a permutation can be sorted by short swaps each of which can eliminate 3 inversions in O(n) time, and if so, sort the permutation by such short swaps in O(n^2) time, where n is the number of elements in the permutation. A short swap can cause the total length of two element vectors to decrease by at most 4. We further propose an algorithm to recognize a permutation which can be sorted by short swaps each of which can cause the element vector length sum to decrease by 4 in O(n) time, and if so, sort the permutation by such short swaps in O(n^2) time. This improves upon the O(n^2) algorithm proposed by Heath and Vergara to decide whether a permutation is so called lucky.

Shu Zhang, Daming Zhu, Haitao Jiang, Jingjing Ma, Jiong Guo, and Haodi Feng. Can a permutation be sorted by best short swaps?. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 14:1-14:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Genomic Scaffold Filling Revisited

Authors: Haitao Jiang, Chenglin Fan, Boting Yang, Farong Zhong, Daming Zhu, and Binhai Zhu

Published in: LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I', with respect to a complete reference genome G, such that the number of adjacencies between G and I' is maximized. The problem is NP-complete and APX-hard, and admits a 1.2-approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when, (1) a scaffold S is given, the missing genes X = c(G) - c(S) can only be inserted in between the contigs, and the objective is to maximize the number of adjacencies between G and the filled S' and (2) a scaffold S is given, a subset of the missing genes X' subset X = c(G) - c(S) can only be inserted in between the contigs, and the objective is still to maximize the number of adjacencies between G and the filled S''. For problem (1), we present a simple NP-completeness proof, we then present a factor-2 greedy approximation algorithm, and finally we show that the problem is FPT when each gene appears at most d times in G. For problem (2), we prove that the problem is W[1]-hard and then we present a factor-2 FPT-approximation for the case when each gene appears at most d times in G.

Haitao Jiang, Chenglin Fan, Boting Yang, Farong Zhong, Daming Zhu, and Binhai Zhu. Genomic Scaffold Filling Revisited. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

