Approximability of Longest Run Subsequence and Complementary Minimization Problems

Asahiro, Yuichi; Gong, Mingyang; Jansson, Jesper; Lin, Guohui; Lu, Sichen; Miyano, Eiji; Ono, Hirotaka; Saitoh, Toshiki; Tanaka, Shunichi

doi:10.4230/LIPIcs.WABI.2025.3

Approximability of Longest Run Subsequence and Complementary Minimization Problems

Yuichi Asahiro

Kyushu Sangyo University, Fukuoka, Japan Mingyang Gong

University of Alberta, Edmonton, Canada Jesper Jansson

Kyoto University, Kyoto, Japan Guohui Lin

University of Alberta, Edmonton, Canada Sichen Lu University of Alberta, Edmonton, Canada Eiji Miyano

Kyushu Institute of Technology, Iizuka, Japan Hirotaka Ono

Nagoya University, Nagoya, Japan Toshiki Saitoh

Kyushu Institute of Technology, Iizuka, Japan Shunichi Tanaka Kyushu Institute of Technology, Iizuka, Japan

Abstract

We study the polynomial-time approximability of the Longest Run Subsequence problem (LRS for short) and its complementary minimization variant Minimum Run Subsequence Deletion problem (MRSD for short). For a string $S=s_{1}\cdots s_{n}$ over an alphabet $\Sigma$ , a subsequence $S^{\prime}$ of $S$ is $S^{\prime}=s_{i_{1}}\cdots s_{i_{p}}$ , such that $1\leq i_{1}<i_{2}<\ldots<i_{p}\leq|S|$ . A run of a symbol $\sigma\in\Sigma$ in $S$ is a maximal substring of consecutive occurrences of $\sigma$ . A run subsequence $S^{\prime}$ of $S$ is a subsequence of $S$ in which every symbol $\sigma\in\Sigma$ occurs in at most one run. The co-subsequence $\overline{S^{\prime}}$ of the subsequence $S^{\prime}=s_{i_{1}}\cdots s_{i_{p}}$ in $S$ is the subsequence obtained by deleting all the characters in $S^{\prime}$ from $S$ , i.e., $\overline{S^{\prime}}=s_{j_{1}}\cdots s_{j_{n-p}}$ such that $j_{1}<j_{2}<\ldots<j_{n-p}$ and $\{j_{1},\ldots,j_{n-p}\}=\{1,\ldots,n\}\setminus\{i_{1},\ldots,i_{p}\}$ . Given a string $S$ , the goal of LRS (resp., MRSD) is to find a run subsequence $S^{*}$ of $S$ such that the length $|S^{*}|$ is maximized (resp., the number $|\overline{S^{*}}|$ of deleted symbols from $S$ is minimized) over all the run subsequences of $S$ . Let $k$ be the maximum number of symbol occurrences in the input $S$ . It is known that LRS and MRSD are APX-hard even if $k=2$ . In this paper, we show that LRS can be approximated in polynomial time within factors of $(k+2)/3$ for $k=2$ or $3$ , and $2(k+1)/5$ for every $k\geq 4$ . Furthermore, we show that MRSD can be approximated in linear time within a factor of $(k+4)/4$ if $k$ is even and $(k+3)/4$ if $k$ is odd.

Keywords and phrases:

Longest run subsequence, minimum run subsequence deletion, approximation algorithm

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Design and analysis of algorithms

Funding:

The work was partially supported by the NSERC Canada, JSPS KAKENHI Grant Numbers JP20H05967, JP22H00513, JP22K11915, JP24H00697, JP24K02898, JP24K02902, JP24K14827, and JP25K03077, and CRONOS Grant Number JPMJCS24K2.

DOI:

10.4230/LIPIcs.WABI.2025.3

Event:

25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Editors:

Broňa Brejová and Rob Patro

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Scaffolding is one of the key informatics processes in DNA sequencing. DNA sequencing is generally carried out through the following steps: (i) Tens to hundreds of millions of DNA fragments extracted from random positions are read via shotgun sequencing, (ii) the extracted random fragments (reads) are assembled into a series of contiguous sequences (contigs) using an assembly algorithm, and (iii) finally, the contigs are arranged in the correct order based on certain criteria. This step (iii) is called scaffolding, which serves as the original motivation of this study. One common approach to scaffolding is to rearrange contigs by comparing multiple incomplete assemblies of related samples (see [10] for example).

The formulation of contig rearrangement from multiple incomplete assemblies in the scaffolding phase as a string processing problem by Schrinner et al. [11, 12] is known as the Longest Run Subsequence problem (LRS). Let $\Sigma$ be a finite alphabet of symbols, and $|\Sigma|=m$ . A string $S=s_{1}\cdots s_{n}$ is a sequence of $n$ characters, each of which is a symbol in $\Sigma$ . Two or more characters in $S$ can be the same symbol in $\Sigma$ . For a string $S=s_{1}\cdots s_{n}$ , $|S|$ denotes the length of $S$ , i.e., $|S|=n$ . For two strings $S_{1}$ and $S_{2}$ , $S_{1}\circ S_{2}$ denotes the concatenation of $S_{1}$ and $S_{2}$ . A subsequence $S^{\prime}$ of $S$ is a sequence $S^{\prime}=s_{i_{1}}\cdots s_{i_{p}}$ , such that $1\leq i_{1}<i_{2}<\ldots<i_{p}\leq|S|$ . Let $S[i]$ denote the character of $S$ in the $i$ th position for $1\leq i\leq|S|$ , and $S[i,j]$ denote the substring of $S$ that starts from the $i$ th position and ends at the $j$ th position. A $\sigma$ -run in $S$ is a substring $S[i,j]$ such that $S[\ell]=\sigma$ , for any $\ell=i,i+1,\ldots,j$ , but $S[i-1]\neq\sigma$ and $S[j+1]\neq\sigma$ . Given a string $S$ on alphabet $\Sigma$ , a run subsequence $S^{\prime}$ of $S$ is a subsequence of $S$ in which every symbol $\sigma\in\Sigma$ occurs in at most one run. For a string $S=abbbaab$ , for example, the substring $S[2,4]$ is a $b$ -run, and $a a a b$ is a run subsequence of $S$ .

Problem 1 (Longest Run Subsequence problem, LRS).

Given a string $S$ in $\Sigma^{n}$ , the goal of LRS is to find a longest run subsequence $S^{*}$ of $S$ , i.e., every $\sigma\in\Sigma$ occurs in at most one run in $S^{*}$ and the length $|S^{*}|$ is maximized over all the run subsequences of $S$ .

For the string $S=abbbaab$ , the longest run subsequnece $S^{*}$ of $S$ is $b b b a a$ of length five. If the maximum number of occurrences of each symbol in the input string $S$ is bounded by $k$ , then the problem is called the $k$ -Longest Run Subsequence problem ( $k$ -LRS). One sees that $1$ -LRS is trivial since the length of all the runs in the input string $S$ is one, and thus the input $S$ itself is the optimal run subsequence. Unfortunately, Schrinner et al. [12] showed that LRS is generally NP-hard. Subsequently, Dondi and Sikora [5] showed $2$ -LRS is APX-hard, while, as a positive result, they provided a polynomial-time $k$ -approximation algorithm for $k$ -LRS. Recently, Asahiro et al. [2] improved the approximation ratio to $(k+1)/2$ for $k$ -LRS, and have shown that for the case $k=2$ , a better approximation ratio $4/3$ than $(k+1)/2=3/2$ is achieved.

In this paper, we first derive further improved approximability results for $k$ -LRS:

Theorem 1.

$k$ -LRS can be approximated in $O(mn^{2})$ time within factors of $(k+2)/3$ for $k=2$ or $3$ , and $2(k+1)/5$ for every $k\geq 4$ .

This paper also considers the complementary minimization variant of LRS, called the Minimum Run Subsequence Deletion problem (MRSD). The co-subsequence $\overline{S^{\prime}}$ of the subsequence $S^{\prime}=s_{i_{1}}\cdots s_{i_{p}}$ in $S$ is the subsequence obtained by deleting all the characters in $S^{\prime}$ from $S$ , i.e., $\overline{S^{\prime}}=s_{j_{1}}\cdots s_{j_{n-p}}$ such that $j_{1}<j_{2}<\ldots<j_{n-p}$ and $\{j_{1},\ldots,j_{n-p}\}=\{1,\ldots,n\}\setminus\{i_{1},\ldots,i_{p}\}$ . For example, consider $S=abcde$ . Then, for a subsequence $S^{\prime}=abd$ , the co-subsequence $\overline{S^{\prime}}$ of $S^{\prime}$ is $\overline{S^{\prime}}=ce$ . Note that for a subsequence $S^{\prime}$ of a string $S$ , $\overline{S^{\prime}}$ is not unique unless we specify position of each character of $S^{\prime}$ in $S$ : for $S=ababa$ and $S^{\prime}=aa$ (without indices from which these $a$ ’s come), candidates of $\overline{S^{\prime}}$ is $b b a$ , $b a b$ , and $a b b$ of the same length three. As will be seen in the following, only the number of deleted characters is important in some cases, but we often need to take care of from where a character is deleted.

Problem 2 (Minimum Run Subsequnce Deletion problem, MRSD).

Given a string $S$ in $\Sigma^{n}$ , the goal of MRSD is to find a run subsequence $S^{*}$ of $S$ such that the number $|\overline{S^{*}}|$ of deleted symbols from $S$ is minimized over all the run subsequence of $S$ .

Similarly to $k$ -LRS, if the maximum number of occurrences of each symbol in the input string $S$ is bounded by $k$ , the problem is called the $k$ -Minimum Run Subsequence Deletion problem ( $k$ -MRSD). Since the run subsequence obtained by minimizing the number of deletions in MRSD corresponds exactly to the longest run subsequence in LRS, LRS and MRSD are essentially equivalent as decision problems. However, due to the difference in the objective functions, MRSD may exhibit different characteristics from LRS in terms of approximability. Thus, a natural question arises: Which problem is easier/harder to approximate, or are they equally hard?

To gain insight into this question, we consider two examples of problem pairs that are essentially equivalent as decision problems but differ in their objective functions, similar to LRS and MRSD. The first example is the pair of Max-2SAT and its deletion variant, Min-2SAT Deletion. It is known that 2SAT can be solved in polynomial time [3]. However, in the case where the instance is unsatisfiable, the objective of Max-2SAT is to find a truth assignment that maximizes the number of satisfied clauses (a maximization problem). In contrast, the objective of Min-2SAT Deletion is to minimize the number of unsatisfied clauses (a minimization problem). For these problems, the following results are known: The maximization version, Max-2SAT, admits a 1.0638-approximation algorithm [9], but it is NP-hard to approximate within a factor of 1.0476 [6]. On the other hand, the best known approximation ratio for the minimization version, Min-2SAT Deletion, is $O(\sqrt{\log n})$ [1], and it is known to be NP-hard to approximate within a factor of 1.36067[4]. Thus, intuitively, the maximization version is easier to approximate than the deletion version for these two problems.

Another example is the pair consisting of the Maximum Independent Set problem (MaxIS) and the Minimum Vertex Cover problem (MinVC). For any graph $G=(V,E)$ and any independent set $\mathit{IS}\subseteq V$ of $G$ , $V\setminus\mathit{IS}$ forms a vertex cover. Thus, MinVC can be seen as the complementary minimization variant of MaxIS. For these problems, the following results are known: MaxIS is NP-hard to approximate within a factor of $n^{1-\varepsilon}$ for any $\varepsilon>0$ [13]. In contrast, MinVC is known to admit a $2-\Theta(1/\sqrt{\log n})$ -approximation algorithm [7]. Thus, contrary to the previous pair, in this case, the maximization version is harder to approximate than the deletion version.

As the second contribution, the paper investigates the approximability of $k$ -MRSD:

Theorem 2.

$k$ -MRSD can be approximated in linear time within a factor of $(k+4)/4$ for even $k$ , and $(k+3)/4$ for odd $k$ .

Namely, unlike the two aforementioned examples, LRS and MRSD can be considered as problems that currently share similar approximation ratios of $O(k)$ . We remark that the basic strategies of the proposed approximation algorithms for MRSD are very similar to those for LRS, but the analyses of the approximation ratios are quite different.

Notation.

For each symbol $\sigma\in\Sigma$ , $\sigma^{h}$ and $\ell_{\max}(\sigma)$ denote a length- $h$ $\sigma$ -run and the length of the longest $\sigma$ -run in the input string $S$ , respectively. Let $occ(\sigma)$ be the number of occurrences of $\sigma$ in the input string $S$ . Let $occ_{max}(S)=\max_{\sigma\in S}occ(\sigma)$ . Without loss of generality, we assume that the number of occurrences of each symbol $\sigma\in\Sigma$ in $S$ is at least one, and if it is one, then we say $\sigma$ is unique.

Example 3.

Consider a string $S=babcbcadedggg$ . Then $S$ includes two $a^{1}$ , three $b^{1}$ , two $c^{1}$ , two $d^{1}$ , and one $e^{1}$ , where each length is one; and a length- $3$ $g$ -run $g^{3}$ . Therefore, the length of the longest run for $a$ , $b$ , $c$ , $d$ , or $e$ is $1$ , and for $g$ is $3$ , respectively. That is, $\ell_{\max}(a)=\ell_{\max}(b)=\ell_{\max}(c)=\ell_{\max}(d)=\ell_{\max}(e)=1$ and $\ell_{\max}(g)=3$ . The number $occ(b)$ of occurrences of $b$ is three, $e$ is unique, i.e., $occ(e)=1$ , and $occ_{max}(S)=3$ .

2 Approximation algorithms for Longest Run Subsequence

In this section we consider LRS and design approximation algorithms for $k$ -LRS. That is, we assume that the maximum number $occ_{max}(S)$ of symbol occurrences in the input $S$ is always bounded by $k$ .

2.1 Preprocessing

We first introduce an inserting operation to preprocess the input string $S$ . For every symbol $\sigma$ with $\ell_{\max}(\sigma)\geq 3$ , we create an auxiliary symbol $\sigma^{\prime}$ . The auxiliary alphabet that contains all the auxiliary symbols is denoted as $\Sigma^{\prime}$ . The inserting operation inserts a copy of a symbol $\sigma^{\prime}$ after the first two consecutive symbols $\sigma\sigma$ in a $\sigma^{h}$ with $h\geq 3$ . We repeatedly apply the inserting operation until there is no $\sigma^{h}$ for any symbol $\sigma$ and any $h\geq 3$ .

Operation (An inserting operation).

Given a $\sigma^{h}$ with $h\geq 3$ in the string $S$ , the operation inserts a copy of the symbol $\sigma^{\prime}\in\Sigma^{\prime}$ after the first two consecutive symbols $\sigma\sigma$ .

Example 4.

Recall the string $S$ in Example 3 and $\ell_{\max}(g)=3$ . After the preprocessing, $g^{3}$ becomes $ggg^{\prime}g$ , and the resulting string is $S^{\prime}=babcbcadedggg^{\prime}g$ .

One clearly sees that in the preprocessed string, denoted as $S^{\prime}$ , $\ell_{\max}(\sigma)\leq 2$ for any symbol $\sigma\in\Sigma$ (and $\ell_{\max}(\sigma^{\prime})=1$ for any auxiliary symbol $\sigma^{\prime}\in\Sigma^{\prime}$ ). Let $\Pi\subseteq\Sigma$ denote the subset of symbols $\sigma$ ’s such that $\ell_{\max}(\sigma)=2$ , and then let $\Lambda=(\Sigma\cup\Sigma^{\prime})\setminus\Pi$ .

2.2 The algorithm

We present an algorithm ALG1 to compute a run-subsequence $ALG1(S^{\prime})$ for the preprocessed string $S^{\prime}$ obtained by applying the above preprocessing to $S$ .

Definition 5.

For a symbol $s\in\Lambda$ that is not unique, if every two consecutive occurrences of $s$ in $S^{\prime}$ are separated by at least two symbols, then $s$ is a good symbol. Otherwise, $s$ is a bad symbol and every substring of $S^{\prime}$ in the form of $s t s$ , where $t$ is another symbol, is a bad segment associated with $s$ .

Using Definition 5, we partition $\Lambda$ into three subsets $\Lambda_{1},\Lambda_{2}$ and $\Lambda_{3}$ , where $\Lambda_{1}$ contains all the unique symbols, $\Lambda_{2}$ contains all the good symbols, and $\Lambda_{3}$ contains all the bad symbols. One sees that such a partition can be done in $O(|S^{\prime}|)$ time.

Our algorithm constructs an initial solution for $S^{\prime}$ and then applies two local search operations to update the solution. During the algorithm ALG1, the current solution is always denoted as $ALG1(S^{\prime})$ . Initially, for each symbol $s\in\Pi\cup\Lambda_{1}\cup\Lambda_{2}$ , the algorithm picks its leftmost longest run in $S^{\prime}$ into $ALG1(S^{\prime})$ ; for each bad symbol $s\in\Lambda_{3}$ , the algorithm picks the leftmost $s$ in the first bad segment associated with $s$ into $ALG1(S^{\prime})$ . Note that all these picked runs are in the same order as they show up in $S^{\prime}$ , i.e., $ALG1(S^{\prime})$ is a subsequence of $S^{\prime}$ . We continue to illustrate using the string in Example 4.

Example 6.

Consider the string $S^{\prime}=babcbcadedggg^{\prime}g$ . We have $\Pi=\{g\}$ , $\Lambda_{1}=\{e,g^{\prime}\}$ , $\Lambda_{2}=\{a\}$ , and $\Lambda_{3}=\{b,c,d\}$ . The initial solution $ALG1(S^{\prime})$ is $bacdeggg^{\prime}$ , which is obtained by picking the symbols in the boxes as follows:

\displaystyle\framebox{$b$}\,\framebox{$a$}\,b\,\framebox{$c$}\,b\,c\,a\,% \framebox{$d$}\,\framebox{$e$}\,d\,\framebox{$gg$}\,\framebox{$g^{\prime}$}\,g.

Observe that if the symbol $t$ after the picked bad symbol $s$ in the associated bad segment $s t s$ is not picked, then we can add the second bad symbol $s$ in the bad segment to the solution to increase its length by one. In the sequel, we aim to do this by possibly swapping some picked good symbols with their respective copies at the other places.

To this purpose, we further partition $\Lambda_{3}$ into two subsets $\Lambda^{\prime}_{3}$ and $\Lambda^{\prime\prime}_{3}$ , where $\Lambda^{\prime\prime}_{3}$ contains those bad symbols, each of which has a length- $2$ run in the solution $ALG1(S^{\prime})$ . At the beginning, $\Lambda^{\prime}_{3}=\Lambda_{3}$ and $\Lambda^{\prime\prime}_{3}=\emptyset$ .

We design two local operations to repeatedly improve the initial solution $ALG1(S^{\prime})$ . The first operation is almost the above observation, while the second goes slightly further to swap a picked symbol.

Operation (Local operation-1 for $s\in\Lambda^{\prime}_{3}$ ).

Given a symbol $s\in\Lambda^{\prime}_{3}$ in a bad segment $s t s$ such that its symbol $t$ is not picked into $ALG1(S^{\prime})$ , the operation replaces $s$ in $ALG1(S^{\prime})$ by the two copies of $s$ in this bad segment, and moves $s$ from $\Lambda^{\prime}_{3}$ to $\Lambda^{\prime\prime}_{3}$ .

Operation (Local operation-2 for $s\in\Lambda^{\prime}_{3}$ ).

Given a symbol $s\in\Lambda^{\prime}_{3}$ in a bad segment $s t s$ such that its symbol $t\in\Lambda_{2}\cup\Lambda^{\prime}_{3}$ is picked into $ALG1(S^{\prime})$ , the operation finds another occurrence of $t$ in $S^{\prime}$ that does not break any length- $2$ runs in $ALG1(S^{\prime})$ , replaces the picked $t$ in $ALG1(S^{\prime})$ by this occurrence and replaces $s$ in $ALG1(S^{\prime})$ by the two copies of $s$ in this bad segment, and moves $s$ from $\Lambda^{\prime}_{3}$ to $\Lambda^{\prime\prime}_{3}$ .

We remark that while applying the above two local operations, the $\sigma^{2}$ -run for each $\sigma\in\Pi$ and the $\sigma$ -run for each $\sigma\in\Lambda_{1}$ are untouched; for each $\sigma\in\Lambda_{2}\cup\Lambda^{\prime}_{3}$ , it appears exactly once in $ALG1(S^{\prime})$ , while for each $\sigma\in\Lambda^{\prime\prime}_{3}$ , it appears twice in $ALG1(S^{\prime})$ and they are picked from a bad segment associated with $\sigma$ . The goal of the local search is to reduce the number of bad symbols in $\Lambda^{\prime}_{3}$ as much as possible, and the process terminates when none of the two local operations is applicable. A high-level description of the algorithm is as follows:

Algorithm 1 A high-level description of the algorithm ALG1.

Input: The sequence $S^{\prime}$ obtained from preprocessing $S$ .
Output: A subsequence $ALG1(S^{\prime})$ of $S^{\prime}$ .

We examine the time complexity of ALG1. First notice that $|S^{\prime}|\leq 3n/2$ and $|\Sigma^{\prime}|\leq m$ . Therefore, the partition of $\Sigma\cup\Sigma^{\prime}$ into $\Pi,\Lambda_{1},\Lambda_{2},\Lambda_{3}$ can be done in $O(n)$ time. One sees that ALG1 executes at most $2m$ local operations, and finding a symbol in $\Lambda_{3}^{\prime}$ to which the local operation-1 is applicable takes $O(n)$ time and finding a symbol in $\Lambda_{3}^{\prime}$ to which the local operation-2 is applicable takes $O(n^{2})$ time. It follows that ALG1 runs in $O(mn^{2})$ .

We continue to illustrate using the string in Example 4.

Example 7.

Consider the string $S^{\prime}=babcbcadedggg^{\prime}g$ . From Example 6, the initial solution $ALG1(S^{\prime})$ is $bacdeggg^{\prime}$ and $\Lambda_{3}=\{b,c,d\}$ , so that $\Lambda^{\prime}_{3}=\{b,c,d\}$ and $\Lambda^{\prime\prime}_{3}=\emptyset$ .

The local operation-1 is applicable for $c\in\Lambda_{3}^{\prime}$ which is associated with only one bad segment $c b c$ , where its $b$ is not picked. Thus $ALG1(S^{\prime})$ is updated to $baccdeggg^{\prime}$ by picking the symbols in the boxes as follows:

\displaystyle\framebox{$b$}\,\framebox{$a$}\,b\,\framebox{$c$}\,b\,\boxed{c}\,% a\,\framebox{$d$}\,\framebox{$e$}\,d\,\framebox{$gg$}\,\framebox{$g^{\prime}$}% \,g,

and $\Lambda^{\prime}_{3}=\{b,d\}$ and $\Lambda^{\prime\prime}_{3}=\{c\}$ .

One sees that the local operation-1 is no longer applicable for symbols $b$ and $d$ in $\Lambda_{3}^{\prime}$ , but then the local operation-2 is applicable to $b$ in the first bad segment $b a b$ associated with $b$ , since the second occurrence of $a$ does not break any length- $2$ runs in $ALG1(S^{\prime})$ . As a result, $ALG1(S^{\prime})$ is updated to $bbccaedggg^{\prime}$ by picking the symbols in the boxes as follows:

\displaystyle\framebox{$b$}\,a\,\framebox{$b$}\,\framebox{$c$}\,b\,\boxed{c}\,% \boxed{a}\,\framebox{$d$}\,\framebox{$e$}\,d\,\framebox{$gg$}\,\framebox{$g^{% \prime}$}\,g

and $\Lambda^{\prime}_{3}=\{d\}$ and $\Lambda^{\prime\prime}_{3}=\{b,c\}$ . For the last symbol $d\in\Lambda_{3}^{\prime}$ , we cannot apply neither the local operation-1 nor the local operation-2, and thus the algorithm terminates.

2.3 Post-processing

We process the achieved solution $ALG1(S^{\prime})$ to produce a solution $ALG1(S)$ for the input $S$ . Recall that $S^{\prime}$ is the result of preprocessing $S$ by the inserting operations, each inserts a copy of the auxiliary symbol $\sigma^{\prime}$ after the first two consecutive symbols $\sigma\sigma$ in a $\sigma^{h}$ with $h\geq 3$ . For each auxiliary symbol $\sigma^{\prime}\in\Sigma^{\prime}$ , we delete its single copy from $ALG1(S^{\prime})$ and then replace the $\sigma^{2}$ in $ALG1(S^{\prime})$ by a longest $\sigma$ -run in the input sequence $S$ . We remark that $\sigma^{\prime}$ is either unique or good, and thus it appears exactly once in $ALG1(S^{\prime})$ , and that no symbol of the longest $\sigma$ -run in the input sequence $S$ breaks any length- $2$ run in $ALG1(S^{\prime})$ .

We continue to illustrate using the string in Example 3.

Example 8.

Consider the string $S=babcbcadedggg$ , for which $S^{\prime}=babcbcadedggg^{\prime}g$ , where $g^{\prime}$ is inserted by the inserting operation. From Example 7, the achieved solution $ALG1(S^{\prime})$ for $S^{\prime}$ is $bbccadegg^{\prime}$ by picking the symbols in the boxes as follows:

\displaystyle\framebox{$b$}\,a\,\framebox{$b$}\,\framebox{$c$}\,b\,\boxed{c}\,% \boxed{a}\,\framebox{$d$}\,\framebox{$e$}\,d\,\framebox{$gg$}\,\framebox{$g^{% \prime}$}\,g,

The post-processing gives the solution $ALG1(S)=bbccadeggg$ for $S$ by picking the symbols in the boxes as follows:

\displaystyle\framebox{$b$}\,a\,\framebox{$b$}\,\framebox{$c$}\,b\,\boxed{c}\,% \boxed{a}\,\framebox{$d$}\,\framebox{$e$}\,d\,\framebox{$ggg$}.

2.4 Performance analysis

In this section, we analyze the worst-case ratio of our algorithm. Let $ALG1(S)$ and $OPT(S)$ denote the output sequence by our algorithm and an optimal solution when the input sequence is $S$ , respectively. Let $\alpha(S)=\frac{|OPT(S)|}{|ALG1(S)|}$ .

Lemma 9.

For an input string $S$ , let $S^{\prime}$ be the result of preprocessing $S$ by the inserting operation. Then $\alpha(S^{\prime})\geq\alpha(S)$ is satisfied.

Proof.

$S$ is a subsequence of $S^{\prime}$ . Therefore, $|OPT(S)|\leq|OPT(S^{\prime})|$ .

We next prove $|ALG1(S^{\prime})|\leq|ALG1(S)|$ . Note that for $\sigma^{\prime}\in\Sigma^{\prime}$ , it corresponds to a symbol $\sigma$ with $\ell_{max}(\sigma)\geq 3$ in $S$ . If $\ell_{max}(\sigma)=3$ in $S$ , then $\sigma^{\prime}$ is unique in $S^{\prime}$ . Otherwise, $\ell_{max}(\sigma)\geq 4$ in $S$ , then $\sigma^{\prime}$ is a good symbol since in between every two $\sigma^{\prime}$ symbols, there is a $\sigma^{2}$ with $\sigma\neq\sigma^{\prime}$ . That is, $\sigma^{\prime}\in\Lambda_{1}\cup\Lambda_{2}$ and thus the length of $\sigma^{\prime}$ in $ALG1(S^{\prime})$ is exactly one. Therefore, the total length of symbols $\sigma$ and $\sigma^{\prime}$ in $ALG1(S^{\prime})$ is exactly three. It indicates that $|ALG1(S^{\prime})|\leq|ALG1(S)|$ since we choose the longest $\sigma$ -run whose length $\ell_{max}(\sigma)$ is at least three in $ALG1(S)$ . Then the lemma is proved. $\hfill\blacktriangleleft$

We remark that since the approximation ratio of the string $S^{\prime}$ is always worse than the one of $S$ by Lemma 9, we show a bound on the approximation ratio of $S^{\prime}$ instead of the original input $S$ .

Now, we are ready to prove the worst-case ratio of our algorithm. Let $n_{i}$ be the number of unique symbols whose length is $i$ in the optimal solution $OPT(S^{\prime})$ for $i=0,1$ . Recall that for each symbol $\sigma\in\Lambda_{2}$ , the length of the $\sigma$ -run in $ALG1(S^{\prime})$ must be one. We use $x_{i}$ to denote the number of symbols in $\Lambda_{2}$ whose length is $i$ in $OPT(S^{\prime})$ for $i=0,1,\ldots,k$ . Similarly, let $y_{i}$ ( $z_{i}$ , respectively) be the number of symbols in $\Lambda^{\prime}_{3}$ ( $\Lambda^{\prime\prime}_{3}$ , respectively) whose length is $i$ in $OPT(S^{\prime})$ for $i=0,1,\ldots,k$ . Lastly, for each symbol $\sigma\in\Pi$ , the length of the $\sigma$ -run in $ALG1(S^{\prime})$ is two. We denote by $t_{i}$ the number of symbols in $\Pi$ whose length is $i$ in $OPT(S^{\prime})$ for $i=0,1,\ldots,k$ . In conclusion, we have the following two equalities:

	$\displaystyle\|OPT(S^{\prime})\|$	$\displaystyle=n_{1}+\sum_{i=1}^{k}i(x_{i}+y_{i}+z_{i}+t_{i}),\mbox{and}$		(1)
	$\displaystyle\|ALG1(S^{\prime})\|$	$\displaystyle=n_{0}+n_{1}+\sum_{i=0}^{k}(x_{i}+y_{i}+2z_{i}+2t_{i}).$		(2)

Let $\#D$ be the total number of symbols deleted from $S^{\prime}$ by the optimal algorithm to obtain $OPT(S^{\prime})$ . Notice that if the length of the $\sigma$ -run is $i$ in $OPT(S^{\prime})$ and $\sigma$ is not unique, then the total number of deleted $\sigma$ symbols is at most $k-i$ . Therefore the upper bound on $\#D$ is

\displaystyle\#D\leq n_{0}+\sum_{i=0}^{k-1}(k-i)(x_{i}+y_{i}+z_{i}+t_{i}).

(3)

Then we estimate a lower bound on $\#D$ as follows.

Lemma 10.

For the number of deleted symbols $\#D$ , we have

\displaystyle\#D\geq\sum_{i=2}^{k}\left(2(i-1)x_{i}+(i-1)(y_{i}+z_{i})+\left(% \frac{i}{2}-1\right)t_{i}\right).

Proof.

Since each symbol $\sigma\in\Lambda_{2}$ is a good symbol, there exist at least two other symbols in between every two $\sigma$ -runs. In order to generate a length- $i$ $\sigma$ -run in $OPT(S^{\prime})$ , at least $2(i-1)$ symbols are deleted.

For each symbol $\sigma\in\Lambda_{3}$ , $\ell_{\max}(\sigma)=1$ in $S^{\prime}$ . So, forming a length- $i$ $\sigma$ -run must delete at least $i-1$ symbols.

Lastly, consider a symbol $\sigma\in\Pi$ . Recall that $\ell_{\max}(\sigma)=2$ . It indicates that we must delete at least one symbol to generate every length- $3$ $\sigma$ -run. So for each symbol in $\Pi$ , we must delete at least $\frac{i}{2}-1$ symbols.

This completes the proof of this lemma. $\hfill\blacktriangleleft$

By Lemma 10 and Eq.(3), we have

\displaystyle\sum_{i=2}^{k}\left(2(i-1)x_{i}+(i-1)(y_{i}+z_{i})+\left(\frac{i}% {2}-1\right)t_{i}\right)\leq n_{0}+\sum_{i=0}^{k-1}(k-i)(x_{i}+y_{i}+z_{i}+t_{% i}),

which is equivalent to

	$\displaystyle 2(k-1)x_{k}+(k-1)(y_{k}+z_{k})+\left(\frac{k}{2}-1\right)t_{k}$
$\displaystyle\leq\$	$\displaystyle n_{0}+k(x_{0}+y_{0}+z_{0}+t_{0})+(k-1)(x_{1}+y_{1}+z_{1}+t_{1})$
	$\displaystyle+\sum_{i=2}^{k-1}(k-3i+2)x_{i}+\sum_{i=2}^{k-1}(k-2i+1)(y_{i}+z_{% i})+\sum_{i=2}^{k-1}\left(k-\frac{3i}{2}+1\right)t_{i}.$	(4)

Lemma 11.

The summation of $y_{i},z_{i}$ satisfies the following inequality:

\displaystyle\sum_{i=0}^{k}y_{i}\leq n_{0}+n_{1}+\sum_{i=0}^{k}z_{i}.

Proof.

The proof appears in Appendix A. $\hfill\blacktriangleleft$

We can show the following theorem:

Theorem 1. [Restated, see original statement.]

$k$ -LRS can be approximated in $O(mn^{2})$ time within factors of $(k+2)/3$ for $k=2$ or $3$ , and $2(k+1)/5$ for every $k\geq 4$ .

Proof sketch..

Let $\ell$ be a constant; to get the best worst-case ratio, we will set $\ell=1/3$ for $k=2,3$ , and $\ell=(2k-3)/(5k-5)$ for $k\geq 4$ later.

By Eq.(2.4) and Eq.(1), we have

	$\displaystyle k(x_{k}+y_{k}+z_{k}+t_{k})$
$\displaystyle=\$	$\displaystyle 2(k-1)\ell x_{k}+(k-1)\ell(y_{k}+z_{k})+\left(\frac{k}{2}-1% \right)\ell t_{k}$
	$\displaystyle+(k-(2k-2)\ell)x_{k}+(k-(k-1)\ell)(y_{k}+z_{k})+\left(k-\left(% \frac{k}{2}-1\right)\ell\right)t_{k}$
$\displaystyle\leq\$	$\displaystyle n_{0}\ell+k\ell(x_{0}+y_{0}+z_{0}+t_{0})+(k-1)\ell(x_{1}+y_{1}+z% _{1}+t_{1})$
	$\displaystyle+\sum_{i=2}^{k-1}(k-3i+2)\ell x_{i}+\sum_{i=2}^{k-1}(k-2i+1)\ell(% y_{i}+z_{i})+\sum_{i=2}^{k-1}\left(k-\frac{3i}{2}+1\right)\ell t_{i}$
	$\displaystyle+(k-(2k-2)\ell)x_{k}+(k-(k-1)\ell)(y_{k}+z_{k})+\left(k-\left(% \frac{k}{2}-1\right)\ell\right)t_{k},\ \mbox{ and }$	(5)

$\displaystyle\|OPT(S^{\prime})\|=\$	$\displaystyle n_{1}+\sum_{i=1}^{k-1}i(x_{i}+y_{i}+z_{i}+t_{i})+k(x_{k}+y_{k}+z% _{k}+t_{k})$
$\displaystyle\leq\$	$\displaystyle n_{0}\ell+n_{1}+k\ell(x_{0}+y_{0}+z_{0}+t_{0})+((k-1)\ell+1)(x_{% 1}+y_{1}+z_{1}+t_{1})$
	$\displaystyle+\sum_{i=2}^{k-1}((k-3i+2)\ell+i)x_{i}+\sum_{i=2}^{k-1}((k-2i+1)% \ell+i)(y_{i}+z_{i})$
	$\displaystyle+\sum_{i=2}^{k-1}\left(\left(k-\frac{3i}{2}+1\right)\ell+i\right)% t_{i}+(k-(2k-2)\ell)x_{k}$
	$\displaystyle+(k-(k-1)\ell)(y_{k}+z_{k})+\left(k-\left(\frac{k}{2}-1\right)% \ell\right)t_{k}.$	(6)

We discuss two cases of $k$ . When $k=2,3$ , $\ell=\frac{1}{3}$ and $\frac{k+2}{3}\geq\frac{5k+2}{12}$ and $\frac{k+2}{3}\leq\min\{\frac{2k+1}{3},\frac{5k+2}{6}\}$ . By Lemma 11, Eqs. (2.4) and (2), the following inequality holds.

\displaystyle|OPT(S^{\prime})|\leq\frac{k+2}{3}|ALG1(S^{\prime})|.

(7)

On the other hand, when $k\geq 4$ , $\ell=\frac{2k-3}{5k-5}$ and thus $\ell\in[\frac{1}{3},\frac{1}{2})$ . By Lemma 11, we have

\displaystyle\sum_{i=0}^{k}(y_{i}+z_{i})\leq\frac{1}{3}(n_{0}+n_{1})+\frac{2}{% 3}\sum_{i=0}^{k}(y_{i}+2z_{i}).

(8)

Here we obtain several inequalities that are used in the following. If $2\leq i\leq k-1$ , then $(k-3i+2)\ell+i\leq(k-4)\ell+2\leq(k-1)\ell+1$ , and $k-(2k-2)\ell\leq(k-1)\ell+1$ . Moreover, $(k-2i+1)\ell+i\leq(3-k)\ell+k-1\leq k-(k-1)\ell$ , and $\left(k-\frac{3i}{2}+1\right)\ell+i\leq\frac{(5-k)\ell}{2}+k-1\leq k-\left(% \frac{k}{2}-1\right)\ell$ . Note that the following inequality holds:

\displaystyle(k-1)\ell+1\leq k-(k-1)\ell\leq k-\left(\frac{k}{2}-1\right)\ell% \leq 2(k-1)\ell+2.

By Eqs.(8) and (2), Eq.(2.4) can be simplified into the following inequality:

\displaystyle|OPT(S^{\prime})|\leq\

\displaystyle\frac{2(k+1)}{5}|ALG1(S^{\prime})|.

(9)

Combining Eqs.(7) and (9), the theorem is proved. $\hfill\blacktriangleleft$

We note that the analysis on the approximation ratio of $(k+2)/3$ for $k=2$ or $3$ is strictly tight. However, for the $(2k+2)/5$ ratio of $k\geq 4$ , we know only a bad example for which the approximation ratio is $(2k+1)/5$ ; there remains a slight gap. See Appendix C for details.

3 Simple approximation algorithm for MRSD

In this section, we consider the deletion variant $k$ -MRSD. Again, assume that $occ_{max}(S)$ is bounded by $k$ . As a warm-up, we design a simple approximation algorithm ALG2 with approximation ratios $(k+2)/2$ if $k$ is even, and $(k+1)/2$ if $k$ is odd:

Algorithm 2 A high-level description of our algorithm ALG2.

Input: An input string $S$ in which every symbol $\sigma\in\Sigma$ appears at most $k$ times.
Output: A run subsequence $ALG2(S)$ of $S$ .

That is, ALG2 deletes all unselected runs for every symbol from $S$ .

Example 12.

Consider the input string $S=a^{1}b^{1}a^{2}b^{1}c^{1}a^{2}b^{1}c^{1}b^{1}c^{2}b^{2}$ of length $15$ for MRSD. The leftmost longest $a$ -run, $b$ -run, and $c$ -run are $S[3,4]$ , $S[14,15]$ , and $S[12,13]$ , respectively. Therefore, the output subsequence of ALG2 is $ALG2(S)=a^{2}c^{2}b^{2}$ , and thus the co-subsequence of $ALG2(S)$ is $\overline{ALG2(S)}=a^{1}b^{2}c^{1}a^{2}b^{1}c^{1}b^{1}$ . Therefore, the number $|\overline{ALG2(S)}|$ of deleted characters from $S$ by our algorithm ALG2 is nine.

Clearly, ALG2 can be implemented in linear time. We bound its approximation ratio in the following. Let $S$ be an input string of $k$ -MRSD. Suppose that $OPT(S)$ and $ALG2(S)$ are solutions obtained by an optimal algorithm OPT and our algorithm ALG2, respectively. An outline of our proof on the approximation ratio is as follows: (I) We first obtain an upper-bound on the number $|\overline{ALG2(S)}|$ of deleted characters by ALG2. Then, (II) we bound the upper-bound above by $\alpha|\overline{OPT(S)}|$ , where $\alpha=(k+2)/2$ if $k$ is even and $\alpha=(k+1)/2$ if $k$ is odd.

(I) We obtain an upper bound on the number $|\overline{ALG2(S)}|$ of deleted characters by our algorithm ALG2. To do so, we first construct a new string $\hat{S}$ , called a “marked” string, by replacing every character $s=\sigma$ in the co-subsequence $\overline{OPT(S)}$ of $OPT(S)$ with an auxiliary symbol $\hat{\sigma}\not\in\Sigma$ , called a “marked” symbol (character). Let $\hat{\Sigma}$ be the alphabet of marked symbols, where $\hat{\sigma}\in\hat{\Sigma}$ if $\sigma\in\Sigma$ . Then, if the $i$ th character $s_{i}=\sigma$ of $S$ is in $\overline{OPT(S)}$ , then it is replaced with the corresponding marked symbol $\hat{\sigma}$ . Furthermore, we replace every marked character in $\hat{S}$ with a new symbol $\gamma\not\in\Sigma\cup\hat{\Sigma}$ , and call it the “ $\gamma$ -string” $S_{\gamma}$ of $S$ . For example, consider $S=a^{1}b^{1}a^{2}b^{1}c^{1}a^{2}b^{1}c^{1}b^{1}c^{2}b^{2}$ as an input for MRSD again, where $\Sigma=\{a,b,c\}$ . One can verify that $OPT(S)=a^{5}c^{3}b^{2}$ is an optimal solution. Then, we obtain the marked string $\hat{S}=a^{1}\hat{b}^{1}a^{2}\hat{b}^{1}\hat{c}^{1}a^{2}\hat{b}^{1}c^{1}\hat{b% }^{1}c^{2}b^{2}$ , where $\hat{\Sigma}=\{\hat{a},\hat{b},\hat{c}\}$ is the alphabet of the marked symbols. By replacing every marked character in $\hat{\Sigma}$ with $\gamma$ , we obtain the $\gamma$ -string $S_{\gamma}=a^{1}\gamma^{1}a^{2}\gamma^{1}\gamma^{1}a^{2}\gamma^{1}c^{1}\gamma^% {1}c^{2}b^{2}=a^{1}\gamma^{1}a^{2}\gamma^{2}a^{2}\gamma^{1}c^{1}\gamma^{1}c^{2% }b^{2}$ . Let $r$ be the number of $\gamma$ -runs in the $\gamma$ -string. For example, $S_{\gamma}$ includes four $\gamma$ -runs, i.e., $r=4$ , three length- $1$ $\gamma$ -runs and one length- $2$ $\gamma$ -run. Note that $r\leq|\overline{OPT(S)}|$ holds.

Next, just for the sake of the analysis on approximation ratios, we introduce a new algorithm ALG2’ for the $\gamma$ -string $S_{\gamma}$ as input, which is very similar to ALG2, but ALG2’ deletes all the $\gamma$ -runs. It is important to note that it is not necessary to find any optimal solution. Later, we show that the number $|\overline{ALG2^{\prime}(S_{\gamma})}|$ of characters deleted by ALG2’ for $S_{\gamma}$ is a good upper bound on $|\overline{ALG2(S)}|$ .

Algorithm 3 A high-level description of our algorithm ALG2’.

Input: The $\gamma$ -string $S_{\gamma}$ of $S$ over the alphabet $\Sigma\cup\{\gamma\}$ .
Output: A run subsequence $ALG2^{\prime}(S_{\gamma})$ of $S_{\gamma}$ .

Example 13.

Consider the $\gamma$ -string $S_{\gamma}=a^{1}\gamma^{1}a^{2}\gamma^{2}a^{2}\gamma^{1}c^{1}\gamma^{1}c^{2}b^% {2}$ of $S$ . Then $ALG2^{\prime}(S_{\gamma})=a^{2}c^{2}b^{2}$ is the output of ALG2’.

Since all $\gamma$ -runs are deleted from $S_{\gamma}$ , $ALG2^{\prime}(S_{\gamma})$ must be feasible for $k$ -MRSD on the original input $S$ . Then, we get the following upper bound on $|\overline{ALG2(S)}|$ :

Lemma 14.

For any input string $S$ and its $\gamma$ -string $S_{\gamma}$ , the following inequalities hold:

1.

$|ALG2(S)|\geq|ALG2^{\prime}(S_{\gamma})|$ ;
2.

$|\overline{ALG2(S)}|\leq|\overline{ALG2^{\prime}(S_{\gamma})}|$ .

Proof.

(1) $\ell_{max}(\sigma)$ in the $\gamma$ -string $S_{\gamma}$ is at most $\ell_{max}(\sigma)$ in the original string $S$ for each $\sigma\in\Sigma$ . Therefore, $|ALG2(S)|\geq|ALG2^{\prime}(S_{\gamma})|$ holds. (2) From $|S_{\gamma}|=|S|$ and $|ALG2(S)|\geq|ALG2^{\prime}(S_{\gamma})|$ , the number $|\overline{ALG2^{\prime}(S_{\gamma})}|$ of characters deleted from $S_{\gamma}$ is at least the number $|\overline{ALG2(S)}|$ of deleted characters from $S$ . $\hfill\blacktriangleleft$

(II) Next, we consider an upper bound on the number $|\overline{ALG2^{\prime}(S_{\gamma})}|$ of deleted characters by ALG2’ on $S_{\gamma}$ . The crux of the following estimation is the number of deleted characters from $OPT(S)$ of $S_{\gamma}$ by ALG2’.

Lemma 15.

For any input string $S$ and its $\gamma$ -string $S_{\gamma}$ for any optimal solution $OPT(S)$ , the following inequality holds:

\displaystyle|\overline{ALG2^{\prime}(S_{\gamma})}|\leq\left(\left\lfloor\frac% {k}{2}\right\rfloor+1\right)\cdot|\overline{OPT(S)}|

Proof.

We first divide $\gamma$ -runs in the $\gamma$ -string $S_{\gamma}$ into two types, (type-i) $\sigma_{i}\gamma^{h}\sigma_{j}$ for $\sigma_{i}\neq\sigma_{j}$ , and (type-ii) $\sigma_{i}\gamma^{h}\sigma_{i}$ for an integer $h>0$ . Suppose that $S_{\gamma}$ can be represented by $S_{\gamma}=S_{\gamma}^{1}\circ\gamma^{h}\circ S_{\gamma}^{2}$ . Recall that if we delete all $\gamma$ ’s from $S_{\gamma}$ , then the remaining sequence must be an optimal solution, i.e., a run subsequence. Therefore, if the middle $\gamma$ -run is in (type-i) and $S_{\gamma}^{1}$ (resp., $S_{\gamma}^{2}$ ) includes a $\sigma\in\Sigma$ , then $S_{\gamma}^{2}$ (resp., $S_{\gamma}^{1}$ ) does not include $\sigma$ . On the other hand, we can see that the second type $\gamma$ -run partitions some $\sigma_{i}$ -run in the optimal solution $OPT(S)$ into the left $\sigma_{i}$ -run and the right $\sigma_{i}$ -run in the $\gamma$ -string $S_{\gamma}$ .

For example, look at a $\gamma$ -string $S_{\gamma}=a^{p_{1}}\gamma^{h_{1}}b^{p_{2}}\gamma^{h_{2}}b^{p_{3}}\gamma^{h_{3% }}b^{p_{4}}\gamma^{h_{4}}b^{p_{5}}\gamma^{h_{5}}c^{p_{6}}\gamma^{h_{6}}d^{p_{7}}$ for $p_{1},\ldots,p_{7}\geq 1$ and $h_{1},\ldots,h_{6}\geq 1$ . We focus on the six $\gamma$ -runs in $S_{\gamma}$ . Since the left and the right runs (or characters) of the second (also, the third and the fourth) $\gamma$ -run in $S_{\gamma}$ are the same, it is in (type-ii). Here, we can see that one long $b$ -run of length $(p_{2}+p_{3}+p_{4}+p_{5})$ is divided into four $b$ -runs, $b^{p_{2}}$ , $b^{p_{3}}$ , $b^{p_{4}}$ and $b^{p_{5}}$ . Note that ALG2’ selects the longest $\sigma_{i}$ -run for each $\sigma_{i}\in\Sigma$ and deletes all other $\sigma_{i}$ -runs from $S_{\gamma}$ . Therefore, three $b$ -runs of the four $b$ -runs are deleted by ALG2’. If $q$ $\gamma$ -runs divide a $\sigma$ -run of length at most $k$ into $q+1$ $\sigma$ -runs, then the length of each $\sigma$ -run except for the longest $\sigma$ -run is bounded above by $\lfloor k/2\rfloor$ . This implies that the number of $\sigma_{i}$ ’s deleted from $OPT(S)$ for a symbol $\sigma_{i}\in\Sigma$ is bounded above by $\lfloor k/2\rfloor$ per $\gamma$ -run in (type-ii) in the worst case. On the other hand, the left and the right runs of the first (also, the fifth and the sixth) $\gamma$ -run in $S_{\gamma}$ are different, and thus it is in (type-i). For example, in the left substring of the fifth length- $h_{5}$ $\gamma$ -run, neither $c$ nor $d$ appears since $c$ and $d$ are included in the right substring of the $\gamma$ -run. Therefore, after deleting $\gamma^{h_{5}}$ , we can independently count the numbers of characters in $OPT(S)$ that are deleted from the left substring and from the right substring by ALG2’.

Let the number of $\gamma$ -runs in (type-ii) be $r$ . Note that $r\leq|\overline{OPT(S)}|$ holds and the total number of $\gamma$ ’s in (type-i) and (type-ii) deleted from $S_{\gamma}$ is $|\overline{OPT(S)}|$ . Also, the total number of $\sigma$ ’s deleted from $OPT(S)$ for all symbols in $\Sigma$ is bounded above by $\lfloor k/2\rfloor\cdot r$ . Hence, we obtain the following inequality on the number of characters deleted by ALG2’:

\displaystyle|\overline{ALG2^{\prime}(S_{\gamma})}|\leq|\overline{OPT(S)}|+% \left\lfloor\frac{k}{2}\right\rfloor\cdot r\leq\left(\left\lfloor\frac{k}{2}% \right\rfloor+1\right)\cdot|\overline{OPT(S)}|

This completes the proof of this lemma. $\hfill\blacktriangleleft$

From Lemmas 14 and 15, we obtain the following theorem:

Theorem 16.

ALG2 is a linear-time approximation algorithm with approximation ratios $(k+2)/2$ if $k$ is even, and $(k+1)/2$ if $k$ is odd.

4 Improved approximation algorithm for MRSD

In this section, we present an improved approximation algorithm ALG3 that runs in linear time and its approximation ratios are $(k+4)/4$ if $k$ is even, and $(k+3)/4$ if $k$ is odd.

4.1 Concatenation operation

We first define an alternating $\sigma$ -run:

Definition 17.

An alternating $\sigma$ -run (simply, an alternating run) in $S=S[1,n]$ is a substring $S[i,i+2p]$ for an integer $p\geq 1$ such that (i) $S[i]=S[i+2]=\ldots=S[i+2p]=\sigma$ , (ii) $S[i+1]\neq\sigma$ , $S[i+3]\neq\sigma$ , $\ldots$ , $S[i+2p-1]\neq\sigma$ , and (iii) $S[i-2]\neq\sigma$ (if $i-2\geq 1$ ), $S[i-1]\neq\sigma$ (if $i-1\geq 1$ ), $S[i+2p+1]\neq\sigma$ (if $i+2p+1\leq n$ ), and $S[i+2p+2]\neq\sigma$ (if $i+2p+2\leq n$ ).

For example, consider a string $S=S[1,15]=ababcababcbcbcc$ . In the string $S$ , $S[1,3]=aba$ , $S[2,4]=bab$ , $S[6,8]=aba$ , $S[7,13]=babcbcb$ , and $S[10,14]=cbcbc$ are an alternating $a$ -run, an alternating $b$ -run, an alternating $a$ -run, an alternating $b$ -run, and an alternating $c$ -run, respectively.

Then, we introduce a concatenation operation to obtain a $\sigma$ -run from an alternating $\sigma$ -run in the string $S$ . Consider an alternating $\sigma$ -run $S[i,i+2p]$ . Then, the concatenation operation deletes all characters that are not $\sigma$ from $S[i,i+2p]$ , i.e., $S[i+1]$ , $S[i+3]$ , $\ldots$ , $S[i+2p-1]$ , and obtains a $\sigma$ -run. For a string $a b a b a$ , however, there are two possibilities, $a^{3}$ and $b^{2}$ by deleting two $b$ ’s and three $a$ ’s, respectively. The operation finds a concatenation (i.e., run) as long as possible using an optimal algorithm for the Interval Scheduling problem (see, e.g., [8]): We regard the alternating run $S[i,i+2p]$ as the interval $[i,i+2p]$ of weight $2p+1$ . A pair of two alternating runs $S[i,i+2p_{i}]$ and $S[j,j+2p_{j}]$ is independent if $i+2p_{i}<j$ or $j+2p_{j}<i$ holds. The concatenation operation aims to find a maximum weight subset of mutually independent alternating runs from $S$ , and obtain a long $\sigma$ -run from the selected alternating $\sigma$ -run by deleting all the characters that are not $\sigma$ for every $\sigma$ .

Operation (Concatenation operation).

Given the input string $S$ , the operation obtains a concatenated sequence $S_{c}$ :

(Step 1): Find all the alternating runs in $S$ .
(Step 2): Select a maximum subset $\mathcal{M}$ of mutually independent alternating runs in $S$ .
(Step 3): Delete all characters from $S$ so that every alternating $\sigma$ -run in $\mathcal{M}$ becomes a $\sigma$ -run, and obtain a concatenated sequence $S_{c}$ .

Example 18.

Consider again a string $S=S[1,15]=ababcababcbcbcc$ . In Step 1, we find an alternating $a$ -run $S[1,3]$ of weight three, an alternating $b$ -run $S[2,4]$ of weight three, an alternating $a$ -run $S[6,8]$ of weight three, an alternating $b$ -run $S[7,13]$ of weight seven, and an alternating $c$ -run $S[10,14]$ of weight five. Then, in Step 2, we select $\mathcal{M}=\{S[1,3],S[6,8],S[10,14]\}$ whose total weight is $11$ . Finally, we delete four characters $S[2]$ , $S[7]$ , $S[11]$ , $S[13]$ from $S$ , and obtain $S_{c}=aabcaabcccc=a^{2}bca^{2}bc^{4}$ .

We estimate the running time of the concatenation operation. Note that the number of alternating runs overlapped at the same position is at most two, and there are $O(n)$ alternating runs in $S$ . Hence Step 1 can be implemented in $O(n)$ time by scanning the string $S$ from left to right. Furthermore, during Step 1, we can sort the alternating runs according to their right-ends in $S$ . Among $O(n)$ alternating runs, we can select the maximum independent set $\mathcal{M}$ by greedily selecting the non-overlapped alternating run with the leftmost right-end [8] in Step 2, which takes $O(n)$ time. Step 3 needs $O(n)$ time. Therefore, the total running time of the concatenation operation is $O(n)$ .

4.2 The algorithm

We present an improved approximation algorithm ALG3 with approximation ratios $(k+4)/4$ if $k$ is even, and $(k+3)/4$ if $k$ is odd.

Algorithm 4 A high-level description of our algorithm ALG3.

Input: An input string $S$ in which every symbol $\sigma$ appears at most $k$ times.
Output: A run subsequence $ALG3(S)$ of $S$ .

For example, if the concatenation operation produces $S_{c}=a^{2}bca^{2}bc^{4}$ from $S$ , then ALG3 outputs $ALG3(S)=a^{2}bc^{4}$ . It is clear that ALG3 can obtain a feasible solution in $O(n)$ time.

4.3 Approximation ratios

Let $OPT(S)$ and $ALG3(S)$ be the solutions obtained by an optimal algorithm OPT, and ALG3 for an input string $S$ , respectively. An outline of our proof on the approximation ratio is very similar to Section 3: (I) We first obtain an upper-bound on the number $|\overline{ALG3(S)}|$ of characters deleted by ALG3, comparing the number of characters deleted by the optimal algorithm. Then, (II) we bound the upper-bound above by $\alpha|\overline{OPT(S)}|$ , where $\alpha=(k+4)/4$ if $k$ is even, and $\alpha=(k+3)/4$ if $k$ is odd. Again, we first construct the marked string $\hat{S}$ from $S$ using the optimal algorithm, and design a little bit worse algorithm ALG3’ than ALG3.

Similarly to Section 3, we construct the marked string $\hat{S}$ by replacing every character $s=\sigma$ in the co-subsequence $\overline{OPT(S)}$ with a symbol $\hat{\sigma}\not\in\Sigma$ . Let $\hat{\Sigma}$ be the alphabet of marked symbols. For example, consider a string $S=ababcababcbcbcc$ of length $15$ . Then, $OPT(S)=a^{4}b^{2}c^{3}$ is an optimal solution and $\hat{S}=a\hat{b}a\hat{b}\hat{c}a\hat{b}ab\hat{c}bc\hat{b}cc$ is the marked string obtained from $\overline{OPT(S)}$ . Note that the optimal solution deletes $S[10]=c$ and thus $\hat{S}$ includes $\hat{S}[10]=\hat{c}$ . Furthermore, $c$ does not appear in $\hat{S}[1,9]$ since $OPT(S)$ must be the run subsequence and $\hat{S}[11,15]$ includes $c$ ’s. In addition, $\hat{S}[1,6]$ does not include $b$ since $b$ ’s appear in $\hat{S}[8,15]$ and $\hat{S}[7]=\hat{b}$ .

Similarly to a $\sigma$ -run for $\sigma\in\Sigma$ , for the marked symbol $\hat{\sigma}\in\hat{\Sigma}$ , a $\hat{\sigma}$ -run can be defined. Also, without distinguishing $\hat{\sigma}$ from $\sigma$ , we consider “mixed” $\sigma$ -runs, $\sigma^{h_{1}}\hat{\sigma}^{h_{2}}$ -type, $\hat{\sigma}^{h_{1}}\sigma^{h_{2}}$ -type, and $\hat{\sigma}^{h_{1}}\sigma^{h_{2}}\hat{\sigma}^{h_{3}}$ -type, for some positive integers $h_{1}$ , $h_{2}$ , and $h_{3}$ in the following. That is, for example, $\sigma^{2}\hat{\sigma}^{3}$ is regarded as one mixed run of length five. Note that we do not need to consider the $\sigma^{h_{1}}\hat{\sigma}^{h_{2}}\sigma^{h_{3}}$ -type since marked strings are constructed based on optimal solutions. Furthermore, we define an alternating-“mixed” run without distinguishing $\hat{\sigma}$ from $\sigma$ as follows:

Definition 19.

An alternating-mixed $\sigma$ -run (or, simply alternating-mixed run) in $\hat{S}=\hat{S}[1,n]$ is a substring $\hat{S}[i,i+2p]$ which satisfies, (Case 1), (Case 2), or (Case 3):

(Case 1): $\sigma^{h_{1}}\hat{\sigma}^{h_{2}}$ -type. (i) $\hat{S}[i]=\hat{S}[i+2]=\ldots=\hat{S}[i+2p_{1}]=\sigma$ , and $\hat{S}[i+2(p_{1}+1)]=\hat{S}[i+2(p_{1}+2)]=\ldots=\hat{S}[i+2p]=\hat{\sigma}$ for $1\leq p_{1}<p$ , (ii) $\hat{S}[i+1]\not\in\{\sigma,\hat{\sigma}\}$ , $\hat{S}[i+3]\not\in\{\sigma,\hat{\sigma}\}$ , $\ldots$ , $\hat{S}[i+2p-1]\not\in\{\sigma,\hat{\sigma}\}$ , and (iii) $\hat{S}[i-2]\neq\sigma$ (if $i-2\geq 1$ ), $\hat{S}[i-1]\neq\sigma$ (if $i-1\geq 1$ ), $\hat{S}[i+2p+1]\neq\hat{\sigma}$ (if $i+2p+1\leq n$ ), and $\hat{S}[i+2p+2]\neq\hat{\sigma}$ (if $i+2p+2\leq n$ ). That is, $\hat{S}[i,i+2p_{1}]$ and $\hat{S}[i+2(p_{1}+1),i+2p]$ are the alternating $\sigma$ -run and the alternating $\hat{\sigma}$ -run, respectively, and $\hat{S}[1,i+2p]=\hat{S}[1,i+2p_{1}]\circ\sigma^{\prime}\circ\hat{S}[i+2(p_{1}+% 1),i+2p]$ for a symbol $\sigma^{\prime}\not\in\{\sigma,\hat{\sigma}\}$ .
(Case 2): $\hat{\sigma}^{h_{1}}\sigma^{h_{2}}$ -type. (i) $\hat{S}[i]=\hat{S}[i+2]=\ldots=\hat{S}[i+2p_{1}]=\hat{\sigma}$ , and $\hat{S}[i+2(p_{1}+1)]=\hat{S}[i+2(p_{1}+2)]=\ldots=\hat{S}[i+2p]=\sigma$ for $1\leq p_{1}<p$ , (ii) $\hat{S}[i+1]\not\in\{\sigma,\hat{\sigma}\}$ , $\hat{S}[i+3]\not\in\{\sigma,\hat{\sigma}\}$ , $\ldots$ , $S[i+2p-1]\not\in\{\sigma,\hat{\sigma}\}$ , and (iii) $\hat{S}[i-2]\neq\hat{\sigma}$ (if $i-2\geq 1$ ), $\hat{S}[i-1]\neq\hat{\sigma}$ (if $i-1\geq 1$ ), $\hat{S}[i+2p+1]\neq\sigma$ (if $i+2p+1\leq n$ ), and $\hat{S}[i+2p+2]\neq\sigma$ (if $i+2p+2\leq n$ ). That is, $\hat{S}[i,i+2p_{1}]$ and $\hat{S}[i+2(p_{1}+1),i+2p]$ are the alternating $\hat{\sigma}$ -run and the alternating $\sigma$ -run, respectively, and $\hat{S}[1,i+2p]=\hat{S}[1,i+2p_{1}]\circ\sigma^{\prime}\circ\hat{S}[i+2(p_{1}+% 1),i+2p]$ for a symbol $\sigma^{\prime}\not\in\{\sigma,\hat{\sigma}\}$ .
(Case 3): $\hat{\sigma}^{h_{1}}\sigma^{h_{2}}\hat{\sigma}^{h_{3}}$ -type. (i) $\hat{S}[i]=\hat{S}[i+2]=\ldots=\hat{S}[i+2p_{1}]=\hat{\sigma}$ , $\hat{S}[i+2(p_{1}+1)]=\hat{S}[i+2(p_{1}+2)]=\ldots=\hat{S}[i+2p_{2}]=\sigma$ , and $\hat{S}[i+2(p_{2}+1)]=\hat{S}[i+2(p_{2}+2)]=\ldots=\hat{S}[i+2p]=\hat{\sigma}$ , for $1\leq p_{1}<p_{2}<p$ , (ii) $\hat{S}[i+1]\not\in\{\sigma,\hat{\sigma}\}$ , $\hat{S}[i+3]\not\in\{\sigma,\hat{\sigma}\}$ , $\ldots$ , $\hat{S}[i+2p-1]\not\in\{\sigma,\hat{\sigma}\}$ , and (iii) $\hat{S}[i-2]\neq\hat{\sigma}$ (if $i-2\geq 1$ ), $\hat{S}[i-1]\neq\hat{\sigma}$ (if $i-1\geq 1$ ), $S[i+2p+1]\neq\hat{\sigma}$ (if $i+2p+1\leq n$ ), and $\hat{S}[i+2p+2]\neq\hat{\sigma}$ (if $i+2p+2\leq n$ ). That is, $\hat{S}[i,i+2p_{1}]$ and $\hat{S}[i+2(p_{2}+1),i+2p]$ are the alternating $\hat{\sigma}$ -runs, and the middle $\hat{S}[i+2(p_{1}+1),i+2p_{2}]$ is the alternating $\sigma$ -run, and $\hat{S}[1,i+2p]=\hat{S}[1,i+2p_{1}]\circ\sigma^{\prime}\circ\hat{S}[i+2(p_{1}+% 1),i+2p_{2}]\circ\sigma^{\prime\prime}\circ\hat{S}[i+2(p_{2}+1),i+2p]$ for symbols $\sigma^{\prime},\sigma^{\prime\prime}\not\in\{\sigma,\hat{\sigma}\}$ .

For example, consider the marked string $\hat{S}=a\hat{b}a\hat{b}\hat{c}a\hat{b}ab\hat{c}bc\hat{b}cc$ of length $15$ . In the string $\hat{S}$ , $\hat{S}[2,4]=\hat{b}a\hat{b}$ $\hat{S}[7,13]=\hat{b}ab\hat{c}bc\hat{b}$ , and $\hat{S}[10,14]=\hat{c}bc\hat{b}c$ are the alternating $b$ -run, the alternating-mixed $b$ -run in (Case 3), and the alternating-mixed $c$ -run in (Case 2), respectively.

Here, we introduce the concatenation operation for marked strings by slightly modifying the concatenation operation.

Operation (Concatenation operation for marked strings).

Given the marked string $\hat{S}$ of the input string $S$ , the operation obtains a concatenated sequence $\hat{S}_{c}$ :

(Step 1): Find all the alternating runs and all the alternating-mixed runs on $\Sigma\cup\hat{\Sigma}$ in $S$ .
(Step 2): Select a maximum subset $\mathcal{M}$ of mutually independent alternating/alternating-mixed runs in $S$ .
(Step 3): Delete all characters that are neither $\sigma$ nor $\hat{\sigma}$ from $S$ so that every alternating $\sigma$ -run, every alternating $\hat{\sigma}$ -run and every alternating-mixed $\sigma$ -run in $\mathcal{M}$ become a $\sigma$ -run, a $\hat{\sigma}$ -run and a mixed $\sigma$ -run, respectively, and obtain the concatenated sequence $\hat{S}_{c}$ .

Example 20.

Consider the marked string $\hat{S}=a\hat{b}a\hat{b}\hat{c}a\hat{b}ab\hat{c}bc\hat{b}cc$ of length $15$ . In Step 1, we find an alternating $a$ -run $S[1,3]$ of weight three, an alternating $\hat{b}$ -run $S[2,4]$ of weight three, an alternating $a$ -run $S[6,8]$ of weight three, an alternating-mixed $b$ -run $S[7,13]$ of weight seven, and an alternating-mixed $c$ -run $S[10,14]$ of weight five. Then, in Step 2, we select $\mathcal{M}=\{S[1,3],S[6,8],S[10,14]\}$ whose total weight is $11$ . Finally, we delete four characters $S[2]$ , $S[7]$ , $S[11]$ , and $S[13]$ from $\hat{S}$ , and obtain $\hat{S}_{c}=aa\hat{b}\hat{c}aab\hat{c}ccc=a^{2}\hat{b}\hat{c}a^{2}b\hat{c}c^{3}$ .

To obtain an upper bound on $|\overline{ALG3(S)}|$ , we introduce the following algorithm ALG3’:

Algorithm 5 A high-level description of our algorithm ALG3’.

Input: The marked string $\hat{S}$ of $S$ over the alphabet $\Sigma\cup\hat{\Sigma}$ .
Output: A run subsequence $ALG3^{\prime}(\hat{S})$ of $\hat{S}$ .

Note that the replacement $\hat{\sigma}$ in the second step of ALG3’ does not create a new $\sigma$ -run, but only increases the length of the $\sigma$ -run which originally appears in $\hat{S}_{c}$ .

Example 21.

If the concatenated sequence is $\hat{S}_{c}=a^{2}\hat{b}\hat{c}a^{2}b\hat{c}c^{3}$ , then we replace $\hat{c}$ in the rightmost mixed run, obtain $a^{2}\hat{b}\hat{c}a^{2}bc^{4}$ , and finally output $a^{2}bc^{4}$ of length seven. Recall that an optimal solution is $OPT(S)=a^{4}b^{2}c^{3}$ of length nine.

Using arguments very similar to the proof of Lemma 14, we obtain the following lemma:

Lemma 22.

For any input $S$ and its marked string $\hat{S}$ , the following inequalities hold:

1.

$|ALG3(S)|\geq|ALG3^{\prime}(\hat{S})|$ ;
2.

$|\overline{ALG3(S)}|\leq|\overline{ALG3^{\prime}(\hat{S})}|$ .

Recall that the optimal algorithm OPT deletes the number $|\overline{OPT(S)}|$ of characters from the input string $S$ . In the following, we show that, given the marked string $\hat{S}$ of $S$ , the number $|\overline{ALG3^{\prime}(\hat{S})}|$ of characters deleted by the worse algorithm ALG3’ from $\hat{S}$ is bounded above by $\frac{k+4}{4}|\overline{OPT(S)}|$ if $k$ is even, and $\frac{k+3}{4}|\overline{OPT(S)}|$ if $k$ is odd.

We first consider the case $k=2$ :

Lemma 23.

Suppose that $k=2$ . Then, for $S$ and its marked string $\hat{S}$ , the following inequality holds:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq\frac{3}{2}|\overline{OPT(% S)}|.

Proof.

We investigate the concatenation operation for marked strings. Consider a marked symbol $\hat{\sigma}_{i}$ . Assume that $\sigma_{i}\neq\sigma_{j}$ . Suppose that the marked string $\hat{S}$ includes a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}$ . Then, even if the middle character $\hat{\sigma}_{i}$ is deleted in the concatenation operation for marked strings, this deletion of $\hat{\sigma}_{i}$ does not increase the number of deleted characters compared to the number of deleted characters by the optimal algorithm. Therefore, we assume that $\hat{\sigma}_{i}$ remains in the concatenated sequence $\hat{S}_{c}$ after the concatenation operation for marked strings. Note that $\hat{S}$ does not include a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}$ . The reason is as follows. The rightmost $\hat{\sigma}_{i}$ implies that the optimal algorithm deletes the rightmost $\sigma_{i}$ , but if it is not deleted, then the length of the optimal solution increases by one, which is a contradiction. One sees that it is enough to count the deleted characters in $OPT(S)$ independently within substrings of length three or four in $\hat{S}$ , since here we assume that $k=2$ .

$\blacksquare$

Suppose that $\hat{S}$ includes a substring $\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}$ . If the middle $\sigma_{j}$ is deleted, then $\hat{\sigma}_{i}^{2}$ is obtained by the concatenation operation for marked strings. ALG3’ deletes those two $\hat{\sigma}_{i}$ ’s in the third step. In other words, ALG3’ deletes one character in $OPT(S)$ per two marked characters in $\overline{OPT(S)}$ (i.e., $1/2$ in $OPT(S)$ per one in $\overline{OPT(S)}$ ).
$\blacksquare$

Suppose that $\hat{S}$ includes a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\sigma_{i}$ . If the right $\sigma_{j}$ is deleted, then we obtain $\sigma_{j}\hat{\sigma}_{i}\sigma_{i}$ and the middle $\hat{\sigma}_{i}$ is replaced by $\sigma_{i}$ in the second step of ALG3’. Here, we can see that $ALG3(\hat{S})$ misses one $\sigma_{j}$ but adds one $\sigma_{i}$ , i.e., the number of deleted characters remains the same for the substring.

In summary, ALG3’ deletes one character in $OPT(S)$ for two marked characters in $\overline{OPT(S)}$ . That is, ALG3’ deletes possibly all the marked characters in $\overline{OPT(S)}$ , and in addition, at most $\frac{1}{2}|\overline{OPT(S)}|$ nonmarked characters in total. From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq|\overline{OPT(S)}|+\frac{% 1}{2}|\overline{OPT(S)}|=\frac{3}{2}|\overline{OPT(S)}|.

(10)

$\hfill\blacktriangleleft$

Next, we obtain the following lemma for $k=3$ , whose proof appears in Appendix B.

Lemma 24.

Suppose that $k=3$ . Then, for $S$ and its marked string $\hat{S}$ , the following inequality holds:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq\frac{3}{2}|\overline{OPT(% S)}|.

Finally, we show the case $k\geq 4$ :

Lemma 25.

Suppose that $k\geq 4$ . Then, for $S$ and its marked string $\hat{S}$ , the following inequality holds:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq\frac{\left\lfloor\frac{k}% {2}\right\rfloor+2}{2}|\overline{OPT(S)}|.

Proof.

We count the deleted characters that appear in $OPT(S)$ , during the concatenation operation for marked strings. Note that if $\hat{S}$ includes a substring consisting of at least two consecutive marked characters, then the substring may partition some $\sigma$ -run in $OPT(S)$ into the left $\sigma$ -run and the right $\sigma$ -run in $\hat{S}$ , and the length of the shorter $\sigma$ -run is at most $\lfloor k/2\rfloor$ .

Now we observe the marked string $\hat{S}$ that includes a substring $\sigma\hat{\sigma}b\hat{\sigma}b\hat{\sigma}b\cdots\hat{\sigma}b\hat{\sigma}% \sigma^{\prime}$ . Suppose that $\sigma=b$ and $\sigma^{\prime}=b$ in the substring. Then, all $\hat{\sigma}$ ’s are deleted in the concatenation operation for marked strings since the alternating $b$ -run is longer than the alternating $\hat{\sigma}$ -run. Here, the number of deleted characters by the concatenation operation remains the same as that by the optimal algorithm. If $\sigma\neq b$ or $\sigma^{\prime}\neq b$ , then all $b$ ’s in $\hat{\sigma}b\hat{\sigma}b\hat{\sigma}b\cdots\hat{\sigma}b\hat{\sigma}$ can be deleted by the concatenation operation for marked strings and the $\hat{\sigma}$ -run is produced, i.e., characters in $OPT(S)$ can be additionally deleted by ALG3’.

$\blacksquare$

Suppose that the length of the produced $\hat{\sigma}$ -run by the concatenation operation for marked strings is $p$ . This implies that $p-1$ $b$ ’s that appear in the optimal solution are deleted while $p$ $\hat{\sigma}$ ’s remain in the concatenated sequence $\hat{S}_{c}$ obtained in the first step of ALG3’.
$\blacksquare$

Suppose that after the first step of ALG3’, we obtain the concatenated sequence $\hat{S}_{c}$ . Moreover, suppose that $\hat{S}_{c}$ includes the $\hat{\sigma}$ -run of length at least two. Then, $\hat{S}_{c}$ possibly includes a substring $\sigma_{0}^{p_{1}}\hat{\sigma}^{p}\sigma_{0}^{p_{2}}$ for $p\geq 2$ . Namely, we can see that the $\sigma_{0}$ -run of length $p_{1}+p_{2}$ is divided into two runs of length $p_{1}$ and $p_{2}$ by the $\hat{\sigma}$ -run of length $p\geq 2$ . Recall that in the third step ALG3’ selects the longest $\sigma_{0}$ -run. That is, if $p_{1}\leq p_{2}$ , then $p_{1}$ characters are deleted; otherwise, $p_{2}$ characters are deleted by ALG3’. From $p_{1}+p_{2}\leq k$ , at most $\lfloor k/2\rfloor$ characters are deleted if the length of the $\hat{\sigma}$ -run is at least two.

In summary, the number of characters deleted from $OPT(S)$ in $\hat{S}$ is one for every one single $\hat{\sigma}_{1}^{1}$ , and $\lfloor k/2\rfloor$ for every one substring of consecutive marked characters $\hat{\sigma}_{1}\cdots\hat{\sigma}_{p}$ for $p\geq 2$ .

Suppose that $\hat{S}_{c}$ includes $r_{1}$ single marked characters whose left and right characters are not marked, and $r_{2}$ substrings of at least two consecutive marked characters $\hat{\sigma}_{1}\cdots\hat{\sigma}_{p}$ . Note that $r_{1}+2r_{2}\leq|\overline{OPT(S)}|$ , and now we assume that $k\geq 4$ . Hence, we obtain the following inequality on the number of characters deleted by ALG3’ from the marked string $\hat{S}$ :

	$\displaystyle\|\overline{ALG3^{\prime}(\hat{S})}\|$	$\displaystyle\leq\|\overline{OPT(S)}\|+r_{1}+\left\lfloor\frac{k}{2}\right% \rfloor\cdot r_{2}\leq\|\overline{OPT(S)}\|+\frac{1}{2}\cdot\left\lfloor\frac{k}% {2}\right\rfloor(r_{1}+2r_{2})$
		$\displaystyle\leq\frac{\left\lfloor\frac{k}{2}\right\rfloor+2}{2}\cdot\|% \overline{OPT(S)}\|$

This completes the proof. $\hfill\blacktriangleleft$

From Lemmas 22, 23, 24 and 25, we obtain the following theorem:

Theorem 2. [Restated, see original statement.]

$k$ -MRSD can be approximated in linear time within a factor of $(k+4)/4$ for even $k$ , and $(k+3)/4$ for odd $k$ .

Finally, we can show that the analyses on the approximation ratios of $(k+4)/4$ and $(k+3)/4$ are strictly tight. For details, the reader is referred to Appendix D.

References

[1] Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. ${O}(\sqrt{\log n})$ approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems. In Harold N. Gabow and Ronald Fagin, editors, Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pages 573–581. ACM, 2005. doi:10.1145/1060590.1060675.
[2] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.
[3] Bengt Aspvall, Michalel F. Plass, and Rovert E. Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Information Processing Letters, 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.
[4] Irit Dinur and Shmuel Safra. The importance of being biased. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 33–42. ACM, 2002. doi:10.1145/509907.509915.
[5] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CPM.2021.14.
[6] Håstad Johan. Some optimal inapproximability results. Journal of the ACM, 48(4):498–859, 2001.
[7] George Karakostas. A better approximation ratio for the vertex cover problem. In Luís Caires, Giuseppe F. Italiano, Luís Monteiro, Catuscia Palamidessi, and Moti Yung, editors, Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, volume 3580 of Lecture Notes in Computer Science, pages 1043–1050. Springer, 2005. doi:10.1007/11523468_84.
[8] Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, 2006.
[9] Michael Lewin, Dror Livnat, and Uri Zwick. Improved rounding techniques for the MAX 2-SAT and MAX DI-CUT problems. In William J. Cook and Andreas S. Schulz, editors, Integer Programming and Combinatorial Optimization, 9th International IPCO Conference, Cambridge, MA, USA, May 27-29, 2002, Proceedings, volume 2337 of Lecture Notes in Computer Science, pages 67–82. Springer, 2002. doi:10.1007/3-540-47867-1_6.
[10] Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, and Chaokun Yan. A comprehensive review of scaffolding methods in genome assembly. Briefings Bioinform., 22(5), 2021. doi:10.1093/bib/bbab033.
[11] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1–6:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.WABI.2020.6.
[12] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/s13015-021-00191-8.
[13] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput., 3(1):103–128, 2007. doi:10.4086/TOC.2007.V003A006.

Appendix A Proof of Lemma 11

Lemma 11. [Restated, see original statement.]

The summation of $y_{i},z_{i}$ satisfies the following inequality:

\displaystyle\sum_{i=0}^{k}y_{i}\leq n_{0}+n_{1}+\sum_{i=0}^{k}z_{i}.

Proof.

Note that $\sum_{i=0}^{k}y_{i}=|\Lambda^{\prime}_{3}|$ and $\sum_{i=0}^{k}z_{i}=|\Lambda^{\prime\prime}_{3}|$ . Moreover, $n_{0}+n_{1}=|\Lambda_{1}|$ . Therefore it is sufficient to show $|\Lambda^{\prime}_{3}|\leq|\Lambda_{1}|+|\Lambda^{\prime\prime}_{3}|=|\Lambda_% {1}\cup\Lambda^{\prime\prime}_{3}|$ .

Then we design a mapping from $\Lambda^{\prime}_{3}$ to $\Lambda_{1}\cup\Lambda^{\prime\prime}_{3}$ as follows: Consider a symbol $s\in\Lambda^{\prime}_{3}$ . Since $s$ is a bad symbol, the algorithm chooses an $s$ -run in a bad segment $s t s$ with $t\neq s$ . Then we know that $t$ in this bad segment $s t s$ must be in $ALG1(S^{\prime})$ . Otherwise, the local operation-1 is applicable to $ALG1(S^{\prime})$ , which contradicts Step 4 of ALG1. If $t$ is unique, then we maps $s$ to $t$ , which is called the Case-1 mapping. Now, we can assume $t$ is not unique.

Since $t$ is in $ALG1(S^{\prime})$ and $t$ is not unique, then $t$ must be in $\Lambda_{2}\cup\Lambda_{3}$ since, otherwise, if $t$ is in $\Pi$ , then a $t$ -run of length two must be chosen in $ALG1(S^{\prime})$ . If $t$ is in $\Lambda^{\prime\prime}_{3}$ , then we map $s$ to $t$ , which is called the Case-2 mapping. Otherwise, $t\in\Lambda_{2}\cup\Lambda^{\prime}_{3}$ and thus the length of $t$ in $ALG1(S^{\prime})$ is exactly one. Note that $t$ is not unique, and the local operation-2 is not applicable to $ALG1(S^{\prime})$ . So, there exists another $t$ such that it is in a bad segment $w t w$ where a $w$ -run of length two is chosen by the algorithm. Therefore $w\in\Lambda^{\prime\prime}_{3}$ and we map $s$ to $w$ , which is called the Case-3 mapping.

Then we prove that the mapping is injective and we are done. Suppose $s_{1},s_{2}$ are two distinct symbols in $\Lambda^{\prime}_{3}$ such that they are mapping to the same symbol $w\in\Lambda_{1}\cup\Lambda^{\prime\prime}_{3}$ . If $w$ is an unique symbol, then $s_{1},s_{2}$ are mapping to $w$ by the Case-1 mapping. That is, $s_{1}ws_{1}$ and $s_{2}ws_{2}$ are two substrings of $S^{\prime}$ , which is impossible since $s_{1}\neq s_{2}$ and $w$ is unique. So, we can assume $w$ is not unique, i.e., $w\in\Lambda^{\prime\prime}_{3}$ and $s_{1},s_{2}$ are mapping to $w$ by the Case-2 mapping or the Case-3 mapping.

The first case is that both $s_{1}$ , $s_{2}$ are mapping to $w$ by the Case-2 mapping. Then it indicates $s_{1}ws_{1}$ and $s_{2}ws_{2}$ are two bad segments in $S^{\prime}$ where $w$ is in $\Lambda^{\prime\prime}_{3}$ . However, this case is impossible since two $w$ -runs in $ALG1(S^{\prime})$ must come from a bad segment associated with $w$ .

The second case is that exactly one of $s_{1},s_{2}$ maps to $w$ by the Case-2 mapping and the other one maps to $w$ by the Case-3 mapping. Without loss of generality, we assume that $s_{1}$ and $s_{2}$ map to $w$ by the Case-2 and Case-3 mapping, respectively. Recall that in the bad segment $s_{1}ws_{1}$ , the symbol $w$ is in $\Lambda^{\prime\prime}_{3}$ . So, there is a substring $ws_{1}ws_{1}$ or $s_{1}ws_{1}w$ containing two bad segments in $S^{\prime}$ such that the $w$ -run of length two is in $ALG1(S^{\prime})$ . By symmetry, we consider the substring $ws_{1}ws_{1}$ . Since $s_{2}$ maps to $w$ by Case-3 mapping, $s_{2}s_{1}s_{2}$ is the substring such that an $s_{1}$ and an $s_{2}$ are in $ALG1(S^{\prime})$ , which leads to a contradiction since $s_{1}\in\Lambda^{\prime}_{3}$ .

The remaining case is that both $s_{1},s_{2}$ map to $w$ by the Case-3 mapping. In this case, there exist two substrings $s_{1}ws_{1}$ and $s_{2}ws_{2}$ in $S^{\prime}$ and $w\in\Lambda^{\prime\prime}_{3}$ . However, this is impossible since two $w$ -runs must come from a bad segment of $w$ , which completes the proof. $\hfill\blacktriangleleft$

Appendix B Proof of Lemma 24

Lemma 24. [Restated, see original statement.]

Suppose that $k=3$ . Then, for $S$ and its marked string $\hat{S}$ , the following inequality holds:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq\frac{3}{2}|\overline{OPT(% S)}|.

Proof.

Consider a marked symbol $\hat{\sigma}_{i}$ and assume that $\sigma_{i}\neq\sigma_{j}$ again. As before, we consider the case where $\hat{\sigma}_{i}$ remains in the concatenated sequence $\hat{S}_{c}$ after the concatenation operation for marked strings. Note that $\hat{S}$ does not include a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}$ . The reason is the same as the previous: The rightmost $\hat{\sigma}_{i}$ implies that the optimal algorithm deletes the rightmost $\sigma_{i}$ , but if the rightmost $\sigma_{i}$ is not deleted, then the length of the optimal solution increases by one, which is a contradiction.

$\blacksquare$

Suppose that $\hat{S}$ includes a substring $\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}$ . If the middle $\sigma_{j}$ is deleted, then $\hat{\sigma}_{i}^{2}$ is obtained by the concatenation operation for marked strings, and ALG3’ deletes two characters $\hat{\sigma}_{i}^{2}$ in the third step. In other words, ALG3’ deletes one character in $OPT(S)$ per two marked characters in $\overline{OPT(S)}$ .
$\blacksquare$

Suppose that $\hat{S}$ includes a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\sigma_{i}$ . If the right $\sigma_{j}$ is deleted, then we obtain $\sigma_{j}\hat{\sigma}_{i}\sigma_{i}$ and the middle $\hat{\sigma}_{i}$ is replaced by $\sigma_{i}$ in the second step of ALG3’. Namely, $ALG3^{\prime}(\hat{S})$ misses one $\sigma_{j}$ but adds one $\sigma_{i}$ , i.e., the number of deleted characters remains the same for the substring.
$\blacksquare$

Suppose that $\hat{S}$ includes a substring $\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\hat{\sigma}_{i}\sigma_{j}\sigma_{i}$ . If the right two $\sigma_{j}$ ’s are deleted, then we obtain $\sigma_{j}\hat{\sigma}_{i}^{2}\sigma_{i}$ and two $\hat{\sigma}_{i}^{2}$ are replaced by $\sigma_{i}^{2}$ in the second step of ALG3’. Namely, $ALG3^{\prime}(\hat{S})$ misses two $\sigma_{j}$ ’s but adds two $\sigma_{i}$ ’s, i.e., the number of deleted characters remains the same for the substring.

Again, we conclude that ALG3’ deletes at least one character in $OPT(S)$ for two marked characters in $\overline{OPT(S)}$ . From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:

\displaystyle|\overline{ALG3^{\prime}(\hat{S})}|\leq|\overline{OPT(S)}|+\frac{% 1}{2}|\overline{OPT(S)}|=\frac{3}{2}|\overline{OPT(S)}|.

(11)

$\hfill\blacktriangleleft$

Appendix C Bad examples for Theorem 1

We can provide tight examples for the analysis on the approximation ratios when $k=2$ and $k=3$ in Theorem 1.

(i)

First, suppose that $k=2$ . Then, consider the following string $S^{2}=S^{2}_{1}\circ S^{2}_{2}\circ\cdots\circ S^{2}_{p}$ of length $6p$ for some integer $p$ :

$\displaystyle S^{2}=\overbrace{abcabc}^{\mbox{substring $S^{2}_{1}$}}% \overbrace{degdeg}^{\mbox{substring $S^{2}_{2}$}}\quad\cdots\quad\overbrace{% \sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}\sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}}^{% \mbox{substring $S^{2}_{p}$}}.$

Then, an optimal solution is $OPT(S^{2})=a^{2}bcd^{2}eg\quad\cdots\quad\sigma_{3p-2}^{2}\sigma_{3p-1}\sigma_% {3p}$ of length $4p$ , and a solution of ALG1 is $ALG1(S^{2})=abcdeg\quad\cdots\quad\sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}$ of length $3p$ Hence, $|\overline{ALG1(S^{2})}|/|\overline{OPT(S^{2})}|=4/3$ .

(ii)

Next, suppose that $k=3$ . Then, the following string $S^{3}=S^{3}_{1}\circ S^{3}_{2}\circ\cdots\circ S^{3}_{p}$ of length $9p$ for some integer $p$ is a tight example:

	$\displaystyle S^{3}=$	$\displaystyle\overbrace{abcabcabc}^{\mbox{substring $S^{3}_{1}$}}\overbrace{% degdegdeg}^{\mbox{substring $S^{3}_{2}$}}\quad\cdots$
		$\displaystyle\hskip 14.22636pt\overbrace{\sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}% \sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}\sigma_{3p-2}\sigma_{3p-1}\sigma_{3p}}^{% \mbox{substring $S^{3}_{p}$}}.$

Then, an optimal solution is $OPT(S^{3})=a^{3}bcd^{3}eg\quad\cdots\quad\sigma_{3p-2}^{3}\sigma_{3p-1}\sigma_% {3p}$ of length $5p$ , and an solution of ALG1 is the same as $ALG1(S^{3})$ . Therefore, $|\overline{ALG1(S^{3})}|/|\overline{OPT(S^{3})}|=5/3$ .

(iii)

For the case $k\geq 4$ , the approximation-ratio analysis in Theorem 1 is almost tight, but there is a slight gap. Suppose that $k\geq 4$ . Then, consider the following string $S^{k}=S^{k}_{1}\circ S^{k}_{2}\circ\cdots\circ S^{k}_{2p}$ of length $6p$ for some integer $3k$ :

	$\displaystyle S^{k}=$	$\displaystyle\overbrace{aabaab\cdots aab}^{\mbox{substring $S^{k}_{1}$ of length $3k/2$ }}\overbrace{ccbccb\cdots ccb}^{\mbox{substring $S^{k}_{2}$ of length $3k/2$ }}\quad\cdots$
		$\displaystyle\hskip 28.45274pt\overbrace{\sigma_{3p-2}\sigma_{3p-2}\sigma_{3p-% 1}\sigma_{3p-2}\sigma_{3p-2}\sigma_{3p-1}\cdots\sigma_{3p-2}\sigma_{3p-2}% \sigma_{3p-1}}^{\mbox{substring $S^{k}_{2p-1}$ of length $3k/2$}}$
		$\displaystyle\hskip 34.1433pt\overbrace{\sigma_{3p}\sigma_{3p}\sigma_{3p-1}% \sigma_{3p}\sigma_{3p}\sigma_{3p-1}\cdots\sigma_{3p}\sigma_{3p}\sigma_{3p-1}}^% {\mbox{substring $S^{k}_{2p}$ of length $3k/2$}}$

Then, the following run subsequence $OPT(S^{k})$ of length $(2k+1)p$ is an optimal solution:

\displaystyle OPT(S^{k})=a^{k}c^{k}bd^{k}g^{k}e\quad\cdots\quad\sigma_{3p-2}^{% k}\sigma_{3p}^{k}\sigma_{3p-1}

On the other hand, our algorithm ALG2 outputs the following subsequence of length $5p$ :

\displaystyle ALG2(S^{k})=a^{2}bc^{2}d^{2}eg^{2}\quad\cdots\quad\sigma_{3p-2}^% {2}\sigma_{3p-1}\sigma_{3p}^{2}.

Hence, $|\overline{ALG2(S^{k})}|/|\overline{OPT(S^{k})}|=(2k+1)/5$ .

Appendix D Bad examples for Theorem 2

We can show that the analysis on the approximation ratios in Theorem 2 is tight.

(i)

First, suppose that $k$ is even. Then, consider the following string $S^{e}=S^{e}_{1}S^{e}_{2}\cdots S^{e}_{p}$ of length $2pk$ for some integer $p$ :

$\displaystyle S^{e}=\overbrace{a^{\frac{k}{2}}b^{2}a^{\frac{k}{2}}b^{k-2}}^{% \mbox{substring $S^{e}_{1}$}}\quad\overbrace{c^{\frac{k}{2}}d^{2}c^{\frac{k}{2% }}d^{k-2}}^{\mbox{substring $S^{e}_{2}$}}\quad\cdots\quad\overbrace{\sigma_{2p% -1}^{\frac{k}{2}}\sigma_{2p}^{2}\sigma_{2p-1}^{\frac{k}{2}}\sigma_{2p}^{k-2}}^% {\mbox{substring $S^{e}_{p}$}}$

Then, the optimal solution $OPT(S^{e})$ is obtained by deleting $2p$ characters as follows:

$\displaystyle OPT(S^{e})=a^{k}b^{k-2}c^{k}d^{k-2}\quad\cdots\quad\sigma_{2p-1}% ^{k}\sigma_{2p}^{k-2}$

On the other hand, our algorithm ALG3 selects the leftmost longest $\sigma_{i}$ -run for each $i\in\{1,\ldots,2p\}$ and thus $ALG3(S^{e})$ is as follows:

$\displaystyle ALG3(S^{e})=a^{\frac{k}{2}}b^{k-2}c^{\frac{k}{2}}d^{k-2}\quad% \cdots\quad\sigma_{2p-1}^{\frac{k}{2}}\sigma_{2p}^{k-2}$

The total number of characters deleted from $S^{e}$ is $\frac{k+4}{2}\cdot p$ . Hence, $|\overline{ALG3(S^{e})}|/|\overline{OPT(S^{e})}|=(k+4)/4$ .
(ii)

Next, suppose that $k$ is odd. Then, consider the following string $S^{o}=S^{o}_{1}S^{o}_{2}\cdots S^{o}_{p}$ of length $2pk$ for some integer $p$ :

$\displaystyle S^{o}=\overbrace{a^{\frac{k+1}{2}}b^{2}a^{\frac{k-1}{2}}b^{k-2}}% ^{\mbox{substring $S^{o}_{1}$}}\quad\overbrace{c^{\frac{k+1}{2}}d^{2}c^{\frac{% k-1}{2}}d^{k-2}}^{\mbox{substring $S^{o}_{2}$}}\quad\cdots\quad\overbrace{% \sigma_{2p-1}^{\frac{k+1}{2}}\sigma_{2p}^{2}\sigma_{2p-1}^{\frac{k-1}{2}}% \sigma_{2p}^{k-2}}^{\mbox{substring $S^{o}_{p}$}}$

Very similarly, we obtain the following equality: $|\overline{ALG3(S^{o})}|/|\overline{OPT(S^{o})}|=(k+3)/4$ . As a result, the analysis of the approximation ratios in the proof of Theorem 2 is tight.

[bib.bib1] [1] Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. ${O}(\sqrt{\log n})$ approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems. In Harold N. Gabow and Ronald Fagin, editors, Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pages 573–581. ACM, 2005. doi:10.1145/1060590.1060675.

[bib.bib2] [2] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.

[bib.bib3] [3] Bengt Aspvall, Michalel F. Plass, and Rovert E. Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Information Processing Letters, 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.

[bib.bib4] [4] Irit Dinur and Shmuel Safra. The importance of being biased. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 33–42. ACM, 2002. doi:10.1145/509907.509915.

[bib.bib5] [5] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CPM.2021.14.

[bib.bib6] [6] Håstad Johan. Some optimal inapproximability results. Journal of the ACM, 48(4):498–859, 2001.

[bib.bib7] [7] George Karakostas. A better approximation ratio for the vertex cover problem. In Luís Caires, Giuseppe F. Italiano, Luís Monteiro, Catuscia Palamidessi, and Moti Yung, editors, Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, volume 3580 of Lecture Notes in Computer Science, pages 1043–1050. Springer, 2005. doi:10.1007/11523468_84.

[bib.bib8] [8] Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, 2006.

[bib.bib9] [9] Michael Lewin, Dror Livnat, and Uri Zwick. Improved rounding techniques for the MAX 2-SAT and MAX DI-CUT problems. In William J. Cook and Andreas S. Schulz, editors, Integer Programming and Combinatorial Optimization, 9th International IPCO Conference, Cambridge, MA, USA, May 27-29, 2002, Proceedings, volume 2337 of Lecture Notes in Computer Science, pages 67–82. Springer, 2002. doi:10.1007/3-540-47867-1_6.

[bib.bib10] [10] Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, and Chaokun Yan. A comprehensive review of scaffolding methods in genome assembly. Briefings Bioinform., 22(5), 2021. doi:10.1093/bib/bbab033.

[bib.bib11] [11] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1–6:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.WABI.2020.6.

[bib.bib12] [12] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/s13015-021-00191-8.

[bib.bib13] [13] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput., 3(1):103–128, 2007. doi:10.4086/TOC.2007.V003A006.

Approximability of Longest Run Subsequence and Complementary Minimization Problems

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

Funding:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Problem 1 (Longest Run Subsequence problem, LRS).

Theorem 1.

Problem 2 (Minimum Run Subsequnce Deletion problem, MRSD).

Theorem 2.

Notation.

Example 3.

2 Approximation algorithms for Longest Run Subsequence

2.1 Preprocessing

Operation (An inserting operation).

Example 4.

2.2 The algorithm

Definition 5.

Example 6.

Operation (Local operation-1 for s∈Λ3′).

Operation (Local operation-2 for s∈Λ3′).

Example 7.

2.3 Post-processing

Example 8.

2.4 Performance analysis

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

Theorem 1. [Restated, see original statement.]

Proof sketch..

3 Simple approximation algorithm for MRSD

Example 12.

Example 13.

Lemma 14.

Proof.

Lemma 15.

Proof.

Theorem 16.

4 Improved approximation algorithm for MRSD

4.1 Concatenation operation

Definition 17.

Operation (Concatenation operation).

Example 18.

4.2 The algorithm

4.3 Approximation ratios

Definition 19.

Operation (Concatenation operation for marked strings).

Example 20.

Example 21.

Lemma 22.

Lemma 23.

Proof.

Lemma 24.

Lemma 25.

Proof.

Theorem 2. [Restated, see original statement.]

References

Appendix A Proof of Lemma 11

Lemma 11. [Restated, see original statement.]

Proof.

Appendix B Proof of Lemma 24

Lemma 24. [Restated, see original statement.]

Proof.

Appendix C Bad examples for Theorem 1

Appendix D Bad examples for Theorem 2

Operation (Local operation-1 for $s\in\Lambda^{\prime}_{3}$ ).

Operation (Local operation-2 for $s\in\Lambda^{\prime}_{3}$ ).