Abstract 1 Introduction 2 Approximation algorithms for Longest Run Subsequence 3 Simple approximation algorithm for MRSD 4 Improved approximation algorithm for MRSD References Appendix A Proof of Lemma 11 Appendix B Proof of Lemma 24 Appendix C Bad examples for Theorem 1 Appendix D Bad examples for Theorem 2

Approximability of Longest Run Subsequence and Complementary Minimization Problems

Yuichi Asahiro ORCID Kyushu Sangyo University, Fukuoka, Japan Mingyang Gong ORCID University of Alberta, Edmonton, Canada Jesper Jansson ORCID Kyoto University, Kyoto, Japan Guohui Lin ORCID University of Alberta, Edmonton, Canada Sichen Lu University of Alberta, Edmonton, Canada Eiji Miyano ORCID Kyushu Institute of Technology, Iizuka, Japan Hirotaka Ono ORCID Nagoya University, Nagoya, Japan Toshiki Saitoh ORCID Kyushu Institute of Technology, Iizuka, Japan Shunichi Tanaka Kyushu Institute of Technology, Iizuka, Japan
Abstract

We study the polynomial-time approximability of the Longest Run Subsequence problem (LRS for short) and its complementary minimization variant Minimum Run Subsequence Deletion problem (MRSD for short). For a string S=s1sn over an alphabet Σ, a subsequence S of S is S=si1sip, such that 1i1<i2<<ip|S|. A run of a symbol σΣ in S is a maximal substring of consecutive occurrences of σ. A run subsequence S of S is a subsequence of S in which every symbol σΣ occurs in at most one run. The co-subsequence S¯ of the subsequence S=si1sip in S is the subsequence obtained by deleting all the characters in S from S, i.e., S¯=sj1sjnp such that j1<j2<<jnp and {j1,,jnp}={1,,n}{i1,,ip}. Given a string S, the goal of LRS (resp., MRSD) is to find a run subsequence S of S such that the length |S| is maximized (resp., the number |S¯| of deleted symbols from S is minimized) over all the run subsequences of S. Let k be the maximum number of symbol occurrences in the input S. It is known that LRS and MRSD are APX-hard even if k=2. In this paper, we show that LRS can be approximated in polynomial time within factors of (k+2)/3 for k=2 or 3, and 2(k+1)/5 for every k4. Furthermore, we show that MRSD can be approximated in linear time within a factor of (k+4)/4 if k is even and (k+3)/4 if k is odd.

Keywords and phrases:
Longest run subsequence, minimum run subsequence deletion, approximation algorithm
Copyright and License:
[Uncaptioned image] © Yuichi Asahiro, Mingyang Gong, Jesper Jansson, Guohui Lin, Sichen Lu, Eiji Miyano, Hirotaka Ono, Toshiki Saitoh, and Shunichi Tanaka; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Funding:
The work was partially supported by the NSERC Canada, JSPS KAKENHI Grant Numbers JP20H05967, JP22H00513, JP22K11915, JP24H00697, JP24K02898, JP24K02902, JP24K14827, and JP25K03077, and CRONOS Grant Number JPMJCS24K2.
Editors:
Broňa Brejová and Rob Patro

1 Introduction

Scaffolding is one of the key informatics processes in DNA sequencing. DNA sequencing is generally carried out through the following steps: (i) Tens to hundreds of millions of DNA fragments extracted from random positions are read via shotgun sequencing, (ii) the extracted random fragments (reads) are assembled into a series of contiguous sequences (contigs) using an assembly algorithm, and (iii) finally, the contigs are arranged in the correct order based on certain criteria. This step (iii) is called scaffolding, which serves as the original motivation of this study. One common approach to scaffolding is to rearrange contigs by comparing multiple incomplete assemblies of related samples (see [10] for example).

The formulation of contig rearrangement from multiple incomplete assemblies in the scaffolding phase as a string processing problem by Schrinner et al. [11, 12] is known as the Longest Run Subsequence problem (LRS). Let Σ be a finite alphabet of symbols, and |Σ|=m. A string S=s1sn is a sequence of n characters, each of which is a symbol in Σ. Two or more characters in S can be the same symbol in Σ. For a string S=s1sn|S| denotes the length of S, i.e., |S|=n. For two strings S1 and S2, S1S2 denotes the concatenation of S1 and S2. A subsequence S of S is a sequence S=si1sip, such that 1i1<i2<<ip|S|. Let S[i] denote the character of S in the ith position for 1i|S|, and S[i,j] denote the substring of S that starts from the ith position and ends at the jth position. A σ-run in S is a substring S[i,j] such that S[]=σ, for any =i,i+1,,j, but S[i1]σ and S[j+1]σ. Given a string S on alphabet Σ, a run subsequence S of S is a subsequence of S in which every symbol σΣ occurs in at most one run. For a string S=abbbaab, for example, the substring S[2,4] is a b-run, and aaab is a run subsequence of S.

Problem 1 (Longest Run Subsequence problem, LRS).

Given a string S in Σn, the goal of LRS is to find a longest run subsequence S of S, i.e., every σΣ occurs in at most one run in S and the length |S| is maximized over all the run subsequences of S.

For the string S=abbbaab, the longest run subsequnece S of S is bbbaa of length five. If the maximum number of occurrences of each symbol in the input string S is bounded by k, then the problem is called the k-Longest Run Subsequence problem (k-LRS). One sees that 1-LRS is trivial since the length of all the runs in the input string S is one, and thus the input S itself is the optimal run subsequence. Unfortunately, Schrinner et al. [12] showed that LRS is generally NP-hard. Subsequently, Dondi and Sikora [5] showed 2-LRS is APX-hard, while, as a positive result, they provided a polynomial-time k-approximation algorithm for k-LRS. Recently, Asahiro et al. [2] improved the approximation ratio to (k+1)/2 for k-LRS, and have shown that for the case k=2, a better approximation ratio 4/3 than (k+1)/2=3/2 is achieved.

In this paper, we first derive further improved approximability results for k-LRS:

Theorem 1.

k-LRS can be approximated in O(mn2) time within factors of (k+2)/3 for k=2 or 3, and 2(k+1)/5 for every k4.

This paper also considers the complementary minimization variant of LRS, called the Minimum Run Subsequence Deletion problem (MRSD). The co-subsequence S¯ of the subsequence S=si1sip in S is the subsequence obtained by deleting all the characters in S from S, i.e., S¯=sj1sjnp such that j1<j2<<jnp and {j1,,jnp}={1,,n}{i1,,ip}. For example, consider S=abcde. Then, for a subsequence S=abd, the co-subsequence S¯ of S is S¯=ce. Note that for a subsequence S of a string S, S¯ is not unique unless we specify position of each character of S in S: for S=ababa and S=aa (without indices from which these a’s come), candidates of S¯ is bba, bab, and abb of the same length three. As will be seen in the following, only the number of deleted characters is important in some cases, but we often need to take care of from where a character is deleted.

Problem 2 (Minimum Run Subsequnce Deletion problem, MRSD).

Given a string S in Σn, the goal of MRSD is to find a run subsequence S of S such that the number |S¯| of deleted symbols from S is minimized over all the run subsequence of S.

Similarly to k-LRS, if the maximum number of occurrences of each symbol in the input string S is bounded by k, the problem is called the k-Minimum Run Subsequence Deletion problem (k-MRSD). Since the run subsequence obtained by minimizing the number of deletions in MRSD corresponds exactly to the longest run subsequence in LRS, LRS and MRSD are essentially equivalent as decision problems. However, due to the difference in the objective functions, MRSD may exhibit different characteristics from LRS in terms of approximability. Thus, a natural question arises: Which problem is easier/harder to approximate, or are they equally hard?

To gain insight into this question, we consider two examples of problem pairs that are essentially equivalent as decision problems but differ in their objective functions, similar to LRS and MRSD. The first example is the pair of Max-2SAT and its deletion variant, Min-2SAT Deletion. It is known that 2SAT can be solved in polynomial time [3]. However, in the case where the instance is unsatisfiable, the objective of Max-2SAT is to find a truth assignment that maximizes the number of satisfied clauses (a maximization problem). In contrast, the objective of Min-2SAT Deletion is to minimize the number of unsatisfied clauses (a minimization problem). For these problems, the following results are known: The maximization version, Max-2SAT, admits a 1.0638-approximation algorithm [9], but it is NP-hard to approximate within a factor of 1.0476 [6]. On the other hand, the best known approximation ratio for the minimization version, Min-2SAT Deletion, is O(logn) [1], and it is known to be NP-hard to approximate within a factor of 1.36067[4]. Thus, intuitively, the maximization version is easier to approximate than the deletion version for these two problems.

Another example is the pair consisting of the Maximum Independent Set problem (MaxIS) and the Minimum Vertex Cover problem (MinVC). For any graph G=(V,E) and any independent set 𝐼𝑆V of G, V𝐼𝑆 forms a vertex cover. Thus, MinVC can be seen as the complementary minimization variant of MaxIS. For these problems, the following results are known: MaxIS is NP-hard to approximate within a factor of n1ε for any ε>0 [13]. In contrast, MinVC is known to admit a 2Θ(1/logn)-approximation algorithm [7]. Thus, contrary to the previous pair, in this case, the maximization version is harder to approximate than the deletion version.

As the second contribution, the paper investigates the approximability of k-MRSD:

Theorem 2.

k-MRSD can be approximated in linear time within a factor of (k+4)/4 for even k, and (k+3)/4 for odd k.

Namely, unlike the two aforementioned examples, LRS and MRSD can be considered as problems that currently share similar approximation ratios of O(k). We remark that the basic strategies of the proposed approximation algorithms for MRSD are very similar to those for LRS, but the analyses of the approximation ratios are quite different.

Notation.

For each symbol σΣ, σh and max(σ) denote a length-h σ-run and the length of the longest σ-run in the input string S, respectively. Let occ(σ) be the number of occurrences of σ in the input string S. Let occmax(S)=maxσSocc(σ). Without loss of generality, we assume that the number of occurrences of each symbol σΣ in S is at least one, and if it is one, then we say σ is unique.

Example 3.

Consider a string S=babcbcadedggg. Then S includes two a1, three b1, two c1, two d1, and one e1, where each length is one; and a length-3 g-run g3. Therefore, the length of the longest run for a, b, c, d, or e is 1, and for g is 3, respectively. That is, max(a)=max(b)=max(c)=max(d)=max(e)=1 and max(g)=3. The number occ(b) of occurrences of b is three, e is unique, i.e., occ(e)=1, and occmax(S)=3.

2 Approximation algorithms for Longest Run Subsequence

In this section we consider LRS and design approximation algorithms for k-LRS. That is, we assume that the maximum number occmax(S) of symbol occurrences in the input S is always bounded by k.

2.1 Preprocessing

We first introduce an inserting operation to preprocess the input string S. For every symbol σ with max(σ)3, we create an auxiliary symbol σ. The auxiliary alphabet that contains all the auxiliary symbols is denoted as Σ. The inserting operation inserts a copy of a symbol σ after the first two consecutive symbols σσ in a σh with h3. We repeatedly apply the inserting operation until there is no σh for any symbol σ and any h3.

Operation (An inserting operation).

Given a σh with h3 in the string S, the operation inserts a copy of the symbol σΣ after the first two consecutive symbols σσ.

Example 4.

Recall the string S in Example 3 and max(g)=3. After the preprocessing, g3 becomes gggg, and the resulting string is S=babcbcadedgggg.

One clearly sees that in the preprocessed string, denoted as S, max(σ)2 for any symbol σΣ (and max(σ)=1 for any auxiliary symbol σΣ). Let ΠΣ denote the subset of symbols σ’s such that max(σ)=2, and then let Λ=(ΣΣ)Π.

2.2 The algorithm

We present an algorithm ALG1 to compute a run-subsequence ALG1(S) for the preprocessed string S obtained by applying the above preprocessing to S.

Definition 5.

For a symbol sΛ that is not unique, if every two consecutive occurrences of s in S are separated by at least two symbols, then s is a good symbol. Otherwise, s is a bad symbol and every substring of S in the form of sts, where t is another symbol, is a bad segment associated with s.

Using Definition 5, we partition Λ into three subsets Λ1,Λ2 and Λ3, where Λ1 contains all the unique symbols, Λ2 contains all the good symbols, and Λ3 contains all the bad symbols. One sees that such a partition can be done in O(|S|) time.

Our algorithm constructs an initial solution for S and then applies two local search operations to update the solution. During the algorithm ALG1, the current solution is always denoted as ALG1(S). Initially, for each symbol sΠΛ1Λ2, the algorithm picks its leftmost longest run in S into ALG1(S); for each bad symbol sΛ3, the algorithm picks the leftmost s in the first bad segment associated with s into ALG1(S). Note that all these picked runs are in the same order as they show up in S, i.e., ALG1(S) is a subsequence of S. We continue to illustrate using the string in Example 4.

Example 6.

Consider the string S=babcbcadedgggg. We have Π={g}, Λ1={e,g}, Λ2={a}, and Λ3={b,c,d}. The initial solution ALG1(S) is bacdeggg, which is obtained by picking the symbols in the boxes as follows:

babcbcadedgggg.

Observe that if the symbol t after the picked bad symbol s in the associated bad segment sts is not picked, then we can add the second bad symbol s in the bad segment to the solution to increase its length by one. In the sequel, we aim to do this by possibly swapping some picked good symbols with their respective copies at the other places.

To this purpose, we further partition Λ3 into two subsets Λ3 and Λ3′′, where Λ3′′ contains those bad symbols, each of which has a length-2 run in the solution ALG1(S). At the beginning, Λ3=Λ3 and Λ3′′=.

We design two local operations to repeatedly improve the initial solution ALG1(S). The first operation is almost the above observation, while the second goes slightly further to swap a picked symbol.

Operation (Local operation-1 for sΛ3).

Given a symbol sΛ3 in a bad segment sts such that its symbol t is not picked into ALG1(S), the operation replaces s in ALG1(S) by the two copies of s in this bad segment, and moves s from Λ3 to Λ3′′.

Operation (Local operation-2 for sΛ3).

Given a symbol sΛ3 in a bad segment sts such that its symbol tΛ2Λ3 is picked into ALG1(S), the operation finds another occurrence of t in S that does not break any length-2 runs in ALG1(S), replaces the picked t in ALG1(S) by this occurrence and replaces s in ALG1(S) by the two copies of s in this bad segment, and moves s from Λ3 to Λ3′′.

We remark that while applying the above two local operations, the σ2-run for each σΠ and the σ-run for each σΛ1 are untouched; for each σΛ2Λ3, it appears exactly once in ALG1(S), while for each σΛ3′′, it appears twice in ALG1(S) and they are picked from a bad segment associated with σ. The goal of the local search is to reduce the number of bad symbols in Λ3 as much as possible, and the process terminates when none of the two local operations is applicable. A high-level description of the algorithm is as follows:

Algorithm 1 A high-level description of the algorithm ALG1.

Input: The sequence S obtained from preprocessing S.
Output: A subsequence ALG1(S) of S.

 

We examine the time complexity of ALG1. First notice that |S|3n/2 and |Σ|m. Therefore, the partition of ΣΣ into Π,Λ1,Λ2,Λ3 can be done in O(n) time. One sees that ALG1 executes at most 2m local operations, and finding a symbol in Λ3 to which the local operation-1 is applicable takes O(n) time and finding a symbol in Λ3 to which the local operation-2 is applicable takes O(n2) time. It follows that ALG1 runs in O(mn2).

We continue to illustrate using the string in Example 4.

Example 7.

Consider the string S=babcbcadedgggg. From Example 6, the initial solution ALG1(S) is bacdeggg and Λ3={b,c,d}, so that Λ3={b,c,d} and Λ3′′=.

The local operation-1 is applicable for cΛ3 which is associated with only one bad segment cbc, where its b is not picked. Thus ALG1(S) is updated to baccdeggg by picking the symbols in the boxes as follows:

babcbcadedgggg,

and Λ3={b,d} and Λ3′′={c}.

One sees that the local operation-1 is no longer applicable for symbols b and d in Λ3, but then the local operation-2 is applicable to b in the first bad segment bab associated with b, since the second occurrence of a does not break any length-2 runs in ALG1(S). As a result, ALG1(S) is updated to bbccaedggg by picking the symbols in the boxes as follows:

babcbcadedgggg

and Λ3={d} and Λ3′′={b,c}. For the last symbol dΛ3, we cannot apply neither the local operation-1 nor the local operation-2, and thus the algorithm terminates.

2.3 Post-processing

We process the achieved solution ALG1(S) to produce a solution ALG1(S) for the input S. Recall that S is the result of preprocessing S by the inserting operations, each inserts a copy of the auxiliary symbol σ after the first two consecutive symbols σσ in a σh with h3. For each auxiliary symbol σΣ, we delete its single copy from ALG1(S) and then replace the σ2 in ALG1(S) by a longest σ-run in the input sequence S. We remark that σ is either unique or good, and thus it appears exactly once in ALG1(S), and that no symbol of the longest σ-run in the input sequence S breaks any length-2 run in ALG1(S).

We continue to illustrate using the string in Example 3.

Example 8.

Consider the string S=babcbcadedggg, for which S=babcbcadedgggg, where g is inserted by the inserting operation. From Example 7, the achieved solution ALG1(S) for S is bbccadegg by picking the symbols in the boxes as follows:

babcbcadedgggg,

The post-processing gives the solution ALG1(S)=bbccadeggg for S by picking the symbols in the boxes as follows:

babcbcadedggg.

2.4 Performance analysis

In this section, we analyze the worst-case ratio of our algorithm. Let ALG1(S) and OPT(S) denote the output sequence by our algorithm and an optimal solution when the input sequence is S, respectively. Let α(S)=|OPT(S)||ALG1(S)|.

Lemma 9.

For an input string S, let S be the result of preprocessing S by the inserting operation. Then α(S)α(S) is satisfied.

Proof.

S is a subsequence of S. Therefore, |OPT(S)||OPT(S)|.

We next prove |ALG1(S)||ALG1(S)|. Note that for σΣ, it corresponds to a symbol σ with max(σ)3 in S. If max(σ)=3 in S, then σ is unique in S. Otherwise, max(σ)4 in S, then σ is a good symbol since in between every two σ symbols, there is a σ2 with σσ. That is, σΛ1Λ2 and thus the length of σ in ALG1(S) is exactly one. Therefore, the total length of symbols σ and σ in ALG1(S) is exactly three. It indicates that |ALG1(S)||ALG1(S)| since we choose the longest σ-run whose length max(σ) is at least three in ALG1(S). Then the lemma is proved.

We remark that since the approximation ratio of the string S is always worse than the one of S by Lemma 9, we show a bound on the approximation ratio of S instead of the original input S.

Now, we are ready to prove the worst-case ratio of our algorithm. Let ni be the number of unique symbols whose length is i in the optimal solution OPT(S) for i=0,1. Recall that for each symbol σΛ2, the length of the σ-run in ALG1(S) must be one. We use xi to denote the number of symbols in Λ2 whose length is i in OPT(S) for i=0,1,,k. Similarly, let yi (zi, respectively) be the number of symbols in Λ3 (Λ3′′, respectively) whose length is i in OPT(S) for i=0,1,,k. Lastly, for each symbol σΠ, the length of the σ-run in ALG1(S) is two. We denote by ti the number of symbols in Π whose length is i in OPT(S) for i=0,1,,k. In conclusion, we have the following two equalities:

|OPT(S)| =n1+i=1ki(xi+yi+zi+ti),and (1)
|ALG1(S)| =n0+n1+i=0k(xi+yi+2zi+2ti). (2)

Let #D be the total number of symbols deleted from S by the optimal algorithm to obtain OPT(S). Notice that if the length of the σ-run is i in OPT(S) and σ is not unique, then the total number of deleted σ symbols is at most ki. Therefore the upper bound on #D is

#Dn0+i=0k1(ki)(xi+yi+zi+ti). (3)

Then we estimate a lower bound on #D as follows.

Lemma 10.

For the number of deleted symbols #D, we have

#Di=2k(2(i1)xi+(i1)(yi+zi)+(i21)ti).

Proof.

Since each symbol σΛ2 is a good symbol, there exist at least two other symbols in between every two σ-runs. In order to generate a length-i σ-run in OPT(S), at least 2(i1) symbols are deleted.

For each symbol σΛ3, max(σ)=1 in S. So, forming a length-i σ-run must delete at least i1 symbols.

Lastly, consider a symbol σΠ. Recall that max(σ)=2. It indicates that we must delete at least one symbol to generate every length-3 σ-run. So for each symbol in Π, we must delete at least i21 symbols.

This completes the proof of this lemma.

By Lemma 10 and Eq.(3), we have

i=2k(2(i1)xi+(i1)(yi+zi)+(i21)ti)n0+i=0k1(ki)(xi+yi+zi+ti),

which is equivalent to

2(k1)xk+(k1)(yk+zk)+(k21)tk
n0+k(x0+y0+z0+t0)+(k1)(x1+y1+z1+t1)
+i=2k1(k3i+2)xi+i=2k1(k2i+1)(yi+zi)+i=2k1(k3i2+1)ti. (4)
Lemma 11.

The summation of yi,zi satisfies the following inequality:

i=0kyin0+n1+i=0kzi.

Proof.

The proof appears in Appendix A.

We can show the following theorem:

Theorem 1. [Restated, see original statement.]

k-LRS can be approximated in O(mn2) time within factors of (k+2)/3 for k=2 or 3, and 2(k+1)/5 for every k4.

Proof sketch..

Let be a constant; to get the best worst-case ratio, we will set =1/3 for k=2,3, and =(2k3)/(5k5) for k4 later.

By Eq.(2.4) and Eq.(1), we have

k(xk+yk+zk+tk)
= 2(k1)xk+(k1)(yk+zk)+(k21)tk
+(k(2k2))xk+(k(k1))(yk+zk)+(k(k21))tk
n0+k(x0+y0+z0+t0)+(k1)(x1+y1+z1+t1)
+i=2k1(k3i+2)xi+i=2k1(k2i+1)(yi+zi)+i=2k1(k3i2+1)ti
+(k(2k2))xk+(k(k1))(yk+zk)+(k(k21))tk, and (5)
|OPT(S)|= n1+i=1k1i(xi+yi+zi+ti)+k(xk+yk+zk+tk)
n0+n1+k(x0+y0+z0+t0)+((k1)+1)(x1+y1+z1+t1)
+i=2k1((k3i+2)+i)xi+i=2k1((k2i+1)+i)(yi+zi)
+i=2k1((k3i2+1)+i)ti+(k(2k2))xk
+(k(k1))(yk+zk)+(k(k21))tk. (6)

We discuss two cases of k. When k=2,3, =13 and k+235k+212 and k+23min{2k+13,5k+26}. By Lemma 11, Eqs. (2.4) and (2), the following inequality holds.

|OPT(S)|k+23|ALG1(S)|. (7)

On the other hand, when k4, =2k35k5 and thus [13,12). By Lemma 11, we have

i=0k(yi+zi)13(n0+n1)+23i=0k(yi+2zi). (8)

Here we obtain several inequalities that are used in the following. If 2ik1, then (k3i+2)+i(k4)+2(k1)+1, and k(2k2)(k1)+1. Moreover, (k2i+1)+i(3k)+k1k(k1), and (k3i2+1)+i(5k)2+k1k(k21). Note that the following inequality holds:

(k1)+1k(k1)k(k21)2(k1)+2.

By Eqs.(8) and (2), Eq.(2.4) can be simplified into the following inequality:

|OPT(S)| 2(k+1)5|ALG1(S)|. (9)

Combining Eqs.(7) and (9), the theorem is proved.

We note that the analysis on the approximation ratio of (k+2)/3 for k=2 or 3 is strictly tight. However, for the (2k+2)/5 ratio of k4, we know only a bad example for which the approximation ratio is (2k+1)/5; there remains a slight gap. See Appendix C for details.

3 Simple approximation algorithm for MRSD

In this section, we consider the deletion variant k-MRSD. Again, assume that occmax(S) is bounded by k. As a warm-up, we design a simple approximation algorithm ALG2 with approximation ratios (k+2)/2 if k is even, and (k+1)/2 if k is odd:

Algorithm 2 A high-level description of our algorithm ALG2.

Input: An input string S in which every symbol σΣ appears at most k times.
Output: A run subsequence ALG2(S) of S.

 

That is, ALG2 deletes all unselected runs for every symbol from S.

Example 12.

Consider the input string S=a1b1a2b1c1a2b1c1b1c2b2 of length 15 for MRSD. The leftmost longest a-run, b-run, and c-run are S[3,4], S[14,15], and S[12,13], respectively. Therefore, the output subsequence of ALG2 is ALG2(S)=a2c2b2, and thus the co-subsequence of ALG2(S) is ALG2(S)¯=a1b2c1a2b1c1b1. Therefore, the number |ALG2(S)¯| of deleted characters from S by our algorithm ALG2 is nine.

Clearly, ALG2 can be implemented in linear time. We bound its approximation ratio in the following. Let S be an input string of k-MRSD. Suppose that OPT(S) and ALG2(S) are solutions obtained by an optimal algorithm OPT and our algorithm ALG2, respectively. An outline of our proof on the approximation ratio is as follows: (I) We first obtain an upper-bound on the number |ALG2(S)¯| of deleted characters by ALG2. Then, (II) we bound the upper-bound above by α|OPT(S)¯|, where α=(k+2)/2 if k is even and α=(k+1)/2 if k is odd.

(I) We obtain an upper bound on the number |ALG2(S)¯| of deleted characters by our algorithm ALG2. To do so, we first construct a new string S^, called a “marked” string, by replacing every character s=σ in the co-subsequence OPT(S)¯ of OPT(S) with an auxiliary symbol σ^Σ, called a “marked” symbol (character). Let Σ^ be the alphabet of marked symbols, where σ^Σ^ if σΣ. Then, if the ith character si=σ of S is in OPT(S)¯, then it is replaced with the corresponding marked symbol σ^. Furthermore, we replace every marked character in S^ with a new symbol γΣΣ^, and call it the “γ-string” Sγ of S. For example, consider S=a1b1a2b1c1a2b1c1b1c2b2 as an input for MRSD again, where Σ={a,b,c}. One can verify that OPT(S)=a5c3b2 is an optimal solution. Then, we obtain the marked string S^=a1b^1a2b^1c^1a2b^1c1b^1c2b2, where Σ^={a^,b^,c^} is the alphabet of the marked symbols. By replacing every marked character in Σ^ with γ, we obtain the γ-string Sγ=a1γ1a2γ1γ1a2γ1c1γ1c2b2=a1γ1a2γ2a2γ1c1γ1c2b2. Let r be the number of γ-runs in the γ-string. For example, Sγ includes four γ-runs, i.e., r=4, three length-1 γ-runs and one length-2 γ-run. Note that r|OPT(S)¯| holds.

Next, just for the sake of the analysis on approximation ratios, we introduce a new algorithm ALG2’ for the γ-string Sγ as input, which is very similar to ALG2, but ALG2’ deletes all the γ-runs. It is important to note that it is not necessary to find any optimal solution. Later, we show that the number |ALG2(Sγ)¯| of characters deleted by ALG2’ for Sγ is a good upper bound on |ALG2(S)¯|.

Algorithm 3 A high-level description of our algorithm ALG2’.

Input: The γ-string Sγ of S over the alphabet Σ{γ}.
Output: A run subsequence ALG2(Sγ) of Sγ.

 
Example 13.

Consider the γ-string Sγ=a1γ1a2γ2a2γ1c1γ1c2b2 of S. Then ALG2(Sγ)=a2c2b2 is the output of ALG2’.

Since all γ-runs are deleted from Sγ, ALG2(Sγ) must be feasible for k-MRSD on the original input S. Then, we get the following upper bound on |ALG2(S)¯|:

Lemma 14.

For any input string S and its γ-string Sγ, the following inequalities hold:

  1. 1.

    |ALG2(S)||ALG2(Sγ)|;

  2. 2.

    |ALG2(S)¯||ALG2(Sγ)¯|.

Proof.

(1) max(σ) in the γ-string Sγ is at most max(σ) in the original string S for each σΣ. Therefore, |ALG2(S)||ALG2(Sγ)| holds. (2) From |Sγ|=|S| and |ALG2(S)||ALG2(Sγ)|, the number |ALG2(Sγ)¯| of characters deleted from Sγ is at least the number |ALG2(S)¯| of deleted characters from S.

(II) Next, we consider an upper bound on the number |ALG2(Sγ)¯| of deleted characters by ALG2’ on Sγ. The crux of the following estimation is the number of deleted characters from OPT(S) of Sγ by ALG2’.

Lemma 15.

For any input string S and its γ-string Sγ for any optimal solution OPT(S), the following inequality holds:

|ALG2(Sγ)¯|(k2+1)|OPT(S)¯|

Proof.

We first divide γ-runs in the γ-string Sγ into two types, (type-i) σiγhσj for σiσj, and (type-ii) σiγhσi for an integer h>0. Suppose that Sγ can be represented by Sγ=Sγ1γhSγ2. Recall that if we delete all γ’s from Sγ, then the remaining sequence must be an optimal solution, i.e., a run subsequence. Therefore, if the middle γ-run is in (type-i) and Sγ1 (resp., Sγ2) includes a σΣ, then Sγ2 (resp., Sγ1) does not include σ. On the other hand, we can see that the second type γ-run partitions some σi-run in the optimal solution OPT(S) into the left σi-run and the right σi-run in the γ-string Sγ.

For example, look at a γ-string Sγ=ap1γh1bp2γh2bp3γh3bp4γh4bp5γh5cp6γh6dp7 for p1,,p71 and h1,,h61. We focus on the six γ-runs in Sγ. Since the left and the right runs (or characters) of the second (also, the third and the fourth) γ-run in Sγ are the same, it is in (type-ii). Here, we can see that one long b-run of length (p2+p3+p4+p5) is divided into four b-runs, bp2, bp3, bp4 and bp5. Note that ALG2’ selects the longest σi-run for each σiΣ and deletes all other σi-runs from Sγ. Therefore, three b-runs of the four b-runs are deleted by ALG2’. If q γ-runs divide a σ-run of length at most k into q+1 σ-runs, then the length of each σ-run except for the longest σ-run is bounded above by k/2. This implies that the number of σi’s deleted from OPT(S) for a symbol σiΣ is bounded above by k/2 per γ-run in (type-ii) in the worst case. On the other hand, the left and the right runs of the first (also, the fifth and the sixth) γ-run in Sγ are different, and thus it is in (type-i). For example, in the left substring of the fifth length-h5 γ-run, neither c nor d appears since c and d are included in the right substring of the γ-run. Therefore, after deleting γh5, we can independently count the numbers of characters in OPT(S) that are deleted from the left substring and from the right substring by ALG2’.

Let the number of γ-runs in (type-ii) be r. Note that r|OPT(S)¯| holds and the total number of γ’s in (type-i) and (type-ii) deleted from Sγ is |OPT(S)¯|. Also, the total number of σ’s deleted from OPT(S) for all symbols in Σ is bounded above by k/2r. Hence, we obtain the following inequality on the number of characters deleted by ALG2’:

|ALG2(Sγ)¯||OPT(S)¯|+k2r(k2+1)|OPT(S)¯|

This completes the proof of this lemma.

From Lemmas 14 and 15, we obtain the following theorem:

Theorem 16.

ALG2 is a linear-time approximation algorithm with approximation ratios (k+2)/2 if k is even, and (k+1)/2 if k is odd.

4 Improved approximation algorithm for MRSD

In this section, we present an improved approximation algorithm ALG3 that runs in linear time and its approximation ratios are (k+4)/4 if k is even, and (k+3)/4 if k is odd.

4.1 Concatenation operation

We first define an alternating σ-run:

Definition 17.

An alternating σ-run (simply, an alternating run) in S=S[1,n] is a substring S[i,i+2p] for an integer p1 such that (i) S[i]=S[i+2]==S[i+2p]=σ, (ii) S[i+1]σ, S[i+3]σ, , S[i+2p1]σ, and (iii) S[i2]σ (if i21), S[i1]σ (if i11), S[i+2p+1]σ (if i+2p+1n), and S[i+2p+2]σ (if i+2p+2n).

For example, consider a string S=S[1,15]=ababcababcbcbcc. In the string S, S[1,3]=aba, S[2,4]=bab, S[6,8]=aba, S[7,13]=babcbcb, and S[10,14]=cbcbc are an alternating a-run, an alternating b-run, an alternating a-run, an alternating b-run, and an alternating c-run, respectively.

Then, we introduce a concatenation operation to obtain a σ-run from an alternating σ-run in the string S. Consider an alternating σ-run S[i,i+2p]. Then, the concatenation operation deletes all characters that are not σ from S[i,i+2p], i.e., S[i+1], S[i+3], , S[i+2p1], and obtains a σ-run. For a string ababa, however, there are two possibilities, a3 and b2 by deleting two b’s and three a’s, respectively. The operation finds a concatenation (i.e., run) as long as possible using an optimal algorithm for the Interval Scheduling problem (see, e.g.,  [8]): We regard the alternating run S[i,i+2p] as the interval [i,i+2p] of weight 2p+1. A pair of two alternating runs S[i,i+2pi] and S[j,j+2pj] is independent if i+2pi<j or j+2pj<i holds. The concatenation operation aims to find a maximum weight subset of mutually independent alternating runs from S, and obtain a long σ-run from the selected alternating σ-run by deleting all the characters that are not σ for every σ.

Operation (Concatenation operation).

Given the input string S, the operation obtains a concatenated sequence Sc:

(Step 1)

Find all the alternating runs in S.

(Step 2)

Select a maximum subset of mutually independent alternating runs in S.

(Step 3)

Delete all characters from S so that every alternating σ-run in becomes a σ-run, and obtain a concatenated sequence Sc.

Example 18.

Consider again a string S=S[1,15]=ababcababcbcbcc. In Step 1, we find an alternating a-run S[1,3] of weight three, an alternating b-run S[2,4] of weight three, an alternating a-run S[6,8] of weight three, an alternating b-run S[7,13] of weight seven, and an alternating c-run S[10,14] of weight five. Then, in Step 2, we select ={S[1,3],S[6,8],S[10,14]} whose total weight is 11. Finally, we delete four characters S[2], S[7], S[11], S[13] from S, and obtain Sc=aabcaabcccc=a2bca2bc4.

We estimate the running time of the concatenation operation. Note that the number of alternating runs overlapped at the same position is at most two, and there are O(n) alternating runs in S. Hence Step 1 can be implemented in O(n) time by scanning the string S from left to right. Furthermore, during Step 1, we can sort the alternating runs according to their right-ends in S. Among O(n) alternating runs, we can select the maximum independent set by greedily selecting the non-overlapped alternating run with the leftmost right-end [8] in Step 2, which takes O(n) time. Step 3 needs O(n) time. Therefore, the total running time of the concatenation operation is O(n).

4.2 The algorithm

We present an improved approximation algorithm ALG3 with approximation ratios (k+4)/4 if k is even, and (k+3)/4 if k is odd.

Algorithm 4 A high-level description of our algorithm ALG3.

Input: An input string S in which every symbol σ appears at most k times.
Output: A run subsequence ALG3(S) of S.

 

For example, if the concatenation operation produces Sc=a2bca2bc4 from S, then ALG3 outputs ALG3(S)=a2bc4. It is clear that ALG3 can obtain a feasible solution in O(n) time.

4.3 Approximation ratios

Let OPT(S) and ALG3(S) be the solutions obtained by an optimal algorithm OPT, and ALG3 for an input string S, respectively. An outline of our proof on the approximation ratio is very similar to Section 3: (I) We first obtain an upper-bound on the number |ALG3(S)¯| of characters deleted by ALG3, comparing the number of characters deleted by the optimal algorithm. Then, (II) we bound the upper-bound above by α|OPT(S)¯|, where α=(k+4)/4 if k is even, and α=(k+3)/4 if k is odd. Again, we first construct the marked string S^ from S using the optimal algorithm, and design a little bit worse algorithm ALG3’ than ALG3.

Similarly to Section 3, we construct the marked string S^ by replacing every character s=σ in the co-subsequence OPT(S)¯ with a symbol σ^Σ. Let Σ^ be the alphabet of marked symbols. For example, consider a string S=ababcababcbcbcc of length 15. Then, OPT(S)=a4b2c3 is an optimal solution and S^=ab^ab^c^ab^abc^bcb^cc is the marked string obtained from OPT(S)¯. Note that the optimal solution deletes S[10]=c and thus S^ includes S^[10]=c^. Furthermore, c does not appear in S^[1,9] since OPT(S) must be the run subsequence and S^[11,15] includes c’s. In addition, S^[1,6] does not include b since b’s appear in S^[8,15] and S^[7]=b^.

Similarly to a σ-run for σΣ, for the marked symbol σ^Σ^, a σ^-run can be defined. Also, without distinguishing σ^ from σ, we consider “mixed” σ-runs, σh1σ^h2-type, σ^h1σh2-type, and σ^h1σh2σ^h3-type, for some positive integers h1, h2, and h3 in the following. That is, for example, σ2σ^3 is regarded as one mixed run of length five. Note that we do not need to consider the σh1σ^h2σh3-type since marked strings are constructed based on optimal solutions. Furthermore, we define an alternating-“mixed” run without distinguishing σ^ from σ as follows:

Definition 19.

An alternating-mixed σ-run (or, simply alternating-mixed run) in S^=S^[1,n] is a substring S^[i,i+2p] which satisfies, (Case 1), (Case 2), or (Case 3):

(Case 1)

σh1σ^h2-type. (i) S^[i]=S^[i+2]==S^[i+2p1]=σ, and S^[i+2(p1+1)]=S^[i+2(p1+2)]==S^[i+2p]=σ^ for 1p1<p, (ii) S^[i+1]{σ,σ^}, S^[i+3]{σ,σ^}, , S^[i+2p1]{σ,σ^}, and (iii) S^[i2]σ (if i21), S^[i1]σ (if i11), S^[i+2p+1]σ^ (if i+2p+1n), and S^[i+2p+2]σ^ (if i+2p+2n). That is, S^[i,i+2p1] and S^[i+2(p1+1),i+2p] are the alternating σ-run and the alternating σ^-run, respectively, and S^[1,i+2p]=S^[1,i+2p1]σS^[i+2(p1+1),i+2p] for a symbol σ{σ,σ^}.

(Case 2)

σ^h1σh2-type. (i) S^[i]=S^[i+2]==S^[i+2p1]=σ^, and S^[i+2(p1+1)]=S^[i+2(p1+2)]==S^[i+2p]=σ for 1p1<p, (ii) S^[i+1]{σ,σ^}, S^[i+3]{σ,σ^}, , S[i+2p1]{σ,σ^}, and (iii) S^[i2]σ^ (if i21), S^[i1]σ^ (if i11), S^[i+2p+1]σ (if i+2p+1n), and S^[i+2p+2]σ (if i+2p+2n). That is, S^[i,i+2p1] and S^[i+2(p1+1),i+2p] are the alternating σ^-run and the alternating σ-run, respectively, and S^[1,i+2p]=S^[1,i+2p1]σS^[i+2(p1+1),i+2p] for a symbol σ{σ,σ^}.

(Case 3)

σ^h1σh2σ^h3-type. (i) S^[i]=S^[i+2]==S^[i+2p1]=σ^, S^[i+2(p1+1)]=S^[i+2(p1+2)]==S^[i+2p2]=σ, and S^[i+2(p2+1)]=S^[i+2(p2+2)]==S^[i+2p]=σ^, for 1p1<p2<p, (ii) S^[i+1]{σ,σ^}, S^[i+3]{σ,σ^}, , S^[i+2p1]{σ,σ^}, and (iii) S^[i2]σ^ (if i21), S^[i1]σ^ (if i11), S[i+2p+1]σ^ (if i+2p+1n), and S^[i+2p+2]σ^ (if i+2p+2n). That is, S^[i,i+2p1] and S^[i+2(p2+1),i+2p] are the alternating σ^-runs, and the middle S^[i+2(p1+1),i+2p2] is the alternating σ-run, and S^[1,i+2p]=S^[1,i+2p1]σS^[i+2(p1+1),i+2p2]σ′′S^[i+2(p2+1),i+2p] for symbols σ,σ′′{σ,σ^}.

For example, consider the marked string S^=ab^ab^c^ab^abc^bcb^cc of length 15. In the string S^, S^[2,4]=b^ab^ S^[7,13]=b^abc^bcb^, and S^[10,14]=c^bcb^c are the alternating b-run, the alternating-mixed b-run in (Case 3), and the alternating-mixed c-run in (Case 2), respectively.

Here, we introduce the concatenation operation for marked strings by slightly modifying the concatenation operation.

Operation (Concatenation operation for marked strings).

Given the marked string S^ of the input string S, the operation obtains a concatenated sequence S^c:

(Step 1)

Find all the alternating runs and all the alternating-mixed runs on ΣΣ^ in S.

(Step 2)

Select a maximum subset of mutually independent alternating/alternating-mixed runs in S.

(Step 3)

Delete all characters that are neither σ nor σ^ from S so that every alternating σ-run, every alternating σ^-run and every alternating-mixed σ-run in become a σ-run, a σ^-run and a mixed σ-run, respectively, and obtain the concatenated sequence S^c.

Example 20.

Consider the marked string S^=ab^ab^c^ab^abc^bcb^cc of length 15. In Step 1, we find an alternating a-run S[1,3] of weight three, an alternating b^-run S[2,4] of weight three, an alternating a-run S[6,8] of weight three, an alternating-mixed b-run S[7,13] of weight seven, and an alternating-mixed c-run S[10,14] of weight five. Then, in Step 2, we select ={S[1,3],S[6,8],S[10,14]} whose total weight is 11. Finally, we delete four characters S[2], S[7], S[11], and S[13] from S^, and obtain S^c=aab^c^aabc^ccc=a2b^c^a2bc^c3.

To obtain an upper bound on |ALG3(S)¯|, we introduce the following algorithm ALG3’:

Algorithm 5 A high-level description of our algorithm ALG3’.

Input: The marked string S^ of S over the alphabet ΣΣ^.
Output: A run subsequence ALG3(S^) of S^.

 

Note that the replacement σ^ in the second step of ALG3’ does not create a new σ-run, but only increases the length of the σ-run which originally appears in S^c.

Example 21.

If the concatenated sequence is S^c=a2b^c^a2bc^c3, then we replace c^ in the rightmost mixed run, obtain a2b^c^a2bc4, and finally output a2bc4 of length seven. Recall that an optimal solution is OPT(S)=a4b2c3 of length nine.

Using arguments very similar to the proof of Lemma 14, we obtain the following lemma:

Lemma 22.

For any input S and its marked string S^, the following inequalities hold:

  1. 1.

    |ALG3(S)||ALG3(S^)|;

  2. 2.

    |ALG3(S)¯||ALG3(S^)¯|.

Recall that the optimal algorithm OPT deletes the number |OPT(S)¯| of characters from the input string S. In the following, we show that, given the marked string S^ of S, the number |ALG3(S^)¯| of characters deleted by the worse algorithm ALG3’ from S^ is bounded above by k+44|OPT(S)¯| if k is even, and k+34|OPT(S)¯| if k is odd.

We first consider the case k=2:

Lemma 23.

Suppose that k=2. Then, for S and its marked string S^, the following inequality holds:

|ALG3(S^)¯|32|OPT(S)¯|.

Proof.

We investigate the concatenation operation for marked strings. Consider a marked symbol σ^i. Assume that σiσj. Suppose that the marked string S^ includes a substring σjσ^iσj. Then, even if the middle character σ^i is deleted in the concatenation operation for marked strings, this deletion of σ^i does not increase the number of deleted characters compared to the number of deleted characters by the optimal algorithm. Therefore, we assume that σ^i remains in the concatenated sequence S^c after the concatenation operation for marked strings. Note that S^ does not include a substring σjσ^iσjσ^i. The reason is as follows. The rightmost σ^i implies that the optimal algorithm deletes the rightmost σi, but if it is not deleted, then the length of the optimal solution increases by one, which is a contradiction. One sees that it is enough to count the deleted characters in OPT(S) independently within substrings of length three or four in S^, since here we assume that k=2.

  • Suppose that S^ includes a substring σ^iσjσ^i. If the middle σj is deleted, then σ^i2 is obtained by the concatenation operation for marked strings. ALG3’ deletes those two σ^i’s in the third step. In other words, ALG3’ deletes one character in OPT(S) per two marked characters in OPT(S)¯ (i.e., 1/2 in OPT(S) per one in OPT(S)¯).

  • Suppose that S^ includes a substring σjσ^iσjσi. If the right σj is deleted, then we obtain σjσ^iσi and the middle σ^i is replaced by σi in the second step of ALG3’. Here, we can see that ALG3(S^) misses one σj but adds one σi, i.e., the number of deleted characters remains the same for the substring.

In summary, ALG3’ deletes one character in OPT(S) for two marked characters in OPT(S)¯. That is, ALG3’ deletes possibly all the marked characters in OPT(S)¯, and in addition, at most 12|OPT(S)¯| nonmarked characters in total. From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:

|ALG3(S^)¯||OPT(S)¯|+12|OPT(S)¯|=32|OPT(S)¯|. (10)

Next, we obtain the following lemma for k=3, whose proof appears in Appendix B.

Lemma 24.

Suppose that k=3. Then, for S and its marked string S^, the following inequality holds:

|ALG3(S^)¯|32|OPT(S)¯|.

Finally, we show the case k4:

Lemma 25.

Suppose that k4. Then, for S and its marked string S^, the following inequality holds:

|ALG3(S^)¯|k2+22|OPT(S)¯|.

Proof.

We count the deleted characters that appear in OPT(S), during the concatenation operation for marked strings. Note that if S^ includes a substring consisting of at least two consecutive marked characters, then the substring may partition some σ-run in OPT(S) into the left σ-run and the right σ-run in S^, and the length of the shorter σ-run is at most k/2.

Now we observe the marked string S^ that includes a substring σσ^bσ^bσ^bσ^bσ^σ. Suppose that σ=b and σ=b in the substring. Then, all σ^’s are deleted in the concatenation operation for marked strings since the alternating b-run is longer than the alternating σ^-run. Here, the number of deleted characters by the concatenation operation remains the same as that by the optimal algorithm. If σb or σb, then all b’s in σ^bσ^bσ^bσ^bσ^ can be deleted by the concatenation operation for marked strings and the σ^-run is produced, i.e., characters in OPT(S) can be additionally deleted by ALG3’.

  • Suppose that the length of the produced σ^-run by the concatenation operation for marked strings is p. This implies that p1 b’s that appear in the optimal solution are deleted while p σ^’s remain in the concatenated sequence S^c obtained in the first step of ALG3’.

  • Suppose that after the first step of ALG3’, we obtain the concatenated sequence S^c. Moreover, suppose that S^c includes the σ^-run of length at least two. Then, S^c possibly includes a substring σ0p1σ^pσ0p2 for p2. Namely, we can see that the σ0-run of length p1+p2 is divided into two runs of length p1 and p2 by the σ^-run of length p2. Recall that in the third step ALG3’ selects the longest σ0-run. That is, if p1p2, then p1 characters are deleted; otherwise, p2 characters are deleted by ALG3’. From p1+p2k, at most k/2 characters are deleted if the length of the σ^-run is at least two.

In summary, the number of characters deleted from OPT(S) in S^ is one for every one single σ^11, and k/2 for every one substring of consecutive marked characters σ^1σ^p for p2.

Suppose that S^c includes r1 single marked characters whose left and right characters are not marked, and r2 substrings of at least two consecutive marked characters σ^1σ^p. Note that r1+2r2|OPT(S)¯|, and now we assume that k4. Hence, we obtain the following inequality on the number of characters deleted by ALG3’ from the marked string S^:

|ALG3(S^)¯| |OPT(S)¯|+r1+k2r2|OPT(S)¯|+12k2(r1+2r2)
k2+22|OPT(S)¯|

This completes the proof.

From Lemmas 22, 23, 24 and 25, we obtain the following theorem:

Theorem 2. [Restated, see original statement.]

k-MRSD can be approximated in linear time within a factor of (k+4)/4 for even k, and (k+3)/4 for odd k.

Finally, we can show that the analyses on the approximation ratios of (k+4)/4 and (k+3)/4 are strictly tight. For details, the reader is referred to Appendix D.

References

  • [1] Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. O(logn) approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems. In Harold N. Gabow and Ronald Fagin, editors, Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pages 573–581. ACM, 2005. doi:10.1145/1060590.1060675.
  • [2] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.
  • [3] Bengt Aspvall, Michalel F. Plass, and Rovert E. Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Information Processing Letters, 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.
  • [4] Irit Dinur and Shmuel Safra. The importance of being biased. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 33–42. ACM, 2002. doi:10.1145/509907.509915.
  • [5] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CPM.2021.14.
  • [6] Håstad Johan. Some optimal inapproximability results. Journal of the ACM, 48(4):498–859, 2001.
  • [7] George Karakostas. A better approximation ratio for the vertex cover problem. In Luís Caires, Giuseppe F. Italiano, Luís Monteiro, Catuscia Palamidessi, and Moti Yung, editors, Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, volume 3580 of Lecture Notes in Computer Science, pages 1043–1050. Springer, 2005. doi:10.1007/11523468_84.
  • [8] Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, 2006.
  • [9] Michael Lewin, Dror Livnat, and Uri Zwick. Improved rounding techniques for the MAX 2-SAT and MAX DI-CUT problems. In William J. Cook and Andreas S. Schulz, editors, Integer Programming and Combinatorial Optimization, 9th International IPCO Conference, Cambridge, MA, USA, May 27-29, 2002, Proceedings, volume 2337 of Lecture Notes in Computer Science, pages 67–82. Springer, 2002. doi:10.1007/3-540-47867-1_6.
  • [10] Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, and Chaokun Yan. A comprehensive review of scaffolding methods in genome assembly. Briefings Bioinform., 22(5), 2021. doi:10.1093/bib/bbab033.
  • [11] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1–6:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.WABI.2020.6.
  • [12] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/s13015-021-00191-8.
  • [13] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput., 3(1):103–128, 2007. doi:10.4086/TOC.2007.V003A006.

Appendix A Proof of Lemma 11

Lemma 11. [Restated, see original statement.]

The summation of yi,zi satisfies the following inequality:

i=0kyin0+n1+i=0kzi.

Proof.

Note that i=0kyi=|Λ3| and i=0kzi=|Λ3′′|. Moreover, n0+n1=|Λ1|. Therefore it is sufficient to show |Λ3||Λ1|+|Λ3′′|=|Λ1Λ3′′|.

Then we design a mapping from Λ3 to Λ1Λ3′′ as follows: Consider a symbol sΛ3. Since s is a bad symbol, the algorithm chooses an s-run in a bad segment sts with ts. Then we know that t in this bad segment sts must be in ALG1(S). Otherwise, the local operation-1 is applicable to ALG1(S), which contradicts Step 4 of ALG1. If t is unique, then we maps s to t, which is called the Case-1 mapping. Now, we can assume t is not unique.

Since t is in ALG1(S) and t is not unique, then t must be in Λ2Λ3 since, otherwise, if t is in Π, then a t-run of length two must be chosen in ALG1(S). If t is in Λ3′′, then we map s to t, which is called the Case-2 mapping. Otherwise, tΛ2Λ3 and thus the length of t in ALG1(S) is exactly one. Note that t is not unique, and the local operation-2 is not applicable to ALG1(S). So, there exists another t such that it is in a bad segment wtw where a w-run of length two is chosen by the algorithm. Therefore wΛ3′′ and we map s to w, which is called the Case-3 mapping.

Then we prove that the mapping is injective and we are done. Suppose s1,s2 are two distinct symbols in Λ3 such that they are mapping to the same symbol wΛ1Λ3′′. If w is an unique symbol, then s1,s2 are mapping to w by the Case-1 mapping. That is, s1ws1 and s2ws2 are two substrings of S, which is impossible since s1s2 and w is unique. So, we can assume w is not unique, i.e., wΛ3′′ and s1,s2 are mapping to w by the Case-2 mapping or the Case-3 mapping.

The first case is that both s1, s2 are mapping to w by the Case-2 mapping. Then it indicates s1ws1 and s2ws2 are two bad segments in S where w is in Λ3′′. However, this case is impossible since two w-runs in ALG1(S) must come from a bad segment associated with w.

The second case is that exactly one of s1,s2 maps to w by the Case-2 mapping and the other one maps to w by the Case-3 mapping. Without loss of generality, we assume that s1 and s2 map to w by the Case-2 and Case-3 mapping, respectively. Recall that in the bad segment s1ws1, the symbol w is in Λ3′′. So, there is a substring ws1ws1 or s1ws1w containing two bad segments in S such that the w-run of length two is in ALG1(S). By symmetry, we consider the substring ws1ws1. Since s2 maps to w by Case-3 mapping, s2s1s2 is the substring such that an s1 and an s2 are in ALG1(S), which leads to a contradiction since s1Λ3.

The remaining case is that both s1,s2 map to w by the Case-3 mapping. In this case, there exist two substrings s1ws1 and s2ws2 in S and wΛ3′′. However, this is impossible since two w-runs must come from a bad segment of w, which completes the proof.

Appendix B Proof of Lemma 24

Lemma 24. [Restated, see original statement.]

Suppose that k=3. Then, for S and its marked string S^, the following inequality holds:

|ALG3(S^)¯|32|OPT(S)¯|.

Proof.

Consider a marked symbol σ^i and assume that σiσj again. As before, we consider the case where σ^i remains in the concatenated sequence S^c after the concatenation operation for marked strings. Note that S^ does not include a substring σjσ^iσjσ^iσjσ^i. The reason is the same as the previous: The rightmost σ^i implies that the optimal algorithm deletes the rightmost σi, but if the rightmost σi is not deleted, then the length of the optimal solution increases by one, which is a contradiction.

  • Suppose that S^ includes a substring σ^iσjσ^i. If the middle σj is deleted, then σ^i2 is obtained by the concatenation operation for marked strings, and ALG3’ deletes two characters σ^i2 in the third step. In other words, ALG3’ deletes one character in OPT(S) per two marked characters in OPT(S)¯.

  • Suppose that S^ includes a substring σjσ^iσjσi. If the right σj is deleted, then we obtain σjσ^iσi and the middle σ^i is replaced by σi in the second step of ALG3’. Namely, ALG3(S^) misses one σj but adds one σi, i.e., the number of deleted characters remains the same for the substring.

  • Suppose that S^ includes a substring σjσ^iσjσ^iσjσi. If the right two σj’s are deleted, then we obtain σjσ^i2σi and two σ^i2 are replaced by σi2 in the second step of ALG3’. Namely, ALG3(S^) misses two σj’s but adds two σi’s, i.e., the number of deleted characters remains the same for the substring.

Again, we conclude that ALG3’ deletes at least one character in OPT(S) for two marked characters in OPT(S)¯. From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:

|ALG3(S^)¯||OPT(S)¯|+12|OPT(S)¯|=32|OPT(S)¯|. (11)

Appendix C Bad examples for Theorem 1

We can provide tight examples for the analysis on the approximation ratios when k=2 and k=3 in Theorem 1.

  1. (i)

    First, suppose that k=2. Then, consider the following string S2=S12S22Sp2 of length 6p for some integer p:

    S2=abcabcsubstring S12degdegsubstring S22σ3p2σ3p1σ3pσ3p2σ3p1σ3psubstring Sp2.

    Then, an optimal solution is OPT(S2)=a2bcd2egσ3p22σ3p1σ3p of length 4p, and a solution of ALG1 is ALG1(S2)=abcdegσ3p2σ3p1σ3p of length 3p Hence, |ALG1(S2)¯|/|OPT(S2)¯|=4/3.

  2. (ii)

    Next, suppose that k=3. Then, the following string S3=S13S23Sp3 of length 9p for some integer p is a tight example:

    S3= abcabcabcsubstring S13degdegdegsubstring S23
    σ3p2σ3p1σ3pσ3p2σ3p1σ3pσ3p2σ3p1σ3psubstring Sp3.

    Then, an optimal solution is OPT(S3)=a3bcd3egσ3p23σ3p1σ3p of length 5p, and an solution of ALG1 is the same as ALG1(S3). Therefore, |ALG1(S3)¯|/|OPT(S3)¯|=5/3.

  3. (iii)

    For the case k4, the approximation-ratio analysis in Theorem 1 is almost tight, but there is a slight gap. Suppose that k4. Then, consider the following string Sk=S1kS2kS2pk of length 6p for some integer 3k:

    Sk= aabaabaabsubstring S1k of length 3k/2 ccbccbccbsubstring S2k of length 3k/2 
    σ3p2σ3p2σ3p1σ3p2σ3p2σ3p1σ3p2σ3p2σ3p1substring S2p1k of length 3k/2
    σ3pσ3pσ3p1σ3pσ3pσ3p1σ3pσ3pσ3p1substring S2pk of length 3k/2

    Then, the following run subsequence OPT(Sk) of length (2k+1)p is an optimal solution:

    OPT(Sk)=akckbdkgkeσ3p2kσ3pkσ3p1

    On the other hand, our algorithm ALG2 outputs the following subsequence of length 5p:

    ALG2(Sk)=a2bc2d2eg2σ3p22σ3p1σ3p2.

    Hence, |ALG2(Sk)¯|/|OPT(Sk)¯|=(2k+1)/5.

Appendix D Bad examples for Theorem 2

We can show that the analysis on the approximation ratios in Theorem 2 is tight.

  1. (i)

    First, suppose that k is even. Then, consider the following string Se=S1eS2eSpe of length 2pk for some integer p:

    Se=ak2b2ak2bk2substring S1eck2d2ck2dk2substring S2eσ2p1k2σ2p2σ2p1k2σ2pk2substring Spe

    Then, the optimal solution OPT(Se) is obtained by deleting 2p characters as follows:

    OPT(Se)=akbk2ckdk2σ2p1kσ2pk2

    On the other hand, our algorithm ALG3 selects the leftmost longest σi-run for each i{1,,2p} and thus ALG3(Se) is as follows:

    ALG3(Se)=ak2bk2ck2dk2σ2p1k2σ2pk2

    The total number of characters deleted from Se is k+42p. Hence, |ALG3(Se)¯|/|OPT(Se)¯|=(k+4)/4.

  2. (ii)

    Next, suppose that k is odd. Then, consider the following string So=S1oS2oSpo of length 2pk for some integer p:

    So=ak+12b2ak12bk2substring S1ock+12d2ck12dk2substring S2oσ2p1k+12σ2p2σ2p1k12σ2pk2substring Spo

    Very similarly, we obtain the following equality: |ALG3(So)¯|/|OPT(So)¯|=(k+3)/4. As a result, the analysis of the approximation ratios in the proof of Theorem 2 is tight.