Approximability of Longest Run Subsequence and Complementary Minimization Problems
Abstract
We study the polynomial-time approximability of the Longest Run Subsequence problem (LRS for short) and its complementary minimization variant Minimum Run Subsequence Deletion problem (MRSD for short). For a string over an alphabet , a subsequence of is , such that . A run of a symbol in is a maximal substring of consecutive occurrences of . A run subsequence of is a subsequence of in which every symbol occurs in at most one run. The co-subsequence of the subsequence in is the subsequence obtained by deleting all the characters in from , i.e., such that and . Given a string , the goal of LRS (resp., MRSD) is to find a run subsequence of such that the length is maximized (resp., the number of deleted symbols from is minimized) over all the run subsequences of . Let be the maximum number of symbol occurrences in the input . It is known that LRS and MRSD are APX-hard even if . In this paper, we show that LRS can be approximated in polynomial time within factors of for or , and for every . Furthermore, we show that MRSD can be approximated in linear time within a factor of if is even and if is odd.
Keywords and phrases:
Longest run subsequence, minimum run subsequence deletion, approximation algorithmCopyright and License:
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsFunding:
The work was partially supported by the NSERC Canada, JSPS KAKENHI Grant Numbers JP20H05967, JP22H00513, JP22K11915, JP24H00697, JP24K02898, JP24K02902, JP24K14827, and JP25K03077, and CRONOS Grant Number JPMJCS24K2.Editors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Scaffolding is one of the key informatics processes in DNA sequencing. DNA sequencing is generally carried out through the following steps: (i) Tens to hundreds of millions of DNA fragments extracted from random positions are read via shotgun sequencing, (ii) the extracted random fragments (reads) are assembled into a series of contiguous sequences (contigs) using an assembly algorithm, and (iii) finally, the contigs are arranged in the correct order based on certain criteria. This step (iii) is called scaffolding, which serves as the original motivation of this study. One common approach to scaffolding is to rearrange contigs by comparing multiple incomplete assemblies of related samples (see [10] for example).
The formulation of contig rearrangement from multiple incomplete assemblies in the scaffolding phase as a string processing problem by Schrinner et al. [11, 12] is known as the Longest Run Subsequence problem (LRS). Let be a finite alphabet of symbols, and . A string is a sequence of characters, each of which is a symbol in . Two or more characters in can be the same symbol in . For a string , denotes the length of , i.e., . For two strings and , denotes the concatenation of and . A subsequence of is a sequence , such that . Let denote the character of in the th position for , and denote the substring of that starts from the th position and ends at the th position. A -run in is a substring such that , for any , but and . Given a string on alphabet , a run subsequence of is a subsequence of in which every symbol occurs in at most one run. For a string , for example, the substring is a -run, and is a run subsequence of .
Problem 1 (Longest Run Subsequence problem, LRS).
Given a string in , the goal of LRS is to find a longest run subsequence of , i.e., every occurs in at most one run in and the length is maximized over all the run subsequences of .
For the string , the longest run subsequnece of is of length five. If the maximum number of occurrences of each symbol in the input string is bounded by , then the problem is called the -Longest Run Subsequence problem (-LRS). One sees that -LRS is trivial since the length of all the runs in the input string is one, and thus the input itself is the optimal run subsequence. Unfortunately, Schrinner et al. [12] showed that LRS is generally NP-hard. Subsequently, Dondi and Sikora [5] showed -LRS is APX-hard, while, as a positive result, they provided a polynomial-time -approximation algorithm for -LRS. Recently, Asahiro et al. [2] improved the approximation ratio to for -LRS, and have shown that for the case , a better approximation ratio than is achieved.
In this paper, we first derive further improved approximability results for -LRS:
Theorem 1.
-LRS can be approximated in time within factors of for or , and for every .
This paper also considers the complementary minimization variant of LRS, called the Minimum Run Subsequence Deletion problem (MRSD). The co-subsequence of the subsequence in is the subsequence obtained by deleting all the characters in from , i.e., such that and . For example, consider . Then, for a subsequence , the co-subsequence of is . Note that for a subsequence of a string , is not unique unless we specify position of each character of in : for and (without indices from which these ’s come), candidates of is , , and of the same length three. As will be seen in the following, only the number of deleted characters is important in some cases, but we often need to take care of from where a character is deleted.
Problem 2 (Minimum Run Subsequnce Deletion problem, MRSD).
Given a string in , the goal of MRSD is to find a run subsequence of such that the number of deleted symbols from is minimized over all the run subsequence of .
Similarly to -LRS, if the maximum number of occurrences of each symbol in the input string is bounded by , the problem is called the -Minimum Run Subsequence Deletion problem (-MRSD). Since the run subsequence obtained by minimizing the number of deletions in MRSD corresponds exactly to the longest run subsequence in LRS, LRS and MRSD are essentially equivalent as decision problems. However, due to the difference in the objective functions, MRSD may exhibit different characteristics from LRS in terms of approximability. Thus, a natural question arises: Which problem is easier/harder to approximate, or are they equally hard?
To gain insight into this question, we consider two examples of problem pairs that are essentially equivalent as decision problems but differ in their objective functions, similar to LRS and MRSD. The first example is the pair of Max-2SAT and its deletion variant, Min-2SAT Deletion. It is known that 2SAT can be solved in polynomial time [3]. However, in the case where the instance is unsatisfiable, the objective of Max-2SAT is to find a truth assignment that maximizes the number of satisfied clauses (a maximization problem). In contrast, the objective of Min-2SAT Deletion is to minimize the number of unsatisfied clauses (a minimization problem). For these problems, the following results are known: The maximization version, Max-2SAT, admits a 1.0638-approximation algorithm [9], but it is NP-hard to approximate within a factor of 1.0476 [6]. On the other hand, the best known approximation ratio for the minimization version, Min-2SAT Deletion, is [1], and it is known to be NP-hard to approximate within a factor of 1.36067[4]. Thus, intuitively, the maximization version is easier to approximate than the deletion version for these two problems.
Another example is the pair consisting of the Maximum Independent Set problem (MaxIS) and the Minimum Vertex Cover problem (MinVC). For any graph and any independent set of , forms a vertex cover. Thus, MinVC can be seen as the complementary minimization variant of MaxIS. For these problems, the following results are known: MaxIS is NP-hard to approximate within a factor of for any [13]. In contrast, MinVC is known to admit a -approximation algorithm [7]. Thus, contrary to the previous pair, in this case, the maximization version is harder to approximate than the deletion version.
As the second contribution, the paper investigates the approximability of -MRSD:
Theorem 2.
-MRSD can be approximated in linear time within a factor of for even , and for odd .
Namely, unlike the two aforementioned examples, LRS and MRSD can be considered as problems that currently share similar approximation ratios of . We remark that the basic strategies of the proposed approximation algorithms for MRSD are very similar to those for LRS, but the analyses of the approximation ratios are quite different.
Notation.
For each symbol , and denote a length- -run and the length of the longest -run in the input string , respectively. Let be the number of occurrences of in the input string . Let . Without loss of generality, we assume that the number of occurrences of each symbol in is at least one, and if it is one, then we say is unique.
Example 3.
Consider a string . Then includes two , three , two , two , and one , where each length is one; and a length- -run . Therefore, the length of the longest run for , , , , or is , and for is , respectively. That is, and . The number of occurrences of is three, is unique, i.e., , and .
2 Approximation algorithms for Longest Run Subsequence
In this section we consider LRS and design approximation algorithms for -LRS. That is, we assume that the maximum number of symbol occurrences in the input is always bounded by .
2.1 Preprocessing
We first introduce an inserting operation to preprocess the input string . For every symbol with , we create an auxiliary symbol . The auxiliary alphabet that contains all the auxiliary symbols is denoted as . The inserting operation inserts a copy of a symbol after the first two consecutive symbols in a with . We repeatedly apply the inserting operation until there is no for any symbol and any .
Operation (An inserting operation).
Given a with in the string , the operation inserts a copy of the symbol after the first two consecutive symbols .
Example 4.
Recall the string in Example 3 and . After the preprocessing, becomes , and the resulting string is .
One clearly sees that in the preprocessed string, denoted as , for any symbol (and for any auxiliary symbol ). Let denote the subset of symbols ’s such that , and then let .
2.2 The algorithm
We present an algorithm ALG1 to compute a run-subsequence for the preprocessed string obtained by applying the above preprocessing to .
Definition 5.
For a symbol that is not unique, if every two consecutive occurrences of in are separated by at least two symbols, then is a good symbol. Otherwise, is a bad symbol and every substring of in the form of , where is another symbol, is a bad segment associated with .
Using Definition 5, we partition into three subsets and , where contains all the unique symbols, contains all the good symbols, and contains all the bad symbols. One sees that such a partition can be done in time.
Our algorithm constructs an initial solution for and then applies two local search operations to update the solution. During the algorithm ALG1, the current solution is always denoted as . Initially, for each symbol , the algorithm picks its leftmost longest run in into ; for each bad symbol , the algorithm picks the leftmost in the first bad segment associated with into . Note that all these picked runs are in the same order as they show up in , i.e., is a subsequence of . We continue to illustrate using the string in Example 4.
Example 6.
Consider the string . We have , , , and . The initial solution is , which is obtained by picking the symbols in the boxes as follows:
Observe that if the symbol after the picked bad symbol in the associated bad segment is not picked, then we can add the second bad symbol in the bad segment to the solution to increase its length by one. In the sequel, we aim to do this by possibly swapping some picked good symbols with their respective copies at the other places.
To this purpose, we further partition into two subsets and , where contains those bad symbols, each of which has a length- run in the solution . At the beginning, and .
We design two local operations to repeatedly improve the initial solution . The first operation is almost the above observation, while the second goes slightly further to swap a picked symbol.
Operation (Local operation-1 for ).
Given a symbol in a bad segment such that its symbol is not picked into , the operation replaces in by the two copies of in this bad segment, and moves from to .
Operation (Local operation-2 for ).
Given a symbol in a bad segment such that its symbol is picked into , the operation finds another occurrence of in that does not break any length- runs in , replaces the picked in by this occurrence and replaces in by the two copies of in this bad segment, and moves from to .
We remark that while applying the above two local operations, the -run for each and the -run for each are untouched; for each , it appears exactly once in , while for each , it appears twice in and they are picked from a bad segment associated with . The goal of the local search is to reduce the number of bad symbols in as much as possible, and the process terminates when none of the two local operations is applicable. A high-level description of the algorithm is as follows:
Input: The sequence obtained from preprocessing .
Output: A subsequence of .
We examine the time complexity of ALG1. First notice that and . Therefore, the partition of into can be done in time. One sees that ALG1 executes at most local operations, and finding a symbol in to which the local operation-1 is applicable takes time and finding a symbol in to which the local operation-2 is applicable takes time. It follows that ALG1 runs in .
We continue to illustrate using the string in Example 4.
Example 7.
Consider the string . From Example 6, the initial solution is and , so that and .
The local operation-1 is applicable for which is associated with only one bad segment , where its is not picked. Thus is updated to by picking the symbols in the boxes as follows:
and and .
One sees that the local operation-1 is no longer applicable for symbols and in , but then the local operation-2 is applicable to in the first bad segment associated with , since the second occurrence of does not break any length- runs in . As a result, is updated to by picking the symbols in the boxes as follows:
and and . For the last symbol , we cannot apply neither the local operation-1 nor the local operation-2, and thus the algorithm terminates.
2.3 Post-processing
We process the achieved solution to produce a solution for the input . Recall that is the result of preprocessing by the inserting operations, each inserts a copy of the auxiliary symbol after the first two consecutive symbols in a with . For each auxiliary symbol , we delete its single copy from and then replace the in by a longest -run in the input sequence . We remark that is either unique or good, and thus it appears exactly once in , and that no symbol of the longest -run in the input sequence breaks any length- run in .
We continue to illustrate using the string in Example 3.
Example 8.
Consider the string , for which , where is inserted by the inserting operation. From Example 7, the achieved solution for is by picking the symbols in the boxes as follows:
The post-processing gives the solution for by picking the symbols in the boxes as follows:
2.4 Performance analysis
In this section, we analyze the worst-case ratio of our algorithm. Let and denote the output sequence by our algorithm and an optimal solution when the input sequence is , respectively. Let .
Lemma 9.
For an input string , let be the result of preprocessing by the inserting operation. Then is satisfied.
Proof.
is a subsequence of . Therefore, .
We next prove . Note that for , it corresponds to a symbol with in . If in , then is unique in . Otherwise, in , then is a good symbol since in between every two symbols, there is a with . That is, and thus the length of in is exactly one. Therefore, the total length of symbols and in is exactly three. It indicates that since we choose the longest -run whose length is at least three in . Then the lemma is proved.
We remark that since the approximation ratio of the string is always worse than the one of by Lemma 9, we show a bound on the approximation ratio of instead of the original input .
Now, we are ready to prove the worst-case ratio of our algorithm. Let be the number of unique symbols whose length is in the optimal solution for . Recall that for each symbol , the length of the -run in must be one. We use to denote the number of symbols in whose length is in for . Similarly, let (, respectively) be the number of symbols in (, respectively) whose length is in for . Lastly, for each symbol , the length of the -run in is two. We denote by the number of symbols in whose length is in for . In conclusion, we have the following two equalities:
| (1) | ||||
| (2) |
Let be the total number of symbols deleted from by the optimal algorithm to obtain . Notice that if the length of the -run is in and is not unique, then the total number of deleted symbols is at most . Therefore the upper bound on is
| (3) |
Then we estimate a lower bound on as follows.
Lemma 10.
For the number of deleted symbols , we have
Proof.
Since each symbol is a good symbol, there exist at least two other symbols in between every two -runs. In order to generate a length- -run in , at least symbols are deleted.
For each symbol , in . So, forming a length- -run must delete at least symbols.
Lastly, consider a symbol . Recall that . It indicates that we must delete at least one symbol to generate every length- -run. So for each symbol in , we must delete at least symbols.
This completes the proof of this lemma.
Lemma 11.
The summation of satisfies the following inequality:
Proof.
The proof appears in Appendix A.
We can show the following theorem:
Theorem 1. [Restated, see original statement.]
-LRS can be approximated in time within factors of for or , and for every .
Proof sketch..
Let be a constant; to get the best worst-case ratio, we will set for , and for later.
By Eq.(2.4) and Eq.(1), we have
| (5) |
| (6) |
We discuss two cases of . When , and and . By Lemma 11, Eqs. (2.4) and (2), the following inequality holds.
| (7) |
On the other hand, when , and thus . By Lemma 11, we have
| (8) |
Here we obtain several inequalities that are used in the following. If , then , and . Moreover, , and . Note that the following inequality holds:
By Eqs.(8) and (2), Eq.(2.4) can be simplified into the following inequality:
| (9) |
We note that the analysis on the approximation ratio of for or is strictly tight. However, for the ratio of , we know only a bad example for which the approximation ratio is ; there remains a slight gap. See Appendix C for details.
3 Simple approximation algorithm for MRSD
In this section, we consider the deletion variant -MRSD. Again, assume that is bounded by . As a warm-up, we design a simple approximation algorithm ALG2 with approximation ratios if is even, and if is odd:
Input: An input string in which every symbol appears at most times.
Output: A run subsequence of .
That is, ALG2 deletes all unselected runs for every symbol from .
Example 12.
Consider the input string of length for MRSD. The leftmost longest -run, -run, and -run are , , and , respectively. Therefore, the output subsequence of ALG2 is , and thus the co-subsequence of is . Therefore, the number of deleted characters from by our algorithm ALG2 is nine.
Clearly, ALG2 can be implemented in linear time. We bound its approximation ratio in the following. Let be an input string of -MRSD. Suppose that and are solutions obtained by an optimal algorithm OPT and our algorithm ALG2, respectively. An outline of our proof on the approximation ratio is as follows: (I) We first obtain an upper-bound on the number of deleted characters by ALG2. Then, (II) we bound the upper-bound above by , where if is even and if is odd.
(I) We obtain an upper bound on the number of deleted characters by our algorithm ALG2. To do so, we first construct a new string , called a “marked” string, by replacing every character in the co-subsequence of with an auxiliary symbol , called a “marked” symbol (character). Let be the alphabet of marked symbols, where if . Then, if the th character of is in , then it is replaced with the corresponding marked symbol . Furthermore, we replace every marked character in with a new symbol , and call it the “-string” of . For example, consider as an input for MRSD again, where . One can verify that is an optimal solution. Then, we obtain the marked string , where is the alphabet of the marked symbols. By replacing every marked character in with , we obtain the -string . Let be the number of -runs in the -string. For example, includes four -runs, i.e., , three length- -runs and one length- -run. Note that holds.
Next, just for the sake of the analysis on approximation ratios, we introduce a new algorithm ALG2’ for the -string as input, which is very similar to ALG2, but ALG2’ deletes all the -runs. It is important to note that it is not necessary to find any optimal solution. Later, we show that the number of characters deleted by ALG2’ for is a good upper bound on .
Input: The -string of over the alphabet .
Output: A run subsequence of .
Example 13.
Consider the -string of . Then is the output of ALG2’.
Since all -runs are deleted from , must be feasible for -MRSD on the original input . Then, we get the following upper bound on :
Lemma 14.
For any input string and its -string , the following inequalities hold:
-
1.
;
-
2.
.
Proof.
(1) in the -string is at most in the original string for each . Therefore, holds. (2) From and , the number of characters deleted from is at least the number of deleted characters from .
(II) Next, we consider an upper bound on the number of deleted characters by ALG2’ on . The crux of the following estimation is the number of deleted characters from of by ALG2’.
Lemma 15.
For any input string and its -string for any optimal solution , the following inequality holds:
Proof.
We first divide -runs in the -string into two types, (type-i) for , and (type-ii) for an integer . Suppose that can be represented by . Recall that if we delete all ’s from , then the remaining sequence must be an optimal solution, i.e., a run subsequence. Therefore, if the middle -run is in (type-i) and (resp., ) includes a , then (resp., ) does not include . On the other hand, we can see that the second type -run partitions some -run in the optimal solution into the left -run and the right -run in the -string .
For example, look at a -string for and . We focus on the six -runs in . Since the left and the right runs (or characters) of the second (also, the third and the fourth) -run in are the same, it is in (type-ii). Here, we can see that one long -run of length is divided into four -runs, , , and . Note that ALG2’ selects the longest -run for each and deletes all other -runs from . Therefore, three -runs of the four -runs are deleted by ALG2’. If -runs divide a -run of length at most into -runs, then the length of each -run except for the longest -run is bounded above by . This implies that the number of ’s deleted from for a symbol is bounded above by per -run in (type-ii) in the worst case. On the other hand, the left and the right runs of the first (also, the fifth and the sixth) -run in are different, and thus it is in (type-i). For example, in the left substring of the fifth length- -run, neither nor appears since and are included in the right substring of the -run. Therefore, after deleting , we can independently count the numbers of characters in that are deleted from the left substring and from the right substring by ALG2’.
Let the number of -runs in (type-ii) be . Note that holds and the total number of ’s in (type-i) and (type-ii) deleted from is . Also, the total number of ’s deleted from for all symbols in is bounded above by . Hence, we obtain the following inequality on the number of characters deleted by ALG2’:
This completes the proof of this lemma.
Theorem 16.
ALG2 is a linear-time approximation algorithm with approximation ratios if is even, and if is odd.
4 Improved approximation algorithm for MRSD
In this section, we present an improved approximation algorithm ALG3 that runs in linear time and its approximation ratios are if is even, and if is odd.
4.1 Concatenation operation
We first define an alternating -run:
Definition 17.
An alternating -run (simply, an alternating run) in is a substring for an integer such that (i) , (ii) , , , , and (iii) (if ), (if ), (if ), and (if ).
For example, consider a string . In the string , , , , , and are an alternating -run, an alternating -run, an alternating -run, an alternating -run, and an alternating -run, respectively.
Then, we introduce a concatenation operation to obtain a -run from an alternating -run in the string . Consider an alternating -run . Then, the concatenation operation deletes all characters that are not from , i.e., , , , , and obtains a -run. For a string , however, there are two possibilities, and by deleting two ’s and three ’s, respectively. The operation finds a concatenation (i.e., run) as long as possible using an optimal algorithm for the Interval Scheduling problem (see, e.g., [8]): We regard the alternating run as the interval of weight . A pair of two alternating runs and is independent if or holds. The concatenation operation aims to find a maximum weight subset of mutually independent alternating runs from , and obtain a long -run from the selected alternating -run by deleting all the characters that are not for every .
Operation (Concatenation operation).
Given the input string , the operation obtains a concatenated sequence :
- (Step 1)
-
Find all the alternating runs in .
- (Step 2)
-
Select a maximum subset of mutually independent alternating runs in .
- (Step 3)
-
Delete all characters from so that every alternating -run in becomes a -run, and obtain a concatenated sequence .
Example 18.
Consider again a string . In Step 1, we find an alternating -run of weight three, an alternating -run of weight three, an alternating -run of weight three, an alternating -run of weight seven, and an alternating -run of weight five. Then, in Step 2, we select whose total weight is . Finally, we delete four characters , , , from , and obtain .
We estimate the running time of the concatenation operation. Note that the number of alternating runs overlapped at the same position is at most two, and there are alternating runs in . Hence Step 1 can be implemented in time by scanning the string from left to right. Furthermore, during Step 1, we can sort the alternating runs according to their right-ends in . Among alternating runs, we can select the maximum independent set by greedily selecting the non-overlapped alternating run with the leftmost right-end [8] in Step 2, which takes time. Step 3 needs time. Therefore, the total running time of the concatenation operation is .
4.2 The algorithm
We present an improved approximation algorithm ALG3 with approximation ratios if is even, and if is odd.
Input: An input string in which every symbol appears at most times.
Output: A run subsequence of .
For example, if the concatenation operation produces from , then ALG3 outputs . It is clear that ALG3 can obtain a feasible solution in time.
4.3 Approximation ratios
Let and be the solutions obtained by an optimal algorithm OPT, and ALG3 for an input string , respectively. An outline of our proof on the approximation ratio is very similar to Section 3: (I) We first obtain an upper-bound on the number of characters deleted by ALG3, comparing the number of characters deleted by the optimal algorithm. Then, (II) we bound the upper-bound above by , where if is even, and if is odd. Again, we first construct the marked string from using the optimal algorithm, and design a little bit worse algorithm ALG3’ than ALG3.
Similarly to Section 3, we construct the marked string by replacing every character in the co-subsequence with a symbol . Let be the alphabet of marked symbols. For example, consider a string of length . Then, is an optimal solution and is the marked string obtained from . Note that the optimal solution deletes and thus includes . Furthermore, does not appear in since must be the run subsequence and includes ’s. In addition, does not include since ’s appear in and .
Similarly to a -run for , for the marked symbol , a -run can be defined. Also, without distinguishing from , we consider “mixed” -runs, -type, -type, and -type, for some positive integers , , and in the following. That is, for example, is regarded as one mixed run of length five. Note that we do not need to consider the -type since marked strings are constructed based on optimal solutions. Furthermore, we define an alternating-“mixed” run without distinguishing from as follows:
Definition 19.
An alternating-mixed -run (or, simply alternating-mixed run) in is a substring which satisfies, (Case 1), (Case 2), or (Case 3):
- (Case 1)
-
-type. (i) , and for , (ii) , , , , and (iii) (if ), (if ), (if ), and (if ). That is, and are the alternating -run and the alternating -run, respectively, and for a symbol .
- (Case 2)
-
-type. (i) , and for , (ii) , , , , and (iii) (if ), (if ), (if ), and (if ). That is, and are the alternating -run and the alternating -run, respectively, and for a symbol .
- (Case 3)
-
-type. (i) , , and , for , (ii) , , , , and (iii) (if ), (if ), (if ), and (if ). That is, and are the alternating -runs, and the middle is the alternating -run, and for symbols .
For example, consider the marked string of length . In the string , , and are the alternating -run, the alternating-mixed -run in (Case 3), and the alternating-mixed -run in (Case 2), respectively.
Here, we introduce the concatenation operation for marked strings by slightly modifying the concatenation operation.
Operation (Concatenation operation for marked strings).
Given the marked string of the input string , the operation obtains a concatenated sequence :
- (Step 1)
-
Find all the alternating runs and all the alternating-mixed runs on in .
- (Step 2)
-
Select a maximum subset of mutually independent alternating/alternating-mixed runs in .
- (Step 3)
-
Delete all characters that are neither nor from so that every alternating -run, every alternating -run and every alternating-mixed -run in become a -run, a -run and a mixed -run, respectively, and obtain the concatenated sequence .
Example 20.
Consider the marked string of length . In Step 1, we find an alternating -run of weight three, an alternating -run of weight three, an alternating -run of weight three, an alternating-mixed -run of weight seven, and an alternating-mixed -run of weight five. Then, in Step 2, we select whose total weight is . Finally, we delete four characters , , , and from , and obtain .
To obtain an upper bound on , we introduce the following algorithm ALG3’:
Input: The marked string of over the alphabet .
Output: A run subsequence of .
Note that the replacement in the second step of ALG3’ does not create a new -run, but only increases the length of the -run which originally appears in .
Example 21.
If the concatenated sequence is , then we replace in the rightmost mixed run, obtain , and finally output of length seven. Recall that an optimal solution is of length nine.
Using arguments very similar to the proof of Lemma 14, we obtain the following lemma:
Lemma 22.
For any input and its marked string , the following inequalities hold:
-
1.
;
-
2.
.
Recall that the optimal algorithm OPT deletes the number of characters from the input string . In the following, we show that, given the marked string of , the number of characters deleted by the worse algorithm ALG3’ from is bounded above by if is even, and if is odd.
We first consider the case :
Lemma 23.
Suppose that . Then, for and its marked string , the following inequality holds:
Proof.
We investigate the concatenation operation for marked strings. Consider a marked symbol . Assume that . Suppose that the marked string includes a substring . Then, even if the middle character is deleted in the concatenation operation for marked strings, this deletion of does not increase the number of deleted characters compared to the number of deleted characters by the optimal algorithm. Therefore, we assume that remains in the concatenated sequence after the concatenation operation for marked strings. Note that does not include a substring . The reason is as follows. The rightmost implies that the optimal algorithm deletes the rightmost , but if it is not deleted, then the length of the optimal solution increases by one, which is a contradiction. One sees that it is enough to count the deleted characters in independently within substrings of length three or four in , since here we assume that .
-
Suppose that includes a substring . If the middle is deleted, then is obtained by the concatenation operation for marked strings. ALG3’ deletes those two ’s in the third step. In other words, ALG3’ deletes one character in per two marked characters in (i.e., in per one in ).
-
Suppose that includes a substring . If the right is deleted, then we obtain and the middle is replaced by in the second step of ALG3’. Here, we can see that misses one but adds one , i.e., the number of deleted characters remains the same for the substring.
In summary, ALG3’ deletes one character in for two marked characters in . That is, ALG3’ deletes possibly all the marked characters in , and in addition, at most nonmarked characters in total. From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:
| (10) |
Next, we obtain the following lemma for , whose proof appears in Appendix B.
Lemma 24.
Suppose that . Then, for and its marked string , the following inequality holds:
Finally, we show the case :
Lemma 25.
Suppose that . Then, for and its marked string , the following inequality holds:
Proof.
We count the deleted characters that appear in , during the concatenation operation for marked strings. Note that if includes a substring consisting of at least two consecutive marked characters, then the substring may partition some -run in into the left -run and the right -run in , and the length of the shorter -run is at most .
Now we observe the marked string that includes a substring . Suppose that and in the substring. Then, all ’s are deleted in the concatenation operation for marked strings since the alternating -run is longer than the alternating -run. Here, the number of deleted characters by the concatenation operation remains the same as that by the optimal algorithm. If or , then all ’s in can be deleted by the concatenation operation for marked strings and the -run is produced, i.e., characters in can be additionally deleted by ALG3’.
-
Suppose that the length of the produced -run by the concatenation operation for marked strings is . This implies that ’s that appear in the optimal solution are deleted while ’s remain in the concatenated sequence obtained in the first step of ALG3’.
-
Suppose that after the first step of ALG3’, we obtain the concatenated sequence . Moreover, suppose that includes the -run of length at least two. Then, possibly includes a substring for . Namely, we can see that the -run of length is divided into two runs of length and by the -run of length . Recall that in the third step ALG3’ selects the longest -run. That is, if , then characters are deleted; otherwise, characters are deleted by ALG3’. From , at most characters are deleted if the length of the -run is at least two.
In summary, the number of characters deleted from in is one for every one single , and for every one substring of consecutive marked characters for .
Suppose that includes single marked characters whose left and right characters are not marked, and substrings of at least two consecutive marked characters . Note that , and now we assume that . Hence, we obtain the following inequality on the number of characters deleted by ALG3’ from the marked string :
This completes the proof.
Theorem 2. [Restated, see original statement.]
-MRSD can be approximated in linear time within a factor of for even , and for odd .
Finally, we can show that the analyses on the approximation ratios of and are strictly tight. For details, the reader is referred to Appendix D.
References
- [1] Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems. In Harold N. Gabow and Ronald Fagin, editors, Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pages 573–581. ACM, 2005. doi:10.1145/1060590.1060675.
- [2] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.
- [3] Bengt Aspvall, Michalel F. Plass, and Rovert E. Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Information Processing Letters, 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.
- [4] Irit Dinur and Shmuel Safra. The importance of being biased. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 33–42. ACM, 2002. doi:10.1145/509907.509915.
- [5] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CPM.2021.14.
- [6] Håstad Johan. Some optimal inapproximability results. Journal of the ACM, 48(4):498–859, 2001.
- [7] George Karakostas. A better approximation ratio for the vertex cover problem. In Luís Caires, Giuseppe F. Italiano, Luís Monteiro, Catuscia Palamidessi, and Moti Yung, editors, Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, volume 3580 of Lecture Notes in Computer Science, pages 1043–1050. Springer, 2005. doi:10.1007/11523468_84.
- [8] Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, 2006.
- [9] Michael Lewin, Dror Livnat, and Uri Zwick. Improved rounding techniques for the MAX 2-SAT and MAX DI-CUT problems. In William J. Cook and Andreas S. Schulz, editors, Integer Programming and Combinatorial Optimization, 9th International IPCO Conference, Cambridge, MA, USA, May 27-29, 2002, Proceedings, volume 2337 of Lecture Notes in Computer Science, pages 67–82. Springer, 2002. doi:10.1007/3-540-47867-1_6.
- [10] Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, and Chaokun Yan. A comprehensive review of scaffolding methods in genome assembly. Briefings Bioinform., 22(5), 2021. doi:10.1093/bib/bbab033.
- [11] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1–6:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.WABI.2020.6.
- [12] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/s13015-021-00191-8.
- [13] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput., 3(1):103–128, 2007. doi:10.4086/TOC.2007.V003A006.
Appendix A Proof of Lemma 11
Lemma 11. [Restated, see original statement.]
The summation of satisfies the following inequality:
Proof.
Note that and . Moreover, . Therefore it is sufficient to show .
Then we design a mapping from to as follows: Consider a symbol . Since is a bad symbol, the algorithm chooses an -run in a bad segment with . Then we know that in this bad segment must be in . Otherwise, the local operation-1 is applicable to , which contradicts Step 4 of ALG1. If is unique, then we maps to , which is called the Case-1 mapping. Now, we can assume is not unique.
Since is in and is not unique, then must be in since, otherwise, if is in , then a -run of length two must be chosen in . If is in , then we map to , which is called the Case-2 mapping. Otherwise, and thus the length of in is exactly one. Note that is not unique, and the local operation-2 is not applicable to . So, there exists another such that it is in a bad segment where a -run of length two is chosen by the algorithm. Therefore and we map to , which is called the Case-3 mapping.
Then we prove that the mapping is injective and we are done. Suppose are two distinct symbols in such that they are mapping to the same symbol . If is an unique symbol, then are mapping to by the Case-1 mapping. That is, and are two substrings of , which is impossible since and is unique. So, we can assume is not unique, i.e., and are mapping to by the Case-2 mapping or the Case-3 mapping.
The first case is that both , are mapping to by the Case-2 mapping. Then it indicates and are two bad segments in where is in . However, this case is impossible since two -runs in must come from a bad segment associated with .
The second case is that exactly one of maps to by the Case-2 mapping and the other one maps to by the Case-3 mapping. Without loss of generality, we assume that and map to by the Case-2 and Case-3 mapping, respectively. Recall that in the bad segment , the symbol is in . So, there is a substring or containing two bad segments in such that the -run of length two is in . By symmetry, we consider the substring . Since maps to by Case-3 mapping, is the substring such that an and an are in , which leads to a contradiction since .
The remaining case is that both map to by the Case-3 mapping. In this case, there exist two substrings and in and . However, this is impossible since two -runs must come from a bad segment of , which completes the proof.
Appendix B Proof of Lemma 24
Lemma 24. [Restated, see original statement.]
Suppose that . Then, for and its marked string , the following inequality holds:
Proof.
Consider a marked symbol and assume that again. As before, we consider the case where remains in the concatenated sequence after the concatenation operation for marked strings. Note that does not include a substring . The reason is the same as the previous: The rightmost implies that the optimal algorithm deletes the rightmost , but if the rightmost is not deleted, then the length of the optimal solution increases by one, which is a contradiction.
-
Suppose that includes a substring . If the middle is deleted, then is obtained by the concatenation operation for marked strings, and ALG3’ deletes two characters in the third step. In other words, ALG3’ deletes one character in per two marked characters in .
-
Suppose that includes a substring . If the right is deleted, then we obtain and the middle is replaced by in the second step of ALG3’. Namely, misses one but adds one , i.e., the number of deleted characters remains the same for the substring.
-
Suppose that includes a substring . If the right two ’s are deleted, then we obtain and two are replaced by in the second step of ALG3’. Namely, misses two ’s but adds two ’s, i.e., the number of deleted characters remains the same for the substring.
Again, we conclude that ALG3’ deletes at least one character in for two marked characters in . From these considerations, the upper bound on the number of deleted characters of ALG3’ can be calculated as follows:
| (11) |
Appendix C Bad examples for Theorem 1
We can provide tight examples for the analysis on the approximation ratios when and in Theorem 1.
-
(i)
First, suppose that . Then, consider the following string of length for some integer :
Then, an optimal solution is of length , and a solution of ALG1 is of length Hence, .
-
(ii)
Next, suppose that . Then, the following string of length for some integer is a tight example:
Then, an optimal solution is of length , and an solution of ALG1 is the same as . Therefore, .
-
(iii)
For the case , the approximation-ratio analysis in Theorem 1 is almost tight, but there is a slight gap. Suppose that . Then, consider the following string of length for some integer :
Then, the following run subsequence of length is an optimal solution:
On the other hand, our algorithm ALG2 outputs the following subsequence of length :
Hence, .
Appendix D Bad examples for Theorem 2
We can show that the analysis on the approximation ratios in Theorem 2 is tight.
-
(i)
First, suppose that is even. Then, consider the following string of length for some integer :
Then, the optimal solution is obtained by deleting characters as follows:
On the other hand, our algorithm ALG3 selects the leftmost longest -run for each and thus is as follows:
The total number of characters deleted from is . Hence, .
-
(ii)
Next, suppose that is odd. Then, consider the following string of length for some integer :
Very similarly, we obtain the following equality: . As a result, the analysis of the approximation ratios in the proof of Theorem 2 is tight.
