A Linear-time Algorithm for the Copy Number Transformation Problem

Problems of genome rearrangement are central in both evolution and cancer. Most evolutionary scenarios have been studied under the assumption that the genome contains a single copy of each gene. In contrast, tumor genomes undergo deletions and duplications, and thus the number of copies of genes varies. The number of copies of each gene along a chromosome is called its copy number profile. Understanding copy number profile changes can assist in predicting disease progression and treatment. To date, questions related to distances between copy number profiles gained little scientific attention. Here we focus on the following fundamental problem, introduced by Schwarz et al. (PLOS Comp. Biol., 2014): given two copy number profiles, u and v, compute the edit distance from u to v, where the edit operations are segmental deletions and amplifications. We establish the computational complexity of this problem, showing that it is solvable in linear time and constant space. 1 Introduction The genome of a species evolves by undergoing small and large mutations over generations. Large mutations modify genome organization by rearrangement of genomic segments. Computational analysis of the process of genome rearrangement has been subject of extensive research over the last two decades [5]. The majority of these studies to date were restricted to a single copy of each gene, and were concerned with the reordering of segments. Extant models that do not make this assumption often result in NP-hard problems [12, 14, 15]. While most work on genome rearrangements to date was done in the context of species evolution, there is today great opportunity in analysis of cancer genome evolution. Cancer is a dynamic process characterized by the rapid accumulation of somatic mutations, which produce complex tumor genomes. Species evolution happens over eons and changes are carried over from one generation to the next. In contrast, cancer evolution happens within a single individual over a few decades. In many tumor genomes, a lot of the changes are segmental deletions and amplifications [16]. As a result, the number of copies of each gene along a chromosome, known as its copy number profile, changes during cancer development, compared to the normal genome that has two copies (or alleles) for each gene. Understanding these changes can assist in predicting disease progression and the outcome of medical interventions.


Introduction
The genome of a species evolves by undergoing small and large mutations over generations.Large mutations modify genome organization by rearrangement of genomic segments.Computational analysis of the process of genome rearrangement has been subject of extensive research over the last two decades [5].The majority of these studies to date were restricted to a single copy of each gene, and were concerned with the reordering of segments.Extant models that do not make this assumption often result in NP-hard problems [12,14,15].
While most work on genome rearrangements to date was done in the context of species evolution, there is today great opportunity in analysis of cancer genome evolution.Cancer is a dynamic process characterized by the rapid accumulation of somatic mutations, which produce complex tumor genomes.Species evolution happens over eons and changes are carried over from one generation to the next.In contrast, cancer evolution happens within a single individual over a few decades.In many tumor genomes, a lot of the changes are segmental deletions and amplifications [16].As a result, the number of copies of each gene along a chromosome, known as its copy number profile, changes during cancer development, compared to the normal genome that has two copies (or alleles) for each gene.Understanding these changes can assist in predicting disease progression and the outcome of medical interventions.

16:2 A Linear-Time Algorithm for the Copy Number Transformation Problem
However, computational questions related to distances between copy number profiles received little scientific attention to date.Such questions are the topic of this paper.
Over the years, a variety of methods were used to determine the copy number profile of a cancer genome, at different resolutions.G-banding allows viewing the chromosomes bands [11].FISH measures the copy numbers of tens to hundreds of targeted genes [4].Array comparative genomic hybridization gives a higher resolution of copy number estimation for a cell population [17].Most recently, deep sequencing techniques yield copy number profiles by using read depth data [10].While it would have been preferable to analyze the genome (karyotype) itself and not its copy number profile, detection of structural variations from sequencing data is still problematic [7,1].Today it is a routine procedure to obtain detailed copy number profiles of cancer genomes, but utilizing them to understand cancer evolution is still an open problem.
Given two copy number profiles, the healthy tissue's and the tumor's, evaluating the distance between them can help in understanding cancer progression.A naïve measure of distance is the Euclidean distance between the two profiles [13].Chowdhury et al. defined edit distance between copy number profiles obtained from FISH, where the edit operations are amplification or deletion of single genes, single chromosomes or the whole genome [3,4,2].However, calculating these distances requires exponential time in the number of genes and therefore is limited to low resolution FISH data.The TuMult algorithm uses the number of breakpoints (loci where the copy numbers change) between two profiles as a simple distance measure [6].
Schwartz et al. introduced a model that admits amplification and deletion of contiguous segments [13].The edit distance between two copy number profiles was defined as the minimum number of segmental deletions and duplications over all separations of the profiles into two alleles (a procedure known as phasing).Their algorithm MEDICC for computing the edit distance uses finite-state transducers (FSTs) [9] in order to model the profiles and efficiently compute the distance.However, the complexity of this method was not analyzed.Even without the phasing computation, the method needs to compose a 3-state transducer with itself N times, resulting in a transducer with 3 N states [13,8].The running time of FST procedures relies on the number of states and transitions, and in some cases may be exponential [9,8].
Copy Number Transformation.We investigate the following problem, which underlies the model of [13]: Given two copy number profiles (CNPs), u and v, compute the minimum number of segmental duplications and deletions needed to transform u into v.We call this problem the Copy Number Transformation Problem (CNTP).A CNP is represented by a vector of nonnegative integers (the number of copies of each gene).A segmental deletion (amplification) decreases (resp.increases) by 1 the values of a contiguous segment of the vector, where zero values are not affected.Formal definitions are given in Section 2.
Our Contribution.We show that CNTP is solvable in linear time and constant space.The algorithm relies on several properties of the problem that we establish in Section 3.1, which may also be relevant to the analysis of other problems involving CNPs.Exploiting these properties results in a pseudo-polynomial dynamic programming algorithm for CNTP, presented in Section 3.2.In Section 3.3, by establishing that a certain function in the dynamic programming recursion is piecewise linear, we improve its performance and obtain our main result.For lack of space, some proofs are omitted.

Preliminaries
In this section, we give definitions and notation that are used throughout the paper.Let n ∈ N.
The CN distance (CND) from S to T , dist(S, T ), is the smallest size of a CNT C that satisfies C(S) = T , where if no such CNT exists, dist(S, T ) = ∞.Note that dist is not symmetric.For example, for S = (1) and T = (0), dist(S, T ) = 1 but dist(T, S) = ∞.Given two CNPs, S = (s 1 , s 2 , . . ., s n ) and T = (t 1 , t 2 , . . ., t n ), the Copy Number Transformation problem, CNTP, seeks dist(S, T ) (if one exists).We say that a CNT C is optimal if it realizes dist(S, T ), i.e., |C| = dist(S, T ) (there may exist several optimal CNTs).We let N = max{max n i=1 {s i }, max n i=1 {t i }} denote the maximum copy number in the input.Finally, for all 1 ≤ i ≤ n, we define u i = s i − t i .

An Algorithm for CNTP
We first present an O(nN 2 )-time, O(N )-space algorithm for CNTP that is based on dynamic programming (Sections 3.1 and 3.2).Recall that N is the maximal integer in the input, so that the algorithm is pseudo-polynomial.Then, we modify this algorithm to run in linear time (Section 3.3).On a high level, the modification is based on the observation that the correctness of the use of these functions requires a somewhat extensive case analysis that is presented separately in Section 3.4.

Key Propositions
We start by developing Alg1, an O(nN 2 )-time dynamic programming algorithm for CNTP.
Let (S = (s 1 , s 2 , . . ., s n ), T = (t 1 , t 2 , . . ., t n )) be the input.Observe that there exists a CNT C such that C(S) = T if and only if there does not exist an index 1 ≤ i ≤ n such that s i = 0 and t i > 0. Since the existence of such an index can be determined in linear time (where, if such an index is found, we return ∞), we will assume that dist(S, T ) < ∞.To simplify the presentation, we further assume w.l.o.g. that t 1 , t n = 0. Indeed, if t 1 = 0 or t n = 0, we can solve the input (S = (1, s 1 , s 2 , . . ., s n , 1), T = (1, t 1 , t 2 , . . ., t n , 1)) instead, since it holds that dist(S, T ) = dist(S , T ).Finally, we assume w.l.o.g. that for all 1 ≤ i ≤ n, s i > 0. Indeed, if there exists 1 ≤ i ≤ n such that s i = 0, then also t i = 0, and we can solve the input (S = (s 1 , . . ., Alg1 exploits four key observations about the nature of the problem at hand, summarized as follows: (1) it is sufficient to examine CNTs where all of the deletions precede all of the amplifications; (2) it is sufficient to examine CNTs that do not contain both a deletion that affects s i but not s i+1 and a deletion that affects s i+1 but not s i , and the same is true for amplifications; (3) when seeking an optimal solution, it is not necessary to store information indicating how many deletions/amplifications affect s i if t i = 0; (4) the maximum number of deletions/amplifications that affect each s i can be bounded by N .
To formally state the first observation, we need the following definition.
Proposition 2. There exists an optimal ordered CNT.
Claim 3. Let C = (c 1 , c 2 , . . ., c m ) be an optimal CNT and let i be an index such that c i = ( i , h i , 1) and c i+1 = ( i+1 , h i+1 , −1).Then, there exists an optimal CNT C = Proof.Consider the following exhaustive case-analysis. 1. h i < i+1 or h i+1 < i : In this case, the segments corresponding to c i and c i+1 are disjoint.
Thus, we can simply define c i = c i+1 and c i+1 = c i .Then, Condition 2 is satisfied.
).This argument holds because an application of c i which is followed by an application of c i+1 does not change any entry v k such that i+1 ≤ k ≤ h i .We have that C (S) = T .Since |C | = |C|, C is an optimal CNT.Now, Condition 1 is satisfied. 16: . As in the second case, we obtain an optimal CNT that satisfies Condition 1.
. As in the second case, we obtain an optimal CNT that satisfies Condition 1.

i+1 ≤
. As in the second case, we obtain an optimal CNT that satisfies Condition 1.
As we show below, Claim 3 implies the existance of an ordered optimal CNT.In each of the cases in Claim 3, a local change is made in the CNT.Note however that just performing enough local operations does not guarantee reaching an ordered optimal CNT.For example, in a CNT with three consecutive CNOs, , one may loop between changing c i+1 into a deletion and then into an amplification.
Proof of Proposition 2. Let C be the set of optimal CNTs, and suppose, by way of contradiction, that it does not contain an ordered CNT.The three following phases sieve some solutions out of C. Informally, we initially consider only optimal CNTs that minimize the sum of the sizes of the segments corresponding to their CNOs (C 1 ); then, we further consider only the CNTs whose first amplification is as late as possible (C 2 ); finally, we only take the CNTs whose first deletion after their first amplification is as early as possible (C 3 ). Given let z(C) be the smallest index i ∈ {y(C) + 1, . . ., m} such that c i is a deletion.By the definition of y(C) and since C is not ordered, we have that z(C) is well-defined and z(C) ≥ y(C) + 2. Let C 3 be the set of every C ∈ C 2 for which there does not exist C ∈ C 2 such that z(C) > z(C ).
Since C = ∅, we have that C 3 = ∅.Thus, we can let C = (c 1 , c 2 , . . ., c m ) be a solution in C 3 .Let i be the smallest index such that c i is an amplification and c i+1 is a deletion.Now, consider the conditions in Claim 3: if Condition 1 holds, we have a contradiction to the fact that C ∈ C 1 , while if Condition 2 holds, we have a contradiction either to the fact that C ∈ C 2 (if i = 1 or c i−1 is a deletion) or to the fact that C ∈ C 3 (otherwise).Thus, we conclude that C contains an ordered CNT.
The other three propositions are stated without proof.
Equivalently, C is elongated if no two amplifications (or deletions) "dovetail", i.e., one ending at i and the other starting at i+1.It is clear that for any CNT C, the inequality ≥ holds above (since {( , h, w) ∈ C : ≤ i, i + 1 ≤ h} is a subset of both {( , h, w) ∈ C : ≤ i ≤ h} and {( , h, w) ∈ C : ≤ i + 1 ≤ h}).Our second key proposition implies the inequality ≤ holds as well.An example for an elongated CNT is given in Fig. 1(B).

Proposition 5. Every ordered optimal CNT is elongated.
To formalize our third key proposition, we need the following definition.In words, for a block of consecutive zeros in the target profile, all deletions that span the block also include its flanking positions.An example of a CNT that skips zeros is given in Fig. 2(A).

Proposition 7.
There exists an optimal ordered CNT that skips zeros.
For a position with positive target value, knowing the number of deletions that affected it uniquely determines the number of amplifications that affected it.This simple fact will help the efficiency of our procedures.Formally: Finally, we formalize our fourth key proposition.Definition 9. A CNT C is bounded if for all 1 ≤ i ≤ n and every w ∈ {−1, 1}, we have op(C, w, i) ≤ N .Proposition 10.Every optimal ordered CNT that skips zeros is bounded.

An O(nN 2 )-Time Algorithm for CNTP
On a high-level, the dynamic programming algorithm works as follows.It considers increasing prefixes S i = (s 1 , s 2 , . . ., s i ) and T i = (t 1 , t 2 , . . ., t i ) of the input.It computes a table M having n(N + 1) entries where M[i, d] is the best value of a solution on (S i , T i ) that uses exactly d deletions that affect the i th position.The parameter d ranges between zero and N , and the values for each i are computed based on values M[j, •] for a single specific j < i.In particular, at each point of time, only two rows of the table M are stored.By Propositions 2-10, the algorithm considers only ordered, elongated, zero-skipping and bounded solutions.We call such solutions good.
More formally, given 1 ≤ i ≤ n and 0 ≤ d ≤ N , we say that a CNT C is an (i, d)-CNT if C(S i ) = T i , d = op(C, −1, i), and C is good.We say that an (i, d)-CNT C is optimal if there is no (i, d)-CNT C such that |C | < |C|.Our goal will be to ensure that each entry M[i, d] stores the size of an optimal (i, d)-CNT, where if no such CNT exists, it stores ∞.We do not compute entries M[i, d] such that t i = 0; indeed, by relying on Property 7, we are able to skip such entries (though our recursive formula does consider CNs s i referring to indices i such that t i = 0).In this context, observe that any ordered CNT C such that C(S) = T consists of at least u i deletions that affect s i , and if t i > 0, it cannot consist of more than s i − 1 such deletions (since after decreasing s i to 0, it remains 0).Moreover, if u i ≤ d < s i , there exists an (i, d)-CNT -by independently adjusting the value of each position < i to its target position and the value at position i with d deletions, using operations of span 1.
Observation 11.Given 1 ≤ i ≤ n such that t i > 0 and 0 ≤ d ≤ N , there exists an (i, d)-CNT if and only if u i ≤ d < s i .
In case s i < t i , Observation 11 states that there exists an (i, d)-CNT if and only if d < s i .In light of this observation, we will use the following assumption.

Assumption 12. In the computation below, we assume that max{u
By Observation 8, if a solution involved d deletions at position i with t i > 0, then it involved −u i + d amplifications at that position.For convenience denote that number by a(i, d) = −u i + d for all 1 ≤ i ≤ n satisfying t i > 0 and max{u i , 0} ≤ d < s i , and a(i, d) = ∞ otherwise.
For input profiles S, T , the algorithm precomputes two vectors .Given an index 1 < i ≤ n such that t i > 0, let prev(i) denote the largest index j < i such that t j > 0.Moreover, if prev(i) = i − 1, let Q i = 0, and otherwise let Q i = max prev(i)<j<i {s j }.A skipping zero solution will skip the positions between i and prev(i) in the computation, but will make sure to perform at least Q i deletions spanning the skipped positions.
Initialization.The initialization step sets all entries M [1, d] Recursion.If t i = 0 position i is skipped.Suppose that i > 1, t i > 0 and max{u i , 0} ≤ d < s i .The order of the computation is determined by the first argument.The computation is summarized in the following formula.
Roughly speaking, to compute M [i, d] we look back to the previous non zero position in T , and for each value d in that position add the difference from d if needed, the number of amplifications to be added if needed, and the number of additional deletions if such are needed to take care of the skipped zero positions.After filling the table M, Alg1 returns min 0≤d≤N M[n, d].An example if a filled table is given in Fig. 2(B).
Correctness.First, we claim that the entries of the table M are computed properly.Lemma 13.For all 1 ≤ i ≤ n such that t i > 0 and for all 0 ≤ d ≤ N , M [i, d] stores the size of an optimal (i, d)-CNT, where if no such CNT exists, it stores ∞.
Proof.We prove the lemma by induction on the order of the computation.
The correctness of the initialization step follows from the definition of an (i, d)-CNT and Observation 8. Now, fix 1 < i ≤ n such that t i > 0, and fix max{u i , 0} ≤ d < s i .Let m be the size of an optimal (i, d)-CNT.Suppose that the lemma is correct for all i < i and 0 ≤ d ≤ N .We need to show that M[i, d] = m.First Direction.First, we show that M[i, d] ≤ m.Let C = (c 1 , c 2 , . . ., c m ) be an optimal (i, d)-CNT, and for all 1 ≤ j ≤ m, denote c j = ( j , h j , w j ).For all 1 ≤ j ≤ m, let c j = ( j , min{h j , prev(i)}, w j ).Now, define C = (c 1 , c 2 , . . ., c m ).We further let C = ( c 1 , c 2 , . . ., c q ) denote the CNT obtained from C by removing all of the CNOs c = ( , h, w) such that h < .Denote d = op( C, −1, prev(i)).Observe that d ≤ N and that C is a (prev(i), d)-CNT (because C is an (i, d)-CNT).Therefore, by the induction hypothesis, Now, suppose that prev(i) < i − 1.Then, since C is ordered and skips zeros, and by the definition of Q i , the two following conditions hold.
Second Direction.Next, we show that M[i, d] ≥ m.To this end, it is sufficient to show that there exists an (i, d)-CNT C such that M[i, d] ≥ |C|.Let d be an argument d at which the value computed by using the recursive formula is minimized.By the inductive hypothesis, there exists a (prev(i), d)-CNT C = ( c , c 2 , . . ., c q ) such that M[prev(i), d] ≥ q.For all 1 ≤ j ≤ q, denote c j = ( j , h j , w j ).Now, if prev(i) = i − 1, define C = C, and else define C as follows.For all 1 ≤ j ≤ q, let c j = ( j , h, w j ), where h = h j if h j < prev(i) and h = i − 1 otherwise.Let C = ( c 1 , c 2 , . . ., c q ).Moreover, as long as there exists prev(i) < j < i such that op( C, −1, j) < s j , choose the smallest such j, and append to the beginning of C the CNO (j, i − 1, −1).Let C be the CNT obtained at the end of this process.Denote C = (c 1 , c 2 , . . ., c r ), and for all 1 ≤ j ≤ r, denote c j = ( j , h j , w j ).Now, let p and q be the number of deletions and amplifications in C whose segments include i − 1, respectively.If p < d, append to the beginning of C d − p "dummy" deletions of the form (i, i − 1, −1), and if a(i, d) < q, append to the end of C a(i, d) − q "dummy" amplifications of the form (i, i − 1, 1).Let C = (c 1 , c 2 , . . ., c k ) be the resulting CNT, and for all 1 ≤ j ≤ k, denote c j = ( j , h j , w j ).Finally, we define C as follows.Let D (A) be a set of exactly d deletions (resp.amplifications) in C whose second argument is i − 1.We let C be defined as C , except that each CNO ( , h, w) ∈ D ∪ A is replaced by the CNO ( , i, w).It is straightforward to verify that C is an (i, d)-CNT such that |C| = q + max{d − d, 0} + max{a(i, d) − a(prev(i), d), 0} + max{Q i − max{d, d}, 0}, which concludes the correctness of the second direction.Now, we turn to consider the correctness and running time of Alg1.Theorem 14. Alg1 solves CNTP in time O(nN 2 ) and space O(N ).

Proof. The table M contains O(nN ) entries, and each entry can be computed in time O(N ).
Therefore, the time complexity of Alg1 is bounded by O(nN 2 ).Moreover, for the computation of M[i, •], it is only necessary to keep O(N ) entries for position prev(i), and therefore the space complexity is bounded by O(N ).Since every (n, d)-CNT C satisfies C(S) = T , and since for every good optimal CNT C, there exists 0 ≤ d ≤ N such that C is an (n, d)-CNT, we have that Lemma 13 implies that Alg1 returns the smallest size of a good optimal CNT (if such a CNT exists).By Propositions 2-10, such a CNT indeed exists, and therefore Alg1 solves CNTP.

A Linear-Time Algorithm for CNTP
In this section we show how to modify Alg1 in order to obtain an algorithm, called Alg2, that solves CNTP in linear time.The central lemma that leads to this improvement states each column in the table M can be described by a piecewise linear function of at most three segments.

16:9
Skipping zeros solution  = 3,1,2,3,2,1,4  = 2,0,0,0,0,0,2 To present this lemma, we need the following notation.For all i ∈ {1, 2, . . ., n} such that t i > 0, let d min i = max{u i , 0} and d max i = max{s i − 1, 0} be the least and largest values of d for which M [i, d] is finite.Now, the function Observe that the function f i is discrete.We stress that in this section, we do not explicitly compute the entries of M -the definition of the functions concerns the values that would have been stored in these entries if they were computed by using Alg1.

Lemma 15.
For each i ∈ {1, 2, . . ., n} such that t i > 0, there exist base i , a i , b i ∈ N ∪ {0} such that for all d ∈ {d min i , . . ., d max i }: Moreover, base 1 , a 1 and b 1 can be computed in constant time, and for each i ∈ {2, 3, . . ., n} such that t i > 0, given base prev(i) , a prev(i) and b prev(i) , base i , a i and b i can be computed in constant time.
An example is given in Fig. 2(C).The proof is based on Lemma 13 and an exhaustive case analysis, which, for the sake of clarity of presentation, is handled separately in Section 3.4.
Our algorithm, Alg2, performs the following computation: , n: a.If t i = 0, skip the rest of the current iteration.b.Compute base i , a i and b i using base prev(i) , a prev(i) and b prev(i) .

Return base n .
We are now ready to prove our main result.

Case Analysis
The purpose of this section is to prove the correctness of Lemma 15.That is, we want to show that f i (d) is a piecewise linear function described by three parameters, and these parameters can be calculated in constant time.To this end, let j = prev(i) and R i = u j − u i .Accordingly, the term a(i, d) − a(j, d ) can be written as R i + d − d .Moreover, let d opt be the argument d that minimizes the recursive formula we use to compute M[i, d] under certain conditions that will be clear from context.
We prove Lemma 15 by induction on i.To simplify the proof, let a 0 = b 0 = base 0 = 0 and f 0 (d) = 2d for every 0 ≤ d ≤ N .This definition is equivalent to adding the new entries s 0 = t 0 = N + 1 (which do not affect the distance from S to T ), and thus, it can serve as the basis of our induction.Next, suppose that Lemma 15 holds for j = prev(i) < i, and we will prove that it holds for i.
The proof is based on an exhaustive case analysis that examines the position of Q i relative to d min j , a j , b j and d max j , as well as the sign of R i .For example, one of the cases is defined by the conditions In each case, we analyze the behavior of M[i, d] as we increase d.More precisely, we examine several intervals that together contain all of the values that can be assigned to d.For example, in the above mentioned case, we consider the intervals d ≤ a For each interval, we let d opt be an argument d that minimizes M[i, d] under the conditions of the examined case.These conditions along with d opt allow us to remove the minimization and maximization functions from the formula defining M[i, d], and thus we obtain f i (d).
In the latter example, if d ≤ a j − R j we can choose d opt = a j and get f i (d) = M[i, d] = M[j, a j ] + max{d − a j , 0} + max{R i + d − a j , 0} + max{Q i − max{d, a j }, 0}} = base j .As a corollary of the analysis, we get that indeed f i (d) is piecewise linear, and that a i , b i and base i can be calculated in constant time given a j , b j , base j , R i and Q i .
Due to lack of space, the details of the case analysis are omitted.The analysis shows that in all cases, f i (d) is indeed a piecewise linear function with at most three linear segments defined by some a i , b i , base i .After applying straightforward operations that reorganize the analysis (to present the results in a compact manner), we obtain the algorithm PiecewiseAlg, whose pseudocode is given below.This algorithm performs step 2b of Alg2, i.e., it calculates a i , b i , base i given a j , b j , base j and Q i in constant time and space.
PiecewiseAlg first calculates R i , d min ).Similarly, we limit the values of a i and b i to that range.

Conclusion
In this paper, we initiated the study of distances between CNPs from a theoretical point of view.We focused on one fundamental problem, CNTP, and showed that it is solvable in linear time and constant space.To this end, we proved several properties of CNTP that may be useful in solving other problems involving CNPs.Our algorithm can be modified to return  a transformation that realizes dist(S, T ) in linear time and linear space by backtracking the dynamic programming vector.We have implemented the algorithm as well as an ILP formulation of CNTP (the implementations are available upon request), and we intend to assess the performance of these approaches.Many computational and combinatorial aspects in the analysis of distances between CNPs require further research.Indeed, this paper can be viewed as a first step towards understanding them.We intend to investigate variants of CNTP where one seeks a CNP that minimizes the overall distance from it to two (or more) CNPs that are given as input.Such variants are relevant to phylogenetic reconstruction in cancer (see [13]).Additional directions for further research involve the introduction of edit operations other than basic segmental deletions and amplifications, dealing with phasing of the profiles, as well as the handling of noise.

Figure 1
Figure 1 Copy number transformations.(A) The CNT C = (c1, c2, c3) transforms S into T .The size of C is 3. Red and green blocks indicate deletions and amplifications, respectively.(B) Elongated and non-elongated CNTs.Bold lines indicate the range of deletions.

6 16: 8 A
0 and since C is C P M 2 0 1 Linear-Time Algorithm for the Copy Number Transformation Problem ordered and elongated, by Observation 8 we have that m − q = max{d − d, 0} + max{a(i, d) − a(prev(i), d), 0}.Thus, by the recursive formula, in this case we get that M[i, d] ≤ m

Figure 2 (
Figure 2 (A) A skipping-zeros solution.Bold lines indicate deletions.(B) The DP M [i, d] matrix for the two CNPs in (A).(C) An example of the piecewise linear function fi(d) described in Lemma 15.The number of segments is three but can be smaller, depending on the values involved.

Theorem 16 . 10 A
Alg2 solves CNTP in time O(n) and space O(1).Proof.According to Lemma 15, the function f i (d) = M[i, d] is a piecewise linear function described by three values.The correctness of Lemma 15 shows that step 3 calculates these values in constant time and space given the previous values.The time and space complexity of Alg2 follow directly.Linear-Time Algorithm for the Copy Number Transformation Problem Now, by the correctness of Alg1, it is sufficient to prove that Alg2 returns the value min 0≤d≤N M[n, d].By Observation 11, min 0≤d≤N M[n, d] = min d min n ≤d≤d max n M[n, d].By Lemma 15, we further have that min d min n ≤d≤d max n M[n, d] = base n .Thus, by the inductive proof of Lemma 15, we conclude that Alg2 solves CNTP.

i and d max i based on s i and
t i .Next, according to the sign of R i and the relative position of Q i in comparison to the previous a j and b j , the algorithm calculates the structure of f i (d) defined by a i and b i .Finally, since f i (d) is defined only for the range d min i ≤ d ≤ d max i , we calculate base i = f i (d min i

− a i − b i if b i < d min i ≤ d max i a i ← max{d min i ,
min{a i , d max i }}; b i ← max{a i , min{b i , d max i }}.

1 6 16:4 A Linear-Time Algorithm for the Copy Number Transformation Problem
the size of table shrinks from O(nN ) to O(n).The precise definitions of the table and the functions are given in Sections 3.2 and 3.3.Our proof of the C P M 2 0

16:11 Algorithm 1
PiecewiseAlgInput:s i , t i , Q i , a j , b j , base j Output: a i , b i , base i R i ← u j − u i