Recently, due to the genomic sequence analysis in several types of cancer, genomic data based on copy number profiles (CNP for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific segment of interest. The motivation is that in the late stage of certain types of cancer, the genomes are progressing rapidly by segmental duplications and deletions, and hence obtaining the exact sequences becomes difficult. Instead, the number of copies of important segments can be predicted from expression analysis and carries important biological information. Therefore, significant research has recently been devoted to the analysis of genomic data represented as CNP’s.

In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. The Minimum Copy Number Generation (MCNG) is defined as follows: given a string S in which each character represents a gene or segment, and a CNP C, compute a string T from S, with the minimum number of segmental duplications and deletions, such that cnp(T)=C. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications and/or deletions are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. This is achieved through a general-purpose lemma on set-cover reductions that require an exact cover in one direction, but not the other, which might be of independent interest. We also prove that the corresponding parameterized version is W[1]-hard, answering another open question by Qingge et al.

The other result is positive and is based on a new (and more general) problem regarding CNP’s. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP’s C₁ and C₂, compute two strings S₁ and S₂ with cnp(S₁)=C₁ and cnp(S₂)=C₂ such that the distance between S₁ and S₂, d(S₁,S₂), is minimized. Here, d(S₁,S₂) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S₁,S₂) is measured by the breakpoint distance then the problem is polynomially solvable. We expect that this will trigger some related research along the line in the near future.