Abstract 1 Introduction 2 Breakpoint medians 3 An algorithm to find medians 4 Experimental results 5 Conclusion References Appendix A Proof of Theorem 1

Identifying Breakpoint Median Genomes:
A Branching Algorithm Approach

Poly H. da Silva ORCID Columbia University, New York, NY, USA Arash Jamshidpey111corresponding author ORCID University of California at Berkeley, Berkeley, CA, USA David Sankoff ORCID University of Ottawa, Ottawa, Ontario, Canada
Abstract

Genome comparison often involves quantifying dissimilarities between genomes with identical gene sets, commonly using breakpoints – points where adjacent genes in one genome are not adjacent in another. The concept of a median genome, used for comparison of multiple genomes, aims to find a genome that minimizes the total distance to all genomes in a given set. While median genomes are useful for extracting common genomic information and estimating ancestral traits, the existence of multiple divergent medians raises concerns about their accuracy in reflecting the true ancestor. The median problem is known to be NP-hard, particularly for unichromosomal genomes, and solving it becomes increasingly challenging under different genome distance models. In this work, we introduce a novel branching algorithm to efficiently find all breakpoint medians of k linear unichromosomal genomes, represented as unsigned permutations. This algorithm constructs a rooted labeled tree, where the sequence of labels along each complete ray defines a genome, providing a structured and efficient way to explore the space of candidate medians by narrowing the search to a well-defined and significantly smaller subset of the permutation space. We validate our approach with experiments on randomly generated sets of three permutations. The results show that our method successfully finds the exact medians and also identifies many near-optimal approximations. Our experiments further show that most medians lie relatively close to the input permutations, in agreement with prior theoretical results.

Keywords and phrases:
Breakpoint distance, median genomes, phylogeny reconstruction, random permutations
Copyright and License:
[Uncaptioned image] © Poly H. da Silva, Arash Jamshidpey, and David Sankoff; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Combinatorial algorithms
; Mathematics of computing Permutations and combinations ; Applied computing Computational genomics
Editors:
Broňa Brejová and Rob Patro

1 Introduction

Comparing genomes with the same syntenic blocks involves computing their dissimilarities. This dissimilarity is often quantified by identifying breakpoints, points at which genes are adjacent in one genome but not in the other. Introduced formally by Sankoff and Blanchette in 1997 [13], the total number of breakpoints serves as a metric for dissimilarity. To compare multiple genomes, we can use the concept of median. Given a set of three or more genomes X={g1,,gk} and a distance d, a median for the set X is a genome that minimizes the total distance function dT():=i=1kd(gi,). The concept of the median was first employed by Sankoff et al. [14] in 1996 within the context of evolutionary gene order models. Motivated by the search for ancestral genomic information and its applications to small phylogeny problems, the median problem has since attracted significant attention [2, 3, 15, 7, 6, 16, 17].

However, the complexity of the median problem varies with different genome distances, often proving to be NP-hard [3, 15, 7] particularly for unichromosomal genomes. For instance, the breakpoint median problem was shown to be NP-hard by Bryant [2] for linear unichromosomal genomes. Moreover, identifying a median in which adjacency sets are contained within the union of the adjacency sets of the input genomes has also been proven to be NP-hard [2]. Following the reduction of the median problem to the Traveling Salesman Problem (TSP) by Sankoff and Blanchette [13], in 2012, Boyd and Haghighi [1], using Concorde (a fast software to find TSP solutions), presented a fast algorithm to find breakpoint medians of samples of large genomes.

While median genomes aim to extract common information among given genomes and estimate ancestral characteristics, the existence of multiple medians with considerable divergence [8] raises questions about their proximity to the true ancestor or their usability in providing ancestral insights. Additionally, determining which, if any, of these medians accurately reflects ancestral traits poses a significant challenge. In fact, Zheng and Sankoff [18], Jamshidpey and Sankoff [10] and Miardan et al. [12] showed that median may fail to approximate the ancestor for the long-time evolution of genomes, while for genomes involved in evolution for a shorter period of time medians may approximate the true ancestor.

To address the challenge of identifying relevant medians, we propose a novel branching algorithm for efficiently finding all breakpoint medians of k linear unichromosomal genomes represented by unsigned permutations of length n. This exponential algorithm constructs a rooted labeled tree, whose sequence of labels for each ray (a shortest path connecting the root to a leaf) with length n1 determines a unichromosomal genome (represented as a permutation). The set of all such unichromosomal genomes contains all medians of the k input genomes. We show that this tree construction reduces the median search space significantly compared to the full space of n! permutations (see Table 3).

This paper is organized as follows. We begin by laying the foundation. In Section 2 we introduce the basic concepts of a genomic space with breakpoint distance and review some essential prior results in the literature. In Section 3, we delve into the methodology behind our branching algorithms designed to identify all medians within a given set of genomes. Subsequently, in Section 4, we provide empirical validation of our approach through a series of experiments using sets of three random permutations. A key contribution of this experimental section is that our method is able to compute the median value exactly, even in cases where it remained unknown in previous work. We examine how the median value behaves as the permutation length increases and analyze the distribution of approximate medians in the reduced search space generated by our algorithm. The results indicate that, although not all permutations in this space are true medians, a substantial proportion have total distances very close to the minimum, making them effective median approximations. We also explore how far the medians tend to be from the input permutations and find that most lie relatively close, an observation consistent with prior theoretical results [8, 9, 11, 4]. We conclude the paper with a discussion of our findings, their implications, and potential avenues for future research.

2 Breakpoint medians

We represent an unichromosomal genome by a permutation π which is a bijection on [n]:={1,,n}. In other words, a permutation π can be represented by π(1),,π(n), which indicates a specific order on [n]. When there is no risk of ambiguity, we often write πi instead of π(i), and denote π:=π1πn. We define the set of adjacencies of π as 𝒜π={{πi,πi+1}1in1}, where each adjacency is treated as an unordered pair. Let Sn denote the set of all permutations of length n. Given x,ySn, we denote by 𝒜x,y:=𝒜x𝒜y the set of all common adjacencies of x and y. For a set XSn, we also denote by 𝒜X:=xX𝒜x the set of all common adjacencies of permutations in X. The breakpoint (bp) distance between x and y is define by d(x,y):=n1|𝒜x,y|.

The breakpoint distance is neither a geodesic distance nor an edit distance, and for this reason the notion of partial geodesics was introduced by Jamshidpey et al. [9]. We can consider the breakpoint distance as a generalized edit distance that determines the parsimonious (shortest) paths of transforming one permutation to another, but with many missing points in the parsimonious path. In other words, in edit distances the length of every jump from a point in the parsimonious path to its closest point in the the same path is one, while in generalized edit distances such as the breakpoint distance this length may be bigger. A partial geodesic [9] between x and y is a maximal chain x=π0,π1,,πk1,πk=y in Sn such that i=0k1d(πi,πi+1)=d(x,y). We denote by [x,y]¯ the set of all permutations lying on partial geodesics connecting x,ySn, and call them geodesic points of x and y.

For a set of three or more genomes X={x1,,xk}, a breakpoint median is a genome that minimizes the total distance function dT(,X):=i=1kd(xi,). The minimal value of dT is known as the median value of the set X, denoted by μ(X). The set of all breakpoint medians of X is denoted by M(X).

For a set of permutations X={x1,,xk} in Sn for which the pairwise breakpoint distances take the maximum value n1, Jamshidpey et al. [9] provide a necessary and sufficient condition for a permutation m to be a median of X, that is m is a median of X if and only if

𝒜mxX𝒜x.

Also from [9], a permutation π is a geodesic point of two permutations x and y, and so it is a median of {x,y}, if and only if 𝒜x,yπ𝒜x𝒜y. On the other hand, we do not have a result establishing a necessary and sufficient condition for a permutation to be a median of a general set of permutations X. In fact, it is known that there may exist a median that does not contain all common adjacencies of permutations in X, i.e., there may exist a median m such that 𝒜X𝒜m, as the example given by Bryant [2]. However, even though there may exist medians not containing all common adjacencies of elements of X, there always exists at least one median with this property, namely, there exists at least one median m such that 𝒜X𝒜m (cf. [2]). In addition, when we have a general set of permutations X, even counter-intuitively, it is not necessary that every adjacency of a median m is an adjacency of at least one of the permutations in X, that is, there may exist a median m such that

𝒜mxX𝒜x,

as is shown by Bryant [2]. However, [5] provides an upper bound for the maximum number of adjacencies of a median that are not in xX𝒜x as stated in Theorem 1 (whose proof is provided in Appendix A). Before the statement of Theorem 1, we need the following notation. Denote by 𝒫(S) the set of all subsets of a set or space S. Let X={x1,,xk}Sn and let XX=x1,,xkX:=𝒜x1,,xk. Then, for any j=1,,k, let

x1,,xj1,xj+1,xkX:=𝒜x1,,xj1,xj+1,xkx1,,xkX

Continuing this, for any i1,,ir[n] and U={xi1,,xir}X, we set

UX=xi1,,xirX:=𝒜U(UVVX).

In other words, UX includes all adjacencies that are common in every xU, but missing from every yXU. We have the following theorem.

Theorem 1 ([5]).

Let X={x1,,xk}Sn be such that

dT(xk,X)=mini=1kdT(xi,X),

and let mM(X). Then

|𝒜m(i=1k𝒜xi)|𝒪n(X):=r=2k1(r1)1i1<<ir<k|xi1,,xirX|. (1)

In particular, for k=3, for any mM(X)

|𝒜mi=13𝒜xi|𝒪n({x1,x2,x3}):=|x1,x2X|.
 Remark 2.

Note that the theorem makes use of the upper bound dT(m,X)dT(xk,X), for any mM(X). In particular, for x=3, dT(x3,X)=mini=1,2,3dT(xi,X) is equivalent to d(x1,x2)=maxi,jd(xi,xj), which itself is equivalent to |x1,x2X|=minij|xi,xjX|. In this case, 𝒪n(X)=|x1,x2X|=minij|xi,xjX| implies that the upper bound is the number of adjacencies common in the pair of farthest genomes, i.e. x1,x2, which are missing from x3.

This upper bound significantly restricts the median search space, and by making use of it, we develop an algorithm to find all breakpoint medians of a general set of permutations.

We first analyze exponential algorithms that construct specific rooted labeled trees, where each ray (a shortest path from the root to a leaf) of length n1 corresponds to a permutation determined by the sequence of labels along the path. The set of all such label sequences includes all medians, thereby significantly reducing the search space. Specifically, the new median search space consists of the set of all leaves of these trees. While the volume of this new search space is exponential, it is negligible compared to the size of the permutation group of length n.

3 An algorithm to find medians

To describe our algorithms, we first define the neighbors of a point (i.e., a number representing a syntenic block or gene) with respect to a given set of permutations. Specifically, for X={x1,,xk}Sn and i=1,n, we define

𝒩X(i)=𝒩x1,,xk(i)={j:{i,j}l=1k𝒜xl}.

Note that for each i, 1|𝒩X(i)|2k. The equality |𝒩X(i)|=1 holds when i satisfies both of the following conditions: i is either the first or last number in each permutation xl, for 1lk; and i is an extremity of an adjacency in 𝒜X. On the other hand, the equality |𝒩X(i)|=2k holds when i satisfies both of the following conditions: i is neither the first nor the last number of any permutation xl, for 1lk; and i is not an extremity of an adjacency in 𝒜xl,xp, for any lp. If X is such that d(xl,xp)=n1 for any lp, then k|𝒩X(i)|2k.

Our main goal in this paper is to find all medians for a given set of permutations XSn. To achieve this, we construct a family of labeled rooted trees of height n1 with the following properties: Each vertex v of the tree is assigned a label, denoted by (v), which is a number between 1 and n. In order for two vertices, u and v, to be connected by an edge, it is necessary that (v)𝒩X((u)). Furthermore, for each path of length n1 from the root to a leaf, the sequence of labels along the path forms a permutation y satisfying certain conditions. In particular, the labels of the root and leaf determine the first and last numbers in y, respectively, i.e., y1 and yn. We refer to y as a permutation given by a leaf.

In the rest of this paper, we first present an algorithm in Section 3.1 for constructing trees in which every permutation y given by a leaf satisfies

𝒜yxX𝒜x.

In this case, if the breakpoint distance between every pair of permutations in X attains the maximum value n1, then from Jamshidpey et al. [9], any permutation y given by a leaf at level n1 is a median of X. Consequently, the algorithm finds all medians of X.

Next, in Section 3.2, we construct trees where every permutation y given by a leaf satisfies

𝒜X𝒜yxX𝒜x.

In this case, if the upper bound given in (1) is zero – a weaker condition than requiring all pairwise distances in X to be maximal – then at least one of the permutations given by a leaf of the tree is a median of X (cf. [2]). This allows us to determine the median value within a relatively smaller search space.

Finally, in Section 3.3, we introduce a modification of the algorithm from Section 3.1, providing additional flexibility to identify all medians of a general set of permutations. This is achieved by allowing permutations to contain a limited number of adjacencies not present in xX𝒜x. The upper bound in (1) ensures that all medians of X are represented among the leaves of the tree constructed by this flexible algorithm.

3.1 Finding all medians of permutations with maximum pairwise distance to each other

Let id denote the identity permutation in Sn, and let xSn be a permutation such that d(id,x)=n1. We first describe the algorithm for the case of two permutations, id and x, and later extend it to k>2 permutations.

For each i=1,,n, we construct a tree whose root is labeled by i. We denote the root of this tree by . The root has |𝒩id,x(i)| children, denoted by 1,2,,|𝒩id,x(i)|. The label of each child is a number in 𝒩id,x(i), such that if jj, then (j)(j). In other words, there is a bijection between the set {(r)1r|𝒩id,x(i)|} and 𝒩id,x(i). By convention, we fix this bijection so that (r) is an increasing function of r; in particular, (1) and (|𝒩id,x(i)|) are the smallest and largest numbers in 𝒩id,x(i), respectively.

Each vertex j1, for 1j1|𝒩id,x(i)|, has |𝒩id,x((j1)){i}| children, denoted by j1j2, where 1j2|𝒩id,x((j1)){i}|, with (j1j2)𝒩id,x((j1)){i}. Moreover, if j2j2, then (j1j2)(j1j2). Continuing this process, the parent of a vertex j1j2jl1jl at level l is the vertex j1j2jl1. If

𝒩id,x((j1j2jl1jl)){(),(j1),,(j1j2jl1)},

then its children are j1j2jl1jljl+1, for

1jl+1|𝒩id,x((j1j2jl1jl)){(),(j1),,(j1j2jl1)}|,

where (j1j2jl1jljl+1)𝒩id,x((j1j2jl1jl)){(),(j1),,(j1j2jl1)}.

Again, if jl+1jl+1, then (j1j2jljl+1)(j1j2jljl+1). Since this is a finite process, it results in a labeled tree for each i as the label of the root, with 1in.

More precisely, the sequence of labels along every (,u)-path, where u is a leaf at level n1, represents a permutation y such that 𝒜y𝒜id𝒜x. Since 𝒜id𝒜x=, we have y[id,x]¯, meaning that we can identify all geodesic points of id and x when d(id,x)=n1. Furthermore, the number of permutations in [id,x]¯ is equal to the number of (,u)-paths of length n1 in all n trees. An example is illustrated in Figure 1.

For each i, 1in, we denote by 𝒯id,xi the tree constructed as above, where the root is labeled by i. We also define 𝒯id,x:={𝒯id,xi1in} as the set of all these n trees.

Figure 1: The representation of the tree 𝒯id,x1, for x=246315, with its labels. In this example we have 𝒩id,x(1)={2,3,5}, 𝒩id,x(2)={1,3,4}, 𝒩id,x(3)={1,2,4,6}, 𝒩id,x(4)={2,3,5,6}, 𝒩id,x(5)={1,4,6} and 𝒩id,x(6)={3,4,5}. Also, d(id,x)=n1=5, and so each path from the root to a leaf in level 5 constitutes a permutation in [id,x]¯. The list of permutations in [id,x]¯ given by this tree is: id=123456, 123465, 123645, 123654, 124365, 124563, 132456, 132465, 136542, 154236, 154632, 156432, 156423, 156342 and 156324. These are all the permutations in [id,x]¯ that start at 1. The bold edges represent the adjacencies of id and the other edges represent the adjacencies of x.

More generally, let X={x1,,xk}Sn be a set of permutations. Following the same steps just replacing 𝒩id,x(i) with 𝒩x1,,xk(i), we can construct n labeled rooted trees, 𝒯Xi, such that the sequence of labels along each (,u)path, where u is a leaf at level n1, forms a permutation y satisfying 𝒜yl=1k𝒜xl. Therefore, if X={x1,,xk}Sn satisfies d(xl,xp)=n1 for any lp, then the set of all permutations given by leaves at level n1 in the trees 𝒯Xi, for i=1,,n, is exactly the set of all medians of X. We denote

𝒯X:={𝒯Xi;1in}

and let Y(𝒯Xi) be the set of all permutations ySn that are given by a leaf of 𝒯Xi at level n1. Moreover, let

Y(𝒯X):=i=1nY(𝒯Xi).

Each vertex of a tree 𝒯Xi is a sequence j1j2jl where each ji, i=1,..,l, is a number between 1 and 2|X|. To construct a child vertex and its label from its parent and the parent’s label, we define the following operation. Given a sequence of symbols u=u1ul (e.g., numbers) and a symbol r, we define the operation ur as a new sequence of symbols ur:=u1ulr. We emphasize that in the above algorithm u1= and u2,..,ul and r are natural numbers in {1,,2|X|}. For each fixed tree of 𝒯X, we denote by Tu(𝒯Xi)=Tu the ordered sequence of labels assigned to the vertices along the (,u)path for a vertex u in 𝒯Xi. Observe that Tu is a sequence of digits, where each digit is between 1 and n, and all digits are distinct. We denote by dig(Tu) the set of labels appearing in Tu, and let 𝒩X(Tu):=𝒩X((u)). Additionally, we define Lj(𝒯Xi)=Lj to be the set of all vertices of the tree 𝒯Xi at level j, for 0jn1, considering the root at level zero. Using these notations, the tree construction process for k2 permutations is described in Algorithm 1. These notations will also be used in the subsequent sections.

Note that we can also view the tree construction in suffix-tree terms, as follows. Each tree 𝒯Xi has root label i, and every internal node that spells a prefix (i,u1,,uj) branches to a child uj+1 if and only if uj+1 is adjacent to uj in at least one genome in X, and uj+1 has not yet appeared in the prefix.

Algorithm 1 Gives permutations y such that 𝒜yxX𝒜x.
 Remark 3 (Finding permutations with maximum distance of a set X).

Given X={x1,,xk}Sn, denote by 𝒩¯X(i)=𝒩¯x1,,xk(i):=[n]𝒩X(i) the complement set of 𝒩X(i), for 1in. Note that, if we replace 𝒩X() by 𝒩¯X() in Algorithm 1, we obtain all permutations with maximum distance from X, i.e, we find all permutations y such that d(y,xi)=n1, for 1in.

3.2 Finding all geodesic points for a general set of permutations

A segment s of a set of adjacencies I𝒜π, for πSn, is a maximal set of consecutive adjacencies of I, i.e. it is a set

s={{π(r),π(r+1)},{π(r+1),π(r+2)},,{π(r+k1),π(r+k)}}I

such that {π(r1),π(r)},{π(r+k),π(r+k+1)}I, for r>1 and r+k<n. We often denote s by π(r),,π(r+k), and write s^I. We say that Int(s):={π(r+1),,π(r+k1)} are the internal points of s, and End(s):={π(r),π(r+k)} are the end points of s. Generalizing the idea, the internal and end points of I𝒜π are defined by

Int(I):=s^IInt(s),End(I):=s^IEnd(s).

Note that the above definitions do not depend on a specific choice of π, that is, the definitions remain intact if we replace π by any π for which I𝒜π.

Now consider the case where xSn satisfies d(id,x)<n1, that is, 𝒜id,x. We can apply a similar idea as in the case of maximum distance, but now with some restrictions. From [9], a permutation y[id,x]¯ if and only if 𝒜id,x𝒜y𝒜id𝒜x. As a result, if s=n0,,nl is a segment of 𝒜id,x, then the ordered sequence of digits n0nl must appear in the ordered sequence of labels of the (,u)-paths with length n1. In order for this to hold, first note that no internal point of 𝒜id,x can be a label of the root. In fact, if iInt(𝒜id,x), then there exist j and j with {i,j} and {i,j} in 𝒜id,x. Therefore, if i is the label of the root, any permutation y given by a leaf at the level n1 will contain either {i,j} or {i,j} (but not both), and thus cannot satisfy 𝒜id,x𝒜y. This implies that if iInt(𝒜id,x), then i can only be a label of an internal vertex of the tree. Moreover, since |𝒩id,x(i)|=2, the vertex of the label equal to i will have exactly one child. Therefore, for any segment s=n0,,nl^𝒜id,x, either n0 or nl should appear before n1,,nl1 in Tu for any leaf u. To ensure the condition 𝒜id,x𝒜y, it follows that each segment s=n0,,nl^𝒜id,x, if a vertex v has label (v)=n0 and nl is not in Tv (or the opposite, (v)=nl and n0 is not in Tv), then v must have exactly one child v1 with label (v1)=n1 (or (v1)=nl1).

To describe the tree construction process, for a given segment s^𝒜X and jEnd(s), we denote by j¯ the other end point of s and by j the unique point (number) such that adjacency {j,j}s. In the case where d(id,x)<n1, for each i[n]Int(𝒜id,x) we construct a rooted tree 𝒯¯id,xi with the root label i. At each level l, a vertex j1jl1jl is a child of j1jl1. Now, if 𝒩id,x((j1jl1jl)){(),(j1),,(j1j2jl1)} then j1jl1jl has children defined as follows. If (jl)End(s) and (jl)¯dig(Tj1jl), for some segment s^𝒜id,x, then j1jl has exactly one child j1jl1 with label (j1jl1)=(j1jl1jl). Otherwise, its children are j1jljl+1, for

1jl+1|𝒩id,x((j1jl1jl)){(),(j1),,(j1j2jl1)}|,

where (j1j2jl1jljl+1)𝒩id,x((j1jl1jl)){(),(j1),,(j1j2jl1)}, in the same way that if jl+1jl+1, then (j1j2jljl+1)(j1j2jljl+1). After a finite number of steps, we construct |[n]Int(Iid,x)| trees such that for each leaf u at the level n1, Tu gives a permutation y satisfying 𝒜id,x𝒜y𝒜id𝒜x.

We can generalize this idea to a set of k permutations X={x1,,xk}Sn. Following the same steps, just replacing 𝒩id,x(i) with 𝒩X(i) and 𝒜id,x with 𝒜X, we construct |[n]Int(𝒜X)| labeled rooted trees 𝒯¯Xi, such that for each leaf u at the level n1, the sequence Tu corresponds to a permutation y with 𝒜X𝒜yl=1k𝒜xl. Denote by

𝒯¯X={𝒯¯Xi;i[n]Int(𝒜X)}

and define Y(𝒯¯Xi) to be the set of all permutations ySn given by a leaf of 𝒯¯Xi, and let

Y(𝒯¯X):=i[n]Int(𝒜X)Y(𝒯¯Xi).

For a set X where the upper bound in (1) is equal to zero (recall that this condition is weaker than requiring all permutations in X to be at maximum pairwise distance), from [2], there exists at least one yY(𝒯¯X) that is a median of X. More precisely, in this case, any yY(𝒯¯X) such that

l=1kd(xl,y)=minyY(𝒯¯X)(l=1kd(xl,y)),

is a median of X. Thus, in addition to finding some medians of X, this algorithm also finds the median value of X efficiently. The tree construction process is described in Algorithm 2.

Algorithm 2 Gives all permutations y such that 𝒜x1,,xk𝒜yl=1k𝒜xl.

Note that, if X={x1,,xk} is a set of permutations such that d(xl,xp)=n1, for any lp, then 𝒜X=. Therefore, in Algorithm 2, the condition

(u)End(𝒜X) and (u)¯dig(Tu)

does not hold, and hence the algorithm proceeds directly to the“else” branch. In this case, Algorithm 2 yields exactly the same output as Algorithm 1. Furthermore, for a general set of permutations X={x1,,xk}, the tree 𝒯Xi produced by Algorithm 1 contains, as a subgraph, the tree 𝒯¯Xi generated by Algorithm 2, for all i[n]Int(𝒜X). The main properties of these subtrees are:

  • No internal point of 𝒜X can be used as the label of the root, as previously noted;

  • For any path starting at the root, once the path reaches one of the end points of a segment s^𝒜X, say j, the path continues without branching until it reaches the other end point of s, namely j¯.

3.3 An algorithm to find all medians of a general set of permutations

As seen in Theorem 1, 𝒪n(X), for XSn, is an upper bound for the number of adjacencies of any median mM(X) outside xX𝒜x. To apply this result to k independent random permutations, namely ξ1,,ξkSn, recall that a sequence of random variables (Zn)n+ converges in probability to a random variable Z, as n goes to infinity, if for any ε>0, (|ZnZ|>ε)0. We know that 𝒪n({ξ1,,ξk}) is very small, with high probability. More explicitly, from [5], we know that

𝒪n(X)an0,n,

in probability, for any sequence (an)n diverging to , such that an/n0, as n. Therefore, if we consider the flexibility of using 𝒪n(X) adjacencies out of xX𝒜x in Algorithm 1, then we obtain all permutations y with at most 𝒪n(X) adjacencies out of xX𝒜x, which we call 𝒪n(X)freedom permutations. These permutations include all medians of X with high probability. More generally, for a non-negative integer α0, we say a permutation π is αfreedom with respect to XSn, if |𝒜πxX𝒜x|α. In this section, we extend our algorithm to construct αfreedom medians of XSn for α=𝒪n(X), i.e. the medians of X that include at most α adjacencies out of xX𝒜x.

Let X={x1,,xk}Sn be a set of permutations such that 𝒪n(X)0. For every i=1,,n, we construct a tree with a root labeled by i. We denote by the root of this tree. Now for each vertex of a tree we add a new parameter, namely, for each vertex u we assign a number τu, with 0τu𝒪n(X), that determines the number of children of vertex u in the tree and the number of adjacencies that are not in xX𝒜x and appear in Tu, in the following way: if τu0 then u has n|dig(Tu)| children, i.e., we construct n|dig(Tu)| sequences of labels by adding to the Tu all possible numbers j, from 1 to n, that did not appear in Tu, and so we add the adjacency {(u),j} for each permutation y that is being constructed from the sequence of labels, which also includes adjacencies that are not in xX𝒜x. If τu=0 then any descendent vertex v of u has τv=0 and u has the same number of children given by Algorithm 1, which is |𝒩X((u))dig(Tu)|. So in this case, Tu already contain 𝒪n(X) adjacencies out of xX𝒜x. For the root we assign τ=𝒪n(X). So the root has n1 children, called 1, 2, …, (n1), with (j)=j, for j<i, and (j)=j+1, for ji. We assign τj=τ1 if (j)𝒩X(i), or τ=τj if (j)𝒩X(i). For each vertex j, if τj0 then j has n2 children, called jj, for 1jn2, with (jj)[n]{(),(j)} such that there is a bijection between set {(jj):1jn2} and [n]{(),(j)}. If (jj)𝒩X((j)) then τjj=τj1, and if (jj)𝒩X((j)) then τjj=τj. On the other hand, if τj=0, then j has |𝒩X((j)){i}| children, namely jj, for 1j|𝒩X((j)){i}| with τjj=0 and (jj)𝒩X((j)){i} in the way that if jj′′, then (jj)(jj′′). Continuing this process, the parent of a vertex j1j2jl1jl, in level l is the vertex j1j2jl1. If τj1j2jl1jl0, then j1j2jl1jl has n|dig(Tj1j2jl1jl)| children, called j1j2jljl+1, with (j1j2jl1jljl+1)[n]dig(Tj1j2jl1jl) such that there is a bijection between set

{(j1j2jljl+1):1jl+1n|dig(Tj1j2jl1jl)|}

and [n]dig(Tj1j2jl1jl). If (j1j2jljl+1)𝒩X((j1j2jl)) then τj1j2jljl+1=τj1j2jl1, and if (j1j2jljl+1)𝒩X((j1j2jl)) then τj1j2jljl+1=τj1j2jl. Now, in the case that τj1j2jl=0, the children of j1j2jl are labeled by 𝒩X((j1j2jl))dig(Tj1j2jl), as in Algorithm 1. After a finite number of steps, we construct the tree denoted by 𝒯X,𝒪i (or 𝒯X,αi for general α0) such that each permutation given by a leaf in the level n1 is an 𝒪n(X)freedom permutation (αfreedom permutation, respectively). We denote 𝒯X,𝒪:={𝒯X,𝒪i: 1in}, and 𝒯X,α:={𝒯X,αi: 1in}. We also let Y(𝒯X,𝒪i) be the set of all permutations ySn that are given by a leaf of 𝒯X,𝒪i in the level n1, and let Y(𝒯X,𝒪):=i=1nY(𝒯X,𝒪i). The definitions of Y(𝒯X,αi), and Y(𝒯X,α) are similar. The construction of such trees is described in the following Algorithm 3, for general α0.

Not only does Algorithm 3 give all 𝒪n(X)freedom permutations but also for each possible permutation in the level n1, the parameter τ indicates the exact number of adjacencies of the permutation from outside of xX𝒜x, e.g., if τu=i then (𝒪n(X)i) adjacencies are from outside in Tu. The trees constructed from Algorithm 3 have as subtrees the trees given by Algorithm 1, considering the same set of permutations. An example is given in Figure 2.

Figure 2: Representation of 𝒯X,𝒪1, for X={id=12345,52341,23145}, where 𝒪5(X)=1. The subtree induced by the blue edges is 𝒯¯X1 and the subtree induced by the blue and red edges is 𝒯X1. The median value of X is μ(X)=4 and 14523 is the unique median given by the tree 𝒯X,𝒪1 which is different from the input permutations. In this example, all medians given by the tree 𝒯X,𝒪1 are actually in the subtree 𝒯¯X1. Also, 13254 is an example of a permutation in the set {ySn;𝒜X𝒜yxX𝒜x} that is not a median for X.
Algorithm 3 α-freedom permutations w.r.t. X.

4 Experimental results

For each n from 6 to 15, we performed 100 independent runs of Algorithm 3 on a set X={x1,x2,x3}Sn of three permutations, where x1=id and x2, x3 are randomly generated such that 𝒪n(X)3. For each n, we compute the mean of the normalized median value. As shown in Figure 3, the mean of the normalized median value increases with n, and we expect it to approach 2 as n, which is in accordance with the last Theorem in [9].

Figure 3: The blue line represents the mean normalized median value for sets of three permutations {id,x2,x3}, where x2 and x3 are randomly and independently sampled (also independently for each run) such that 𝒪n({id,x1,x2})3. The red and green lines indicate the minimum and maximum normalized median values observed across 100 independent runs of Algorithm 3, for each genome size n=6,,15.

Although not all the permutations in Y(𝒯X,𝒪) are medians, we find that non-median permutations in Y(𝒯X,𝒪) often have total distances close to the median value, indicating that they serve as good approximations. To formalize this, we define

Kj(𝒯X,𝒪):={yY(𝒯X,𝒪);xXd(y,x)μ(X)=j},

and compute the mean of the proportion |Kj(𝒯X,𝒪)|/|Y(𝒯X,𝒪)| over 100 runs for each 6n15. Note that K0(𝒯X,𝒪)=M(X) and for small j>0 we can consider the permutations in Kj(𝒯X,𝒪) as approximate medians, since the total distance is close to the minimum total distance.

Figure 4 shows that a significant portion of permutations in Y(𝒯X,𝒪) have total distances concentrated near the minimum, indicating that while most are not exact medians, many are close approximations.

Figure 4: The mean of |Ki(𝒯X,𝒪)|/|Y(𝒯X,𝒪)|, for each 6n15.

In fact, across all tested values of n, the union K0K1K2 consistently contains over 33% of Y(𝒯X,𝒪), confirming the abundance of near-optimal solutions in the reduced space. For example, at n=6, this set accounts for 61.9% of candidates; for n=12, it still covers 36.4% despite the increase in size. Table 1 summarizes these proportions numerically for selected values of n.

Table 1: Proportion of permutations in K0K1K2 over Y(𝒯X,𝒪) for selected values of n.
n |K0| |K1| |K2| Total proportion (%)
6 7.77% 22.58% 31.55% 61.9%
8 4.91% 16.03% 25.08% 46.0%
10 4.07% 13.04% 21.01% 38.1%
12 4.25% 12.87% 19.29% 36.4%
14 4.46% 13.23% 20.20% 37.9%

To analyze the proportion of medians far from the input set, we denote by Mi:={mM(X);d(m,xk)i, for xkX}. Note that M0=M(X), MiMl for l<i, and Mi is empty set for i>2n/3. Figure 5 shows the mean of the ratio of |Mi|/|M(X)|, for 6n15. The results indicate that the proportion of medians far from all input permutations decreases rapidly, consistent with the observations and conjectures of Haghighi and Sankoff [8]. For example, when n=12, over 91% of medians are within distance 3 of all inputs, and fewer than 0.8% exceed distance 6. This illustrates the general trend that most medians tend to remain close to at least one input genome.

Figure 5: The mean of |Mi|/|M(X)|, for each 6n15.

However, as n increases, the number of medians that lie far from all inputs also grows. Table 2 reports the proportion of medians lying in Mi for values of i near 2n3, which corresponds to the breakpoint distance of a “midpoint” genome – that is, one that draws approximately one-third of its adjacencies from each of the three input genomes. As expected, the proportion of medians with distance at least 2n3 from all inputs is either zero or negligible across all tested values of n, reflecting the rarity of truly equidistant medians. Still, for slightly smaller values such as 2n31, 2n32, or 2n33, the proportion increases noticeably. For instance, when n=14, the set M6, consisting of medians at distance at least 6 from all three inputs, contains more than 11% of all medians, and M5 contains over 41%. These medians are still far from each input genome – at least 5 breakpoints away – yet appear with consistent frequency, indicating a non-negligible presence near the midpoint region as n increases.

Table 2: Mean proportion of medians in Mi for values of i near the midpoint distance 2n3.
n i=2n3 i=2n31 i=2n32 i=2n334
7 0 3.8% 42.54% -
8 0 0.57% 27.59% -
9 0 9.34% 50.74% -
10 0 2.97% 26.85% -
11 0 15.83% 52.02% -
12 0 0.08% 6.94% 36.54%
13 0 2.06% 18.16% 53.56%
14 0 0.76% 11.90% 41.63%
15 0 0 5.29% 29.11%

To quantify the algorithm’s efficiency, we compare the size of the reduced space to n!. Table 3 demonstrates that the number of permutations explored by Algorithm 3 represents only a tiny fraction of Sn, yet suffices to find all exact and many near-optimal medians. For instance, when n=15, the number of candidate medians generated by the algorithm – i.e., the search space – is less than 0.003% of the full 15!1.31×1012 permutations.

Table 3: Reduction in search space by Algorithm 3 for selected values of n (with 𝒪n(X)3).
n Total candidates Total medians n! candidatesn! (%)
6 180.94 11.46 720 25.13%
8 2981.10 82.86 40320 7.40%
10 24824.90 513.54 3628800 0.68%
12 353921.52 3387.82 4.79×108 0.07%
14 1882425.04 23815.54 8.72×1010 0.002%
15 36659,718 48372.52 1.31×1012 0.0028%

Although Algorithm 3 was run with 𝒪n(X)3, allowing up to three adjacencies outside the union of the input adjacencies, we observed that such instances were extremely rare – and when they occurred, each involved only a single external adjacency. For example, at n=6, only 0.04 medians per run (roughly 0.35% of all medians) included one adjacency not present in the union of the inputs. At n=12, the mean was 0.48 per run (under 0.015%). For all other values of n15, there was no external adjacency. These results indicate that nearly all medians are already covered when we allow zero-freedom, that is, when every adjacency is drawn from the input genomes. In practice, therefore, we can use Algorithm 1, which corresponds to the zero-freedom version of Algorithm 3, to recover most of the medians while achieving substantial speed-ups. When no adjacency is taken from outside the union xX𝒜x, the algorithm completes around 0.75 seconds and uses approximately 19.26 MB of memory for n=10; about 40 seconds and 104.16 MB for n=13; and around 3.5 minutes and 289.94 MB for n=15 (mean runtime and memory usage over 5 runs, measured on a 2.3 GHz quad-core Intel Core i7 machine with 32 GB RAM).

Finally, although it is known that, given a set of genomes X, there may exist medians that do not contain all adjacencies in 𝒜X, we verified that for the input sets tested (6n15), all medians returned by Algorithm 3 contained the full set of common adjacencies 𝒜X shared by the input genomes. As a result, Algorithm 3 produced the same set of medians as Algorithm 2 on all tested instances.

5 Conclusion

In this paper, we introduced a novel algorithmic framework to find all breakpoint medians of a given set of linear unsigned genomes. Unlike previous methods – which reduce the breakpoint median problem to an instance of the Traveling Salesman Problem (TSP) and return only a single median – our approach is based on the construction of rooted, labeled trees that allow us to find all medians, along with a substantial number of near-medians. Each path of length n1 from the root to a leaf encodes a unique permutation, and the tree structure is designed to efficiently capture the combinatorial space in which medians reside.

This structural strategy provides a new perspective on the median problem. It not only allows us to find all medians in exponential time, but also to systematically explore a constrained and meaningful subset of the permutation space. This is particularly valuable for comparative genomics, where the goal is often to infer an ancestral genome that minimizes evolutionary distance to the observed genomes. Having access to the entire set of medians makes it possible to evaluate and compare them based on additional biological or statistical criteria, such as similarity to known ancestral features or consistency with gene orientation and synteny.

From a theoretical point of view, we demonstrated that our method finds the exact median value, even in cases where prior methods could not. Experimentally, we showed that the number of candidate permutations generated by our trees is a vanishingly small fraction of the full symmetric group (e.g., less than 0.0028% of S15), yet this restricted space reliably captures all medians and a large portion of near-optimal solutions. In particular, we found that a substantial fraction of permutations in the output tree fall into K0K1K2, indicating that many are either exact or high-quality approximate medians. We also observed that even when allowing up to three adjacencies outside the input set, the inclusion of such external adjacencies was extremely rare, often occurring in fewer than 1% of medians.

Finally, we investigated how far medians tend to lie from all inputs using the Mi decomposition. While truly equidistant medians are rare, we found that a non-negligible proportion of medians are located near the theoretical midpoint region. Moreover, we observed that most medians are relatively close to the input permutations, an observation that aligns with theoretical results in the literature [8, 9, 4]. This suggests a layered structure in the space of medians that could be exploited for further biological modeling and inference.

While our work focuses on the breakpoint median problem for unsigned unichromosomal genomes, the algorithm and underlying methodology are not limited to this setting. The core tree-based construction and median search strategy naturally extend to more general models, including signed permutations and multichromosomal genomes. Overall, our method not only offers a new algorithmic contribution but also opens up a range of possibilities for deeper combinatorial and biological analysis of breakpoint medians and their role in gene order phylogeny.

References

  • [1] Sylvia Boyd and Maryam Haghighi. A fast method for large-scale multichromosomal breakpoint median problems. Journal of Bioinformatics and Computational Biology, 10(01):1240008, 2012. doi:10.1142/S0219720012400082.
  • [2] David Bryant. The complexity of the breakpoint median problem. Centre de recherches mathematiques, 1998.
  • [3] Alberto Caprara. The reversal median problem. INFORMS Journal on Computing, 15(1):93–113, 2003. doi:10.1287/IJOC.15.1.93.15155.
  • [4] Poly H da Silva, Arash Jamshidpey, and David Sankoff. Sampling gene adjacencies and geodesic points of random genomes. In RECOMB International Workshop on Comparative Genomics, pages 189–210. Springer, 2024. doi:10.1007/978-3-031-58072-7_10.
  • [5] Poly H da Silva, Arash Jamshidpey, and David Sankoff. On the number of breakpoint medians of random genomes. preprint (submitted), 2025.
  • [6] Pedro Feijão and João Meidanis. SCJ: a variant of breakpoint distance for which sorting, genome median and genome halving problems are easy. In International Workshop on Algorithms in Bioinformatics, pages 85–96. Springer, 2009. doi:10.1007/978-3-642-04241-6_8.
  • [7] G Fertin, A Labarre, I Rusu, E Tannier, and S Vialette. Combinatorics of genome rearrangements. The MIT Press, 2009.
  • [8] Maryam Haghighi and David Sankoff. Medians seek the corners, and other conjectures. BMC Bioinformatics, 13(19):S5, 2012. doi:10.1186/1471-2105-13-S19-S5.
  • [9] Arash Jamshidpey, Aryo Jamshidpey, and David Sankoff. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics, 15(6):S3, 2014.
  • [10] Arash Jamshidpey and David Sankoff. Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics, 14(15):S7, 2013. doi:10.1186/1471-2105-14-S15-S7.
  • [11] Caroline Anne Larlee, Chunfang Zheng, and David Sankoff. Near-medians that avoid the corners; a combinatorial probability approach. BMC Genomics, 15(6):S1, 2014.
  • [12] Mona Meghdari Miardan, Arash Jamshidpey, and David Sankoff. Escape from parsimony of a double-cut-and-join genome evolution process. Journal of Computational Biology, 30(2):118–130, 2023. doi:10.1089/CMB.2021.0468.
  • [13] David Sankoff and Mathieu Blanchette. The median problem for breakpoints in comparative genomics. Computing and Combinatorics, pages 251–263, 1997. doi:10.1007/BFB0045092.
  • [14] David Sankoff, Gopalakrishnan Sundaram, and John Kececioglu. Steiner points in the space of genome rearrangements. International Journal of Foundations of Computer Science, 7(01):1–9, 1996. doi:10.1142/S0129054196000026.
  • [15] Eric Tannier, Chunfang Zheng, and David Sankoff. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10(1):120, 2009. doi:10.1186/1471-2105-10-120.
  • [16] Andrew Wei Xu. The median problems on linear multichromosomal genomes: Graph representation and fast exact solutions. Journal of Computational Biology, 17(9):1195–1211, 2010. doi:10.1089/CMB.2010.0106.
  • [17] João Paulo Pereira Zanetti, Priscila Biller, and João Meidanis. Median approximations for genomes modeled as matrices. Bulletin of Mathematical Biology, 78:786–814, 2016.
  • [18] Chunfang Zheng and David Sankoff. On the pathgroups approach to rapid small phylogeny. BMC Bioinformatics, 12(1):S4, 2011. doi:10.1186/1471-2105-12-S1-S4.

Appendix A Proof of Theorem 1

Below, we include the proof of Theorem 1, as presented in [5].

Proof.

For a permutation π and rk, let ε¯i1,,irX(π):=|𝒜πxi1,xirX|. To ease the notation, we let i1,,i=xi1,,xi. Let η=|𝒜mi=1k𝒜xi|. Then

η+r=1k1i1<<irkε¯i1,,irX(m)=n1.

As m is a median of X, we have

dT(m,X)=k(n1)r=1k[r1i1<<irkε¯i1,,irX(m)]=(k1)(n1)+ηr=2k[(r1)1i1<<irkε¯i1,,irX(m)]dT(xk,X)=(k1)(n1)(1i1<k|i1,kX|+21i1<i2<k|i1,i2,kX|++(k2)1i1<<ik2<k|i1,,ik2,kX|+(k1)|1,,kX|).

Hence,

η(r=2k(r1)1i1<<irkε¯i1,,irX(m))(r=2k(r1)1i1<<ir1<k|i1,,ir1,kX|)r=2k1(r1)1i1<<ir<k|i1,,irX|, (2)

where the last inequality holds because ε¯i1,,irX(m)|i1,,irX|, for any rk and 1i1<<irk.