Identifying Breakpoint Median Genomes:
A Branching Algorithm Approach
Abstract
Genome comparison often involves quantifying dissimilarities between genomes with identical gene sets, commonly using breakpoints – points where adjacent genes in one genome are not adjacent in another. The concept of a median genome, used for comparison of multiple genomes, aims to find a genome that minimizes the total distance to all genomes in a given set. While median genomes are useful for extracting common genomic information and estimating ancestral traits, the existence of multiple divergent medians raises concerns about their accuracy in reflecting the true ancestor. The median problem is known to be NP-hard, particularly for unichromosomal genomes, and solving it becomes increasingly challenging under different genome distance models. In this work, we introduce a novel branching algorithm to efficiently find all breakpoint medians of linear unichromosomal genomes, represented as unsigned permutations. This algorithm constructs a rooted labeled tree, where the sequence of labels along each complete ray defines a genome, providing a structured and efficient way to explore the space of candidate medians by narrowing the search to a well-defined and significantly smaller subset of the permutation space. We validate our approach with experiments on randomly generated sets of three permutations. The results show that our method successfully finds the exact medians and also identifies many near-optimal approximations. Our experiments further show that most medians lie relatively close to the input permutations, in agreement with prior theoretical results.
Keywords and phrases:
Breakpoint distance, median genomes, phylogeny reconstruction, random permutationsCopyright and License:
2012 ACM Subject Classification:
Mathematics of computing Combinatorial algorithms ; Mathematics of computing Permutations and combinations ; Applied computing Computational genomicsEditors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Comparing genomes with the same syntenic blocks involves computing their dissimilarities. This dissimilarity is often quantified by identifying breakpoints, points at which genes are adjacent in one genome but not in the other. Introduced formally by Sankoff and Blanchette in 1997 [13], the total number of breakpoints serves as a metric for dissimilarity. To compare multiple genomes, we can use the concept of median. Given a set of three or more genomes and a distance , a median for the set is a genome that minimizes the total distance function . The concept of the median was first employed by Sankoff et al. [14] in 1996 within the context of evolutionary gene order models. Motivated by the search for ancestral genomic information and its applications to small phylogeny problems, the median problem has since attracted significant attention [2, 3, 15, 7, 6, 16, 17].
However, the complexity of the median problem varies with different genome distances, often proving to be NP-hard [3, 15, 7] particularly for unichromosomal genomes. For instance, the breakpoint median problem was shown to be NP-hard by Bryant [2] for linear unichromosomal genomes. Moreover, identifying a median in which adjacency sets are contained within the union of the adjacency sets of the input genomes has also been proven to be NP-hard [2]. Following the reduction of the median problem to the Traveling Salesman Problem (TSP) by Sankoff and Blanchette [13], in 2012, Boyd and Haghighi [1], using Concorde (a fast software to find TSP solutions), presented a fast algorithm to find breakpoint medians of samples of large genomes.
While median genomes aim to extract common information among given genomes and estimate ancestral characteristics, the existence of multiple medians with considerable divergence [8] raises questions about their proximity to the true ancestor or their usability in providing ancestral insights. Additionally, determining which, if any, of these medians accurately reflects ancestral traits poses a significant challenge. In fact, Zheng and Sankoff [18], Jamshidpey and Sankoff [10] and Miardan et al. [12] showed that median may fail to approximate the ancestor for the long-time evolution of genomes, while for genomes involved in evolution for a shorter period of time medians may approximate the true ancestor.
To address the challenge of identifying relevant medians, we propose a novel branching algorithm for efficiently finding all breakpoint medians of linear unichromosomal genomes represented by unsigned permutations of length . This exponential algorithm constructs a rooted labeled tree, whose sequence of labels for each ray (a shortest path connecting the root to a leaf) with length determines a unichromosomal genome (represented as a permutation). The set of all such unichromosomal genomes contains all medians of the input genomes. We show that this tree construction reduces the median search space significantly compared to the full space of permutations (see Table 3).
This paper is organized as follows. We begin by laying the foundation. In Section 2 we introduce the basic concepts of a genomic space with breakpoint distance and review some essential prior results in the literature. In Section 3, we delve into the methodology behind our branching algorithms designed to identify all medians within a given set of genomes. Subsequently, in Section 4, we provide empirical validation of our approach through a series of experiments using sets of three random permutations. A key contribution of this experimental section is that our method is able to compute the median value exactly, even in cases where it remained unknown in previous work. We examine how the median value behaves as the permutation length increases and analyze the distribution of approximate medians in the reduced search space generated by our algorithm. The results indicate that, although not all permutations in this space are true medians, a substantial proportion have total distances very close to the minimum, making them effective median approximations. We also explore how far the medians tend to be from the input permutations and find that most lie relatively close, an observation consistent with prior theoretical results [8, 9, 11, 4]. We conclude the paper with a discussion of our findings, their implications, and potential avenues for future research.
2 Breakpoint medians
We represent an unichromosomal genome by a permutation which is a bijection on . In other words, a permutation can be represented by , which indicates a specific order on . When there is no risk of ambiguity, we often write instead of , and denote . We define the set of adjacencies of as , where each adjacency is treated as an unordered pair. Let denote the set of all permutations of length . Given , we denote by the set of all common adjacencies of and . For a set , we also denote by the set of all common adjacencies of permutations in . The breakpoint (bp) distance between and is define by .
The breakpoint distance is neither a geodesic distance nor an edit distance, and for this reason the notion of partial geodesics was introduced by Jamshidpey et al. [9]. We can consider the breakpoint distance as a generalized edit distance that determines the parsimonious (shortest) paths of transforming one permutation to another, but with many missing points in the parsimonious path. In other words, in edit distances the length of every jump from a point in the parsimonious path to its closest point in the the same path is one, while in generalized edit distances such as the breakpoint distance this length may be bigger. A partial geodesic [9] between and is a maximal chain in such that . We denote by the set of all permutations lying on partial geodesics connecting , and call them geodesic points of and .
For a set of three or more genomes , a breakpoint median is a genome that minimizes the total distance function . The minimal value of is known as the median value of the set , denoted by . The set of all breakpoint medians of is denoted by .
For a set of permutations in for which the pairwise breakpoint distances take the maximum value , Jamshidpey et al. [9] provide a necessary and sufficient condition for a permutation to be a median of , that is is a median of if and only if
Also from [9], a permutation is a geodesic point of two permutations and , and so it is a median of , if and only if . On the other hand, we do not have a result establishing a necessary and sufficient condition for a permutation to be a median of a general set of permutations . In fact, it is known that there may exist a median that does not contain all common adjacencies of permutations in , i.e., there may exist a median such that as the example given by Bryant [2]. However, even though there may exist medians not containing all common adjacencies of elements of , there always exists at least one median with this property, namely, there exists at least one median such that (cf. [2]). In addition, when we have a general set of permutations , even counter-intuitively, it is not necessary that every adjacency of a median is an adjacency of at least one of the permutations in , that is, there may exist a median such that
as is shown by Bryant [2]. However, [5] provides an upper bound for the maximum number of adjacencies of a median that are not in as stated in Theorem 1 (whose proof is provided in Appendix A). Before the statement of Theorem 1, we need the following notation. Denote by the set of all subsets of a set or space . Let and let . Then, for any , let
Continuing this, for any and , we set
In other words, includes all adjacencies that are common in every , but missing from every . We have the following theorem.
Theorem 1 ([5]).
Let be such that
and let . Then
| (1) |
In particular, for , for any
Remark 2.
Note that the theorem makes use of the upper bound for any . In particular, for , is equivalent to , which itself is equivalent to . In this case, implies that the upper bound is the number of adjacencies common in the pair of farthest genomes, i.e. , which are missing from .
This upper bound significantly restricts the median search space, and by making use of it, we develop an algorithm to find all breakpoint medians of a general set of permutations.
We first analyze exponential algorithms that construct specific rooted labeled trees, where each ray (a shortest path from the root to a leaf) of length corresponds to a permutation determined by the sequence of labels along the path. The set of all such label sequences includes all medians, thereby significantly reducing the search space. Specifically, the new median search space consists of the set of all leaves of these trees. While the volume of this new search space is exponential, it is negligible compared to the size of the permutation group of length .
3 An algorithm to find medians
To describe our algorithms, we first define the neighbors of a point (i.e., a number representing a syntenic block or gene) with respect to a given set of permutations. Specifically, for and , we define
Note that for each , . The equality holds when satisfies both of the following conditions: is either the first or last number in each permutation , for ; and is an extremity of an adjacency in . On the other hand, the equality holds when satisfies both of the following conditions: is neither the first nor the last number of any permutation , for ; and is not an extremity of an adjacency in , for any . If is such that for any , then .
Our main goal in this paper is to find all medians for a given set of permutations . To achieve this, we construct a family of labeled rooted trees of height with the following properties: Each vertex of the tree is assigned a label, denoted by , which is a number between and . In order for two vertices, and , to be connected by an edge, it is necessary that . Furthermore, for each path of length from the root to a leaf, the sequence of labels along the path forms a permutation satisfying certain conditions. In particular, the labels of the root and leaf determine the first and last numbers in , respectively, i.e., and . We refer to as a permutation given by a leaf.
In the rest of this paper, we first present an algorithm in Section 3.1 for constructing trees in which every permutation given by a leaf satisfies
In this case, if the breakpoint distance between every pair of permutations in attains the maximum value , then from Jamshidpey et al. [9], any permutation given by a leaf at level is a median of . Consequently, the algorithm finds all medians of .
Next, in Section 3.2, we construct trees where every permutation given by a leaf satisfies
In this case, if the upper bound given in (1) is zero – a weaker condition than requiring all pairwise distances in to be maximal – then at least one of the permutations given by a leaf of the tree is a median of (cf. [2]). This allows us to determine the median value within a relatively smaller search space.
Finally, in Section 3.3, we introduce a modification of the algorithm from Section 3.1, providing additional flexibility to identify all medians of a general set of permutations. This is achieved by allowing permutations to contain a limited number of adjacencies not present in The upper bound in (1) ensures that all medians of are represented among the leaves of the tree constructed by this flexible algorithm.
3.1 Finding all medians of permutations with maximum pairwise distance to each other
Let denote the identity permutation in , and let be a permutation such that . We first describe the algorithm for the case of two permutations, and , and later extend it to permutations.
For each , we construct a tree whose root is labeled by . We denote the root of this tree by . The root has children, denoted by . The label of each child is a number in , such that if , then . In other words, there is a bijection between the set and . By convention, we fix this bijection so that is an increasing function of ; in particular, and are the smallest and largest numbers in , respectively.
Each vertex , for , has children, denoted by , where , with . Moreover, if , then . Continuing this process, the parent of a vertex at level is the vertex . If
then its children are , for
where .
Again, if , then . Since this is a finite process, it results in a labeled tree for each as the label of the root, with .
More precisely, the sequence of labels along every -path, where is a leaf at level , represents a permutation such that Since , we have , meaning that we can identify all geodesic points of and when . Furthermore, the number of permutations in is equal to the number of -paths of length in all trees. An example is illustrated in Figure 1.
For each , , we denote by the tree constructed as above, where the root is labeled by . We also define as the set of all these trees.
More generally, let be a set of permutations. Following the same steps just replacing with , we can construct labeled rooted trees, , such that the sequence of labels along each path, where is a leaf at level , forms a permutation satisfying . Therefore, if satisfies for any , then the set of all permutations given by leaves at level in the trees , for , is exactly the set of all medians of . We denote
and let be the set of all permutations that are given by a leaf of at level . Moreover, let
Each vertex of a tree is a sequence where each , , is a number between and . To construct a child vertex and its label from its parent and the parent’s label, we define the following operation. Given a sequence of symbols (e.g., numbers) and a symbol , we define the operation as a new sequence of symbols . We emphasize that in the above algorithm and and are natural numbers in . For each fixed tree of , we denote by the ordered sequence of labels assigned to the vertices along the path for a vertex in . Observe that is a sequence of digits, where each digit is between and , and all digits are distinct. We denote by the set of labels appearing in , and let . Additionally, we define to be the set of all vertices of the tree at level , for , considering the root at level zero. Using these notations, the tree construction process for permutations is described in Algorithm 1. These notations will also be used in the subsequent sections.
Note that we can also view the tree construction in suffix-tree terms, as follows. Each tree has root label , and every internal node that spells a prefix branches to a child if and only if is adjacent to in at least one genome in , and has not yet appeared in the prefix.
Remark 3 (Finding permutations with maximum distance of a set ).
Given , denote by the complement set of , for . Note that, if we replace by in Algorithm 1, we obtain all permutations with maximum distance from , i.e, we find all permutations such that , for .
3.2 Finding all geodesic points for a general set of permutations
A segment of a set of adjacencies , for , is a maximal set of consecutive adjacencies of , i.e. it is a set
such that , for and . We often denote by , and write . We say that are the internal points of , and are the end points of . Generalizing the idea, the internal and end points of are defined by
Note that the above definitions do not depend on a specific choice of , that is, the definitions remain intact if we replace by any for which .
Now consider the case where satisfies , that is, . We can apply a similar idea as in the case of maximum distance, but now with some restrictions. From [9], a permutation if and only if . As a result, if is a segment of , then the ordered sequence of digits must appear in the ordered sequence of labels of the -paths with length . In order for this to hold, first note that no internal point of can be a label of the root. In fact, if , then there exist and with and in . Therefore, if is the label of the root, any permutation given by a leaf at the level will contain either or (but not both), and thus cannot satisfy . This implies that if , then can only be a label of an internal vertex of the tree. Moreover, since , the vertex of the label equal to will have exactly one child. Therefore, for any segment , either or should appear before in for any leaf . To ensure the condition , it follows that each segment , if a vertex has label and is not in (or the opposite, and is not in ), then must have exactly one child with label (or ).
To describe the tree construction process, for a given segment and , we denote by the other end point of and by the unique point (number) such that adjacency . In the case where , for each we construct a rooted tree with the root label . At each level , a vertex is a child of . Now, if then has children defined as follows. If and , for some segment , then has exactly one child with label . Otherwise, its children are , for
where in the same way that if , then . After a finite number of steps, we construct trees such that for each leaf at the level , gives a permutation satisfying .
We can generalize this idea to a set of permutations . Following the same steps, just replacing with and with , we construct labeled rooted trees , such that for each leaf at the level , the sequence corresponds to a permutation with . Denote by
and define to be the set of all permutations given by a leaf of , and let
For a set where the upper bound in (1) is equal to zero (recall that this condition is weaker than requiring all permutations in to be at maximum pairwise distance), from [2], there exists at least one that is a median of . More precisely, in this case, any such that
is a median of . Thus, in addition to finding some medians of , this algorithm also finds the median value of efficiently. The tree construction process is described in Algorithm 2.
Note that, if is a set of permutations such that , for any , then . Therefore, in Algorithm 2, the condition
and
does not hold, and hence the algorithm proceeds directly to the“else” branch. In this case, Algorithm 2 yields exactly the same output as Algorithm 1. Furthermore, for a general set of permutations , the tree produced by Algorithm 1 contains, as a subgraph, the tree generated by Algorithm 2, for all . The main properties of these subtrees are:
-
No internal point of can be used as the label of the root, as previously noted;
-
For any path starting at the root, once the path reaches one of the end points of a segment , say , the path continues without branching until it reaches the other end point of , namely .
3.3 An algorithm to find all medians of a general set of permutations
As seen in Theorem 1, , for , is an upper bound for the number of adjacencies of any median outside . To apply this result to independent random permutations, namely , recall that a sequence of random variables converges in probability to a random variable , as goes to infinity, if for any , . We know that is very small, with high probability. More explicitly, from [5], we know that
in probability, for any sequence diverging to , such that , as . Therefore, if we consider the flexibility of using adjacencies out of in Algorithm 1, then we obtain all permutations with at most adjacencies out of , which we call freedom permutations. These permutations include all medians of with high probability. More generally, for a non-negative integer , we say a permutation is freedom with respect to , if . In this section, we extend our algorithm to construct freedom medians of for , i.e. the medians of that include at most adjacencies out of .
Let be a set of permutations such that . For every , we construct a tree with a root labeled by . We denote by the root of this tree. Now for each vertex of a tree we add a new parameter, namely, for each vertex we assign a number , with , that determines the number of children of vertex in the tree and the number of adjacencies that are not in and appear in , in the following way: if then has children, i.e., we construct sequences of labels by adding to the all possible numbers , from to , that did not appear in , and so we add the adjacency for each permutation that is being constructed from the sequence of labels, which also includes adjacencies that are not in . If then any descendent vertex of has and has the same number of children given by Algorithm 1, which is . So in this case, already contain adjacencies out of . For the root we assign . So the root has children, called , , …, , with , for , and , for . We assign if , or if . For each vertex , if then has children, called , for , with such that there is a bijection between set and . If then , and if then . On the other hand, if , then has children, namely , for with and in the way that if , then . Continuing this process, the parent of a vertex , in level is the vertex . If , then has children, called , with such that there is a bijection between set
and . If then , and if then . Now, in the case that , the children of are labeled by , as in Algorithm 1. After a finite number of steps, we construct the tree denoted by (or for general ) such that each permutation given by a leaf in the level is an freedom permutation (freedom permutation, respectively). We denote and We also let be the set of all permutations that are given by a leaf of in the level , and let The definitions of and are similar. The construction of such trees is described in the following Algorithm 3, for general .
Not only does Algorithm 3 give all freedom permutations but also for each possible permutation in the level , the parameter indicates the exact number of adjacencies of the permutation from outside of , e.g., if then adjacencies are from outside in . The trees constructed from Algorithm 3 have as subtrees the trees given by Algorithm 1, considering the same set of permutations. An example is given in Figure 2.
4 Experimental results
For each from 6 to 15, we performed 100 independent runs of Algorithm 3 on a set of three permutations, where and , are randomly generated such that . For each , we compute the mean of the normalized median value. As shown in Figure 3, the mean of the normalized median value increases with , and we expect it to approach as , which is in accordance with the last Theorem in [9].
Although not all the permutations in are medians, we find that non-median permutations in often have total distances close to the median value, indicating that they serve as good approximations. To formalize this, we define
and compute the mean of the proportion over 100 runs for each . Note that and for small we can consider the permutations in as approximate medians, since the total distance is close to the minimum total distance.
Figure 4 shows that a significant portion of permutations in have total distances concentrated near the minimum, indicating that while most are not exact medians, many are close approximations.
In fact, across all tested values of , the union consistently contains over 33% of , confirming the abundance of near-optimal solutions in the reduced space. For example, at , this set accounts for 61.9% of candidates; for , it still covers 36.4% despite the increase in size. Table 1 summarizes these proportions numerically for selected values of .
| Total proportion (%) | ||||
|---|---|---|---|---|
| 6 | 7.77% | 22.58% | 31.55% | 61.9% |
| 8 | 4.91% | 16.03% | 25.08% | 46.0% |
| 10 | 4.07% | 13.04% | 21.01% | 38.1% |
| 12 | 4.25% | 12.87% | 19.29% | 36.4% |
| 14 | 4.46% | 13.23% | 20.20% | 37.9% |
To analyze the proportion of medians far from the input set, we denote by for . Note that , for , and is empty set for . Figure 5 shows the mean of the ratio of , for . The results indicate that the proportion of medians far from all input permutations decreases rapidly, consistent with the observations and conjectures of Haghighi and Sankoff [8]. For example, when , over 91% of medians are within distance 3 of all inputs, and fewer than 0.8% exceed distance 6. This illustrates the general trend that most medians tend to remain close to at least one input genome.
However, as increases, the number of medians that lie far from all inputs also grows. Table 2 reports the proportion of medians lying in for values of near , which corresponds to the breakpoint distance of a “midpoint” genome – that is, one that draws approximately one-third of its adjacencies from each of the three input genomes. As expected, the proportion of medians with distance at least from all inputs is either zero or negligible across all tested values of , reflecting the rarity of truly equidistant medians. Still, for slightly smaller values such as , , or , the proportion increases noticeably. For instance, when , the set , consisting of medians at distance at least 6 from all three inputs, contains more than 11% of all medians, and contains over 41%. These medians are still far from each input genome – at least 5 breakpoints away – yet appear with consistent frequency, indicating a non-negligible presence near the midpoint region as increases.
| 7 | % | - | ||
| 8 | - | |||
| 9 | - | |||
| 10 | - | |||
| 11 | - | |||
| 12 | ||||
| 13 | ||||
| 14 | ||||
| 15 |
To quantify the algorithm’s efficiency, we compare the size of the reduced space to . Table 3 demonstrates that the number of permutations explored by Algorithm 3 represents only a tiny fraction of , yet suffices to find all exact and many near-optimal medians. For instance, when , the number of candidate medians generated by the algorithm – i.e., the search space – is less than of the full permutations.
| Total candidates | Total medians | (%) | ||
|---|---|---|---|---|
| 6 | 180.94 | 11.46 | 720 | 25.13% |
| 8 | 2981.10 | 82.86 | 40320 | 7.40% |
| 10 | 24824.90 | 513.54 | 3628800 | 0.68% |
| 12 | 353921.52 | 3387.82 | 0.07% | |
| 14 | 1882425.04 | 23815.54 | 0.002% | |
| 15 | 36659,718 | 48372.52 | 0.0028% |
Although Algorithm 3 was run with , allowing up to three adjacencies outside the union of the input adjacencies, we observed that such instances were extremely rare – and when they occurred, each involved only a single external adjacency. For example, at , only 0.04 medians per run (roughly 0.35% of all medians) included one adjacency not present in the union of the inputs. At , the mean was 0.48 per run (under 0.015%). For all other values of , there was no external adjacency. These results indicate that nearly all medians are already covered when we allow zero-freedom, that is, when every adjacency is drawn from the input genomes. In practice, therefore, we can use Algorithm 1, which corresponds to the zero-freedom version of Algorithm 3, to recover most of the medians while achieving substantial speed-ups. When no adjacency is taken from outside the union , the algorithm completes around 0.75 seconds and uses approximately 19.26 MB of memory for ; about 40 seconds and 104.16 MB for ; and around 3.5 minutes and 289.94 MB for (mean runtime and memory usage over 5 runs, measured on a 2.3 GHz quad-core Intel Core i7 machine with 32 GB RAM).
Finally, although it is known that, given a set of genomes , there may exist medians that do not contain all adjacencies in , we verified that for the input sets tested (), all medians returned by Algorithm 3 contained the full set of common adjacencies shared by the input genomes. As a result, Algorithm 3 produced the same set of medians as Algorithm 2 on all tested instances.
5 Conclusion
In this paper, we introduced a novel algorithmic framework to find all breakpoint medians of a given set of linear unsigned genomes. Unlike previous methods – which reduce the breakpoint median problem to an instance of the Traveling Salesman Problem (TSP) and return only a single median – our approach is based on the construction of rooted, labeled trees that allow us to find all medians, along with a substantial number of near-medians. Each path of length from the root to a leaf encodes a unique permutation, and the tree structure is designed to efficiently capture the combinatorial space in which medians reside.
This structural strategy provides a new perspective on the median problem. It not only allows us to find all medians in exponential time, but also to systematically explore a constrained and meaningful subset of the permutation space. This is particularly valuable for comparative genomics, where the goal is often to infer an ancestral genome that minimizes evolutionary distance to the observed genomes. Having access to the entire set of medians makes it possible to evaluate and compare them based on additional biological or statistical criteria, such as similarity to known ancestral features or consistency with gene orientation and synteny.
From a theoretical point of view, we demonstrated that our method finds the exact median value, even in cases where prior methods could not. Experimentally, we showed that the number of candidate permutations generated by our trees is a vanishingly small fraction of the full symmetric group (e.g., less than 0.0028% of ), yet this restricted space reliably captures all medians and a large portion of near-optimal solutions. In particular, we found that a substantial fraction of permutations in the output tree fall into , indicating that many are either exact or high-quality approximate medians. We also observed that even when allowing up to three adjacencies outside the input set, the inclusion of such external adjacencies was extremely rare, often occurring in fewer than 1% of medians.
Finally, we investigated how far medians tend to lie from all inputs using the decomposition. While truly equidistant medians are rare, we found that a non-negligible proportion of medians are located near the theoretical midpoint region. Moreover, we observed that most medians are relatively close to the input permutations, an observation that aligns with theoretical results in the literature [8, 9, 4]. This suggests a layered structure in the space of medians that could be exploited for further biological modeling and inference.
While our work focuses on the breakpoint median problem for unsigned unichromosomal genomes, the algorithm and underlying methodology are not limited to this setting. The core tree-based construction and median search strategy naturally extend to more general models, including signed permutations and multichromosomal genomes. Overall, our method not only offers a new algorithmic contribution but also opens up a range of possibilities for deeper combinatorial and biological analysis of breakpoint medians and their role in gene order phylogeny.
References
- [1] Sylvia Boyd and Maryam Haghighi. A fast method for large-scale multichromosomal breakpoint median problems. Journal of Bioinformatics and Computational Biology, 10(01):1240008, 2012. doi:10.1142/S0219720012400082.
- [2] David Bryant. The complexity of the breakpoint median problem. Centre de recherches mathematiques, 1998.
- [3] Alberto Caprara. The reversal median problem. INFORMS Journal on Computing, 15(1):93–113, 2003. doi:10.1287/IJOC.15.1.93.15155.
- [4] Poly H da Silva, Arash Jamshidpey, and David Sankoff. Sampling gene adjacencies and geodesic points of random genomes. In RECOMB International Workshop on Comparative Genomics, pages 189–210. Springer, 2024. doi:10.1007/978-3-031-58072-7_10.
- [5] Poly H da Silva, Arash Jamshidpey, and David Sankoff. On the number of breakpoint medians of random genomes. preprint (submitted), 2025.
- [6] Pedro Feijão and João Meidanis. SCJ: a variant of breakpoint distance for which sorting, genome median and genome halving problems are easy. In International Workshop on Algorithms in Bioinformatics, pages 85–96. Springer, 2009. doi:10.1007/978-3-642-04241-6_8.
- [7] G Fertin, A Labarre, I Rusu, E Tannier, and S Vialette. Combinatorics of genome rearrangements. The MIT Press, 2009.
- [8] Maryam Haghighi and David Sankoff. Medians seek the corners, and other conjectures. BMC Bioinformatics, 13(19):S5, 2012. doi:10.1186/1471-2105-13-S19-S5.
- [9] Arash Jamshidpey, Aryo Jamshidpey, and David Sankoff. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics, 15(6):S3, 2014.
- [10] Arash Jamshidpey and David Sankoff. Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics, 14(15):S7, 2013. doi:10.1186/1471-2105-14-S15-S7.
- [11] Caroline Anne Larlee, Chunfang Zheng, and David Sankoff. Near-medians that avoid the corners; a combinatorial probability approach. BMC Genomics, 15(6):S1, 2014.
- [12] Mona Meghdari Miardan, Arash Jamshidpey, and David Sankoff. Escape from parsimony of a double-cut-and-join genome evolution process. Journal of Computational Biology, 30(2):118–130, 2023. doi:10.1089/CMB.2021.0468.
- [13] David Sankoff and Mathieu Blanchette. The median problem for breakpoints in comparative genomics. Computing and Combinatorics, pages 251–263, 1997. doi:10.1007/BFB0045092.
- [14] David Sankoff, Gopalakrishnan Sundaram, and John Kececioglu. Steiner points in the space of genome rearrangements. International Journal of Foundations of Computer Science, 7(01):1–9, 1996. doi:10.1142/S0129054196000026.
- [15] Eric Tannier, Chunfang Zheng, and David Sankoff. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10(1):120, 2009. doi:10.1186/1471-2105-10-120.
- [16] Andrew Wei Xu. The median problems on linear multichromosomal genomes: Graph representation and fast exact solutions. Journal of Computational Biology, 17(9):1195–1211, 2010. doi:10.1089/CMB.2010.0106.
- [17] João Paulo Pereira Zanetti, Priscila Biller, and João Meidanis. Median approximations for genomes modeled as matrices. Bulletin of Mathematical Biology, 78:786–814, 2016.
- [18] Chunfang Zheng and David Sankoff. On the pathgroups approach to rapid small phylogeny. BMC Bioinformatics, 12(1):S4, 2011. doi:10.1186/1471-2105-12-S1-S4.
Appendix A Proof of Theorem 1
Proof.
For a permutation and , let To ease the notation, we let . Let . Then
As is a median of , we have
Hence,
| (2) |
where the last inequality holds because , for any and .
