Identifying Breakpoint Median Genomes: A Branching Algorithm Approach

da Silva, Poly H.; Jamshidpey, Arash; Sankoff, David

doi:10.4230/LIPIcs.WABI.2025.18

Identifying Breakpoint Median Genomes:
A Branching Algorithm Approach

Poly H. da Silva

Columbia University, New York, NY, USA Arash Jamshidpey¹¹1corresponding author

University of California at Berkeley, Berkeley, CA, USA David Sankoff

University of Ottawa, Ottawa, Ontario, Canada

Abstract

Genome comparison often involves quantifying dissimilarities between genomes with identical gene sets, commonly using breakpoints – points where adjacent genes in one genome are not adjacent in another. The concept of a median genome, used for comparison of multiple genomes, aims to find a genome that minimizes the total distance to all genomes in a given set. While median genomes are useful for extracting common genomic information and estimating ancestral traits, the existence of multiple divergent medians raises concerns about their accuracy in reflecting the true ancestor. The median problem is known to be NP-hard, particularly for unichromosomal genomes, and solving it becomes increasingly challenging under different genome distance models. In this work, we introduce a novel branching algorithm to efficiently find all breakpoint medians of $k$ linear unichromosomal genomes, represented as unsigned permutations. This algorithm constructs a rooted labeled tree, where the sequence of labels along each complete ray defines a genome, providing a structured and efficient way to explore the space of candidate medians by narrowing the search to a well-defined and significantly smaller subset of the permutation space. We validate our approach with experiments on randomly generated sets of three permutations. The results show that our method successfully finds the exact medians and also identifies many near-optimal approximations. Our experiments further show that most medians lie relatively close to the input permutations, in agreement with prior theoretical results.

Keywords and phrases:

Breakpoint distance, median genomes, phylogeny reconstruction, random permutations

Copyright and License:

2012 ACM Subject Classification:

Mathematics of computing

\rightarrow

Combinatorial algorithms ; Mathematics of computing

\rightarrow

Permutations and combinations ; Applied computing

\rightarrow

Computational genomics

DOI:

10.4230/LIPIcs.WABI.2025.18

Event:

25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Editors:

Broňa Brejová and Rob Patro

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Comparing genomes with the same syntenic blocks involves computing their dissimilarities. This dissimilarity is often quantified by identifying breakpoints, points at which genes are adjacent in one genome but not in the other. Introduced formally by Sankoff and Blanchette in 1997 [13], the total number of breakpoints serves as a metric for dissimilarity. To compare multiple genomes, we can use the concept of median. Given a set of three or more genomes $X=\{g_{1},...,g_{k}\}$ and a distance $d$ , a median for the set $X$ is a genome that minimizes the total distance function $d_{T}(\cdot):=\sum_{i=1}^{k}d(g_{i},\cdot)$ . The concept of the median was first employed by Sankoff et al. [14] in 1996 within the context of evolutionary gene order models. Motivated by the search for ancestral genomic information and its applications to small phylogeny problems, the median problem has since attracted significant attention [2, 3, 15, 7, 6, 16, 17].

However, the complexity of the median problem varies with different genome distances, often proving to be NP-hard [3, 15, 7] particularly for unichromosomal genomes. For instance, the breakpoint median problem was shown to be NP-hard by Bryant [2] for linear unichromosomal genomes. Moreover, identifying a median in which adjacency sets are contained within the union of the adjacency sets of the input genomes has also been proven to be NP-hard [2]. Following the reduction of the median problem to the Traveling Salesman Problem (TSP) by Sankoff and Blanchette [13], in 2012, Boyd and Haghighi [1], using Concorde (a fast software to find TSP solutions), presented a fast algorithm to find breakpoint medians of samples of large genomes.

While median genomes aim to extract common information among given genomes and estimate ancestral characteristics, the existence of multiple medians with considerable divergence [8] raises questions about their proximity to the true ancestor or their usability in providing ancestral insights. Additionally, determining which, if any, of these medians accurately reflects ancestral traits poses a significant challenge. In fact, Zheng and Sankoff [18], Jamshidpey and Sankoff [10] and Miardan et al. [12] showed that median may fail to approximate the ancestor for the long-time evolution of genomes, while for genomes involved in evolution for a shorter period of time medians may approximate the true ancestor.

To address the challenge of identifying relevant medians, we propose a novel branching algorithm for efficiently finding all breakpoint medians of $k$ linear unichromosomal genomes represented by unsigned permutations of length $n$ . This exponential algorithm constructs a rooted labeled tree, whose sequence of labels for each ray (a shortest path connecting the root to a leaf) with length $n-1$ determines a unichromosomal genome (represented as a permutation). The set of all such unichromosomal genomes contains all medians of the $k$ input genomes. We show that this tree construction reduces the median search space significantly compared to the full space of $n!$ permutations (see Table 3).

This paper is organized as follows. We begin by laying the foundation. In Section 2 we introduce the basic concepts of a genomic space with breakpoint distance and review some essential prior results in the literature. In Section 3, we delve into the methodology behind our branching algorithms designed to identify all medians within a given set of genomes. Subsequently, in Section 4, we provide empirical validation of our approach through a series of experiments using sets of three random permutations. A key contribution of this experimental section is that our method is able to compute the median value exactly, even in cases where it remained unknown in previous work. We examine how the median value behaves as the permutation length increases and analyze the distribution of approximate medians in the reduced search space generated by our algorithm. The results indicate that, although not all permutations in this space are true medians, a substantial proportion have total distances very close to the minimum, making them effective median approximations. We also explore how far the medians tend to be from the input permutations and find that most lie relatively close, an observation consistent with prior theoretical results [8, 9, 11, 4]. We conclude the paper with a discussion of our findings, their implications, and potential avenues for future research.

2 Breakpoint medians

We represent an unichromosomal genome by a permutation $\pi$ which is a bijection on $[n]:=\{1,\cdots,n\}$ . In other words, a permutation $\pi$ can be represented by $\pi(1),\cdots,\pi(n)$ , which indicates a specific order on $[n]$ . When there is no risk of ambiguity, we often write $\pi_{i}$ instead of $\pi(i)$ , and denote $\pi:=\pi_{1}...\pi_{n}$ . We define the set of adjacencies of $\pi$ as $\mathcal{A}_{\pi}=\{\{\pi_{i},\pi_{i+1}\}\mid 1\leq i\leq n-1\}$ , where each adjacency is treated as an unordered pair. Let $S_{n}$ denote the set of all permutations of length $n$ . Given $x,y\in S_{n}$ , we denote by $\mathcal{A}_{x,y}:=\mathcal{A}_{x}\cap\mathcal{A}_{y}$ the set of all common adjacencies of $x$ and $y$ . For a set $X\subset S_{n}$ , we also denote by $\mathcal{A}_{X}:=\bigcap_{x\in X}\mathcal{A}_{x}$ the set of all common adjacencies of permutations in $X$ . The breakpoint (bp) distance between $x$ and $y$ is define by $d(x,y):=n-1-|\mathcal{A}_{x,y}|$ .

The breakpoint distance is neither a geodesic distance nor an edit distance, and for this reason the notion of partial geodesics was introduced by Jamshidpey et al. [9]. We can consider the breakpoint distance as a generalized edit distance that determines the parsimonious (shortest) paths of transforming one permutation to another, but with many missing points in the parsimonious path. In other words, in edit distances the length of every jump from a point in the parsimonious path to its closest point in the the same path is one, while in generalized edit distances such as the breakpoint distance this length may be bigger. A partial geodesic [9] between $x$ and $y$ is a maximal chain $x=\pi_{0},\pi_{1},...,\pi_{k-1},\pi_{k}=y$ in $S_{n}$ such that $\sum_{i=0}^{k-1}d(\pi_{i},\pi_{i+1})=d(x,y)$ . We denote by $\overline{[x,y]}$ the set of all permutations lying on partial geodesics connecting $x,y\in S_{n}$ , and call them geodesic points of $x$ and $y$ .

For a set of three or more genomes $X=\{x_{1},...,x_{k}\}$ , a breakpoint median is a genome that minimizes the total distance function $d_{T}(\cdot,X):=\sum_{i=1}^{k}d(x_{i},\cdot)$ . The minimal value of $d_{T}$ is known as the median value of the set $X$ , denoted by $\mu(X)$ . The set of all breakpoint medians of $X$ is denoted by $M(X)$ .

For a set of permutations $X=\{x_{1},...,x_{k}\}$ in $S_{n}$ for which the pairwise breakpoint distances take the maximum value $n-1$ , Jamshidpey et al. [9] provide a necessary and sufficient condition for a permutation $m$ to be a median of $X$ , that is $m$ is a median of $X$ if and only if

\mathcal{A}_{m}\subset\bigcup\limits_{x\in X}\mathcal{A}_{x}.

Also from [9], a permutation $\pi$ is a geodesic point of two permutations $x$ and $y$ , and so it is a median of $\{x,y\}$ , if and only if $\mathcal{A}_{x,y}\subset\mathcal{\pi}\subset\mathcal{A}_{x}\cup\mathcal{A}_{y}$ . On the other hand, we do not have a result establishing a necessary and sufficient condition for a permutation to be a median of a general set of permutations $X$ . In fact, it is known that there may exist a median that does not contain all common adjacencies of permutations in $X$ , i.e., there may exist a median $m$ such that $\mathcal{A}_{X}\nsubseteq\mathcal{A}_{m},$ as the example given by Bryant [2]. However, even though there may exist medians not containing all common adjacencies of elements of $X$ , there always exists at least one median with this property, namely, there exists at least one median $m$ such that $\mathcal{A}_{X}\subset\mathcal{A}_{m}$ (cf. [2]). In addition, when we have a general set of permutations $X$ , even counter-intuitively, it is not necessary that every adjacency of a median $m$ is an adjacency of at least one of the permutations in $X$ , that is, there may exist a median $m$ such that

\mathcal{A}_{m}\nsubseteq\bigcup\limits_{x\in X}\mathcal{A}_{x},

as is shown by Bryant [2]. However, [5] provides an upper bound for the maximum number of adjacencies of a median that are not in $\bigcup_{x\in X}\mathcal{A}_{x}$ as stated in Theorem 1 (whose proof is provided in Appendix A). Before the statement of Theorem 1, we need the following notation. Denote by $\mathcal{P}(S)$ the set of all subsets of a set or space $S$ . Let $X=\{x_{1},...,x_{k}\}\subset S_{n}$ and let $\mathcal{B}_{X}^{X}=\mathcal{B}_{x_{1},...,x_{k}}^{X}:=\mathcal{A}_{x_{1},...,% x_{k}}$ . Then, for any $j=1,\cdots,k$ , let

\mathcal{B}_{x_{1},...,x_{j-1},x_{j+1},...x_{k}}^{X}:=\mathcal{A}_{x_{1},...,x% _{j-1},x_{j+1},...x_{k}}\setminus\mathcal{B}_{x_{1},...,x_{k}}^{X}

Continuing this, for any $i_{1},\cdots,i_{r}\in[n]$ and $U=\{x_{i_{1}},...,x_{i_{r}}\}\subset X$ , we set

\mathcal{B}_{U}^{X}=\mathcal{B}_{x_{i_{1}},...,x_{i_{r}}}^{X}:=\mathcal{A}_{U}% \setminus(\bigcup\limits_{U\subsetneqq V}\mathcal{B}_{V}^{X}).

In other words, $\mathcal{B}_{U}^{X}$ includes all adjacencies that are common in every $x\in U$ , but missing from every $y\in X\setminus U$ . We have the following theorem.

Theorem 1 ([5]).

Let $X=\{x_{1},...,x_{k}\}\subset S_{n}$ be such that

d_{T}(x_{k},X)=\min\limits_{i=1...k}d_{T}(x_{i},X),

and let $m\in M(X)$ . Then

|\mathcal{A}_{m}\setminus(\bigcup\limits_{i=1}^{k}\mathcal{A}_{x_{i}})|\leq% \mathcal{O}_{n}(X):=\sum\limits_{r=2}^{k-1}(r-1)\sum\limits_{1\leq i_{1}<...<i% _{r}<k}|\mathcal{B}_{x_{i_{1}},...,x_{i_{r}}}^{X}|.

(1)

In particular, for $k=3$ , for any $m\in M(X)$

|\mathcal{A}_{m}\setminus\bigcup\limits_{i=1}^{3}\mathcal{A}_{x_{i}}|\leq% \mathcal{O}_{n}(\{x_{1},x_{2},x_{3}\}):=|\mathcal{B}_{x_{1},x_{2}}^{X}|.

$\blacktriangleright$ Remark 2.

Note that the theorem makes use of the upper bound $d_{T}(m,X)\leq d_{T}(x_{k},X),$ for any $m\in M(X)$ . In particular, for $x=3$ , $d_{T}(x_{3},X)=\min_{i=1,2,3}d_{T}(x_{i},X)$ is equivalent to $d(x_{1},x_{2})=\max_{i,j}d(x_{i},x_{j})$ , which itself is equivalent to $|\mathcal{B}_{x_{1},x_{2}}^{X}|=\min_{i\neq j}|\mathcal{B}_{x_{i},x_{j}}^{X}|$ . In this case, $\mathcal{O}_{n}(X)=|\mathcal{B}_{x_{1},x_{2}}^{X}|=\min_{i\neq j}|\mathcal{B}_% {x_{i},x_{j}}^{X}|$ implies that the upper bound is the number of adjacencies common in the pair of farthest genomes, i.e. $x_{1},x_{2}$ , which are missing from $x_{3}$ .

This upper bound significantly restricts the median search space, and by making use of it, we develop an algorithm to find all breakpoint medians of a general set of permutations.

We first analyze exponential algorithms that construct specific rooted labeled trees, where each ray (a shortest path from the root to a leaf) of length $n-1$ corresponds to a permutation determined by the sequence of labels along the path. The set of all such label sequences includes all medians, thereby significantly reducing the search space. Specifically, the new median search space consists of the set of all leaves of these trees. While the volume of this new search space is exponential, it is negligible compared to the size of the permutation group of length $n$ .

3 An algorithm to find medians

To describe our algorithms, we first define the neighbors of a point (i.e., a number representing a syntenic block or gene) with respect to a given set of permutations. Specifically, for $X=\{x_{1},...,x_{k}\}\subset S_{n}$ and $i=1,\cdots n$ , we define

\mathcal{N}_{X}(i)=\mathcal{N}_{x_{1},...,x_{k}}(i)=\{j:\{i,j\}\in\bigcup% \limits_{l=1}^{k}\mathcal{A}_{x_{l}}\}.

Note that for each $i$ , $1\leq|\mathcal{N}_{X}(i)|\leq 2k$ . The equality $|\mathcal{N}_{X}(i)|=1$ holds when $i$ satisfies both of the following conditions: $i$ is either the first or last number in each permutation $x_{l}$ , for $1\leq l\leq k$ ; and $i$ is an extremity of an adjacency in $\mathcal{A}_{X}$ . On the other hand, the equality $|\mathcal{N}_{X}(i)|=2k$ holds when $i$ satisfies both of the following conditions: $i$ is neither the first nor the last number of any permutation $x_{l}$ , for $1\leq l\leq k$ ; and $i$ is not an extremity of an adjacency in $\mathcal{A}_{x_{l},x_{p}}$ , for any $l\neq p$ . If $X$ is such that $d(x_{l},x_{p})=n-1$ for any $l\neq p$ , then $k\leq|\mathcal{N}_{X}(i)|\leq 2k$ .

Our main goal in this paper is to find all medians for a given set of permutations $X\subset S_{n}$ . To achieve this, we construct a family of labeled rooted trees of height $n-1$ with the following properties: Each vertex $v$ of the tree is assigned a label, denoted by $\ell(v)$ , which is a number between $1$ and $n$ . In order for two vertices, $u$ and $v$ , to be connected by an edge, it is necessary that $\ell(v)\in\mathcal{N}_{X}(\ell(u))$ . Furthermore, for each path of length $n-1$ from the root to a leaf, the sequence of labels along the path forms a permutation $y$ satisfying certain conditions. In particular, the labels of the root and leaf determine the first and last numbers in $y$ , respectively, i.e., $y_{1}$ and $y_{n}$ . We refer to $y$ as a permutation given by a leaf.

In the rest of this paper, we first present an algorithm in Section 3.1 for constructing trees in which every permutation $y$ given by a leaf satisfies

\mathcal{A}_{y}\subset\bigcup_{x\in X}\mathcal{A}_{x}.

In this case, if the breakpoint distance between every pair of permutations in $X$ attains the maximum value $n-1$ , then from Jamshidpey et al. [9], any permutation $y$ given by a leaf at level $n-1$ is a median of $X$ . Consequently, the algorithm finds all medians of $X$ .

Next, in Section 3.2, we construct trees where every permutation $y$ given by a leaf satisfies

\mathcal{A}_{X}\subset\mathcal{A}_{y}\subset\bigcup\limits_{x\in X}\mathcal{A}% _{x}.

In this case, if the upper bound given in (1) is zero – a weaker condition than requiring all pairwise distances in $X$ to be maximal – then at least one of the permutations given by a leaf of the tree is a median of $X$ (cf. [2]). This allows us to determine the median value within a relatively smaller search space.

Finally, in Section 3.3, we introduce a modification of the algorithm from Section 3.1, providing additional flexibility to identify all medians of a general set of permutations. This is achieved by allowing permutations to contain a limited number of adjacencies not present in $\bigcup_{x\in X}\mathcal{A}_{x}.$ The upper bound in (1) ensures that all medians of $X$ are represented among the leaves of the tree constructed by this flexible algorithm.

3.1 Finding all medians of permutations with maximum pairwise distance to each other

Let $i d$ denote the identity permutation in $S_{n}$ , and let $x\in S_{n}$ be a permutation such that $d(id,x)=n-1$ . We first describe the algorithm for the case of two permutations, $i d$ and $x$ , and later extend it to $k>2$ permutations.

For each $i=1,\dots,n$ , we construct a tree whose root is labeled by $i$ . We denote the root of this tree by $\varnothing$ . The root $\varnothing$ has $|\mathcal{N}_{id,x}(i)|$ children, denoted by $\varnothing 1,\varnothing 2,\dots,\varnothing|\mathcal{N}_{id,x}(i)|$ . The label of each child is a number in $\mathcal{N}_{id,x}(i)$ , such that if $j\neq j^{\prime}$ , then $\ell(\varnothing j)\neq\ell(\varnothing j^{\prime})$ . In other words, there is a bijection between the set $\{\ell(\varnothing r)\mid 1\leq r\leq|\mathcal{N}_{id,x}(i)|\}$ and $\mathcal{N}_{id,x}(i)$ . By convention, we fix this bijection so that $\ell(\varnothing r)$ is an increasing function of $r$ ; in particular, $\ell(\varnothing 1)$ and $\ell(\varnothing|\mathcal{N}_{id,x}(i)|)$ are the smallest and largest numbers in $\mathcal{N}_{id,x}(i)$ , respectively.

Each vertex $\varnothing j_{1}$ , for $1\leq j_{1}\leq|\mathcal{N}_{id,x}(i)|$ , has $|\mathcal{N}_{id,x}(\ell(\varnothing j_{1}))\setminus\{i\}|$ children, denoted by $\varnothing j_{1}j_{2}$ , where $1\leq j_{2}\leq|\mathcal{N}_{id,x}(\ell(\varnothing j_{1}))\setminus\{i\}|$ , with $\ell(\varnothing j_{1}j_{2})\in\mathcal{N}_{id,x}(\ell(\varnothing j_{1}))% \setminus\{i\}$ . Moreover, if $j_{2}\neq j_{2}^{\prime}$ , then $\ell(\varnothing j_{1}j_{2})\neq\ell(\varnothing j_{1}j_{2}^{\prime})$ . Continuing this process, the parent of a vertex $\varnothing j_{1}j_{2}\dots j_{l-1}j_{l}$ at level $l$ is the vertex $\varnothing j_{1}j_{2}\dots j_{l-1}$ . If

\mathcal{N}_{id,x}(\ell(\varnothing j_{1}j_{2}\dots j_{l-1}j_{l}))\setminus\{% \ell(\varnothing),\ell(\varnothing j_{1}),\dots,\ell(\varnothing j_{1}j_{2}% \dots j_{l-1})\}\neq\emptyset,

then its children are $\varnothing j_{1}j_{2}\dots j_{l-1}j_{l}j_{l+1}$ , for

1\leq j_{l+1}\leq|\mathcal{N}_{id,x}(\ell(\varnothing j_{1}j_{2}\dots j_{l-1}j% _{l}))\setminus\{\ell(\varnothing),\ell(\varnothing j_{1}),\dots,\ell(% \varnothing j_{1}j_{2}\dots j_{l-1})\}|,

where $\ell(\varnothing j_{1}j_{2}...j_{l-1}j_{l}j_{l+1})\in\mathcal{N}_{id,x}(\ell(% \varnothing j_{1}j_{2}...j_{l-1}j_{l}))\setminus\{\ell(\varnothing),\ell(% \varnothing j_{1}),...,\ell(\varnothing j_{1}j_{2}...j_{l-1})\}$ .

Again, if $j_{l+1}\neq j_{l+1}^{\prime}$ , then $\ell(\varnothing j_{1}j_{2}\dots j_{l}j_{l+1})\neq\ell(\varnothing j_{1}j_{2}% \dots j_{l}j_{l+1}^{\prime})$ . Since this is a finite process, it results in a labeled tree for each $i$ as the label of the root, with $1\leq i\leq n$ .

More precisely, the sequence of labels along every $(\varnothing,u)$ -path, where $u$ is a leaf at level $n-1$ , represents a permutation $y$ such that $\mathcal{A}_{y}\subset\mathcal{A}_{id}\cup\mathcal{A}_{x}.$ Since $\mathcal{A}_{id}\cap\mathcal{A}_{x}=\emptyset$ , we have $y\in\overline{[id,x]}$ , meaning that we can identify all geodesic points of $i d$ and $x$ when $d(id,x)=n-1$ . Furthermore, the number of permutations in $\overline{[id,x]}$ is equal to the number of $(\varnothing,u)$ -paths of length $n-1$ in all $n$ trees. An example is illustrated in Figure 1.

For each $i$ , $1\leq i\leq n$ , we denote by $\mathcal{T}_{id,x}^{i}$ the tree constructed as above, where the root is labeled by $i$ . We also define $\mathcal{T}_{id,x}:=\{\mathcal{T}_{id,x}^{i}\mid 1\leq i\leq n\}$ as the set of all these $n$ trees.

Figure 1: The representation of the tree

\mathcal{T}_{id,x}^{1}

, for

x=246315

, with its labels. In this example we have

\mathcal{N}_{id,x}(1)=\{2,3,5\}

,

\mathcal{N}_{id,x}(2)=\{1,3,4\}

,

\mathcal{N}_{id,x}(3)=\{1,2,4,6\}

,

\mathcal{N}_{id,x}(4)=\{2,3,5,6\}

,

\mathcal{N}_{id,x}(5)=\{1,4,6\}

and

\mathcal{N}_{id,x}(6)=\{3,4,5\}

. Also,

d(id,x)=n-1=5

, and so each path from the root to a leaf in level

5

constitutes a permutation in

\overline{[id,x]}

. The list of permutations in

\overline{[id,x]}

given by this tree is:

id=123456

,

123465

,

123645

,

123654

,

124365

,

124563

,

132456

,

132465

,

136542

,

154236

,

154632

,

156432

,

156423

,

156342

and

156324

. These are all the permutations in

\overline{[id,x]}

that start at

1

. The bold edges represent the adjacencies of

i d

and the other edges represent the adjacencies of

x

.

More generally, let $X=\{x_{1},...,x_{k}\}\subset S_{n}$ be a set of permutations. Following the same steps just replacing $\mathcal{N}_{id,x}(i)$ with $\mathcal{N}_{x_{1},...,x_{k}}(i)$ , we can construct $n$ labeled rooted trees, $\mathcal{T}_{X}^{i}$ , such that the sequence of labels along each $(\varnothing,u)-$ path, where $u$ is a leaf at level $n-1$ , forms a permutation $y$ satisfying $\mathcal{A}_{y}\subset\bigcup_{l=1}^{k}\mathcal{A}_{x_{l}}$ . Therefore, if $X=\{x_{1},...,x_{k}\}\subset S_{n}$ satisfies $d(x_{l},x_{p})=n-1$ for any $l\neq p$ , then the set of all permutations given by leaves at level $n-1$ in the trees $\mathcal{T}_{X}^{i}$ , for $i=1,...,n$ , is exactly the set of all medians of $X$ . We denote

\mathcal{T}_{X}:=\{\mathcal{T}_{X}^{i};1\leq i\leq n\}

and let $Y(\mathcal{T}_{X}^{i})$ be the set of all permutations $y\in S_{n}$ that are given by a leaf of $\mathcal{T}_{X}^{i}$ at level $n-1$ . Moreover, let

Y(\mathcal{T}_{X}):=\bigcup\limits_{i=1}^{n}Y(\mathcal{T}_{X}^{i}).

Each vertex of a tree $\mathcal{T}_{X}^{i}$ is a sequence $\varnothing j_{1}j_{2}...j_{l}$ where each $j_{i}$ , $i=1,..,l$ , is a number between $1$ and $2|X|$ . To construct a child vertex and its label from its parent and the parent’s label, we define the following operation. Given a sequence of symbols $u=u_{1}\dots u_{l}$ (e.g., numbers) and a symbol $r$ , we define the operation $u\oplus r$ as a new sequence of symbols $u\oplus r:=u_{1}...u_{l}r$ . We emphasize that in the above algorithm $u_{1}=\varnothing$ and $u_{2},..,u_{l}$ and $r$ are natural numbers in $\{1,...,2|X|\}$ . For each fixed tree of $\mathcal{T}_{X}$ , we denote by $T_{u}(\mathcal{T}^{i}_{X})=T_{u}$ the ordered sequence of labels assigned to the vertices along the $(\varnothing,u)-$ path for a vertex $u$ in $\mathcal{T}_{X}^{i}$ . Observe that $T_{u}$ is a sequence of digits, where each digit is between $1$ and $n$ , and all digits are distinct. We denote by $dig(T_{u})$ the set of labels appearing in $T_{u}$ , and let $\mathcal{N}_{X}(T_{u}):=\mathcal{N}_{X}(\ell(u))$ . Additionally, we define $L_{j}(\mathcal{T}^{i}_{X})=L_{j}$ to be the set of all vertices of the tree $\mathcal{T}^{i}_{X}$ at level $j$ , for $0\leq j\leq n-1$ , considering the root at level zero. Using these notations, the tree construction process for $k\geq 2$ permutations is described in Algorithm 1. These notations will also be used in the subsequent sections.

Note that we can also view the tree construction in suffix-tree terms, as follows. Each tree $\mathcal{T}^{i}_{X}$ has root label $i$ , and every internal node that spells a prefix $(i,u_{1},\dots,u_{j})$ branches to a child $u_{j+1}$ if and only if $u_{j+1}$ is adjacent to $u_{j}$ in at least one genome in $X$ , and $u_{j+1}$ has not yet appeared in the prefix.

Algorithm 1 Gives permutations

y

such that

\mathcal{A}_{y}\subset\bigcup\limits_{x\in X}\mathcal{A}_{x}

.

$\blacktriangleright$ Remark 3 (Finding permutations with maximum distance of a set $X$ ).

Given $X=\{x_{1},...,x_{k}\}\subset S_{n}$ , denote by $\overline{\mathcal{N}}_{X}(i)=\overline{\mathcal{N}}_{x_{1},...,x_{k}}(i):=[n]% \setminus\mathcal{N}_{X}(i)$ the complement set of $\mathcal{N}_{X}(i)$ , for $1\leq i\leq n$ . Note that, if we replace $\mathcal{N}_{X}(\cdot)$ by $\overline{\mathcal{N}}_{X}(\cdot)$ in Algorithm 1, we obtain all permutations with maximum distance from $X$ , i.e, we find all permutations $y$ such that $d(y,x_{i})=n-1$ , for $1\leq i\leq n$ .

3.2 Finding all geodesic points for a general set of permutations

A segment $s$ of a set of adjacencies $I\subset\mathcal{A}_{\pi}$ , for $\pi\in S_{n}$ , is a maximal set of consecutive adjacencies of $I$ , i.e. it is a set

s=\{\{\pi(r),\pi(r+1)\},\{\pi(r+1),\pi(r+2)\},\cdots,\{\pi(r+k-1),\pi(r+k)\}\}\subset I

such that $\{\pi(r-1),\pi(r)\},\{\pi(r+k),\pi(r+k+1)\}\notin I$ , for $r>1$ and $r+k<n$ . We often denote $s$ by $\|\pi(r),\cdots,\pi(r+k)\|$ , and write $s\hat{\in}I$ . We say that $Int(s):=\{\pi(r+1),\cdots,\pi(r+k-1)\}$ are the internal points of $s$ , and $End(s):=\{\pi(r),\pi(r+k)\}$ are the end points of $s$ . Generalizing the idea, the internal and end points of $I\subset\mathcal{A}_{\pi}$ are defined by

Int(I):=\bigcup_{s\hat{\in}I}Int(s),\ \ \ End(I):=\bigcup_{s\hat{\in}I}End(s).

Note that the above definitions do not depend on a specific choice of $\pi$ , that is, the definitions remain intact if we replace $\pi$ by any $\pi^{\prime}$ for which $I\subset\mathcal{A}_{\pi^{\prime}}$ .

Now consider the case where $x\in S_{n}$ satisfies $d(id,x)<n-1$ , that is, $\mathcal{A}_{id,x}\neq\emptyset$ . We can apply a similar idea as in the case of maximum distance, but now with some restrictions. From [9], a permutation $y\in\overline{[id,x]}$ if and only if $\mathcal{A}_{id,x}\subset\mathcal{A}_{y}\subset\mathcal{A}_{id}\cup\mathcal{A}% _{x}$ . As a result, if $s=\|n_{0},...,n_{l}\|$ is a segment of $\mathcal{A}_{id,x}$ , then the ordered sequence of digits $n_{0}...n_{l}$ must appear in the ordered sequence of labels of the $(\varnothing,u)$ -paths with length $n-1$ . In order for this to hold, first note that no internal point of $\mathcal{A}_{id,x}$ can be a label of the root. In fact, if $i\in Int(\mathcal{A}_{id,x})$ , then there exist $j$ and $j^{\prime}$ with $\{i,j\}$ and $\{i,j^{\prime}\}$ in $\mathcal{A}_{id,x}$ . Therefore, if $i$ is the label of the root, any permutation $y$ given by a leaf at the level $n-1$ will contain either $\{i,j\}$ or $\{i,j^{\prime}\}$ (but not both), and thus cannot satisfy $\mathcal{A}_{id,x}\subset\mathcal{A}_{y}$ . This implies that if $i\in Int(\mathcal{A}_{id,x})$ , then $i$ can only be a label of an internal vertex of the tree. Moreover, since $|\mathcal{N}_{id,x}(i)|=2$ , the vertex of the label equal to $i$ will have exactly one child. Therefore, for any segment $s=\|n_{0},...,n_{l}\|\hat{\in}\mathcal{A}_{id,x}$ , either $n_{0}$ or $n_{l}$ should appear before $n_{1},\dots,n_{l-1}$ in $T_{u}$ for any leaf $u$ . To ensure the condition $\mathcal{A}_{id,x}\subset\mathcal{A}_{y}$ , it follows that each segment $s=\|n_{0},...,n_{l}\|\hat{\in}\mathcal{A}_{id,x}$ , if a vertex $v$ has label $\ell(v)=n_{0}$ and $n_{l}$ is not in $T_{v}$ (or the opposite, $\ell(v)=n_{l}$ and $n_{0}$ is not in $T_{v}$ ), then $v$ must have exactly one child $v\oplus 1$ with label $\ell(v\oplus 1)=n_{1}$ (or $\ell(v\oplus 1)=n_{l-1}$ ).

To describe the tree construction process, for a given segment $s\hat{\in}\mathcal{A}_{X}$ and $j\in End(s)$ , we denote by $\overline{j}$ the other end point of $s$ and by $j^{*}$ the unique point (number) such that adjacency $\{j,j^{*}\}\in s$ . In the case where $d(id,x)<n-1$ , for each $i\in[n]\setminus Int(\mathcal{A}_{id,x})$ we construct a rooted tree $\overline{\mathcal{T}}^{i}_{id,x}$ with the root label $i$ . At each level $l$ , a vertex $\varnothing j_{1}...j_{l-1}j_{l}$ is a child of $\varnothing j_{1}...j_{l-1}$ . Now, if $\mathcal{N}_{id,x}(\ell(\varnothing j_{1}...j_{l-1}j_{l}))\setminus\{\ell(% \varnothing),\ell(\varnothing j_{1}),...,\ell(\varnothing j_{1}j_{2}...j_{l-1}% )\}\neq\emptyset$ then $\varnothing j_{1}...j_{l-1}j_{l}$ has children defined as follows. If $\ell(j_{l})\in End(s)$ and $\overline{\ell(j_{l})}\notin dig(T_{\varnothing j_{1}...j_{l}})$ , for some segment $s\hat{\in}\mathcal{A}_{id,x}$ , then $\varnothing j_{1}...j_{l}$ has exactly one child $\varnothing j_{1}...j_{l}1$ with label $\ell(\varnothing j_{1}...j_{l}1)=\ell(\varnothing j_{1}...j_{l-1}j_{l})^{*}$ . Otherwise, its children are $\varnothing j_{1}...j_{l}j_{l+1}$ , for

1\leq j_{l+1}\leq|\mathcal{N}_{id,x}(\ell(\varnothing j_{1}...j_{l-1}j_{l}))% \setminus\{\ell(\varnothing),\ell(\varnothing j_{1}),...,\ell(\varnothing j_{1% }j_{2}...j_{l-1})\}|,

where $\ell(\varnothing j_{1}j_{2}...j_{l-1}j_{l}j_{l+1})\in\mathcal{N}_{id,x}(\ell(% \varnothing j_{1}...j_{l-1}j_{l}))\setminus\{\ell(\varnothing),\ell(% \varnothing j_{1}),...,\ell(\varnothing j_{1}j_{2}...j_{l-1})\},$ in the same way that if $j_{l+1}\neq j_{l+1}^{\prime}$ , then $\ell(\varnothing j_{1}j_{2}...j_{l}j_{l+1})\neq\ell(\varnothing j_{1}j_{2}...j% _{l}j_{l+1}^{\prime})$ . After a finite number of steps, we construct $|[n]\setminus Int(I_{id,x})|$ trees such that for each leaf $u$ at the level $n-1$ , $T_{u}$ gives a permutation $y$ satisfying $\mathcal{A}_{id,x}\subset\mathcal{A}_{y}\subset\mathcal{A}_{id}\cup\mathcal{A}% _{x}$ .

We can generalize this idea to a set of $k$ permutations $X=\{x_{1},...,x_{k}\}\subset S_{n}$ . Following the same steps, just replacing $\mathcal{N}_{id,x}(i)$ with $\mathcal{N}_{X}(i)$ and $\mathcal{A}_{id,x}$ with $\mathcal{A}_{X}$ , we construct $|[n]\setminus Int(\mathcal{A}_{X})|$ labeled rooted trees $\overline{\mathcal{T}}^{i}_{X}$ , such that for each leaf $u$ at the level $n-1$ , the sequence $T_{u}$ corresponds to a permutation $y$ with $\mathcal{A}_{X}\subset\mathcal{A}_{y}\subset\bigcup\limits_{l=1}^{k}\mathcal{A% }_{x_{l}}$ . Denote by

\overline{\mathcal{T}}_{X}=\{\overline{\mathcal{T}}_{X}^{i};i\in[n]\setminus Int% (\mathcal{A}_{X})\}

and define $Y(\overline{\mathcal{T}}_{X}^{i})$ to be the set of all permutations $y\in S_{n}$ given by a leaf of $\overline{\mathcal{T}}_{X}^{i}$ , and let

Y(\overline{\mathcal{T}}_{X}):=\bigcup\limits_{i\in[n]\setminus Int(\mathcal{A% }_{X})}Y(\overline{\mathcal{T}}_{X}^{i}).

For a set $X$ where the upper bound in (1) is equal to zero (recall that this condition is weaker than requiring all permutations in $X$ to be at maximum pairwise distance), from [2], there exists at least one $y\in Y(\overline{\mathcal{T}}_{X})$ that is a median of $X$ . More precisely, in this case, any $y^{\prime}\in Y(\overline{\mathcal{T}}_{X})$ such that

\sum\limits_{l=1}^{k}d(x_{l},y^{\prime})=\underset{y\in Y(\overline{\mathcal{T% }}_{X})}{\min}\left(\sum\limits_{l=1}^{k}d(x_{l},y)\right),

is a median of $X$ . Thus, in addition to finding some medians of $X$ , this algorithm also finds the median value of $X$ efficiently. The tree construction process is described in Algorithm 2.

Algorithm 2 Gives all permutations

y

such that

\mathcal{A}_{x_{1},...,x_{k}}\subset\mathcal{A}_{y}\subset\bigcup\limits_{l=1}% ^{k}\mathcal{A}_{x_{l}}

.

Note that, if $X=\{x_{1},...,x_{k}\}$ is a set of permutations such that $d(x_{l},x_{p})=n-1$ , for any $l\neq p$ , then $\mathcal{A}_{X}=\emptyset$ . Therefore, in Algorithm 2, the condition

$\ell(u)\in End(\mathcal{A}_{X})$ and $\overline{\ell(u)}\notin dig(T_{u})$

does not hold, and hence the algorithm proceeds directly to the“else” branch. In this case, Algorithm 2 yields exactly the same output as Algorithm 1. Furthermore, for a general set of permutations $X=\{x_{1},...,x_{k}\}$ , the tree $\mathcal{T}^{i}_{X}$ produced by Algorithm 1 contains, as a subgraph, the tree $\overline{\mathcal{T}}_{X}^{i}$ generated by Algorithm 2, for all $i\in[n]\setminus Int(\mathcal{A}_{X})$ . The main properties of these subtrees are:

$\blacksquare$

No internal point of $\mathcal{A}_{X}$ can be used as the label of the root, as previously noted;
$\blacksquare$

For any path starting at the root, once the path reaches one of the end points of a segment $s\hat{\in}\mathcal{A}_{X}$ , say $j$ , the path continues without branching until it reaches the other end point of $s$ , namely $\overline{j}$ .

3.3 An algorithm to find all medians of a general set of permutations

As seen in Theorem 1, $\mathcal{O}_{n}(X)$ , for $X\in S_{n}$ , is an upper bound for the number of adjacencies of any median $m\in M(X)$ outside $\cup_{x\in X}\mathcal{A}_{x}$ . To apply this result to $k$ independent random permutations, namely $\xi_{1},...,\xi_{k}\in S_{n}$ , recall that a sequence of random variables $(Z_{n})_{n\in\mathbb{Z}_{+}}$ converges in probability to a random variable $Z$ , as $n$ goes to infinity, if for any $\varepsilon>0$ , $\mathbb{P}(|Z_{n}-Z|>\varepsilon)\rightarrow 0$ . We know that $\mathcal{O}_{n}(\{\xi_{1},...,\xi_{k}\})$ is very small, with high probability. More explicitly, from [5], we know that

\frac{\mathcal{O}_{n}(X)}{a_{n}}\rightarrow 0,\ n\rightarrow\infty,

in probability, for any sequence $(a_{n})_{n\in\mathbbm{N}}$ diverging to $\infty$ , such that $a_{n}/n\rightarrow 0$ , as $n\rightarrow\infty$ . Therefore, if we consider the flexibility of using $\mathcal{O}_{n}(X)$ adjacencies out of $\cup_{x\in X}\mathcal{A}_{x}$ in Algorithm 1, then we obtain all permutations $y$ with at most $\mathcal{O}_{n}(X)$ adjacencies out of $\cup_{x\in X}\mathcal{A}_{x}$ , which we call $\mathcal{O}_{n}(X)-$ freedom permutations. These permutations include all medians of $X$ with high probability. More generally, for a non-negative integer $\alpha\geq 0$ , we say a permutation $\pi$ is $\alpha-$ freedom with respect to $X\subset S_{n}$ , if $|\mathcal{A}_{\pi}\setminus\cup_{x\in X}\mathcal{A}_{x}|\leq\alpha$ . In this section, we extend our algorithm to construct $\alpha-$ freedom medians of $X\subset S_{n}$ for $\alpha=\mathcal{O}_{n}(X)$ , i.e. the medians of $X$ that include at most $\alpha$ adjacencies out of $\cup_{x\in X}\mathcal{A}_{x}$ .

Let $X=\{x_{1},...,x_{k}\}\subset S_{n}$ be a set of permutations such that $\mathcal{O}_{n}(X)\neq 0$ . For every $i=1,...,n$ , we construct a tree with a root labeled by $i$ . We denote by $\varnothing$ the root of this tree. Now for each vertex of a tree we add a new parameter, namely, for each vertex $u$ we assign a number $\tau_{u}$ , with $0\leq\tau_{u}\leq\mathcal{O}_{n}(X)$ , that determines the number of children of vertex $u$ in the tree and the number of adjacencies that are not in $\cup_{x\in X}\mathcal{A}_{x}$ and appear in $T_{u}$ , in the following way: if $\tau_{u}\neq 0$ then $u$ has $n-|dig(T_{u})|$ children, i.e., we construct $n-|dig(T_{u})|$ sequences of labels by adding to the $T_{u}$ all possible numbers $j$ , from $1$ to $n$ , that did not appear in $T_{u}$ , and so we add the adjacency $\{\ell(u),j\}$ for each permutation $y$ that is being constructed from the sequence of labels, which also includes adjacencies that are not in $\cup_{x\in X}\mathcal{A}_{x}$ . If $\tau_{u}=0$ then any descendent vertex $v$ of $u$ has $\tau_{v}=0$ and $u$ has the same number of children given by Algorithm 1, which is $|\mathcal{N}_{X}(\ell(u))\setminus dig(T_{u})|$ . So in this case, $T_{u}$ already contain $\mathcal{O}_{n}(X)$ adjacencies out of $\cup_{x\in X}\mathcal{A}_{x}$ . For the root we assign $\tau_{\varnothing}=\mathcal{O}_{n}(X)$ . So the root has $n-1$ children, called $\varnothing 1$ , $\varnothing 2$ , …, $\varnothing(n-1)$ , with $\ell(\varnothing j)=j$ , for $j<i$ , and $\ell(\varnothing j)=j+1$ , for $j\geq i$ . We assign $\tau_{\varnothing j}=\tau_{\varnothing}-1$ if $\ell(\varnothing j)\notin\mathcal{N}_{X}(i)$ , or $\tau_{\varnothing}=\tau_{\varnothing j}$ if $\ell(\varnothing j)\in\mathcal{N}_{X}(i)$ . For each vertex $\varnothing j$ , if $\tau_{\varnothing j}\neq 0$ then $\varnothing j$ has $n-2$ children, called $\varnothing jj^{\prime}$ , for $1\leq j^{\prime}\leq n-2$ , with $\ell(\varnothing jj^{\prime})\in[n]\setminus\{\ell(\varnothing),\ell(% \varnothing j)\}$ such that there is a bijection between set $\{\ell(\varnothing jj^{\prime}):1\leq j^{\prime}\leq n-2\}$ and $[n]\setminus\{\ell(\varnothing),\ell(\varnothing j)\}$ . If $\ell(\varnothing jj^{\prime})\notin\mathcal{N}_{X}(\ell(\varnothing j))$ then $\tau_{\varnothing jj^{\prime}}=\tau_{\varnothing j}-1$ , and if $\ell(\varnothing jj^{\prime})\in\mathcal{N}_{X}(\ell(\varnothing j))$ then $\tau_{\varnothing jj^{\prime}}=\tau_{\varnothing j}$ . On the other hand, if $\tau_{\varnothing j}=0$ , then $\varnothing j$ has $|\mathcal{N}_{X}(\ell(\varnothing j))\setminus\{i\}|$ children, namely $\varnothing jj^{\prime}$ , for $1\leq j^{\prime}\leq|\mathcal{N}_{X}(\ell(\varnothing j))\setminus\{i\}|$ with $\tau_{\varnothing jj^{\prime}}=0$ and $\ell(\varnothing jj^{\prime})\in\mathcal{N}_{X}(\ell(\varnothing j))\setminus% \{i\}$ in the way that if $j^{\prime}\neq j^{\prime\prime}$ , then $\ell(\varnothing jj^{\prime})\neq\ell(\varnothing jj^{\prime\prime})$ . Continuing this process, the parent of a vertex $\varnothing j_{1}j_{2}...j_{l-1}j_{l}$ , in level $l$ is the vertex $\varnothing j_{1}j_{2}...j_{l-1}$ . If $\tau_{\varnothing j_{1}j_{2}...j_{l-1}j_{l}}\neq 0$ , then $\varnothing j_{1}j_{2}...j_{l-1}j_{l}$ has $n-|dig(T_{\varnothing j_{1}j_{2}...j_{l-1}j_{l}})|$ children, called $\varnothing j_{1}j_{2}...j_{l}j_{l+1}$ , with $\ell(\varnothing j_{1}j_{2}...j_{l-1}j_{l}j_{l+1})\in[n]\setminus dig(T_{% \varnothing j_{1}j_{2}...j_{l-1}j_{l}})$ such that there is a bijection between set

\{\ell(\varnothing j_{1}j_{2}...j_{l}j_{l+1}):1\leq j_{l+1}\leq n-|dig(T_{% \varnothing j_{1}j_{2}...j_{l-1}j_{l}})|\}

and $[n]\setminus dig(T_{\varnothing j_{1}j_{2}...j_{l-1}j_{l}})$ . If $\ell(\varnothing j_{1}j_{2}...j_{l}j_{l+1})\notin\mathcal{N}_{X}(\ell(% \varnothing j_{1}j_{2}...j_{l}))$ then $\tau_{\varnothing j_{1}j_{2}...j_{l}j_{l+1}}=\tau_{\varnothing j_{1}j_{2}...j_% {l}}-1$ , and if $\ell(\varnothing j_{1}j_{2}...j_{l}j_{l+1})\in\mathcal{N}_{X}(\ell(\varnothing j% _{1}j_{2}...j_{l}))$ then $\tau_{\varnothing j_{1}j_{2}...j_{l}j_{l+1}}=\tau_{\varnothing j_{1}j_{2}...j_% {l}}$ . Now, in the case that $\tau_{\varnothing j_{1}j_{2}...j_{l}}=0$ , the children of $\varnothing j_{1}j_{2}...j_{l}$ are labeled by $\mathcal{N}_{X}(\ell(\varnothing j_{1}j_{2}...j_{l}))\setminus dig(T_{% \varnothing j_{1}j_{2}...j_{l}})$ , as in Algorithm 1. After a finite number of steps, we construct the tree denoted by $\mathcal{T}_{X,\mathcal{O}}^{i}$ (or $\mathcal{T}_{X,\alpha}^{i}$ for general $\alpha\geq 0$ ) such that each permutation given by a leaf in the level $n-1$ is an $\mathcal{O}_{n}(X)-$ freedom permutation ( $\alpha-$ freedom permutation, respectively). We denote $\mathcal{T}_{X,\mathcal{O}}:=\{\mathcal{T}_{X,\mathcal{O}}^{i}:\ 1\leq i\leq n\},$ and $\mathcal{T}_{X,\alpha}:=\{\mathcal{T}_{X,\alpha}^{i}:\ 1\leq i\leq n\}.$ We also let $Y(\mathcal{T}_{X,\mathcal{O}}^{i})$ be the set of all permutations $y\in S_{n}$ that are given by a leaf of $\mathcal{T}_{X,\mathcal{O}}^{i}$ in the level $n-1$ , and let $Y(\mathcal{T}_{X,\mathcal{O}}):=\cup_{i=1}^{n}Y(\mathcal{T}_{X,\mathcal{O}}^{i% }).$ The definitions of $Y(\mathcal{T}_{X,\alpha}^{i}),$ and $Y(\mathcal{T}_{X,\alpha})$ are similar. The construction of such trees is described in the following Algorithm 3, for general $\alpha\geq 0$ .

Not only does Algorithm 3 give all $\mathcal{O}_{n}(X)-$ freedom permutations but also for each possible permutation in the level $n-1$ , the parameter $\tau$ indicates the exact number of adjacencies of the permutation from outside of $\cup_{x\in X}\mathcal{A}_{x}$ , e.g., if $\tau_{u}=i$ then $(\mathcal{O}_{n}(X)-i)$ adjacencies are from outside in $T_{u}$ . The trees constructed from Algorithm 3 have as subtrees the trees given by Algorithm 1, considering the same set of permutations. An example is given in Figure 2.

Figure 2: Representation of

\mathcal{T}_{X,\mathcal{O}}^{1}

, for

X=\{id=12345,52341,23145\}

, where

\mathcal{O}_{5}(X)=1

. The subtree induced by the blue edges is

\overline{\mathcal{T}}_{X}^{1}

and the subtree induced by the blue and red edges is

\mathcal{T}_{X}^{1}

. The median value of

X

is

\mu(X)=4

and

14523

is the unique median given by the tree

\mathcal{T}_{X,\mathcal{O}}^{1}

which is different from the input permutations. In this example, all medians given by the tree

\mathcal{T}_{X,\mathcal{O}}^{1}

are actually in the subtree

\overline{\mathcal{T}}_{X}^{1}

. Also,

13254

is an example of a permutation in the set

\{y\in S_{n};\mathcal{A}_{X}\subset\mathcal{A}_{y}\subset\bigcup_{x\in X}% \mathcal{A}_{x}\}

that is not a median for

X

.

Algorithm 3

\alpha

-freedom permutations w.r.t.

X

.

4 Experimental results

For each $n$ from 6 to 15, we performed 100 independent runs of Algorithm 3 on a set $X=\{x_{1},x_{2},x_{3}\}\in S_{n}$ of three permutations, where $x_{1}=\mathrm{id}$ and $x_{2}$ , $x_{3}$ are randomly generated such that $\mathcal{O}_{n}(X)\leq 3$ . For each $n$ , we compute the mean of the normalized median value. As shown in Figure 3, the mean of the normalized median value increases with $n$ , and we expect it to approach $2$ as $n\to\infty$ , which is in accordance with the last Theorem in [9].

Figure 3: The blue line represents the mean normalized median value for sets of three permutations

\{id,x_{2},x_{3}\}

, where

x_{2}

and

x_{3}

are randomly and independently sampled (also independently for each run) such that

\mathcal{O}_{n}(\{id,x_{1},x_{2}\})\leq 3

. The red and green lines indicate the minimum and maximum normalized median values observed across 100 independent runs of Algorithm 3, for each genome size

n=6,...,15

.

Although not all the permutations in $Y(\mathcal{T}_{X,\mathcal{O}})$ are medians, we find that non-median permutations in $Y(\mathcal{T}_{X,\mathcal{O}})$ often have total distances close to the median value, indicating that they serve as good approximations. To formalize this, we define

K_{j}(\mathcal{T}_{X,\mathcal{O}}):=\{y\in Y(\mathcal{T}_{X,\mathcal{O}});\sum% \limits_{x\in X}d(y,x)-\mu(X)=j\},

and compute the mean of the proportion $|K_{j}(\mathcal{T}_{X,\mathcal{O}})|/|Y(\mathcal{T}_{X,\mathcal{O}})|$ over 100 runs for each $6\leq n\leq 15$ . Note that $K_{0}(\mathcal{T}_{X,\mathcal{O}})=M(X)$ and for small $j>0$ we can consider the permutations in $K_{j}(\mathcal{T}_{X,\mathcal{O}})$ as approximate medians, since the total distance is close to the minimum total distance.

Figure 4 shows that a significant portion of permutations in $Y(\mathcal{T}_{X,\mathcal{O}})$ have total distances concentrated near the minimum, indicating that while most are not exact medians, many are close approximations.

Figure 4: The mean of

|K_{i}(\mathcal{T}_{X,\mathcal{O}})|/|Y(\mathcal{T}_{X,\mathcal{O}})|

, for each

6\leq n\leq 15

.

In fact, across all tested values of $n$ , the union $K_{0}\cup K_{1}\cup K_{2}$ consistently contains over 33% of $Y(\mathcal{T}_{X,\mathcal{O}})$ , confirming the abundance of near-optimal solutions in the reduced space. For example, at $n=6$ , this set accounts for 61.9% of candidates; for $n=12$ , it still covers 36.4% despite the increase in size. Table 1 summarizes these proportions numerically for selected values of $n$ .

Table 1: Proportion of permutations in

K_{0}\cup K_{1}\cup K_{2}

over

Y(\mathcal{T}_{X,\mathcal{O}})

for selected values of

n

.

$n$	$\|K_{0}\|$	$\|K_{1}\|$	$\|K_{2}\|$	Total proportion (%)
6	7.77%	22.58%	31.55%	61.9%
8	4.91%	16.03%	25.08%	46.0%
10	4.07%	13.04%	21.01%	38.1%
12	4.25%	12.87%	19.29%	36.4%
14	4.46%	13.23%	20.20%	37.9%

To analyze the proportion of medians far from the input set, we denote by $M_{i}:=\{m\in M(X);d(m,x_{k})\geq i,$ for $x_{k}\in X\}$ . Note that $M_{0}=M(X)$ , $M_{i}\subset M_{l}$ for $l<i$ , and $M_{i}$ is empty set for $i>2n/3$ . Figure 5 shows the mean of the ratio of $|M_{i}|/|M(X)|$ , for $6\leq n\leq 15$ . The results indicate that the proportion of medians far from all input permutations decreases rapidly, consistent with the observations and conjectures of Haghighi and Sankoff [8]. For example, when $n=12$ , over 91% of medians are within distance 3 of all inputs, and fewer than 0.8% exceed distance 6. This illustrates the general trend that most medians tend to remain close to at least one input genome.

Figure 5: The mean of

|M_{i}|/|M(X)|

, for each

6\leq n\leq 15

.

However, as $n$ increases, the number of medians that lie far from all inputs also grows. Table 2 reports the proportion of medians lying in $M_{i}$ for values of $i$ near $\left\lfloor\frac{2n}{3}\right\rfloor$ , which corresponds to the breakpoint distance of a “midpoint” genome – that is, one that draws approximately one-third of its adjacencies from each of the three input genomes. As expected, the proportion of medians with distance at least $\left\lfloor\frac{2n}{3}\right\rfloor$ from all inputs is either zero or negligible across all tested values of $n$ , reflecting the rarity of truly equidistant medians. Still, for slightly smaller values such as $\left\lfloor\frac{2n}{3}\right\rfloor-1$ , $\left\lfloor\frac{2n}{3}\right\rfloor-2$ , or $\left\lfloor\frac{2n}{3}\right\rfloor-3$ , the proportion increases noticeably. For instance, when $n=14$ , the set $M_{6}$ , consisting of medians at distance at least 6 from all three inputs, contains more than 11% of all medians, and $M_{5}$ contains over 41%. These medians are still far from each input genome – at least 5 breakpoints away – yet appear with consistent frequency, indicating a non-negligible presence near the midpoint region as $n$ increases.

Table 2: Mean proportion of medians in

M_{i}

for values of

i

near the midpoint distance

\left\lfloor\frac{2n}{3}\right\rfloor

.

$n$	$i=\left\lfloor\frac{2n}{3}\right\rfloor$	$i=\left\lfloor\frac{2n}{3}\right\rfloor-1$	$i=\left\lfloor\frac{2n}{3}\right\rfloor-2$	$i=\left\lfloor\frac{2n}{3}\right\rfloor-3\geq 4$
7	$0$	$3.8\%$	$42.54$ %	-
8	$0$	$0.57\%$	$27.59\%$	-
9	$0$	$9.34\%$	$50.74\%$	-
10	$0$	$2.97\%$	$26.85\%$	-
11	$0$	$15.83\%$	$52.02\%$	-
12	$0$	$0.08\%$	$6.94\%$	$36.54\%$
13	$0$	$2.06\%$	$18.16\%$	$53.56\%$
14	$0$	$0.76\%$	$11.90\%$	$41.63\%$
15	$0$	$0$	$5.29\%$	$29.11\%$

To quantify the algorithm’s efficiency, we compare the size of the reduced space to $n!$ . Table 3 demonstrates that the number of permutations explored by Algorithm 3 represents only a tiny fraction of $S_{n}$ , yet suffices to find all exact and many near-optimal medians. For instance, when $n=15$ , the number of candidate medians generated by the algorithm – i.e., the search space – is less than $0.003\%$ of the full $15!\approx 1.31\times 10^{12}$ permutations.

Table 3: Reduction in search space by Algorithm 3 for selected values of

n

(with

\mathcal{O}_{n}(X)\leq 3

).

$n$	Total candidates	Total medians	$n!$	$\frac{\text{candidates}}{n!}$ (%)
6	180.94	11.46	720	25.13%
8	2981.10	82.86	40320	7.40%
10	24824.90	513.54	3628800	0.68%
12	353921.52	3387.82	$4.79\times 10^{8}$	0.07%
14	1882425.04	23815.54	$8.72\times 10^{10}$	0.002%
15	36659,718	48372.52	$1.31\times 10^{12}$	0.0028%

Although Algorithm 3 was run with $\mathcal{O}_{n}(X)\leq 3$ , allowing up to three adjacencies outside the union of the input adjacencies, we observed that such instances were extremely rare – and when they occurred, each involved only a single external adjacency. For example, at $n=6$ , only 0.04 medians per run (roughly 0.35% of all medians) included one adjacency not present in the union of the inputs. At $n=12$ , the mean was 0.48 per run (under 0.015%). For all other values of $n\leq 15$ , there was no external adjacency. These results indicate that nearly all medians are already covered when we allow zero-freedom, that is, when every adjacency is drawn from the input genomes. In practice, therefore, we can use Algorithm 1, which corresponds to the zero-freedom version of Algorithm 3, to recover most of the medians while achieving substantial speed-ups. When no adjacency is taken from outside the union $\cup_{x\in X}\mathcal{A}_{x}$ , the algorithm completes around 0.75 seconds and uses approximately 19.26 MB of memory for $n=10$ ; about 40 seconds and 104.16 MB for $n=13$ ; and around 3.5 minutes and 289.94 MB for $n=15$ (mean runtime and memory usage over 5 runs, measured on a 2.3 GHz quad-core Intel Core i7 machine with 32 GB RAM).

Finally, although it is known that, given a set of genomes $X$ , there may exist medians that do not contain all adjacencies in $\mathcal{A}_{X}$ , we verified that for the input sets tested ( $6\leq n\leq 15$ ), all medians returned by Algorithm 3 contained the full set of common adjacencies $\mathcal{A}_{X}$ shared by the input genomes. As a result, Algorithm 3 produced the same set of medians as Algorithm 2 on all tested instances.

5 Conclusion

In this paper, we introduced a novel algorithmic framework to find all breakpoint medians of a given set of linear unsigned genomes. Unlike previous methods – which reduce the breakpoint median problem to an instance of the Traveling Salesman Problem (TSP) and return only a single median – our approach is based on the construction of rooted, labeled trees that allow us to find all medians, along with a substantial number of near-medians. Each path of length $n-1$ from the root to a leaf encodes a unique permutation, and the tree structure is designed to efficiently capture the combinatorial space in which medians reside.

This structural strategy provides a new perspective on the median problem. It not only allows us to find all medians in exponential time, but also to systematically explore a constrained and meaningful subset of the permutation space. This is particularly valuable for comparative genomics, where the goal is often to infer an ancestral genome that minimizes evolutionary distance to the observed genomes. Having access to the entire set of medians makes it possible to evaluate and compare them based on additional biological or statistical criteria, such as similarity to known ancestral features or consistency with gene orientation and synteny.

From a theoretical point of view, we demonstrated that our method finds the exact median value, even in cases where prior methods could not. Experimentally, we showed that the number of candidate permutations generated by our trees is a vanishingly small fraction of the full symmetric group (e.g., less than 0.0028% of $S_{15}$ ), yet this restricted space reliably captures all medians and a large portion of near-optimal solutions. In particular, we found that a substantial fraction of permutations in the output tree fall into $K_{0}\cup K_{1}\cup K_{2}$ , indicating that many are either exact or high-quality approximate medians. We also observed that even when allowing up to three adjacencies outside the input set, the inclusion of such external adjacencies was extremely rare, often occurring in fewer than 1% of medians.

Finally, we investigated how far medians tend to lie from all inputs using the $M_{i}$ decomposition. While truly equidistant medians are rare, we found that a non-negligible proportion of medians are located near the theoretical midpoint region. Moreover, we observed that most medians are relatively close to the input permutations, an observation that aligns with theoretical results in the literature [8, 9, 4]. This suggests a layered structure in the space of medians that could be exploited for further biological modeling and inference.

While our work focuses on the breakpoint median problem for unsigned unichromosomal genomes, the algorithm and underlying methodology are not limited to this setting. The core tree-based construction and median search strategy naturally extend to more general models, including signed permutations and multichromosomal genomes. Overall, our method not only offers a new algorithmic contribution but also opens up a range of possibilities for deeper combinatorial and biological analysis of breakpoint medians and their role in gene order phylogeny.

References

[1] Sylvia Boyd and Maryam Haghighi. A fast method for large-scale multichromosomal breakpoint median problems. Journal of Bioinformatics and Computational Biology, 10(01):1240008, 2012. doi:10.1142/S0219720012400082.
[2] David Bryant. The complexity of the breakpoint median problem. Centre de recherches mathematiques, 1998.
[3] Alberto Caprara. The reversal median problem. INFORMS Journal on Computing, 15(1):93–113, 2003. doi:10.1287/IJOC.15.1.93.15155.
[4] Poly H da Silva, Arash Jamshidpey, and David Sankoff. Sampling gene adjacencies and geodesic points of random genomes. In RECOMB International Workshop on Comparative Genomics, pages 189–210. Springer, 2024. doi:10.1007/978-3-031-58072-7_10.
[5] Poly H da Silva, Arash Jamshidpey, and David Sankoff. On the number of breakpoint medians of random genomes. preprint (submitted), 2025.
[6] Pedro Feijão and João Meidanis. SCJ: a variant of breakpoint distance for which sorting, genome median and genome halving problems are easy. In International Workshop on Algorithms in Bioinformatics, pages 85–96. Springer, 2009. doi:10.1007/978-3-642-04241-6_8.
[7] G Fertin, A Labarre, I Rusu, E Tannier, and S Vialette. Combinatorics of genome rearrangements. The MIT Press, 2009.
[8] Maryam Haghighi and David Sankoff. Medians seek the corners, and other conjectures. BMC Bioinformatics, 13(19):S5, 2012. doi:10.1186/1471-2105-13-S19-S5.
[9] Arash Jamshidpey, Aryo Jamshidpey, and David Sankoff. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics, 15(6):S3, 2014.
[10] Arash Jamshidpey and David Sankoff. Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics, 14(15):S7, 2013. doi:10.1186/1471-2105-14-S15-S7.
[11] Caroline Anne Larlee, Chunfang Zheng, and David Sankoff. Near-medians that avoid the corners; a combinatorial probability approach. BMC Genomics, 15(6):S1, 2014.
[12] Mona Meghdari Miardan, Arash Jamshidpey, and David Sankoff. Escape from parsimony of a double-cut-and-join genome evolution process. Journal of Computational Biology, 30(2):118–130, 2023. doi:10.1089/CMB.2021.0468.
[13] David Sankoff and Mathieu Blanchette. The median problem for breakpoints in comparative genomics. Computing and Combinatorics, pages 251–263, 1997. doi:10.1007/BFB0045092.
[14] David Sankoff, Gopalakrishnan Sundaram, and John Kececioglu. Steiner points in the space of genome rearrangements. International Journal of Foundations of Computer Science, 7(01):1–9, 1996. doi:10.1142/S0129054196000026.
[15] Eric Tannier, Chunfang Zheng, and David Sankoff. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10(1):120, 2009. doi:10.1186/1471-2105-10-120.
[16] Andrew Wei Xu. The median problems on linear multichromosomal genomes: Graph representation and fast exact solutions. Journal of Computational Biology, 17(9):1195–1211, 2010. doi:10.1089/CMB.2010.0106.
[17] João Paulo Pereira Zanetti, Priscila Biller, and João Meidanis. Median approximations for genomes modeled as matrices. Bulletin of Mathematical Biology, 78:786–814, 2016.
[18] Chunfang Zheng and David Sankoff. On the pathgroups approach to rapid small phylogeny. BMC Bioinformatics, 12(1):S4, 2011. doi:10.1186/1471-2105-12-S1-S4.

Appendix A Proof of Theorem 1

Below, we include the proof of Theorem 1, as presented in [5].

Proof.

For a permutation $\pi$ and $r\leq k$ , let $\bar{\varepsilon}_{i_{1},...,i_{r}}^{X}(\pi):=|\mathcal{A}_{\pi}\cap\mathcal{B% }_{x_{i_{1}},...x_{i_{r}}}^{X}|.$ To ease the notation, we let $\mathcal{B}_{i_{1},\cdots,i_{\ell}}=\mathcal{B}_{x_{i_{1}},...,x_{i_{\ell}}}$ . Let $\eta=|\mathcal{A}_{m}\setminus\cup_{i=1}^{k}\mathcal{A}_{x_{i}}|$ . Then

\eta+\sum\limits_{r=1}^{k}\sum\limits_{1\leq i_{1}<...<i_{r}\leq k}\bar{% \varepsilon}_{i_{1},...,i_{r}}^{X}(m)=n-1.

As $m$ is a median of $X$ , we have

\begin{array}[]{l}d_{T}(m,X)=k(n-1)-\sum\limits_{r=1}^{k}[r\sum\limits_{1\leq i% _{1}<...<i_{r}\leq k}\bar{\varepsilon}_{i_{1},...,i_{r}}^{X}(m)]\\ =(k-1)(n-1)+\eta-\sum\limits_{r=2}^{k}[(r-1)\sum\limits_{1\leq i_{1}<...<i_{r}% \leq k}\bar{\varepsilon}_{i_{1},...,i_{r}}^{X}(m)]\\[14.22636pt] \leq d_{T}(x_{k},X)=(k-1)(n-1)-(\sum\limits_{1\leq i_{1}<k}|\mathcal{B}_{i_{1}% ,k}^{X}|+2\sum\limits_{1\leq i_{1}<i_{2}<k}|\mathcal{B}_{i_{1},i_{2},k}^{X}|\\% [14.22636pt] \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ +\cdots+(k-2)\sum\limits_{1\leq i_{1}<...<i_{k-% 2}<k}|\mathcal{B}_{i_{1},...,i_{k-2},k}^{X}|+(k-1)|\mathcal{B}_{1,...,k}^{X}|)% .\end{array}

Hence,

\eta\leq\\ (\sum\limits_{r=2}^{k}(r-1)\sum\limits_{1\leq i_{1}<...<i_{r}\leq k}\bar{% \varepsilon}_{i_{1},...,i_{r}}^{X}(m))-(\sum\limits_{r=2}^{k}(r-1)\sum\limits_% {1\leq i_{1}<...<i_{r-1}<k}|\mathcal{B}_{i_{1},...,i_{r-1},k}^{X}|)\\ \leq\sum\limits_{r=2}^{k-1}(r-1)\sum\limits_{1\leq i_{1}<...<i_{r}<k}|\mathcal% {B}_{i_{1},...,i_{r}}^{X}|,

(2)

where the last inequality holds because $\bar{\varepsilon}_{i_{1},...,i_{r}}^{X}(m)\leq|\mathcal{B}_{i_{1},...,i_{r}}^{% X}|$ , for any $r\leq k$ and $1\leq i_{1}<...<i_{r}\leq k$ . $\hfill\blacktriangleleft$

[bib.bib1] [1] Sylvia Boyd and Maryam Haghighi. A fast method for large-scale multichromosomal breakpoint median problems. Journal of Bioinformatics and Computational Biology, 10(01):1240008, 2012. doi:10.1142/S0219720012400082.

[bib.bib2] [2] David Bryant. The complexity of the breakpoint median problem. Centre de recherches mathematiques, 1998.

[bib.bib3] [3] Alberto Caprara. The reversal median problem. INFORMS Journal on Computing, 15(1):93–113, 2003. doi:10.1287/IJOC.15.1.93.15155.

[bib.bib4] [4] Poly H da Silva, Arash Jamshidpey, and David Sankoff. Sampling gene adjacencies and geodesic points of random genomes. In RECOMB International Workshop on Comparative Genomics, pages 189–210. Springer, 2024. doi:10.1007/978-3-031-58072-7_10.

[bib.bib5] [5] Poly H da Silva, Arash Jamshidpey, and David Sankoff. On the number of breakpoint medians of random genomes. preprint (submitted), 2025.

[bib.bib6] [6] Pedro Feijão and João Meidanis. SCJ: a variant of breakpoint distance for which sorting, genome median and genome halving problems are easy. In International Workshop on Algorithms in Bioinformatics, pages 85–96. Springer, 2009. doi:10.1007/978-3-642-04241-6_8.

[bib.bib7] [7] G Fertin, A Labarre, I Rusu, E Tannier, and S Vialette. Combinatorics of genome rearrangements. The MIT Press, 2009.

[bib.bib8] [8] Maryam Haghighi and David Sankoff. Medians seek the corners, and other conjectures. BMC Bioinformatics, 13(19):S5, 2012. doi:10.1186/1471-2105-13-S19-S5.

[bib.bib9] [9] Arash Jamshidpey, Aryo Jamshidpey, and David Sankoff. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics, 15(6):S3, 2014.

[bib.bib10] [10] Arash Jamshidpey and David Sankoff. Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics, 14(15):S7, 2013. doi:10.1186/1471-2105-14-S15-S7.

[bib.bib11] [11] Caroline Anne Larlee, Chunfang Zheng, and David Sankoff. Near-medians that avoid the corners; a combinatorial probability approach. BMC Genomics, 15(6):S1, 2014.

[bib.bib12] [12] Mona Meghdari Miardan, Arash Jamshidpey, and David Sankoff. Escape from parsimony of a double-cut-and-join genome evolution process. Journal of Computational Biology, 30(2):118–130, 2023. doi:10.1089/CMB.2021.0468.

[bib.bib13] [13] David Sankoff and Mathieu Blanchette. The median problem for breakpoints in comparative genomics. Computing and Combinatorics, pages 251–263, 1997. doi:10.1007/BFB0045092.

[bib.bib14] [14] David Sankoff, Gopalakrishnan Sundaram, and John Kececioglu. Steiner points in the space of genome rearrangements. International Journal of Foundations of Computer Science, 7(01):1–9, 1996. doi:10.1142/S0129054196000026.

[bib.bib15] [15] Eric Tannier, Chunfang Zheng, and David Sankoff. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10(1):120, 2009. doi:10.1186/1471-2105-10-120.

[bib.bib16] [16] Andrew Wei Xu. The median problems on linear multichromosomal genomes: Graph representation and fast exact solutions. Journal of Computational Biology, 17(9):1195–1211, 2010. doi:10.1089/CMB.2010.0106.

[bib.bib17] [17] João Paulo Pereira Zanetti, Priscila Biller, and João Meidanis. Median approximations for genomes modeled as matrices. Bulletin of Mathematical Biology, 78:786–814, 2016.

[bib.bib18] [18] Chunfang Zheng and David Sankoff. On the pathgroups approach to rapid small phylogeny. BMC Bioinformatics, 12(1):S4, 2011. doi:10.1186/1471-2105-12-S1-S4.