Robust-Sorting and Applications to Ulam-Median
Abstract
Sorting is one of the most basic primitives in many algorithms and data analysis tasks. Comparison-based sorting algorithms, like quick-sort and merge-sort, are known to be optimal when the outcome of each comparison is error-free. However, many real-world sorting applications operate in scenarios where the outcome of each comparison can be noisy. In this work, we explore settings where a bounded number of comparisons are potentially corrupted by erroneous agents, resulting in arbitrary, adversarial outcomes.
We model the sorting problem as a query-limited tournament graph where edges involving erroneous nodes may yield arbitrary results. Our primary contribution is a randomized algorithm inspired by quick-sort that, in expectation, produces an ordering close to the true total order while only querying edges. We achieve a distance from the target order within , where is the set of erroneous nodes, balancing the competing objectives of minimizing both query complexity and misalignment with . Our algorithm needs to carefully balance two aspects – identify a pivot that partitions the vertex set evenly and ensure that this partition is “truthful” and yet query as few “triangles” in the graph as possible. Since the nodes in can potentially hide in an intricate manner, our algorithm requires several technical steps that ensure that progress is made in each recursive step.
Additionally, we demonstrate significant implications for the Ulam--Median problem. This is a classical clustering problem where the metric is defined on the set of permutations on a set of elements. Chakraborty, Das, and Krauthgamer gave a FPT approximation algorithm for this problem, where the running time is super-linear in both and . We give the first FPT linear time approximation algorithm for this problem. Our main technical result gives a strengthening of the results in Chakraborty et al. by showing that a good 1-median solution can be obtained from a constant-size random sample of the input. We use our robust sorting framework to find a good solution from such a random sample. We feel that the notion of robust sorting should have applications in several such settings.
Keywords and phrases:
Sorting, clustering, query complexityCategory:
Track A: Algorithms, Complexity and GamesFunding:
Ragesh Jaiswal: The author acknowledges the support from the SERB, MATRICS grant.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsAcknowledgements:
We thank anonymous reviewers for their valuable feedback and suggestions.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Sorting is one of the most basic primitives in many algorithms and data analysis tasks. The classical model of comparison-based sorting has been extensively studied, where one aims to sort a list of objects using pairwise comparisons. It is well-known that, in this model, sorting requires at least comparisons in the worst case. Popular algorithms such as merge sort, quick-sort, and heap-sort achieve this bound with comparisons.
However, in practical scenarios, sorting is often applied to very large datasets where errors or imperfections in the comparisons are unavoidable. In real-world applications involving noisy data or large-scale distributed systems, comparisons may occasionally be faulty due to hardware imperfections, data corruption, or other noisy sources. Thus, it is crucial to extend classical sorting algorithms to handle such imperfections effectively. An approach for dealing with such noisy errors in sorting was initiated by Feige, Raghavan, Peleg, and Upfal [7]. In their model, each comparison’s outcome is flipped independently with some known probability . Assuming is a constant, repeatedly querying a pair times would ensure that the outcome is correct with high probability. However, this would lead to number of comparisons. [7] showed that one could instead obtain an algorithm that outputs the sorted order with high probability and performs comparisons. There has been much recent work [19, 12] on obtaining tight dependence on the parameter .
Our work is rooted in adversarial settings where each key is controlled by an agent, and some of the agents may be erroneous. Thus, any comparison involving one of the erroneous agents could result in an arbitrary, though deterministic, outcome. Such models are of practical relevance. For example, in distributed computing environments, each data element may be processed by different nodes, and some nodes could behave erratically due to hardware failures or software bugs. Similarly, in data integration tasks, comparisons may be unreliable due to discrepancies between data sources with inconsistent quality. In these scenarios, it is natural to seek sorting algorithms that are robust to a bounded number of adversarial errors.
To formalize this setting, we model the outcome of all pair-wise comparisons between keys as a tournament graph (recall that a tournament is a directed graph that has a directed arc between each pair of vertices) and assume that there is a total order on the vertices of . If there were no erroneous edges, then the edges of would be consistent with (i.e., has a directed arc iff appears before in ). However, there are some “bad” nodes in this graph: querying any edge involving such a bad node as an end-point can lead to an (adversarially) arbitrary outcome. But the outcome of querying any other edge is consistent with .
The algorithm queries some of the edges of , and then outputs an ordering on all the vertices of . The goal is to output an ordering with minimum distance (measured in terms of the length of the longest common subsequence) with respect to .
Observe that there are two competing objectives here: query as few edges of as possible and minimize the mismatch with the hidden input sequence. If we query all the edges of , it is not difficult to show that a 2-approximation algorithm for the feedback vertex set on tournaments [17] can be used to obtain an ordering that has distance at most from . In this work, we address the following question:
Can we get an efficient algorithm that queries edges in , and outputs an ordering that has distance at most from ?
We answer this question in the affirmative and give a randomized algorithm that comes within distance of . Furthermore, it runs in time. Our algorithm is based on careful identification of good pivots which are likely to be outside the set . Since the algorithm does not know the set , we can only use indirect approaches for this. Our algorithm relies on finding directed triangles in – we know that there must be at least one vertex of in such a triangle. However, each failed attempt to identify such a triangle increases the query complexity. Thus, the algorithm needs to carefully balance these two aspects – identify a pivot that partitions the vertex set evenly and ensure that this partition is “truthful” and yet query as few triangles in the graph as possible. Direct implementations of these ideas do not work for several subtle reasons: (i) a pivot belonging to the set may be able to hide itself by participating in few triangles, and yet, may create large number of misalignments if used for partitioning the set of elements, (ii) elements in will result in truthful partitions of , but may be involved in lot of triangles (involving elements in ), and hence, the algorithm may not use them to act as pivots. The solution lies in a technically more involved strategy: first, ensure by random sampling that there are not too many triangles in . Subsequently, we show that an element that results in an almost balanced partition and is not involved in too many triangles can be used as a pivot. Proving this fact lies at the heart of our technical contribution.
Implications for Ulam--Median Problem.
Our result on robust sorting has interesting implications for another well-studied problem, namely the Ulam--Median problem. Given a positive integer , let denote the set and the set of all permutations over . We can define a metric over as follows: given , , where denotes the length of the longest common subsequence between and . The metric is also popularly known as the Ulam metric. An instance of this problem is specified by a subset and an integer . The goal is to find a subset of permutations in (which may not lie in ) such that the objective value
is minimized. In other words, it is the -median problem where the metric is given by on . When , there is a simple 2-approximation algorithm (which works for any metric): output the best point in the input set .
Breaking the approximation guarantee of for specific metrics, such as the Ulam metric, was open for a long time. Recently, a sequence of works [5, 6] breaks this 2-factor barrier for the Ulam -median problem. The algorithm is a fixed parameter tractable (FPT) algorithm (with respect to the parameter ) that gives an approximation guarantee of for a very small but fixed constant . The running time of this algorithm is , which can be written as and hence is FPT time. Note that the running time is super-linear in and cubic in . This contrasts with several FPT approximation algorithms for the -median/means problems in the literature [16, 13, 3, 2, 11] that have linear running time in the input size. For example, the running time of the FPT -approximation algorithms for the Euclidean -median/means problems in [16, 13, 3, 2] is linear in , where is the dimension. Similarly, the running time for the FPT constant factor approximation algorithms [11] in the metric setting is linear in . Thus, we ask:
Is it possible to break the barrier of 2-approximation for the Ulam -median problem using an FPT algorithm with a linear running time in the input size, i.e., ?
We show that the answer is affirmative using two crucial ideas, one of which relies on the robust sorting problem. We build on the ideas of [6] for designing a -approximation for the 1-median problem. They show that for any input set , there exist five permutations in from which we can find a permutation that has a small distance with respect to the optimal . To obtain linear in dependence on the running time, we show that a constant size randomly sampled subset of input permutations contains these five permutations with high probability. To obtain linear dependence on , given guesses for the five permutations, we can assign each pair of symbols a direction based on the majority vote among these permutations. Now, it turns out that a good solution to the corresponding robust sorting instance is close to the optimal solution .
2 Preliminaries and Problem Definition
We discuss two problems – Robust Sorting and Ulam median. We start with the robust sorting.
Robust Sorting.
We are given a set of elements . There is an (unknown) total order over . However, we have imperfect access to the total order . We are given an implicit directed graph , where we have a directed edge between every pair of elements in (such graphs are often denoted tournaments). In the ideal (zero error) scenario, the edge set would correspond to , i.e., for every distinct pair , we have the directed arc iff comes before in . However, we allow the graph to represent in an imperfect manner as formalized below:
Definition 1 (Imperfect representation).
We say that a tournament is -imperfect with respect to a total order on , where is a subset of , if for every pair of distinct , has the arc iff comes before in . For an integer , we say that is -imperfect w.r.t. if there is a subset of , such that is -imperfect with respect to .
In other words, if is -imperfect w.r.t. , then the arcs with both end-points outside represent correctly, but we cannot give any guarantee for arcs incident with vertices in . We shall often ignore the reference to if it is clear from the context. For edge , we will sometimes use the notation or . Note that the relations are not transitive, except when . Similarly, we will sometimes say that is lesser than (resp. greater than) if (resp. ).
The Robust Sort problem is as follows: given a set of points with an implicit total order , and a query access to the edges in a tournament , output an ordering on which maximizes while using as few queries to as possible. Here denotes the longest common subsequence between and and denotes the length of . Observe that when is -imperfect w.r.t. , one can obtain the total order with queries to .
The Robust Sort problem can be reduced to the well-studied Feedback Vertex Set problem on tournaments (FVST). A feedback vertex set in a directed graph is a subset of vertices whose removal makes the graph acyclic. The Feedback Vertex Set in Tournaments (FVST) problem is to find a feedback vertex set of the smallest size in a given Tournament graph. If we are allowed to query all the edges in , then an -approximation algorithm for FVST can be utilized to obtain an -approximation algorithm for Robust Sort as follows: find an -approximate feedback vertex set and then find a topological ordering on . Output the concatenation of any arbitrary ordering on with . Clearly, Hence, using exponential time, we can find an optimal feedback vertex set and therefore a -approximate solution to Robust Sort. However, since feedback vertex set on tournaments is NP-hard to approximate better than a factor of , and a polynomial time 2-approximation algorithm is known [17], -approximation is the best we can get from such a direct reduction to FVST. Indeed, consider a simple example where , . Consider a -imperfect tournament , such that for all unordered pairs with , there is an arc from to in , except the direction of the arc is reversed for and . In this example, a -approximation algorithm might return as the feedback vertex set. Hence, is the only topological ordering on the remaining vertices, and we might return . Here,
Although the algorithm of Lokshtanov et al. [17] is also inspired by quick-sort, it queries edges and runs in time. In our setting, we are constrained by near-linear running time and, hence, can only check a small subset of triangles for consistency. However, this causes several other issues: a bad pivot can masquerade as a good one, and a good pivot may not get a chance to partition the elements into almost equal halves. The fact that we are on a very tight budget in terms of the number of queries and errors makes the algorithm and analysis quite subtle.
Ulam median.
The Ulam -median problem is simply the -median problem defined over the Ulam metric – given a set of elements and a positive integer , find a set of elements (called centers) such that the objective function is minimized. Here, is the set of all permutations of , which can also be seen as all -length strings with distinct symbols from set . The distance function is defined as , where denotes the length of the longest common subsequence of permutations . There is a trivial 2-approximation algorithm and it has been a long open challenge to obtain a better approximation. Breaking this barrier of 2 was recently achieved by Chakraborty et al. [6] who gave a parameterized algorithm (with parameter ) with a running time of and approximation guarantee of for a small constant . Our goal was to achieve the same using a parameterized algorithm with running time .
3 Our Results
Our first result gives an algorithm for Robust Sort that queries (here hides poly-logarithmic factors) edges in and achieves nearly the same guarantees as that obtained by an efficient algorithm querying all the edges in followed by a reduction to FVST.
Theorem 2.
Consider an instance of Robust Sort given by a tournament graph , where , and a parameter . Suppose is -imperfect w.r.t. an ordering on . Then, there is an efficient algorithm that queries edges in and outputs a sequence such that expectation of is at least Further, the algorithm does not require knowledge of the quantity and has running time (assuming each query takes constant time).
It is worth emphasizing that the parameter may not be constant. In fact, much of the technical difficulty lies in handling cases when may be sub-linear. Following is our main result for the Ulam -median problem.
Theorem 3.
There is a randomized algorithm for the Ulam -median problem that runs in time and returns a center set with with probability at least , where is a small but fixed constant.
3.1 Our Techniques
We now give an overview of our techniques for the robust sorting and the Ulam--median problems.
3.1.1 Robust Sorting
Consider an instance of ROBUST-SORT given by a graph on a vertex set of size . Assume is -imperfect for some , . We shall call the elements of “bad” elements and the rest “good” elements. Observe that every directed triangle in must contain at least one bad element. One potential idea is to keep finding and removing directed triangles in , until is acyclic (recall that a tournament is acyclis if and only if it has no triangles). Suppose we remove triangles. As each removed triangle has at least one bad element, and the number of remaining bad elements is at most . Hence, the number of remaining good elements is at most . Thus, we produce an ordering with . However, this approach uses a quadratic number of queries on . On the other hand, there are many sorting algorithms that use queries in the classical setting, i.e., the 0-imperfect setting. Direct generalizations of such sorting algorithms fail to get the required guarantees on the lcs between the output and .
For example, consider Merge Sort. Here, bad elements can result in the merge procedure to output permutations with a large distance with respect to . Indeed, suppose we partition the input set into equal sized and recursively get good orderings and on them, respectively. Further, assume that all the good elements in appear before those in . However, it is possible that the first element in happens to be a bad element , and turns out to be larger than all the elements in . In such a setting, the merge procedure would place all the elements in before , which is clearly an undesirable outcome.
We now show that using randomized quick sort directly would also lead to an undesirable outcome. In randomized quick sort, one chooses a pivot at random and recursively solves the problem on the elements smaller than the pivot and those larger than the pivot, placing the pivot between the two recursive outputs. For the feedback arc set in tournaments (FAST) problem, where the goal is to minimize the number of inverted pairs, this algorithm was shown to return a -approximation [1] in expectation. However, this simple algorithm does not work for our problem. Indeed, let and consider a random input where a subset of size is chosen at random as the set of bad elements, and a random permutation is selected among the set of good elements. Edges where both endpoints are good respect the permutation, and each edge incident on a bad element is oriented randomly. Now, the random quick-sort algorithm chooses a bad pivot with probability – assume this event happens. Let be the elements less than and greater than , respectively (with respect to the graph ). Since is a bad pivot, our assumption implies that form a roughly random partition of into equal-sized subsets. Thus, if denote the set of good elements and be the left and the right half of respectively, then roughly elements of will end up in and similarly for . Note that . Hence, roughly elements will be wrongly placed by the pivot . Let denote the distance between the output produced by this algorithm on an instance of size containing bad elements and the ordering . Then, we have shown that is at least with a high probability if we choose a bad pivot. Thus, we get the approximate recurrence (note that both will have roughly bad elements):
It is easy to verify that this results in , whereas we desire an output for which this quantity is .
The above analysis indicates that we cannot afford to have “arbitrarily” bad elements as pivots. The following idea takes care of examples as above: when we choose a pivot , check (logarithmic number of times) if forms a triangle with randomly sampled pair . If a triangle is found, we can remove (and hence, at least one bad element gets removed) and try a new pivot; otherwise, it is guaranteed that would be involved in very few triangles (note that in the example above, a triangle would be found with constant probability if the pivot is a bad one). However, this simple idea also does not work. The reason is as follows: (i) randomly chosen good elements, which would ideally act as good pivots, are involved in a lot of triangles and, hence, would not be selected as pivots, and (ii) there could be bad elements, which are not involved in too many triangles, and hence, would sneak in as a pivot. It may seem that the latter scenario is not undesirable – if a bad element participates in a few triangles, then it perhaps acts like a good element and can be used to partition the input set. Unfortunately the quantity as defined above turns out to be large for this algorithm. This happens for the following subtle reason: say there are bad elements, and suppose each of the good elements participates in many triangles. Then, with probability (which can be considered to be close to 1), the algorithm picks a good element as a random pivot and finds a triangle containing it. This would reduce the problem size by only three elements, i.e., the recursive problem has almost the same size. On the other hand, with probability , which may be small, the algorithm partitions using a bad element as a pivot. As outlined above, when we pick a bad element as a pivot, the resulting partition may have many good elements on the wrong side of the partition. Thus, in the overall calculation, the large misalignment created due to these low probability events overwhelms the expected value of . In other words, one obtains a recurrence of the form:
where refers to the sub-problem obtained when a triangle is found, and refers to the misalignment caused by a typical bad element. Since a bad element participates in few triangles, it is possible that (comparing with the previous recurrence above), but still, this can be high enough to lead to a recurrence where is not . The issue arises because the problem size does not shrink sufficiently to balance the expected misalignment. One way to handle this is to consider separately the case where there are too many triangles in the tournament. So, before picking a potential pivot and testing its goodness, in a pre-pivoting step, we check randomly chosen triples for triangles and remove them in case they are found. This ensures that at the time of pivot selection, there is a reasonable chance the problem size shrinks without too much increase in the expected misalignment. Our algorithm and analysis basically work by balancing these quantities in a carefully devised inductive argument. One of the key technical insights is the following: given two orderings and of a partition and of respectively, we define the notion of concatenation loss – this captures the extra misalignment (with respect to the implicit ordering) created by concatenating and . Our key technical result shows that if such a partitioning is created by a pivot involved in a few triangles, then the corresponding concatenation loss of the orderings on and output by the algorithm is small.
3.1.2 Ulam--Median
There is a trivial and well-known 2-approximation algorithm for the Ulam 1-median problem – output the best permutation from the input. To break the barrier of 2, one must consider a stronger version of the triangle inequality that holds specifically for the Ulam metric. Let be the permutations in the input and let denote the optimal 1-median. Let denote the subset of symbols that are not in , i.e., the misaligned symbols in and . So, . The following (stronger version) of the triangle inequality holds for the Ulam metric: . Chakraborty et al. [6] exploit this inequality to break the 2-approximation barrier. Even though the technical details are intricate, at a very high level, the key idea in [6] is to show that either one of the input permutations is a good center or there are five permutations such that with , is small, i.e., the number of common misaligned symbols is small.
We strengthen this result as follows – we show that either there are a significant number of permutations that will be good centers, or there are a significant number of permutations with a small number of pair-wise common misaligned symbols. This allows us to argue that an -sized (for some constant ) random subset of permutations is sufficient to find a good center, so we do not need to consider possibilities for finding a good center. Chakraborty et al. [6] gave a similar sampling lemma. However, they required a random sample of size . Using this result would lead to an additional factor in the running time for the -median problem. One of our key contributions is showing that a constant-sized sample suffices. The number of common misaligned symbols in the five permutations being small implies that for most pairs of symbols, their relative order in matches that in at least 3 out of 5 permutations . This is used to find a permutation with a good agreement with and hence is a good center. [6] uses an procedure to find such a center, whereas we improve this to by using our robust sorting algorithm. For the Ulam -median problem, we use our sampling-based algorithm, ULAM1, within the -sampling framework of [13] to obtain an -time algorithm. Here is the summary of the key ideas: Let be the dataset partition that denotes the optimal clustering, and let denote the optimal centers, respectively. Let us try to use ULAM1 to find good centers for each of . We would need uniformly sampled points each from . The issue is that the optimal clustering is not known. If the clusters were balanced, i.e., , then uniformly sampling points from and then considering all possible partitions of these points would give the required uniform samples from each of the optimal clusters. We can then use ULAM1() to find good center candidates for . In the general case, where the clusters may not be balanced, we use the -sampling technique333-sampling is sampling proportional to the distance of an element from the closest previously chosen center. to boost the probability of sampling from small-sized optimal clusters, which may get ignored when sampling uniformly at random from .
The details of our algorithm for the Ulam median problem are given in the full version of the paper available on ArXiv (https://arxiv.org/abs/2502.07653).
3.2 Related Work
We have already seen the connection between robust sorting and the Feedback Vertex Set in Tournaments (FVST) problem [17, 18]. Another problem related to the FVST problem is the Feedback Arc Set in Tournaments (FAST) problem [1, 15], where the goal is to find an ordering of the nodes of a given tournament such that the number of edges going backward (an edge is said to go backward if it is directed from a node that comes later to a node that comes earlier as per the ordering) is minimized. This is the restricted variant of the maximum acyclic subgraph problem [8], where the goal is to find the maximum subset of edges that induces a directed acyclic graph (DAG) in a directed graph. The FAST problem may be seen as robust sorting under adversarial corruption of edges rather than adversarial corruption of nodes, as in our formulation. The FVST and FAST problems are known to the -hard. A 2-approximation, which is tight under UGC, is known [17] for the FVST problem, and a PTAS is known [15] for the FAST problem.
Several works have been done on sorting in the presence of a noisy comparison operator, also called noisy sorting. Feige et al. [7] consider a noise model in which the comparison operator gives the correct answer independently with probability at least each time a query is made on an element pair. This can be regarded as noisy sorting with resampling since we can get the correct answer for a pair by repeatedly querying the operator on the same pair. So, each time a comparison needs to be made for a pair, by repeatedly querying times, one can obtain a algorithm. A better algorithm with queries can be obtained [7, 14]. In more recent works [19, 12], a deeper investigation was made into the constant, which is dependent on the bias , hidden in the sorting algorithm of [7] for noisy sorting with resampling. A more interesting noise model was considered by Braverman and Mossel [4], where the ordering algorithm cannot repeat a comparison query.444This can also be modeled by the constraint that the errors are independent but persistent, i.e., if a comparison is repeated, then you get the same answer. This is called the noisy sorting without resampling (NSWR) problem. The NSWR problem can also be seen as a stochastic version of the Feedback Arc Set on Tournament (FAST) problem – the tournament is generated using the noisy comparator (with respect to some total order ), and the goal is to find an ordering of the vertices such that the number of edges going backward is minimized. [4] gave an algorithm that runs in time and outputs an optimal ordering with probability at least . Another objective function in this setting is to minimize the maximum dislocation and total dislocation of elements, where the dislocation of an element is the difference between its ranks in and the output ordering. Optimal bounds for this have been achieved in a recent sequence of works [10, 9].
4 Algorithm for Robust Sort
In this section, we present our algorithm for Robust Sort (Algorithm 1). The algorithm discards a subset of and returns an ordering on the remaining elements . We can obtain an ordering on by appending an arbitrary ordering on to . Since the size of shall turn out to be small, will be close to .
Algorithm 1 is based on a divide-and-conquer strategy similar to quick-sort. Consider a recursive sub-problem given by a subset of . We choose a parameter (line 1.4), where , and then proceeds in four steps:
-
1.
Testing random triplets for triangles: In this step, we randomly sample triplets of elements from uniformly at random (line 1.6). For each such triplet , we check if it forms a triangle, i.e., if contains the arcs and . If so, we discard (elements in) the triangle (line 1.9) and go back to the beginning of the procedure. Note that checking whether a triplet is a triangle requires three queries to .
-
2.
Finding a balanced pivot: In this step, it tries to find a good pivot – this pivot finding step is repeated times (line 1.11) to ensure that one of these succeeds with high probability). We first select a randomly chosen element of as the pivot (line 1.12). Then we check whether it is a balanced pivot as follows: we sample (with replacement) elements from and partition these elements with respect to (lines 1.15–1.20). Let and denote the partitioning of these elements with respect to . In line 1.21, we check if both these sets are of size at least . If not, we repeat the process of finding a pivot (line 1.22). Otherwise, we continue to the next step. Note that if this pivot finding iteration fails for trials, then we discard all elements of (line 1.35).
-
3.
Testing for triangles involving the pivot: In this step, we test if the balanced pivot chosen in the previous step forms a triangle with randomly chosen pairs of elements from . More formally, we repeat the following process times (line 1.23): sample two elements and (with replacement) from . If forms a triangle, we discard these three points from (line 1.26) and go back to the beginning of the procedure (line 1.5).
-
4.
Recursively solving the subproblems: Assuming the pivot found in the second step above does not yield a triangle in the third step above, we partition the entire set with respect to into two sets and respectively (lines 1.28–1.33). We recursively call the algorithm on and and output the concatenation of the orderings returned by the recursive calls (line 1.34).
5 Analysis
In this section, we analyse Algorithm 1. Let the input instance be given by a tournament , and assume that is -imperfect with respect to a total order on , for some subset of . Let . Since is -imperfect, we know that induced on is a DAG. We shall refer to the elements in as good elements. Let be the restriction of on the good elements. We begin with some key definitions:
Definition 4 (Balanced partition).
Let be an element of a subset . A partition of with respect to the pivot is said to be balanced if .
Definition 5 (Support and loss of a sequence).
Let be a sequence on a subset of elements in . The support of the sequence , denoted , is defined as , i.e., the longest subsequence of good elements in which appear in the same order as in . If there are multiple choices for , we choose the one that is lexicographically smallest with respect to the indices in . Let , the loss of , be defined as the number of elements in that are not in .
Definition 6 (Concatenation Loss).
Consider two sequences and . Let be the sequence formed by the concatenation of and . The concatenation loss of sequences and , denoted , is defined as .
Fix a subset , and consider the recursive call corresponding to . Observe that the set changes during the run of this recursive call – we shall use the index to denote an iteration of the while loop in line 1.5 and use to denote the set during this iteration. Note that each iteration of the while loop either ends with the removal of a triangle or with recursive calls to smaller subproblems ( and ). We define several failure events with respect to an iteration of the while loop and show that these events happen with low probability:
We now show that with high probability, none of the failure events happen.
Lemma 7.
Let be large enough (greater than some constant). Then .
Proof.
We first consider Suppose has at least triangles. The probability that an iteration of the for loop in lines 1.6–1.10 does not find a triangle is at most . Thus, the probability that none of the iterations find a triangle is at most
We now analyze the event Suppose we choose a pivot in line 1.12 which is not balanced w.r.t. . We need to argue that we shall execute line 1.22 (which happens when ) with high probability. Let and denote the set of elements in that are lesser than and greater than , respectively. Assume (the other case is similar). Thus, the expected size of is at most . It follows from standard Chernoff bounds that the probability that is at most: . Since we can execute line 1.12 at most times during a particular iteration of the while loop, it follows by union bound that (for large enough ).
Finally, we consider the event . Suppose has at most bad elements, i.e., there are at least good elements. Consider the middle (in the ordering ) good elements of . Any such element has at least good elements on either side. It follows that if the pivot is chosen among these elements, then the expected size of the sets defined in lines 1.13–1.20 is at least . Thus, for such a pivot, using the Chernoff bound, the probability of executing line 1.22 (that is, ) is at most . Hence, the probability that we do not execute line 1.35 in a particular iteration of the for loop in line 1.11 is at least Thus, the probability that we execute line 1.35 in each of the iterations of this for loop is at most
Induction Hypothesis.
Given a subset of , let be the sequence generated by . Note that is a random sequence. Given integers and , let denote all subsets of size of which have bad elements. Let denote the maximum expected loss of the sequence output by Algorithm 1 when run on a subset . More formally,
We now state the induction hypothesis that shall prove the main result Theorem 2:
(1) |
We prove the above result by induction on . When , the result follows trivially. Now assume it is true for all where . Now, we would like to show it for for some given . Fix a subset . We need to show that
(2) |
We first check an easy case.
Claim 8.
Suppose has at least triangles, where is as stated in line 1.4, then .
Proof.
We know by Lemma 7 that the failure event happens with probability at most . Thus, with probability at least , the algorithm shall make a recursive call on a subset obtained from by removing a triangle found in line 1.9 (notice that, whenever we remove a triangle , we start from scratch after setting ). This triangle must contain at least 1 bad element. We can assume that it has exactly one bad element (otherwise, the induction hypothesis applied on only gets stronger because the r.h.s. of Equation 1 is monotonically increasing with ). Thus, , where the second term on the r.h.s. appears because the loss can never exceed . Applying induction hypothesis on , we see that the above is at most
where the second inequality uses and assumes that exceeds a large enough constant (for a fixed ) Henceforth, assume that has at most triangles. Further, if the algorithm finds a triangle in line 1.9, then the same argument as above applies. Thus, we can condition on the event that no triangles are found in the for loop in lines 1.6–1.10. We can also assume that , otherwise the condition (2) follows trivially. Assume that does not occur (we shall account for this improbable event later). In that case, we know that we shall find a balanced pivot of during one of the iterations of the for loop in line 1.11, and such that the condition in line 1.21 will not hold. Let and be as defined in lines 1.28–1.33. We now give a crucial definition.
Definition 9.
Let be a pivot and and be the partition of consisting of elements lesser than and greater than , respectively. Define as: , where varies over all sequences of distinct elements of and varies over all sequences of distinct elements of .
Recall that, we say that is less than , if . Using the definition of loss, it is easy to verify that if is a good element. For an element , let denote the probability that no triangles are found in the for loop in line 1.23 conditioned on the fact that is chosen in the pivot in line 1.12 and the execution reaches the for loop in line 1.23. We have the following key lemma:
Lemma 10.
For any , where denotes the number of triangles in containing .
Proof.
We first prove the lower bound on . Consider an element . The probability that during an iteration of the for loop in line 1.23, we pick the (unordered) triplet for a triangle is . Since there are triangles containing , the probability that we pick a triangle containing is at most . Thus, the probability that we do not pick any such triangle during the iterations of the for loop in line 1.23 is at least
Now, we prove the upper bound. Assume ; otherwise, the claim is trivial. Let and denote the partition of with respect to the pivot . Let and be sequences over subsets of and respectively such that . Recall that (or ) is the longest common subsequence between (or ) and the optimal sequence on the good elements. In particular, and consist of sorted good elements.
Let denote the suffix of length of , and denote the prefix of length of . We claim that for any forms a triangle. Suppose not. Then, is less than . But then, consider concatenating the prefix of till and the suffix of from (see Figure 1). These sequence of good elements will be sorted and hence a subsequence of . But the length of this sequence is larger than This implies that a contradiction. Thus, we see that there are at least triangles involving and elements of . So, any particular iteration of the for loop in line 1.23 shall find such a triangle with probability at least . This proves the desired upper bound on .
Let denote the set of all balanced pivots of , that is, the elements of for which the partition of is balanced (). For a pivot , let denote the probability that, when is chosen as pivot (in line 1.12), we get a balanced partition of the sampled elements, that is, (and hence, line 1.22 is not executed). Recall that . Hence, assuming does not occur, there must be some pivot, among the (up to) sampled pivots, that passes the balanced partition test (i.e., line 1.22 is not executed for this pivot). Let denote the probability that this pivot is . It is easy to see that .
Lemma 11.
Let denote the set of middle (in the ordering ) good elements of . We have:
-
(i)
.
-
(ii)
For all , .
-
(iii)
For all , .
Proof.
The first statement follows from the definition of and Lemma 7. Now, notice that for (this was argued in the proof of Lemma 7 when bounding the probability of ), and . Hence, for all . For the third statement, note that . Also, for all Hence, for all .
Now, once a pivot passes the partition test, there are two possibilities:
-
(a)
With probability , we partition into and with respect to and recursively call and . Let the output of these recursive calls be sequences and , respectively (recall that the output of or can be a sequence on a subset of or respectively). Let denote the concatenated sequence . If is a good element, then , and hence . If is a bad element, is same as the loss of the sequence . The latter quantity, by the definition of , is at most Let and have and bad elements, and let their sizes be and For, . For ,
where the second inequality follows from the induction hypothesis, and the last inequality uses .
-
(b)
With probability , a triangle is found in lines 1.23–1.27. In this case, we effectively recurse on a subset of , where has elements and at most bad elements. In this case, the induction hypothesis implies that the expected loss of the sequence returned by is, at most
where the expectation is over the choice of and the procedure.
Putting everything together, we see that (conditioned on not happening),
is at most (recall that is the set of bad elements):
(3) |
where the first inequality uses the observation that , and utilizes the bounds on derived in Lemma 11. The second inequality uses the bounds on derived in Lemma 10. Now, observe that, the expression over , is maximized at , and hence,
Substituting this in (3), we see that is at most:
After Claim 8, we had assumed that . Therefore, there can be at most elements for which . It follows that there are at least elements of for which . Therefore, the expression above is, at most:
where the first inequality uses the fact that and hence, , the second one uses the fact that , and the last inequality assumes that exceeds a large enough constant (for a fixed ), and (for , trivially holds). Now,
Finally, using , we see that the expected loss is at most . This proves the statement about the quality of the output permutation in Theorem 2. It remains to calculate the expected number of queries by the algorithm.
Lemma 12.
Conditioned on the failure event not happening during any of the recursive calls of ROBUST-SORT, the number of queries made by the algorithm is . The same bound holds for the running time of the algorithm.
Proof.
It is easy to verify that each iteration of the while loop (1.5) takes time, and either finds a triangle or divides the problem into two smaller subproblems (in this case taking an additional time). Since does not happen, we have that the subproblem sizes are at most . Hence, the time complexity recursion (which subsumes the query complexity) is:
It is easy to inductively prove that
The probability of happening during an iteration of the while loop (1.5) is at most ( Lemma 7). Since there are up to iterations of the while loop overall, using union bound, the probability that never happens is at least . Also, the worst-case running time (and the query complexity) of ROBUST-SORT is . Hence, the expected running time (also the expected query complexity) of the algorithm is at most:
Note that, when proving , where , we assumed that is large enough at certain places. It can be easily verified that suffices in all these places. To extend the small loss guarantee to smaller , we can simply tweak our algorithm to run the triangle removal algorithm if , resulting in a loss of at most , with a run-time of . This completes the proof of Theorem 2.
6 Conclusion and open problems
We give an algorithm for robust sorting. For a total order on a set of elements, we are given an imperfect comparison operator that behaves in the following manner: There is an (unknown) subset such that for every pair , the comparator is consistent with the total order , but if even one of is from , then the comparator can give an arbitrary (but deterministic) response. We give an algorithm that outputs an ordering such that the order of at least elements are consistent with (i.e., ). Our algorithm runs in time . This means that it does not compare every pair of elements. Even though was sufficient for the Ulam median application, it is natural to ask whether the factor can be improved. More concretely, we identify the following open problem:
What is the best constant , such that there exists a randomized algorithm that asks queries in expectation, and outputs an ordering with ? In particular, as our algorithm provides a -approximation, it would be interesting to see if .
The above open question restricts the number of queries to . However, the problem remains interesting even if there is no bound on the number of queries. In particular, we ask the following open question:
What is the best constant , such that there exists a polynomial time randomized algorithm with no bound on the number of queries, that outputs an ordering with ? In particular, as a reduction to FVST yields a -approximation to Robust Sort, it would be interesting to see if .
References
- [1] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5), November 2008. doi:10.1145/1411509.1411513.
- [2] Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In Nitin Saxena and Sunil Simon, editors, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020), volume 182 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:17, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.FSTTCS.2020.13.
- [3] Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theory of Computing Systems, 62(1):93–115, 2018. doi:10.1007/s00224-017-9820-7.
- [4] Mark Braverman and Elchanan Mossel. Noisy sorting without resampling. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, pages 268–276, USA, 2008. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1347082.1347112.
- [5] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the Median under the Ulam Metric, pages 761–775. SIAM, 2021. doi:10.1137/1.9781611976465.48.
- [6] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering Permutations: New Techniques with Streaming Applications. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference (ITCS 2023), volume 251 of Leibniz International Proceedings in Informatics (LIPIcs), pages 31:1–31:24, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ITCS.2023.31.
- [7] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM J. Comput., 23(5):1001–1018, 1994. doi:10.1137/S0097539791195877.
- [8] Alan Frieze and Ravi Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175–220, 1999. doi:10.1007/s004930050052.
- [9] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimal Sorting with Persistent Comparison Errors. In Michael A. Bender, Ola Svensson, and Grzegorz Herman, editors, 27th Annual European Symposium on Algorithms (ESA 2019), volume 144 of Leibniz International Proceedings in Informatics (LIPIcs), pages 49:1–49:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ESA.2019.49.
- [10] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimal dislocation with persistent errors in subquadratic time. Theory of Computing Systems, 64(3):508–521, 2020. doi:10.1007/s00224-019-09957-5.
- [11] Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. FPT Approximation for Constrained Metric k-Median/Means. In Yixin Cao and Marcin Pilipczuk, editors, 15th International Symposium on Parameterized and Exact Computation (IPEC 2020), volume 180 of Leibniz International Proceedings in Informatics (LIPIcs), pages 14:1–14:19, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.IPEC.2020.14.
- [12] Yuzhou Gu and Yinzhan Xu. Optimal bounds for noisy sorting. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, pages 1502–1515, New York, NY, USA, 2023. Association for Computing Machinery. doi:10.1145/3564246.3585131.
- [13] Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple -sampling based PTAS for -means and other clustering problems. Algorithmica, 70(1):22–46, 2014. doi:10.1007/s00453-013-9833-9.
- [14] Richard M. Karp and Robert Kleinberg. Noisy binary search and its applications. In SODA ’07, pages 881–890, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283478.
- [15] Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC ’07, pages 95–103, New York, NY, USA, 2007. Association for Computing Machinery. doi:10.1145/1250790.1250806.
- [16] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32, February 2010. doi:10.1145/1667053.1667054.
- [17] Daniel Lokshtanov, Pranabendu Misra, Joydeep Mukherjee, Fahad Panolan, Geevarghese Philip, and Saket Saurabh. 2-approximating feedback vertex set in tournaments. ACM Trans. Algorithms, 17(2), April 2021. doi:10.1145/3446969.
- [18] Matthias Mnich, Virginia Vassilevska Williams, and László A. Végh. A 7/3-Approximation for Feedback Vertex Sets in Tournaments. In Piotr Sankowski and Christos Zaroliagis, editors, 24th Annual European Symposium on Algorithms (ESA 2016), volume 57 of Leibniz International Proceedings in Informatics (LIPIcs), pages 67:1–67:14, Dagstuhl, Germany, 2016. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ESA.2016.67.
- [19] Ziao Wang, Nadim Ghaddar, and Lele Wang. Noisy sorting capacity. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 2541–2546, 2022. doi:10.1109/ISIT50566.2022.9834370.