Randomized Binary and Tree Search Under Pressure
Abstract
We study a generalized binary search problem on the line and general trees. On the line (e.g., a sorted array), binary search finds a target node in queries in the worst case, where is the number of nodes. In time-constrained applications, we might only have time to perform a sub-logarithmic number of queries. In this case, it is impossible to guarantee that the target will be found regardless of its position. Our main result is the construction of a randomized strategy that maximizes the minimum (over the target’s position) probability of finding the target. Such a strategy provides a natural solution when there is no a priori (stochastic) information about the target’s position. As with regular binary search, we can find and run the strategy in time (and using only random bits). Our construction is obtained by reinterpreting the problem as a two-player zero-sum game and exploiting an underlying number theoretical structure.
For the more general case on trees, querying an edge returns the edge’s endpoint closest to the target. Given a bound on the number of queries, we quantify a the-less-queries-the-better approach by defining a seeker’s profit depending on the number of queries needed to locate the hider. For the linear programming formulation of the corresponding zero-sum game, we show that computing the best response for the hider (that is, the separation problem of the underlying dual LP) can be done in time , where is the size of the tree. This result allows us to compute a Nash equilibrium in polynomial time whenever . In contrast, computing the best response for the hider is NP-hard in general.
Keywords and phrases:
Binary Search, Search Trees on Trees, Nash EquilibriumCategory:
Track A: Algorithms, Complexity and GamesFunding:
Agustín Caracci: Partially supported by Fondecyt-ANID Nr. 1221460.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Sorting and searching ; Theory of computation Algorithmic game theory ; Theory of computation Dynamic programmingAcknowledgements:
We thank several anonymous reviewers. Their insightful comments helped to improve the presentation of this document significantly.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Searching for an object on a line (or, equivalently, an ordered list) with vertices is one of the most fundamental tasks in computer science. Binary search, one of the most basic discrete algorithms, produces a search strategy that finds the target item with comparisons, regardless of its position. However, when we encounter tight time constraints, we face the problem of searching under a potentially sub-logarithmic budget of comparisons. In this paper, we study search problems on paths and, more generally, trees, where the given number of queries does not allow us to search the complete set of nodes.
A prominent recent example arises in the search for infections on a sewer network, a strategy that attracted significant attention during the COVID-19 pandemic [1, 2, 17, 23, 26, 28]. In this context, we can query a node of the sewer network – via a PCR test – and ask whether traces of the virus are found in the sampled water. Since a sewer network can (usually) be modeled as an in-tree [6] – which carries water from different locations to the water-treatment plant, i.e., the root node – a query on a node tells us if the infection lies in the subtree rooted at , or its complement. As the water flows towards the root, querying is equivalent to querying the edge , where is the parent of . Hence, we can equivalently state the problem for an undirected tree , where querying edge indicates the connected component of that contains the target node.
A core but poorly understood question considers the search for the first infection in a particular locality. Unlike other models found in the literature, after the infection is first detected at the water-treatment plant, the system designer faces intense pressure to find the source of the infection within a fixed time constraint: If the infected node is found within a pre-specified number of days, the spread of the infection could be prevented; otherwise, the infection will spread uncontrollably. Observe that this strategy is attractive because the virus can be detected in the wastewater before symptoms appear [20].
Before stating our model in general, consider the example shown in Figure 1, where the network is just a path with nodes . A target (infected) node is hidden in the network. We can devise a search strategy that queries edges. Each query answers whether the target is to the left or right of the queried edge. We aim to find (and certify) the target’s position within a given number of queries . In this example, . In our COVID motivation, is the number of queries we can perform between the infection is detected in the network, and the infected person starts spreading the virus. As one can imagine, the number of queries might not be enough to search the entire path, that is, is sub-logarithmic. In Figure 1(a), we depict a search strategy (i.e., a binary tree) and the positions where the target would be found, which is, for this search strategy, a subpath. If we do not have a priori information on the position of , selecting a deterministic strategy will leave blind spots (i.e., the white boxes). Choosing a randomized strategy allows us to guarantee a non-zero probability of finding the target, regardless of its position. We study the problem of devising a randomized strategy that maximizes this guaranteed probability, i.e., that maximizes the worst-case (with respect to the position of the target) probability of finding the target. Uniformly randomizing the strategy in Figure 1(a) with the mirrored strategy (that finds the target if or ) yields the optimal solution in this case, where the target is found with a probability of at least 0.5, regardless of its position. In contrast, the solution in Figure 1(b), yields a strategy that is never in the support of an optimal solution.
The described setting can be equivalently posed as a two-player zero-sum game on a path . The first player, the hider, chooses a node to hide (i.e., the target). The second player, the seeker, must decide on a search strategy, i.e., an adaptive sequence of edges to query to find the hidden node . If edge is queried, the seeker learns whether is to the left or to the right of . The game finishes after queries or after the seeker can univocally find the position of . If the seeker finds within the given number of queries, she gets a profit of 1, and the seeker obtains a profit of -1 (i.e., she pays a cost of 1). If after the queries the seeker cannot certify the position of , then both players get a profit of 0. We want to find a mixed (i.e., randomized) strategy for each player that yields a Nash equilibrium. That is, the seeker (respectively, the hider) aims to maximize (resp., minimize) the probability that the target is found within queries. For the example in Figure 1, a Nash equilibrium is the mentioned optimal strategy for the seeker (randomizing between Figure 1(a) and the mirror solution) and a solution that picks a random vertex within for the hider. For a given randomized search strategy, let be the probability of finding the target if . If the randomized solution is part of a Nash equilibrium, the hider’s best response is a probability distribution supported within , which yields a profit of for the seeker. Therefore, in a Nash equilibrium, the seeker maximizes , which was our original goal.
The advantage of interpreting our problem as a zero-sum game is twofold: (i) it allows to interpret a hider’s Nash strategy as a dual certificate of optimality of the seeker’s strategy (by exploiting von Neummans’ minimax theorem or, equivalently, LP-duality; see Section 2 for details), and (ii) the Nash equilibrium explicitly captures situations when both the hider and seeker act strategically, in particular, when the target is hidden adversarially. This last point is particularly relevant in other applications. For example, similar search problems have been proposed to detect the presence of illicit drugs [30], pollution [16] and even illicit explosives [13] in water networks.
Inspired by these applications, we extend our model in two directions. First, we consider the case in which the graph is not only a path, but an arbitrary tree. As before, the seeker queries edges. If is queried, she obtains as answer the connected component of that contains . As a second extension, we further consider a time-dependent non-increasing profit function , where if . If the seeker finds the hider with queries, she obtains a profit of , and the hider perceives a cost of (or, equivalently, a profit of ). We obtain the same problem as above if for , which we call the unit-profit case. As before, the objective is to find a mixed Nash equilibrium efficiently.
Model discussion
Our paper studies a models where the seeker must certify exactly the position of the target at the end of the procedure. Guessing the target at the end – without spending queries to certify it – yields zero profit. This feature is motivated by cases where the seeker is accountable for actions taken after the target is found, which include, e.g., forcing quarantines or other extreme measures. It is interesting to consider extensions, where some profit might be gained depending on the size of the target’s range after the queries. We leave such models for future work.
Related Literature
Previous literature mostly focuses on minimizing the worst-case number of queries needed to find the target. When the target is hidden in a tree, Lam and Yue [22] (see also [15]) provide a linear time algorithm which generates an optimal search tree for the edge query model, a result that was later rediscovered in a sequence of papers, including the works by Ben-Asher et al. [3], Onak and Parys [27], and Mozes et al. [25]. A more general setting was later considered by Cicalese et al. [12] where each edge has a different cost. The objective is to find the target with the worst-case minimum total cost. This version turns out to be strongly NP-hard and admits an approximation algorithm with a slightly sublogarithmic approximation factor.
An alternative model considers queries on the vertex of the tree. If the target is not in the queried vertex, one learns which of the neighbors is closest to the target. For this model, Schäffer [29] provides a linear-time algorithm generating an optimal search tree. Minimizing the number of vertex queries on an arbitrary undirected graph becomes quite intriguing. There is an -time algorithm, where is the number of edges and the number of vertices. The quasi-polynomial dependency is necessary, as there is no -time algorithm under the Strong Exponential-Time Hypothesis (SETH) [18]. For directed acyclic graphs, a vertex query returns 3 types of answers: the target is on the queried vertex, the target is reachable from the queried vertex, or it is not. In this model, minimizing the number of queries is NP-complete [9].
A different line of research considers a known probability distribution of the position of the target, and aims to find a search strategy (either for node or edge queries) that minimizes the expected number of queries until the target is found. Knuth [21] shows that dynamic programming can solve this problem on the line with running time , while Mehlhorn [24] shows that bisecting the probability mass in each iteration yields a constant-factor approximation algorithm. For the more complicated problem on trees in the edge query model, Cicalese et al. [10, 11] shows that finding an optimal search strategy is NP-hard. Moreover, it admits a 1.62-approximation algorithm and even an FPTAS on trees of bounded degree. On the other hand, for node queries, the complexity of the problem is still open. Recently, Berendsohn and Kozma [5] show that the problem admits a PTAS, and Berendsohn et al. [4] analyze the algorithm that iteratively select the centroid of the tree. Besides given a fast output-sensitive algorithm for finding the centroid, they show that the obtained strategy is a 2-approximation, and that the analysis is tight even when the target distribution is uniform.
Our Contribution
We study two scenarios: (i) For the unit profit case on a line, we provide an algorithm that computes the optimal strategy for the seeker in time; (ii) if is a general tree and the profit is arbitrary, we show that we can find a Nash equilibrium in time while computing the best response for the seeker is NP-hard.
Our main result is (i). Our algorithm samples a search strategy in time (using random bits). Afterwards, choosing the next edge to query takes constant time. The solution picks a random interval to perform binary search on it. The interval is chosen uniformly at random from a set constructed with a greedy algorithm based on modular arithmetics (see Figure 4). This construction naturally connects to Bezout’s identity and the Extended Euclidean algorithm, which we exploit to obtain the claimed running times. On the other hand, we provide a construction that yields an optimal strategy for the hider, which we then use to show that both strategies (hider and seeker) are optimal by linear programming duality. For some regimes of values of and , the optimal hider’s solution requires an intricate construction as she needs to unevenly distribute probability mass to avoid giving the seeker an advantage (see Figure 5).
For (ii), where is an arbitrary tree and is any non-increasing profit function, we again consider the corresponding zero-sum game as a linear program (LP), which has an exponential number of variables. We show that the separation of the dual problem – the problem of computing the best-response of the hider – can be solved in time . In contrast, when is part of the input, the separation problem is NP-hard, which justifies the exponential dependency on 111Observe, however, that this does not imply that computing a Nash equilibrium is NP-hard.. The separation problem of the dual takes as input a distribution of the target node over . It consists of finding a search strategy with at most queries that maximizes the expected profit of the seeker. We devise a dynamic program that solves this problem. It is based on the concept of visibility vectors [22, 27, 15], introduced for the problem of minimizing the time to find the target in the worst-case. By rooting the tree, we can apply a bottom-up approach to extend a solution for each possible visibility vector. The exponential dependency on comes from the fact that the number of different visibility vectors is exponential in . Together with the Ellipsoid method, we obtain an algorithm to compute a Nash equilibrium in time .
Observe that in regimes where the budget is logarithmic, , then our algorithm for the separation problem runs in polynomial time . Moreover, is logarithmic in many natural cases. For example, if has maximum degree and we are in the unit profit case, then there are search strategies that can unequivocally find the target within queries [3, 18]. Therefore, if is a constant, our model only makes sense when is at most logarithmic, as otherwise we have a strategy that finds the hider with probability 1. Hence, for these cases we can assume that and hence our algorithm takes polynomial time . In particular, this yields a running time of for the separation problem when is a line; which is significantly worse than our dedicated solution.
In Section 2 we introduce the problem formally and discuss the connection with zero-sum games. Our main result is given in Section 3, where we show our solution for the case that is a path. Finally, Section 4 shows the dynamic program for the general case. Some of the technical details are deferred to the appendix.
2 Problem Definition
For an arbitrary graph , let and denote its node-set and edge-set, respectively. We will represent edges as sets or with the shortcut .
Let be a tree with nodes. To represent a search strategy, consider a rooted binary out-tree , where each internal node is labeled with an edge ; see Figure 2 for an example. By definition, the root is labeled, unless . The edge corresponds to the first edge queried by the strategy. Let (resp. ) denote the connected component of that contains (resp. ). One child of is associated to , and the other child to . The output of the query points to the child of associated to the connected component that contains the target. The search is then resumed in said child. Based on this notation, we give a recursive definition of a search strategy. We say that with root label is a search strategy for if the subtrees of in , denoted by and , are also search strategies for and , respectively. Finally, is a search strategy for any tree if .
Let be a search strategy for . Observe that each node of can be naturally associated to a set that potentially contains the target node . More precisely, let us define as . If , we define recursively and , where (resp. ) is the root of (resp. ) and (resp. ) is the vertex-set of (resp. ). Hence, at the moment that we query according to node , the set corresponds to the smallest node set in for which we can guarantee that .
Let be the set of leaves of . It is easy to see that defines a partition of into connected components. We say that node is covered by if the singleton belongs to . The set of nodes covered by is denoted by . Observe that if the hider chooses , then applying the search strategy we can assert that is the target if and only if .
Finally, the height of a search strategy is the height of the underlying binary tree, i.e., the length of the longest path from the root to a leaf. For a fixed tree and budget , we represent with the set of all the search strategies for of height at most .
Let be the set of distributions supported on , that is,
For the unit-profit case, our aim is to find a vector that maximizes the worst-case probability of finding the target , that is,
| (1) |
More generally, let be a non-increasing function, where represents the profit of finding the target with exactly queries, and for . Our most general model considers the maximization of the worst-case expected profit,
| (2) |
where is if (and hence ) and, otherwise, it equals the distance from the root to the leaf such that .
We can interpret this problem as a two-player zero-sum game as follows. The first player, the seeker, selects a search strategy . The second player, the hider, selects a node . Hence, for a given pure strategy pair , the seeker obtains a profit of while the hider incurs a cost of (or a negative profit of . A mixed strategy for the seeker is a probability vector , where represents the probability of selecting the search strategy . A mixed strategy for the hider is a vector . For a pair , the expected profit of the seeker (cost of the hider) is A pair is said to be a Nash equilibrium if no player can unilaterally improve their profit or cost, that is,
Von Neumann’s minimax theorem [31], a fundamental game-theoretic fact, states that a pair is a Nash equilibrium if and only if defines an optimal solution to the LP,
and is an optimal solution to
Observe that [P] and [D] are dual LPs, and hence they attain the same optimal value. We refer to this value as , the optimal profit or the value of the game. Moreover, as in an optimal solution for [P] the value of is simply the minimum value of over all , then [P] is a restatement of our original problem in Equation (2). For a given probability vector we say that is its objective value. Similarly, for a vector we say that is its objective value.
3 Search on a line
In this section we focus on the special case where is a line, that is, and . Also, we restrict ourselves to unit profit functions, where for all and if . To avoid trivial border cases, we assume that . Moreover, we can assume that , as otherwise there exists a deterministic search strategy, namely binary search, that finds the target in every single node, and hence the value of the game is trivially 1. The aim of this section is to compute a highly structured Nash equilibrium.
We start by characterizing the sets that arise as covered sets of a search strategy . Then, we restrict the set to a smaller set of strategies, which we call efficient. Using our characterization of covered sets together with LP duality, we will able show that there exists an optimal solution for the seeker that only uses such strategies.
Covered Sets and Efficient Strategies
For a given search strategy , recall that represents the set of covered nodes. Much of our technical work is dedicated to show that an optimal randomized solution selects only search strategies where forms a set of consecutive nodes modulo . For , we define as if , and otherwise. For example, is . We refer to as an interval modulo , or simply an interval. To simplify notation, we omit the subscript when , writing instead of . Additionally, for an integer , we define , which we refer to as the interval that starts at of length or simply an interval of length starting at 222The length of an interval is not to be confused with the length of a path. The former is given by the number of vertices and the latter by the number of edges in the path.. Observe that this notation describes intervals that “wrap around” the path in a concise manner.
It will be particularly useful to understand the sets that appear as a cover set of a search strategy . For a fixed value of , for the rest of this section we define . Observe that a search strategy that aims to cover a single interval in (i.e., an interval that does not wrap around) is able to cover nodes: just apply binary search restricted to a given interval of length . On the other hand, if the strategy covers a single (wrap-around) interval that contains either or (or both), it will cover nodes. Notice also that there cannot be a strategy that covers exactly , as this immediately implies that is covered, similarly with and node . Strategies that cover more than one interval cover a smaller number of nodes. For example, if it covers two disjoint intervals (that do not contain or ), the number of covered nodes will be ; see Figure 3(b). The next proposition formalizes this intuition by giving a characterization of maximal cover sets (that is, the ones such that there is no with ) for search strategies .
Proposition 1.
propositionNonDominated Let . Let be a set where (and is minimal). The set is a maximal covered set for some if and only if
-
i)
and ,
-
ii)
for all and and
-
iii)
The first two conditions in the proposition encompass the fact that the vertices not covered by a search strategy always have at least one neighbor that is also not covered. The third condition states a trade-off between , the number of intervals in , and the cardinality of . Finally, it also says that if contains or (or both), the strategy gains an extra covered node. See Figure 3 for an illustration.
Of particular interest are search strategies that cover a single interval, i.e., with , as they maximize the number of covered elements. We call such strategies efficient. The following corollary is a direct consequence of Proposition 1.
Corollary 2.
Let . For every there exists an efficient search strategy that covers
Due to the first condition of Proposition 1, no efficient strategy covers an interval starting at . For , let be the efficient search strategy that covers , as in the corollary.
The Seeker’s Strategy
We will start by constructing a strategy for the seeker given by a greedy algorithm that defines the support of . Given , the support of , we will define the probability vector uniformly over , i.e., if and 0 otherwise.
Intuitively, our aim is to cover all nodes as evenly as possible with efficient strategies, that is, by the same number of strategies in . Consider, for example, the case and (i.e., ), depicted in Figure 5. Imagine that we choose to be in . This alone yields an objective of 0, and hence a natural choice is to consider . However, this implies that nodes in are covered twice while the rest are covered only once. This imbalance suggests we should add more strategies, the most natural being , which produces an imbalance for nodes in . By continually adding strategies to in a greedy manner, we obtain the solution in Figure 5, yielding an objective value of . This example suggests Algorithm 1, which can be interpreted as a greedy rectangle packing algorithm, see Figure 4 and Figure 5.
Observe that in Line 4 of Algorithm 1 we take modulo instead of modulo , which might be counterintuitive. The reason for this choice is that if we remove node , then for every the cardinality (length) of each set is . This avoids the case distinction of Corollary 2.
Given a set , the objective value of is . If in the while loop we reach the case , then the probability of covering vertex is , which is the same for every as each element is covered by the same number of search strategies in . This is the reason why terminates the while loop. If we reach the situation where in the while loop, then we should also stop since is not well defined. We observe that all nodes, except maybe for node , are covered the same number of times by , i.e., for all . We define this quantity as and we say that it corresponds to the height of . Similarly, let denote the cardinality of . With these parameters, the objective value of solution is . The following lemma yields an explicit relationship between and , and shows how to compute these values with a faster algorithm. In particular, it relates and to , the greatest common divisor of and , and Bezout’s identity. The proof of the lemma is based on the fact that the algorithm indeed finishes and that the values of and satisfies the Bezout’s identity if and are coprime. The running times in the next lemma are considered in the RAM model.
Lemma 3.
Algorithm 1 terminates in finite time. If denotes the output of Algorithm 1, then its objective value is , where is the cardinality and is the height of . Moreover, are the numbers that satisfy
| (3) |
where is minimal. In particular, we can compute , and hence the objective value of , in time. Moreover, if and are not coprime the objective value simplifies to , where and .
Proof.
We have already argued about the objective value of .
We now give a proof for Equation (3) and argue that the algorithm terminates. For a given iteration, consider the state of the algorithm right after running line 4, in particular we have a set together with an updated value of . Let be the cardinality of and its height, defined as .
We use a volume argument. Define the volume of as . Observe that the length of all intervals for is . Thus, On the other hand, . Notice that can also be seen as , and that the algorithm terminates the first time that . Hence, if is the final set, then is the smallest number such that, for some , it holds that
We now show that the algorithm finishes for and that if and only if at termination.
Recall that the basic properties of modular arithmetic and Bezout’s identity [7] imply that all numbers of the form for are multiples of . Moreover, there exists a pair that satisfies equation , and any pair that satisfies this equation is of the form for some . Therefore, the pair that satisfies this equation with the largest value satisfies , where equality holds if and only if divides
Assume that . With the previous discussion, the smallest such that , satisfies that . This in particular implies that the algorithm terminates after at most iterations. Moreover, the pair with the smallest such that implies that (and ), as otherwise and would share a common divisor. Hence, the smallest value of such that is attained when , and thus Equation (3) holds.
If , then there are no values of such that , and hence is the smallest number such that by construction. This implies Equation (3) and that the algorithm terminates after many iterations.
To obtain a fast algorithm to compute and , use the Extended Euclidean Algorithm to compute values of such that and is minimal. This takes time (in the RAM model) [14, Chapter 31]. If then we are done as and . Otherwise, and .
With this lemma, we observe that we can perform a search following in logarithmic time, even without the need of running Algorithm 1: simply compute and with the Extended Euclidean Algorithm and sample the -th element in uniformly.
Corollary 4.
Let be the output of Algorithm 1. We can sample a tree with probability in time and using random bits, without the need to compute . After choosing , each query of the tree can be determined in constant time.
The proof of this result is provided in the Appendix. In order to show that the constructed solution is indeed optimal, we construct a dual with the same objective value. Hence, by weak duality this implies that the pair is a Nash equilibrium. As suggested by the previous lemma, we should distinguish whether and are coprime or not.
Although our construction of the dual solution does not explicitly rely on complementary slackness, the intuition for our approach came from thinking about certain constraints that this concept imposed on the dual variables. In particular, observe that the solution of the seeker induces “segments” of vertices that are covered by the same subset of search strategies in . See, for example, vertices 3 and 4 in Figure 4 or vertices 5 and 6 in Figure 5. There are of these segments (ignoring vertex 0 in the not coprime case), and complementary slackness implies that each of them should have a mass of in an optimal solution for the hider (albeit it does not imply that the mass should be uniformly distributed within each segment). The solution we construct shares this structure: the line is split into segments, and each of them gets a probability of in total. How these segments are chosen and how the mass is distributed within them differs significantly depending on whether and are coprime or not. However, in both cases the solution has the property that any pair of disjoint intervals contains at most the same probability mass as an interval obtained by gluing them together according to Proposition 1. This then implies that the best-response strategy of the seeker is achieved with an efficient strategy, for which the covered mass will be precisely .
We start with the non-coprime case, which turns out to admit a simpler proof since the segments can be chosen in a more regular manner. On the other hand, the coprime case is significantly more challenging from a technical point of view as cannot be constructed so regularly.
Optimality if and are not coprime
If and are not coprime, Lemma 3 implies that the objective value of the solution computed by Algorithm 1 equals . In what follows we will explicitly define a solution for the dual [D] with the same objective function, thus implying optimality of and . The dual solution that we will analyze is defined as
| (4) |
for all , where and . We can interpret this solution by thinking of the vertices as divided into segments, each containing vertices for . Within each segment, the probability mass of is uniformly distributed among its first vertices (see Figure 4). Although this interpretation may seem somewhat contrived, it serves to highlight the connection to the construction of the coprime case, which relies heavily on the use of segments.
In what follows we show that is indeed a probability distribution with objective value .
Theorem 5.
For a set , let . To prove the theorem, observe that by weak duality the objective value of cannot be smaller than . Hence, it suffices to show that for all . Also, observe that as is a divisor of , it holds that . Hence, it suffices to analyze the set . Recall that denotes the interval modulo . Also, notice that , for all and . The next lemma will be useful to compute if is an efficient strategy.
Lemma 6.
For any and positive integer it holds that
Proof.
The definition of implies that for any and it holds that where is the number of elements in that are multiples of . It is easy to see that or . The lemma follows.
The next technical lemma will be useful to show that is maximized when is an efficient strategy. In essence, it says that by gluing two intervals into one (and increasing by one the length of the interval, as suggested by Proposition 1), we can only increase the amount of captured probability mass. With this lemma we will have enough tools to show Theorem 5.
Lemma 7.
Consider two disjoint intervals and . Then
Proof.
Observe that
and hence it suffices to show that . Indeed,
where the first inequality follows by Lemma 6.
Proof (Theorem 5).
First we check that is indeed a probability distribution. To do this, note that there are multiples of in . Hence, we have that
and hence defines a probability distribution over .
In order to show that has an objective value of , by weak duality it suffices to show that for all First of all, observe that for any we have that
where the second equality follows as divides . Hence, if we have that . Finally, if , as , then . We conclude that for any efficient search strategy .
Let be any tree with maximal. We must show that . By Proposition 1 we know that where and . Assume that , as otherwise is an efficient strategy and we are done. Observe that where for all and if and otherwise. Regardless, reinterpreting condition (iii) in Proposition 1, and recalling that intervals in are maximal, it holds that if then and if .
Also by Proposition 1, there exists a search strategy such that
that is, we merged the first and second interval, gaining one unit in the total length. Observe that regardless of weather or not, the new intervals satisfy the conditions of Proposition 1, guaranteeing the existence of . Moreover, by Lemma 7, we have that
Iterating this argument we obtain that Weak duality implies the theorem.
The coprime case
It remains to show optimality of the solution in the case gcd. We define a solution which assigns probability to the border cells and . The remaining cells are partitioned into segments, and a probability mass is uniformly distributed among the cells of each segment, see Figure 5.
The idea for the definition of the segments is as follows. Ideally, we would like to assign a mass of to each coordinate . In this way, each efficient strategy of length would capture a probability of . However, this simple idea must be refined as there are issues with the boundary, where efficient strategies have length . To handle this situation we define a function , the ideal mass function, The general idea is that, for any vertex , the cumulative mass of the solution up to remains bounded by To satisfy this property, we partition nodes in into segments of two possible lengths: or (where length refers to the number of nodes in a segment).
To define the segments start at node , and choose the largest such that
| (5) |
then, is the length of the segment starting at . Set for . Update and repeat the same selection rule until . We call Equation (5) the segment rule.
The constructed can be used to establish the optimality of . To do so, we need to show that the seeker cannot capture more than probability mass from with any strategy in . The proof is given in the full version [8]. It follows a similar structure as for the non-coprime case. However, as the segments are not of uniform length, the analysis is significantly more technical. Combining both cases, coprime and non-coprime, allows to conclude our main theorem.
4 A Dynamic Programming Algorithm for the General Case
In this section we consider problems [P] and [D] for the case when is a general tree and the profit is an arbitrary non-decreasing function.
A natural approach for computing the optimal profit is using the Ellipsoid method [19] on the dual. For this, we need an algorithm to separate the first set of inequalities of [D], i.e., we need to solve for a given . In other words, we need to solve the best-response problem for the seeker: given a tree , a mixed strategy of the hider and a non-increasing profit function with , what is the search strategy in that maximizes the expected profit of the seeker? This problem turns out to be NP-hard; see the full version [8] for the proof.
Theorem 9.
Computing the best-response for the seeker is NP-hard, even if has constant diameter or constant degree, and if for all .
Next, we show that, for an arbitrary tree , the best-response problem can be solved in time via a dynamic program based on characterizing search strategies using edge labelings. With the Ellipsoid method, this implies an optimal algorithm with time complexity .
4.1 Edge Labelings to describe Search Strategies
Edge labelings are functions that map edges of a graph to numbers. In what follows, we specify the conditions an edge labeling must obey in order to properly encode a search strategy in . Such a labeling will be called valid. Subsequently, we formulate the objective function of the best-response problem in terms of valid labelings. Finally, we provide a characterization of valid labelings that will be used by the dynamic program defined in the next subsection.
The relationship between edge labelings and (unrestricted) search strategies was established independently in [15, 27]. We give a modified definition that captures the limited height of the search strategies in , by allowing 0 labels (representing that the edge is never queried).
Definition 10 (Valid Labeling).
A function is a valid labeling for tree if for each pair of distinct edges such that , there is an edge on the simple path from to for which . The set of valid labelings with range is denoted by
We remark that every valid labeling reaches its maximum at a unique edge trivially, except for the degenerate all zero labeling. Intuitively, a valid labeling maps an edge to the remaining budget of a search strategy right before it queries (see Figure 2 for an example). Observe that a search strategy finds a vertex the moment all of its incident edges are queried. This suggests the following notation: for a vertex and a valid labeling , let . Consequently, the expected profit of a valid labeling is given by , and hence we can show that the best-response problem can be solved by maximizing this expression.
Proposition 11.
The maximum expected profit of a valid labeling is equal to the objective value of , that is,
We leave the proof of this proposition for the appendix. With this equivalence established, we turn our attention to obtaining a “local” description of valid labelings. Such a description will give us a recipe for constructing valid labelings in an algorithmic fashion. Let us orient by rooting it arbitrarily and directing the edges away from the root. Let be an arbitrary labeling (not necessarily valid). In this setting, an edge that points from vertex to is denoted . We say that an edge is visible from edge if on the directed path from ending in , there is no other edge such that . In other words, the edges visible from are those that are not “screened” by greater values of . The visibility sequence of , denoted , is the enumeration in ascending order of the labels of edges visible from . We remark that the first label of equals . We extend the definition of visibility sequence to vertices. The visibility sequence of a vertex , , is the union of the visibility sequences of its outgoing edges (see Figure 6). The following result gives us the conditions that we need to impose on visibility sets to obtain a valid labeling. Its proof can be found in the full version [8].
Proposition 12.
A labeling of tree is valid if and only if, for every possible rooting of the tree with edges directed away from the root, the following hold: (i) for every , if then or (or both); and (ii) for every , for distinct edges , the sets and are disjoint, except possibly for label .
4.2 Dynamic Program
The idea of the dynamic program is to compute an optimal valid labeling with a bottom-up traversal, following an arbitrary rooting of the tree, with edges directed away from the root. A similar technique was used by Cicalese et al. [12, Proposition 1]. For every edge and vertex, we need to keep track of their visibility sets and make sure that the conditions of Proposition 12 are satisfied. This is done using two tables and , with recurrences interleaving one another. For an edge , consider the sub-tree consisting of this edge and the sub-tree rooted at . Let denote the maximum expected profit of a valid labeling of this sub-tree with visibility sequence for .
For some vertex with outgoing edges and an integer , consider the subtree consisting of the edges together with the sub-trees attached to the endpoints of these edges. Let denote the maximum profit of a valid labeling of this sub-tree with visibility sequence at vertex . The Bellman equations for tables are defined in a natural way, with special care for border cases. The overall running time is . Details are given in the full version [8].
Theorem 13.
The best-response problem can be solved with a running time of .
Together with the Ellipsoid method, we can directly conclude the following.
Corollary 14.
An optimal solution for the problem of searching on a tree with a budget and an arbitrary non-decreasing profit function can be computed in polynomial time if . In particular, if the input tree has a constant maximum degree, , then the optimal solution can be found in polynomial time regardless of the value of .
5 Conclusions and Open Problems
In this article, we have introduced a model for search problems under pressure, with a strict budget constraint on the number of queries needed to find the target.
Several open questions remain. Arguably, the most fundamental corresponds to search on trees to minimize the worst-case number of queries needed to find the target. As shown in Corollary 14, if , then the problem is polynomially solvable. Although for arbitrary profit functions, the problem is NP-hard (Theorem 9), for the unit-profit case the status of the problem is left open for larger values of and if the maximum degree is super constant. Similarly, we leave the problem of efficiently approximating the optimal solution for general profit functions.
In our model, a profit is only obtained if the target is localized with complete certainty. An exciting future direction is to consider generalized models where a partial profit is obtained if the target is localized to a small enough subset of vertices.
References
- [1] W. Ahmed, N. Angel, J. Edson, K. Bibby, A. Bivins, J. W. O’Brien, P. M. Choi, M. Kitajima, S. L. Simpson, J. Li, B. Tscharke, R. Verhagen, W. J.M. Smith, J. Zaugg, L. Dierens, P. Hugenholtz, K. V. Thomas, and J. F. Mueller. First confirmed detection of SARS-CoV-2 in untreated wastewater in Australia: A proof of concept for the wastewater surveillance of COVID-19 in the community. Science of The Total Environment, 728:138764, 2020.
- [2] J. Baboun, I. S. Beaudry, L. M. Castro, F. Gutierrez, A. Jara, B. Rubio, and J. Verschae. Identifying outbreaks in sewer networks: An adaptive sampling scheme under network’s uncertainty. Proceedings of the National Academy of Sciences (PNAS), 121(14):e2316616121, 2024.
- [3] Y. Ben-Asher, E. Farchi, and I. Newman. Optimal search in trees. SIAM Journal on Computing, 28(6):2090–2102, 1999. doi:10.1137/S009753979731858X.
- [4] B. A. Berendsohn, I. Golinsky, H. Kaplan, and L. Kozma. Fast Approximation of Search Trees on Trees with Centroid Trees. LIPIcs, Volume 261, ICALP 2023, 261:19:1–19:20, 2023. doi:10.4230/LIPICS.ICALP.2023.19.
- [5] B. A. Berendsohn and L. Kozma. Splay trees on trees. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’22, pages 1875–1900. Society for Industrial and Applied Mathematics, 2022.
- [6] A. Berko, V. Zhuk, and I. Sereda. Modeling of sewer networks by means of directed graphs. Environmental Problems, 2:97–100, 2017.
- [7] E. Bézout. Théorie générale des équations algébrique. De l’imprimerie de Ph. D. Pierres, 1779.
- [8] A. Caracci, C. Dürr, and J. Verschae. Randomized binary and tree search under pressure, 2024. doi:10.48550/arXiv.2406.06468.
- [9] R. Carmo, J. Donadelli, Y. Kohayakawa, and E. Laber. Searching in random partially ordered sets. Theoretical Computer Science, 321(1):41–57, 2004. doi:10.1016/J.TCS.2003.06.001.
- [10] F. Cicalese, T. Jacobs, E. Laber, and M. Molinaro. On the complexity of searching in trees and partially ordered structures. Theoretical Computer Science, 412:6879–6896, 2011. doi:10.1016/J.TCS.2011.08.042.
- [11] F. Cicalese, T. Jacobs, E. Laber, and M. Molinaro. Improved Approximation Algorithms for the Average-Case Tree Searching Problem. Algorithmica, 68:1045–1074, 2014. doi:10.1007/S00453-012-9715-6.
- [12] F. Cicalese, T. Jacobs, E. Laber, and C. Valentim. The binary identification problem for weighted trees. Theoretical Computer Science, 459:100–112, 2012. doi:10.1016/J.TCS.2012.06.023.
- [13] CORDIS. Explosive material production (hidden) agile search and intelligence system, HORIZON 2020. https://cordis.europa.eu/project/id/261381. Accessed: 2025-04-21.
- [14] T. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009.
- [15] D. Dereniowski. Edge ranking and searching in partial orders. Discrete Applied Mathematics, 156(13):2493–2500, 2008. doi:10.1016/J.DAM.2008.03.007.
- [16] C. Di Cristo and A. Leopardi. Pollution Source Identification of Accidental Contamination in Water Distribution Networks. Journal of Water Resources Planning and Management, 134(2):197–202, 2008.
- [17] E. Domokos, V. Sebestyén, V. Somogyi, A. J. Trájer, R. Gerencsér-Berta, B. Oláné H., E. G. Tóth, F. Jakab, G. Kemenesi, and J. Abonyi. Identification of sampling points for the detection of SARS-CoV-2 in the sewage system. Sustainable Cities and Society, 76:103422, 2022.
- [18] E. Emamjomeh-Zadeh, D. Kempe, and V. Singhal. Deterministic and probabilistic binary search in graphs. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, STOC ’16, pages 519–532, 2016.
- [19] M. Grötschel, L. Lovász, and A. Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1:169–197, 1981. doi:10.1007/BF02579273.
- [20] D. L. Jones, M. Baluja, D. W. Graham, A. Corbishley, J. E. McDonald, S. K. Malham, L. S. Hillary, T. R. Connor, W. H. Gaze, I. B. Moura, M. H. Wilcox, and K. Farkas. Shedding of SARS-CoV-2 in feces and urine and its potential role in person-to-person transmission and the environment-based spread of COVID-19. Science of The Total Environment, 749:141364, 2020.
- [21] D. E. Knuth. Optimum binary search trees. Acta Informatica, 1:14–25, 1971. doi:10.1007/BF00264289.
- [22] T. W. Lam and F. L. Yue. Optimal Edge Ranking of Trees in Linear Time. Algorithmica, 30:12–33, 2001. doi:10.1007/S004530010076.
- [23] R. C. Larson, O. Berman, and M. Nourinejad. Sampling manholes to home in on SARS-CoV-2 infections. PLOS ONE, 15(10):e0240007, 2020.
- [24] K. Mehlhorn. Nearly optimal binary search trees. Acta Informatica, 5(4):287–295, 1975. doi:10.1007/BF00264563.
- [25] S. Mozes, K. Onak, and O. Weimann. Finding an optimal tree searching strategy in linear time. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, pages 1096–1105, 2008.
- [26] M. Nourinejad, O. Berman, and R. C. Larson. Placing sensors in sewer networks: A system to pinpoint new cases of coronavirus. PLOS ONE, 16:e0248893, 2021.
- [27] K. Onak and P. Parys. Generalization of binary search: Searching in trees and forest-like partial orders. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’06, pages 379–388. IEEE, 2006.
- [28] J. O’Keeffe. Wastewater-based epidemiology: current uses and future opportunities as a public health surveillance tool. Environmental Health Review, 64:44–52, 2021.
- [29] A. A. Schäffer. Optimal node ranking of trees in linear time. Information Processing Letters, 33:91–96, 1989. doi:10.1016/0020-0190(89)90161-0.
- [30] A. Sulej-Suchomska, A. Klupczynska, P. Derezinski, J. Matysiak, P. Przybylowski, and Z. Kokot. Urban wastewater analysis as an effective tool for monitoring illegal drugs, including new psychoactive substances, in the Eastern European region. Scientific Reports, 10(4885), 2020.
- [31] J. Von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295–320, 1928.
