Tree Path Majority Data Structures

We present the first solution to $\tau$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..\sigma]$, and a fixed threshold $0<\tau<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $\tau \cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query is of size up to $1/\tau$. On a $w$-bit RAM, we obtain a linear-space data structure with $O((1/\tau)\log^* n \log\log_w \sigma)$ query time. For any $\kappa>1$, we can also build a structure that uses $O(n\log^{[\kappa]} n)$ space, where $\log^{[\kappa]} n$ denotes the function that applies logarithm $\kappa$ times to $n$, and answers queries in time $O((1/\tau)\log\log_w \sigma)$. The construction time of both structures is $O(n\log n)$. We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses $2nH + 4n + o(n)(H+1)$ bits, where $H \le \lg\sigma$ is the entropy of the label distribution, and can be built in $O(n\log n)$ time. The other uses $nH + O(n) + o(nH)$ bits and is built in $O(n\log n)$ time w.h.p.


Introduction
Finding frequent elements in subsets of a multiset is a fundamental operation for data analysis and data mining [2,3]. When the sets have a certain An early partial version of this article appeared in Proc. ISAAC 2018 [1]. * Corresponding author Email addresses: travis.gagie@mail.udp.cl (Travis Gagie), mhe@cs.dal.ca (Meng He), gnavarro@dcc.uchile.cl (Gonzalo Navarro), cochoa@dcc.uchile.cl (Carlos Ochoa) structure, it is possible to preprocess the multiset to build data structures that efficiently find the frequent elements in any subset.
The best studied multiset structure is the sequence, where the subsets that can be queried are ranges (i.e., contiguous subsequences) of the sequence. Applications of this case include time sequences, linear-versioned structures, and one-dimensional models, for example. Data structures for finding the mode (i.e., the most frequent element) in a range require time O( n/ lg n), and it is unlikely that this can be done much better within reasonable extra space [4]. Instead, listing all the elements whose relative frequency in a range is over some fraction τ (called the τ -majorities of the range) is feasible within linear space and O(1/τ ) time, which is worst-case optimal [5]. Mode and τ -majority queries on higher-dimensional arrays have also been studied [6,4].
In this paper we focus on finding frequent elements when the subsets that can be queried are the labels on paths from one given node to another in a labeled tree. For example, given a minimum spanning tree of a graph, we might be interested in frequent node types on the path between two nodes. Path mode or τ -majority queries on multi-labeled trees could be useful when handling the tree of versions of a document or a piece of software, or a phylogenetic tree (which is essentially a tree of versions of a genome). If each node stores a list of the sections (i.e., chapters, modules, genes) on which its version differs from its parent's, then we can efficiently query which sections are changed most frequently between two given versions.
There has been relatively little previous work on finding frequent elements on tree paths. Krizanc et al. [7] considered path mode queries, obtaining O( √ n lg n) query time. This was recently improved by Durocher et al. [8], who obtained O( n/w lg lg n) time on a RAM machine of w = Ω(lg n) bits. As in the special case of sequences, these times are not likely to improve much. No previous work has considered the problem of finding path τmajority queries, which is more tractable than finding the path mode. This is our focus.
We present the first data structures to support path τ -majority queries on trees of n nodes, with labels in [1..σ], on a RAM machine. We first obtain a data structure using O(n lg n) space and O((1/τ ) lg lg w σ) time (Theorem 3). Building on this result, we manage to reduce the space to O(n) without affecting the query time (Theorem 7). We then show that our linear-space data structure can be further compressed, to either 2nH + 4n + o(n)(H + 1) bits or nH + O(n) + o(nH) bits, where H ≤ lg σ is the entropy of the distribution of the labels in T , while increasing the query time of the linearspace data structure only slightly, to O((1/τ ) lg * n lg lg w σ) (Theorems 8 and 9). Finally, we extend the succinct results so as to allow τ to be specified at query time, at the cost of just o(n lg σ) further bits of space (Theorems 11 and 12).
Durocher et al. [8] also considered queries that look for the least frequent elements and τ -minorities on paths. In Theorem 14, we slightly improve their query time to O((1/τ ) lg lg w σ) within linear space, and in Theorem 15 we show how to compress the data structure to fit in succinct space, with only a very slight increase in query time.
Finally, we describe how to adapt our results to multi-labeled trees and to path queries on functions, and discuss some open problems.
An early partial version of this paper appeared in Proc. ISAAC 2018 [1]. This version includes an improved complexity for the linear-space version (so the super-linear space version of the conference paper becomes obsolete), which is also simplified. It also includes new results for τ specified at query time, for τ -minorities, and for extensions to path queries on functions. Finally, we have improved the writing and added more detail, fixed some minor errors.

Definitions
We deal with rooted ordinal trees (or just trees) T . Further, our trees are labeled, that is, each node u of T has an integer label label(u) ∈ [1..σ]. We assume that, if our main tree has n nodes, then σ = O(n); if not, we can remap the labels to a range of size at most n without altering the semantics of the queries of interest in this paper.
The path between nodes u and v in a tree T is the (only) sequence of nodes P uv = u = z 1 , z 2 , . . . , z k−1 , z k = v such that there is an edge in T between each pair z i and z i+1 , for 1 ≤ i < k. The length of the path is |P uv | = k; for example, the length of the path P uu is 1. Any path from u to v goes from u to the lowest common ancestor of u and v, and then from there it goes to v (if u is an ancestor of v or vice versa, one of these two subpaths is empty).
Given a real number 0 < τ < 1, a τ -majority of the path P uv is any label that appears (strictly) more than τ ·|P uv | times among the labels of the nodes in P uv . The path τ -majority problem is, given u and v, list all the τ -majorities in the path P uv . Note that there can be up to 1/τ such τ -majorities.
Our results hold in the RAM model of computation, assuming a computer word of w = Ω(lg n) bits, supporting the standard operations.
Our logarithms are to the base 2 by default. By lg [k] n we mean the function that applies logarithm k times to n, i.e., lg [0] n = n and lg [k] n = lg(lg [k−1] n). By lg * n we denote the iterated logarithm, i.e., the minimum k such that lg [k] n ≤ 1.
.n] can be represented within n+o(n) bits so that the following operations take constant time: access(B, i) returns B[i], rank b (B, i) returns the number of times bit b appears in B[1..i], and select b (B, j) returns the position of the jth occurrence of b in B [9]. If B has m 1s, then it can be represented within m lg(n/m) + O(m) bits while retaining the same operation times [10]. Note the space is o(n) bits if m = o(n). Those structures can be built in linear time.
Analogous operations are defined on sequences S[1.
.n] over alphabets [1..σ]. For example, one can represent S within nH + o(n)(H + 1) bits, where H ≤ lg σ is the entropy of the distribution of symbols in S, so that rank takes time O(lg lg w σ), access takes time O(1), and select takes any time in ω(1) [11,Thm. 8]. The construction takes linear time. While this rank time is optimal, we can answer partial rank queries in O(1) time, prank(S, i) = rank S[i] (S, i), by adding O(n(1 + lg H)) bits on top of a representation giving constant-time access [12,Sec. 3]. This construction requires linear randomized time.

Range τ -majorities on sequences
A special version of the path τ -majority queries on trees is range τmajority queries on sequences S[1..n], which have been studied in greater depth. Given i and j, the problem is to return all the distinct symbols that appear more than τ · (j − i + 1) times in S[i..j]. The most recent result on this problem [13,5] is a linear-space data structure, built in O(n lg n) time, that answers queries in the worst-case optimal time, O(1/τ ).
For our succinct representations, we also use a data structure [5, Thm. 5.2] that requires nH +o(n)(H +1) bits, and can answer range τ -majority queries in any time in (1/τ ) · ω(1). The structure is built on the sequence representation mentioned above [11,Thm. 8], and thus it includes its support for access, rank, and select queries on the sequence. To obtain the given times for τ -majorities, the structure includes the support for partial rank queries [12,Sec. 3], and therefore its construction time is randomized. In this paper, however, it will be sufficient to obtain O((1/τ ) lg lg w σ) time, and therefore we can replace their prank queries by general rank operations. These take time O(lg lg w σ) instead of O(1), but can be built in linear time. 1 Therefore, this slightly slower structure can also be built in O(n lg n) deterministic time.
When a set has no structure, we can find its τ -majorities in linear time. Misra and Gries [14] proposed an optimal solution that computes all τmajorities using O(n lg(1/τ )) comparisons. When implemented on a word RAM over an integer alphabet of size σ, the running time becomes O(n) [3].

Tree operations
For tree nodes u and v, we define the operations root (the tree root), parent(u) (the parent of node u), depth(u) (the depth of node u, 0 being the depth of the root), preorder(u) (the rank of u in a preorder traversal of T ), postorder(u) (the rank of u in a postorder traversal of T ), subtreesize(u) (the number of nodes descending from u, including u), anc(u, d) (the ancestor of u at depth d), and lca(u, v) (the lowest common ancestor of u and v). All those operations can be supported in constant time and linear space on a static tree after a linear-time preprocessing, trivially with the exceptions of anc [15] and lca [16].
A less classical query is labelanc(u, ), which returns the nearest ancestor of u (possibly u itself) labeled (note that the label of u need not be ). If u has no ancestor labeled , labelanc(u, ) returns null. This operation can be solved in time O(lg lg w σ) using linear space and preprocessing time [17,18,8].

Succinct tree representations
A tree T of n nodes can be represented as a sequence P [1.
.2n] of parentheses (i.e., a bit sequence). In particular, we consider the balanced parentheses representation, where we traverse T in depth-first order, writing an opening parenthesis when reaching a node and a closing one when leaving its subtree. A node is identified with the position P [i] of its opening parenthesis. By using 2n + o(n) bits, all the tree operations defined in Section 2.4 (except those on labels) can be supported in constant time [19].
This representation also supports access, rank and select on the bitvector of parentheses, and the operations close(P, i) (the position of the parenthesis closing the one that opens at P [i]), open(P, i) (the position of the parenthesis opening the one that closes at P [i]), and enclose(P, i) (the position of the rightmost opening parenthesis whose corresponding parenthesis pair encloses P [i]; when P represents a tree, this parenthesis represents the parent of the node to which P [i]).
Labeled trees can be represented within nH + 2n + o(n)(H + 1) bits by adding the sequence S[1..n] of the node labels in preorder, so that label(i) = access(S, preorder(i)).

An O(n lg n)-Space Solution
In this section we design a data structure answering path τ -majority queries on a tree of n nodes using O(n lg n) space and O((1/τ ) lg lg w σ) time. This introduces the basic ideas to obtain our final results.
We start by marking O(τ n) tree nodes, in a way that any node has a marked ancestor at distance O(1/τ ). A simple way to obtain these bounds is to mark every node whose height is ≥ 1/τ and whose depth is a multiple of 1/τ . Therefore, every marked node is the nearest marked ancestor of at least 1/τ − 1 distinct non-marked nodes, which guarantees that there are ≤ τ n marked nodes. On the other hand, any node is at distance at most 2 1/τ − 1 from its nearest marked ancestor.
For each marked node x, we will consider prefixes P i (x) of the labels in the path from x to the root, of length 1 + 2 i , that is, P i (x) = label(x), label(parent(x)), label(parent 2 (x)), . . . , label(parent 2 i (x)) (terminating the sequence at the root if we reach it). For each 0 ≤ i ≤ lg depth(x) , we store C i (x), the set of (τ /2)-majorities in P i (x). Note that |C i (x)| ≤ 2/τ for any x and i.
By successive applications of the next lemma we have that, to find all the τ -majorities in the path from u to v, we can partition the path into several subpaths and then consider just the τ -majorities in each subpath. Lemma 1. Let u and v be two tree nodes, and let z be an intermediate node in the path. Then, a τ -majority in the path from u to v is a τ -majority in the path from u to z (including z) or a τ -majority in the path from z to v (excluding z), or in both.
Proof. Let d uz be the distance from u to z (counting z) and d zv be the distance from z to v (not counting z). Then the path from u to v is of length d = d uz + d zv . If a label occurs at most τ · d uz times in the path from u to z and at most τ · d zv times in the path from z to v, then it occurs at most τ (d uz + d zv ) = τ · d times in the path from u to v.
Let us now show that the candidates we record for marked nodes are sufficient to find path τ -majorities towards their ancestors.
Lemma 2. Let x be a marked node. All the τ -majorities in the path from x to a proper ancestor z are included in C i (x) for some suitable i.
Proof. Let d xz = depth(x) − depth(z) be the distance from x to z (i.e., the length of the path from x to z minus 1). Let i = lg d xz . The prefix P i (x) contains all the nodes in an upward path of length 1 + 2 i starting at x, where d xz ≤ 2 i < 2d xz . Therefore, P i (x) contains node z, but its length is |P i (x)| < 1+ 2d xz . Therefore, any τ -majority in the path from x to z appears more than τ · (1 + d xz ) > (τ /2) · (1 + 2d xz ) > (τ /2) · |P i (x)| times, and thus it is a (τ /2)-majority recorded in C i (x).

Queries
With the properties above, we can find a candidate set of size O(1/τ ) for the path τ -majorities between arbitrary tree nodes u and v. Let z = lca(u, v). If v = z, let us also define z = anc(v, depth(z) + 1), that is, the child of z in the path to v. The path is then split into at most four subpaths, each of which can be empty: 1. The nodes from u to its nearest marked ancestor, x, not including x. If x does not exist or is a proper ancestor of z, then this subpath contains the nodes from u to z. The length of this path is less than 2 1/τ by the definition of marked nodes, and it is empty if u = x. 2. The nodes from v to its nearest marked ancestor, y, not including y. If y does not exist or is an ancestor of z, then this subpath contains the nodes from v to z . The length of this path is again less than 2 1/τ , and it is empty if v = y or v = z. 3. The nodes from x to z. This path exists only if x exists and descends from z.
4. The nodes from y to z . This path exists only if y exists and descends from z .
By Lemma 1, any τ -majority in the path from u to v must be a τ -majority in some of these four paths. For the paths 1 and 2, we consider all their up to 2 1/τ − 1 nodes as candidates. For the paths 3 and 4, we use Lemma 2 to find suitable values i and j so that C i (x) and C j (y), both of size at most 2/τ , contain all the possible τ -majorities in those paths. In total, we obtain a set of at most 8/τ + O(1) candidates that contain all the τ -majorities in the path from u to v.
In order to verify whether a candidate is indeed a τ -majority, we follow the technique of Durocher et al. [8]. Every tree node u will store count(u), the number of times its label occurs in the path from u to the root. We also make use of the operation labelanc(u, ). If u has no ancestor labeled , this operation returns null, and we define count(null) = 0. Therefore, the number of times label occurs in the path from u to an ancestor z of u (including z) can be computed as count(labelanc(u, ))−count(labelanc(parent(z), )). Each of our candidates can then be checked by counting their occurrences in the path from u to v using (count(labelanc(u, )) − count(labelanc(parent(z), ))) + (count(labelanc(v, )) − count(labelanc(z, ))).
The time to perform query labelanc is O(lg lg w σ) using a linear-space data structure on the tree [17,18,8], and therefore we find all the path τ -majorities in time O((1/τ ) lg lg w σ).
The space of our data structure is dominated by the O(lg n) candidate sets C i (x) we store for the marked nodes x. These amount to O((1/τ ) lg n) space per marked node, of which there are O(τ n). Thus, we spend O(n lg n) space in total.
Theorem 3. Let T be a tree of n nodes with labels in [1..σ], and 0 < τ < 1. On a RAM machine of w-bit words, we can build an O(n lg n) space data structure that answers path τ -majority queries in time O((1/τ ) lg lg w σ).

Construction
The construction of the data structure is easily carried out in linear time (including the fields count and the data structure to support labelanc [8]), except for the candidate sets C i (x) of the marked nodes x. We can compute the sets C i (x) for all i in total time O(depth(x)) using the linear-time algorithm of Misra and Gries [14] because we compute (τ /2)-majorities of doubling-length prefixes P i (x). This amounts to time O(mt) on a tree of t nodes and m marked nodes. In our case, where t = n and m ≤ τ n, this is O(τ n 2 ).
To reduce this time, we proceed as follows. First we build all the data structure components except the sets C i (x). We then decompose the tree into heavy paths [20] in linear time, and collect the labels along the heavy paths to form a set of sequences. On the sequences, we build in O(t lg t) time the range τ -majority data structure [13,5] that answers queries in time O(1/τ ). The prefix P i (x) for any marked node x then spans O(lg t) sequence ranges, corresponding to the heavy paths intersected by P i (x). We can then compute C i (x) by collecting and checking the O(1/τ ) (τ /2)-majorities from each of those O(lg t) ranges.
Each prefix P i (x) is formed by some prefix π 1 , . . . , π j−1 plus a prefix of π j . We can then carry out a process similar to the one to compute the majorities of π 1 , . . . , π j , but using only the proper prefix of π j . The O(lg t) sets C i (x) are then computed in total time O((1/τ ) lg t lg lg w σ). Added over the m marked nodes, we obtain O((1/τ )m lg t lg lg w σ) construction time. The construction time in our case, where t = n and m ≤ τ n, is the following.
Corollary 5. The data structure of Theorem 3 can be built in O(n lg n lg lg w σ) time.

A Linear-Space Solution
We can reduce the space of our data structure by stratifying our tree. First, let us create a separate structure to handle unary paths, that is, formed by nodes with only one child. The labels of upward maximal unary paths are laid out in a sequence, and the sequences of the labels of all the unary paths in T are concatenated into a single sequence, S, of length at most n. On S we build the linear-space data structure that solves range τ -majority queries in time O(1/τ ) [13,5]. Each node in a unary path of T points to its position in S. Each node also stores a pointer to its nearest branching ancestor (i.e., one with more than one child).
The stratification then proceeds as follows. We say that a tree node is large if it has more than (1/τ ) lg n descendant nodes (counting itself); other nodes are small. Then the subset of the large nodes, which is closed by parent, induces a subtree T of T with the same root and containing at most τ n/ lg n leaves, because for each leaf in T there are at least (1/τ ) lg n − 1 distinct nodes of T not in T . Further, T − T is a forest of trees {F j }, each of size at most (1/τ ) lg n.
We will use for T a structure similar to the one from Section 3, with some changes to ensure linear space. Note that T may have Θ(n) nodes, but since it has at most τ n/ lg n leaves, T has only O(τ n/ lg n) branching nodes. We modify the marking scheme, so that we mark precisely the branching nodes in T . Spending O((1/τ ) lg n) space for the candidate sets C i (x) over all branching nodes of T adds up to O(n) space.
The procedure to solve path τ -majority queries on T is then as follows. We split the path from u to v into four subpaths, exactly as in Section 3. The subpaths of type 1 and 2 can now be of arbitrary length, but they are unary, thus we obtain their (up to) 1/τ candidates in time O(1/τ ) from the corresponding range of S. Finally, we check all the O(1/τ ) candidates in time O((1/τ ) lg lg w σ) as in Section 3.
The nodes u and v may, however, belong to some small tree F j , which is of size |F j | ≤ (1/τ ) lg n. We preprocess all those trees F j in a way analogous to Section 3, using its same marking scheme to ensure that at most τ |F j | nodes x are marked. The definition of the prefix P i (x), and consequently of their (τ /2)-majorities C i (x), however, is slightly modified: P i (x) is the sequence of the labels of the first 1 + 2 i /τ nodes in the path from x to the root of its small subtree F j , that is, lg(τ |F j |) ≤ lg lg n , we store C i (x), the set of (τ /2)-majorities in P i (x).
The sizes |C i (x)| are still at most 2/τ for any x and i. Lemma 2 applies with C i (x) as well, as we show next.
Lemma 6. Let x be a marked node in a small tree F j . All the τ -majorities in the path from x to a proper ancestor z in , the length of the path from x to z minus 1). Let i = lg(τ · d xz ) ≥ 0. The path P i (x) contains all the nodes in an upward path of length 1 + 2 i /τ starting at x, where d xz ≤ 2 i /τ < 2d xz . Therefore, P i (x) contains node z, but its length is |P i (x)| < 1 + 2d xz . Therefore, any τ -majority in the path from x to z appears more than Note that, if d xz ≤ 1/(2τ ), we do not need to use any C i (x); we can simply collect all the O(1/τ ) elements in the path from x to z.
If the O(lg lg n) candidate sets C i (x), for a marked node x, were stored as in Section 3, they would require O((1/τ ) lg σ lg lg n) bits. Instead of storing the candidate labels directly, however, we will store depth(y), where y is the nearest ancestor of x with label . We can then recover = label(anc(x, depth(y))) in constant time. Since the depths in F j are also O((1/τ ) lg n), we need only O(lg((1/τ ) lg n)) bits per candidate. Further, by sorting the candidates by their depth(y) value, we can encode only the differences between consecutive depths using γ-codes [21]. Encoding k increasing numbers in [1..t] with this method requires O(k lg(t/k)) bits; therefore we can encode our O(1/τ ) candidates using O((1/τ ) lg lg n) bits in total. Added over all the O(lg lg n) values of i, the candidates C i (x) require O((1/τ )(lg lg n) 2 ) bits per marked node. Added over all the O(τ |F j |) marked nodes of F j , this amounts to O(|F j |(lg lg n) 2 ) bits of space, and added over all the small trees F j , this yields O(n(lg lg n) 2 ) bits, or o(n) words, in total. The other pointers of F j , as well as node labels, can be represented normally, as they are O(n) in total.
To solve a general path τ -majority query from u to v, we compute z = lca(u, v) and process the path from u to z as follows: • If u (and thus z) belongs to T , then we proceed on T as explained.
• If z (and thus u) belongs to some small tree F j , then we proceed on F j as in Section 3, collecting O(1/τ ) candidates in our path from u to its nearest marked ancestor x, and then other O(1/τ ) candidates from the corresponding set C i (x).
• If u is in some F j and z is in T , then let u be the root of F j (we have enough space to store a pointer to u for each node u), whose parent is a leaf in T . Then we collect O(1/τ ) candidates in the path from u to u using the mechanism of F j , and then other O(1/τ ) candidates in the path from the parent of u to z using the mechanism of T .
Other O(1/τ ) candidates are collected analogously in the path from v to z , where z is the child of z in the path to v, that is, z = anc(v, depth(z) + 1). Finally, all the candidates are checked as in Section 3, each in time O(lg lg w σ).
The time to build the structures on T , using the technique of Lemma 4, is O(n lg lg w σ) because T has t = O(n) nodes and m = O(τ n/ lg n) marked nodes. For the small trees F j , we can use the O(mt)-time method described in the first paragraph of Section 3.2. Since on F j it holds that t = |F j | ≤ (1/τ ) lg n and m ≤ τ · |F j |, the construction time is O(|F j | lg n), which adds up to O(n lg n). Note that we also need O(n lg n) time to build the range majority data structure on S. Example. Figure 1 shows an example tree T where we have defined that a node is large if it has more than 7 descendant nodes (including itself). The large nodes of T form T , which has gray background. The branching nodes of T are circled; those are the sampled nodes of T . We have chosen the path between a small node u and a large node v. The node u is then within a small subtree F j , rooted at u . The path P uv between u and v is split into several subpaths: (1) from u to u , which is handled within the subtree F j (possibly with a mix of brute force and the use of a set C i (·)); (2) from the parent of u to its nearest marked ancestor x in T (excluding x), which is a unary path and thus handled with a range query on the sequence S (not drawn); (3) from x to z = lca(u, v), which is handled with a set C i (x) for some i; (4) from v to z = anc(v, depth(z) + 1), which is handled with a set C i (v) for some i, because v belongs to T and is a sampled node.  Figure 1: An example tree T where the labels are the letters of each node. The top tree T of large nodes (with more than 7 descendants) has gray background. The path between two chosen nodes u and v is highlighted in dashed lines; note that u belongs to a small subtree F j rooted at u , whereas v belongs to T . We also show the nodes z = lca(u, v) and z = anc(v, depth(z) + 1). Finally, we show the nearest branching ancestor x or u . Figure 2 shows the nodes included in each prefix P i (x) in the path from x to the root of T , for i = 0 to 4.
Assume τ = 1/3. The subpaths (1) and (2) do not yield any candidate to τ -majority, since no label appears in more than a third of the subpath nodes. Instead, C 4 (x) = {b, c} (since b and c are the (τ /2 = 1/6)-majorities in the path P 4 (x) from x to the root of T ) and C 3 (v) = {a, b} (since a and b are the (τ /2 = 1/6)-majorities in the path P 3 (v) from v to z ). We thus check the candidates a, b, and c, and report only b, because it appears 9 > τ · |P uv | = (1/3) · 25 = 8.3 times in P uv .

A Succinct Space Solution
To obtain a succinct-space structure from Theorem 7, we increase the thresholds that define the large nodes in Section 4 and generalize the stratification to several levels. Let us say that the original tree T is of level 0. We now define the large nodes as those whose subtree size is larger than (1/τ )(lg n) 3 ; these form the nodes corresponding to T in Section 4. The small trees F j of Section 4, which here are of size ≤ (1/τ )(lg n) 3 , are said to be of level 1. We recursively apply the same stratification on the small trees F j . On those, we define large nodes as those whose subtree size is larger than (1/τ )(lg lg n) 3 ; the resulting small trees are said to be of level 2. We iterate this process κ times. In general, the trees of level 1 ≤ k ≤ κ are of size at most (1/τ )(lg [k] n) 3 . The large nodes of the trees of level 0 ≤ k < κ are those whose subtree size exceeds (1/τ )(lg [k+1] n) 3 . The smallest trees, of level κ, are of size (1/τ )(lg [κ] n) 3 and are not further decomposed.
Level 0 can be handled exactly as T in Section 4. In this case, since T has O(τ n/ lg 3 n) branching nodes, the space for the sets C i (x) amounts to only O(n/ lg n) = o(n) bits. In all the other levels, except the last one, we sample the branching nodes (as done for T in Section 4), but build on them the sets C i (x) (as done for the subtrees F j in Section 4). A tree F of level 1 ≤ k < κ has t ≤ (1/τ )(lg [k] n) 3 nodes and m ≤ τ ·|F |/(lg [k+1] n) 3 branching nodes. The representation of a set C i (x) in such a tree F , using the described differential encoding, takes O((1/τ ) lg((lg [k] n) 3 )) = O((1/τ ) lg [k+1] n) bits. Added over all the branching nodes, we obtain O(|F |/(lg [k+1] n) 2 ) bits. Since every node belongs to one tree F , the total space amounts to O(n/(lg [κ] n) 2 ) bits.
We aim to use about lg * n levels. This will introduce a slowdown factor of the same order in query times, but in exchange the smallest trees will be small enough that they can be traversed by brute force, within the same penalty factor as well. We must carefully choose κ so as to also obtain o(n) bits of space for all the sets C i (x). Thus we set κ = 1 + lg * n − lg * * n, so that there are κ = O(lg * n) levels, and the last-level subtrees are of The general process to solve a path τ -majority query from u to v is then as follows. We compute z = lca(u, v) and split the path from u to z into k − k + 1 subpaths, where k and k (note k ≤ k ≤ κ) are the levels of the subtree where z and u belong, respectively. Let us call u i the root of the subtree of level i that is an ancestor of u, except that we call u k = z. For uniformity, the sets C i (x) of level 0 are called C i (x) as well.
1. If k = κ, then u belongs to one of the smallest subtrees. We then collect the o((1/τ ) lg * n) node labels in the path from u to u κ one by one and include them in the set of candidates. We then move to the parent of that root, setting u ← parent(u κ ) and k ← κ − 1. 2. At levels k ≤ k < κ, if u is a branching node, we collect the 2/τ candidates from the corresponding set C i (u), where i is sufficient to cover u k (C i (u) will not store candidates beyond the subtree root). We then set u ← parent(u k ) and k ← k − 1. 3. At levels k ≤ k < κ, if u is not a branching node, let x be lowest between parent(z) and the nearest branching ancestor of u. Let also p be the position of u in S. Then we find the ; see Section 5.1. We then continue from u ← x and k ← k(x), where k(x) is the level of the subtree where x belongs. Note that k(x) can be equal to k, but it can also be any other level less than k.

We stop when u = parent(z).
A similar procedure is followed to collect the candidates from v to z , where again z = anc(v, depth(z) + 1) is the child of z in the path to v.
In total, since each path has at most one case 2 and one case 3 per level k, we collect at most 2κ = O(lg * n) candidate sets of size O(1/τ ), plus two of size o((1/τ ) lg * n). The total cost to verify all the candidates is then O((1/τ ) lg * n lg lg w σ).
The construction time, using Lemma 4 on level 0, is O(n lg lg w σ) as in Section 4. Applied on level 1, the lemma yields O((n/(lg lg n) 3 ) lg((1/τ ) lg n) lg lg w σ) = o(n lg n) construction time. For higher levels, we use the basic quadratic method described in the first paragraph of Section 3.2: a subtree 3 ) time for level k. This is maximized at level k = 2, yielding time O(n(lg lg n/ lg lg lg n) 3 ) = o(n lg n). All these costs are dominated by the O(n lg n) time to build the range majority data structure on S, which also absorbs the time to sort all the sets C i (x) by decreasing frequency.
We still need, however, to use succinct space for all the other linearspace components of the structure. The topology of the whole tree T can be represented using a sequence P of balanced parentheses in 2n + o(n) bits, supporting in constant time all the standard tree traversal operations we use [19]. We assume that opening and closing parentheses are represented with 1s and 0s in P , respectively. Let us now focus on the less standard operations needed.

Counting labels in paths
In Section 3, we count the number of times a label occurs in the path from u to the root by means of a query labelanc and by storing count fields in the nodes. In Section 4, we use in addition a string S to support range majority queries on the unary paths.
To solve labelanc queries, we use the representation of Durocher et al. [8,Lem. 7], which uses nH +2n+o(n)(H +1) bits in addition to the 2n+o(n) bits of the tree topology. This representation includes a string S[1.
.n] where all the labels of T are written in preorder; any implementation of S supporting access, rank, and select in time O(lg lg w σ) can be used (e.g., [11]). This string can also play the role of the one we call S in Section 4, because the labels of unary paths are contiguous in S, and any node v can access its label from S[preorder(v)].
On top of this string we must also answer range τ -majority queries in time O((1/τ ) lg lg w σ). We can use the slow variant of the succinct structure described in Section 2.3, which requires only o(n)(H + 1) additional bits and also supports access in O(1) time and rank and select in time O(lg lg w σ). This variant of the structure is built in O(n lg n) time.
In addition to supporting operation labelanc, we need to store or compute the count fields. Durocher et al. [8] also require this field, but find no succinct way to represent it. We now show a way to obtain this value within succinct space.
The sequence S lists the labels of T in preorder, that is, aligned with the opening parentheses of P . Assume we have another sequence S [1..n] where the labels of T are listed in postorder (i.e., aligned with the closing parentheses of P ). Since the opened parentheses not yet closed in P [1..i] are precisely node i and its ancestors, we can compute the number of times a label appears in the path from P [i] to the root as rank (S, rank 1 (P, i)) − rank (S , rank 0 (P, i)).
Therefore, we can support this operation with nH +o(n)(H +1) additional bits. Note that, with this representation, we do not need the operation labelanc, since we do not need that P [i] itself is labeled .
If we do use operation labelanc, however, we can ensure that P [i] is labeled , and another solution is possible based on partial rank queries. Let o = rank (S, rank 1 (P, i)) and c = rank (S , rank 0 (P, i)) be the numbers of opening and closing parentheses up to P [i], respectively, so that we want to compute o − c. Since P [i] is labeled , it holds that S[rank 1 (P, i))] = , and thus o = prank(S, rank 1 (P, i)). To compute c, we do not store S , but rather S [1..2n], so that S [i] is the label of the node whose opening or closing parenthesis is at P [i] (i.e., S is formed by interleaving S and S ). Then, prank(S , i) = o + c; therefore the answer we seek is o − c = 2 · prank(S, rank 1 (P, i)) − prank(S , i).
We use the structure for constant-time partial rank queries [12, Sec

Other data structures
The other fields stored at tree nodes, which we must now compute within succinct space, are the following: Pointers to candidate sets C i (x). All the branching nodes in all subtrees except those of level κ are marked, and there are O(n/(lg [κ] n) 3 ) = o(n) such nodes. We can then mark their preorder ranks with 1s in a bitvector M [1..n]. Since M has o(n) 1s, it can be represented within o(n) bits [10] while supporting constant-time rank and select operations. We can then find out when a node i is marked (iff M [preorder(i)] = 1), and if it is, its rank among all the marked nodes, r = rank 1 (M, preorder(i)). The C i (x) sets of all the marked nodes x of any level can be written down in a contiguous memory area of total size o(n) bits, sorted by the preorder rank of x. A bitvector C of length o(n) marks the starting position of each new node x in this memory area. Then the area for marked node i starts at p = select 1 (C, r). A second bitvector D can mark the starting position of each C j (x) in the memory area of each node x, and thus we access the specific set C j (x) from position select 1 (D, rank 1 (D, p − 1) + j).
Pointers to subtree roots. We store an additional bitvector B[1..2n], parallel to the parentheses bitvector P [1..2n]. In B, we mark with 1s the positions of the opening and closing parentheses that are roots of subtrees of any level. As there are O(n/(lg [κ] n) 3 ) = o(n) such nodes, B can be represented within o(n) bits while supporting constant-time rank and select operations. We also store the sequence of o(n) parentheses P corresponding to those in P marked with 1s in B. The nearest subtree root containing node P [i] is obtained by finding the nearest position to the left that is marked in B, i.e., j = select 1 (B, r) with r = rank 1 (B, i), and then considering the corresponding position P [r]. If it is an opening parenthesis, then the nearest subtree root is the node whose parenthesis opens in P [j]. Otherwise, it is the one opening at P [j ], where j = select 1 (B, enclose(P , open(P , r))) (see [22,Sec. 4.1]).
Finding the nearest branching ancestor. A unary path looks like a sequence of opening parentheses followed by a sequence of closing parentheses. The nearest branching ancestor of P [i] is obtained in constant time by finding the nearest closing parenthesis to the left, l = select 0 (rank 0 (P, i)), and the nearest opening parenthesis to the right, r = select 1 (rank 1 (close(P, i)) + 1). Then the answer is the larger between enclose(P, open(P, l)) and enclose(P, r).
Determining the subtree level of a node. We can compute s = subtreesize(i) of a node P [i] in constant time, so we can determine the corresponding level: if s > (1/τ ) lg 3 n, it is level 1. Otherwise, we look up τ · s in a precomputed table of size O(lg 3 n) that stores the level corresponding to each possible size.
Therefore, depending on whether we represent both S and S or use partial rank structures, we obtain two results within succinct space.
Theorem 8. Let T be a tree of n nodes with labels in [1..σ], and 0 < τ < 1. On a RAM machine of w-bit words, we can build in O(n lg n) time a data structure using 2nH + 4n + o(n)(H + 1) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -majority queries in time O((1/τ ) lg * n lg lg w σ).
Theorem 9. Let T be a tree of n nodes with labels in [1..σ], and 0 < τ < 1. On a RAM machine of w-bit words, we can build in O(n lg n) time (w.h.p.) a data structure using nH +O(n)+o(nH) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -majority queries in time O((1/τ ) lg * n lg lg w σ).
We can also retain the same complexity of the linear-space version by using a constant number κ of levels, at the cost of using a slightly superlinear number of bits. In this case, we do not use brute force on the last-level trees, but rather combine the marking scheme of Section 3 with the storage format of the sets C i (x). In this case, level κ requires O(n(lg [κ+1] n) 2 ) bits and its O(1/τ ) candidates are obtained as in Section 3. Next we write O(n(lg [κ+1] n) 2 ) ⊂ O(n lg [κ] n) for simplicity.
Theorem 10. Let T be a tree of n nodes with labels in [1..σ], and 0 < τ < 1. On a RAM machine of w-bit words, for any constant κ, we can build in O(n lg n) time (w.h.p.) a data structure using nH + O(n lg [κ] n) + o(nH) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -majority queries in time O((1/τ ) lg lg w σ).
We note that, within this space, all the typical tree navigation functionality, as well as access to labels, is supported.

Variable τ
Up to now, the value of τ is known at index construction time and cannot be changed later (we can obviously query for some τ ≥ τ by using τ when verifying the candidates, but the time is still proportional to 1/τ ). We aim at a structure that is independent of τ and can receive it together with the query nodes u and v, and answer in time proportional to 1/τ .
Note that, if τ = O(1/σ), we can simply test all the candidates of the alphabet in P uv in time O(σ lg lg w σ) = O((1/τ ) lg lg w σ). Therefore, we only care about values τ > 2/σ.
This solution increases the space by a factor of lg σ. Note, however, that in the succinct solutions of Theorems 8 to 10, the space component O(nH) + 4n + o(n) is due to the tree topology and the sequence S, which do not depend on the value of τ . In particular, the representation of S is used to perform τ -majority queries on the unary paths, but it allows τ be specified at query time [5].
The structures that do depend on τ (i.e., the information on levels and all the candidate sets C i (x)) require only o(n) bits in Theorems 8 and 9, and O(n lg [κ] n) bits in Theorem 10. These spaces stay succinct or nearsuccinct even after our space increase. We then obtain results close to those of Theorem 8 to 10, now for any τ specified at query time.
The construction time of the structure is O(n(lg σ/ lg lg n) 2 + n lg n), which includes the time to build lg σ copies of the C i (x) structures.
Theorem 11. Let T be a tree of n nodes with labels in [1..σ]. On a RAM machine of w-bit words, we can build in O(n(lg σ/ lg lg n) 2 + n lg n) time a data structure using 2nH + 4n + o(n lg σ) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -majority queries for any 0 < τ < 1, in time O((1/τ ) lg * n lg lg w σ).
Theorem 12. Let T be a tree of n nodes with labels in [1..σ]. On a RAM machine of w-bit words, we can build in O(n(lg σ/ lg lg n) 2 + n lg n) time (w.h.p.) a data structure using nH + O(n) + o(n lg σ) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -majority queries for any 0 < τ < 1, in time O((1/τ ) lg * n lg lg w σ).
Theorem 13. Let T be a tree of n nodes with labels in [1..σ]. On a RAM machine of w-bit words, for any constant κ, we can build in O(n(lg σ/ lg lg n) 2 + n lg n) time (w.h.p.) a data structure using O(n lg σ lg [κ] n) bits, that answers path τ -majority queries for any 0 < τ < 1, in time O((1/τ ) lg lg w σ).

Path τ -Minorities
A path τ -minority query asks for a τ -minority in a given path P uv , that is, a label that appears at least once and at most τ · |P uv | times in this path. If we try A = 1 + 1/τ distinct elements in the path from u to v, then one of them will turn out to be a τ -minority. With this idea, we extend the technique of Chan et al. [23] to tree paths. To find a τ -minority, we will find A distinct labels (or all the labels, if there are not that many) in the path P uz , where z = lca(u, v), and check their frequency in P uv . We then run an analogous process on the path P vz . We will stop as soon as we find a label that is not a τ -majority. We describe the process on P uz , as P vz is analogous. Note that we need to know τ only at query time.
To find A distinct labels, we will simulate on P uz the algorithm of Muthukrishnan [24], which finds A distinct elements in any range of an array E. In his algorithm, Muthukrishnan defines the array C where is set to 0 if such a value does not exist) and builds on C a range minimum query (RMQ) data structure; a range minimum query asks for the minimum element in a given subrange of the array. Then he finds A (or all the) distinct elements in any range E[i..j] via O(A) RMQs.
In our case, we store for each node u the field prevlabel(u) = depth(labelanc(parent(u), label(u))) , which is the depth of the nearest ancestor of u with its same label (and −1 if there is none). Then we conceptually define E and C over P uz , where E[i] = label(anc(u, depth(z) − 1 + i)) and C[i] = 1 + prevlabel(anc(u, depth(z) − 1 + i)). Note that we do not store E or C explicitly, but each entry of E or C can be computed in constant time using these formulas. To solve RMQs on C, we also build the linear-space data structure of Chazelle [25], which can return the minimum-weight node in any path of a weighted tree in constant time. This data structure is constructed over the tree T , for which we assign prevlabel(u) as the weight of each node u. With all these structures, we can run Muthukrishnan's algorithm and obtain A distinct labels of P uz . This yields our first result, which slightly reduces the O((1/τ ) lg lg n) time (within linear space) of Durocher et al. [8]. Note that the prevlabel fields are easily computed in O(n) time in a DFS traversal.
Theorem 14. Let T be a tree of n nodes with labels in [1.
.σ]. On a RAM machine of w-bit words, we can build an O(n) space data structure that answers path τ -minority queries for any 0 < τ < 1, in time O((1/τ ) lg lg w σ).
The structure is built in linear time.
It is likely that the result of Durocher et al. [8] can be improved to match ours, by just using a faster predecessor data structure. We can, however, make our solution succinct by using our tree representation of 2n + o(n) bits [19]. Instead of storing field prevlabel, we compute it on the fly with the given formula. Using the structures of Durocher et al. [8,Lem. 7], we can compute labelanc in time O(lg lg w σ). Their structure uses 2n + o(n) bits in addition to the topology of T and the representation of S.
The structure for RMQs, on the other hand, can be replaced by the one of Chan et al. [26], which uses 2n + o(n) further bits and answers RMQs with O(α(n)) queries prevlabel(u), where α is the inverse Ackermann function. Therefore, we can spot the A candidates in time O(A · α(n) lg lg w σ) and then verify them in time O(A · lg lg w σ). This yields the first result for path α-minority queries within succinct space.
Theorem 15. Let T be a tree of n nodes with labels in [1..σ]. On a RAM machine of w-bit words, we can build in O(n) time a data structure using nH + 6n + o(n)(H + 1) bits, where H ≤ lg σ is the entropy of the distribution of the node labels, that answers path τ -minority queries for any 0 < τ < 1, in time O((1/τ )α(n) lg lg w σ), where α is the inverse Ackermann function.

Multi-labeled trees
As mentioned in the Introduction, many applications of these results require that the trees are multi-labeled, that is, each node holds several labels. We can easily accommodate multi-labeled trees T in our data structure, by building a new tree T * where each node u of T with m(u) labels 1 , . . . , m(u) is replaced by an upward path of nodes u 1 , . . . , u m(u) , each u i holding the label i and being the only child of u i+1 (and u m(u) being a child of v 1 , where v is the parent of u in T ). Path queries from u to v in T are then transformed into path queries from u 1 to v 1 in T * , except when u (v) is an ancestor of v (u), in which case we replace u (v) by u m(u) (v m(v) ) in the query. All our complexities then hold on T * , which is of size n = |T * | = u∈T m(u).

Queries on functions
Gagie et al. [27] consider path queries over a structure more general than trees. Let f : [1..n] → [1..n] be a function and : [1..n] → [1..σ] an assignment of labels to the domain elements. The function defines a directed graph where nodes v(i) are associated with the domain elements i and the edges lead from v(i) to v(f (i)). The general form of these graphs is a set of cycles with trees sprouting from the cycle nodes (arrows point upwards, toward the cycles). We are interested in the so-called "positive path queries": given i and 0 ≤ k 1 ≤ k 2 , the path contains all the distinct elements {f k (i), k 1 ≤ i ≤ k 2 }. Our tree paths are then a particular case of positive path queries. They consider several queries on the labels of the path, and give general results like the following theorem. .σ] an assignment of labels to the domain elements. Let there be a tree representation that computes in constant time the mapping between nodes and preorders, ancestor queries, depths of nodes, leftmost leaves of nodes, and lowest common ancestors, and in addition it solves a certain decomposable path query on n-node trees with labels in [1..σ] in T (n, σ) time, using in total S(n, σ) bits of space. Then, there exists a data structure using n lg n+O(n)+ S(n, σ) bits that answers the same query on the positive paths of f in time O(lg n/ lg lg n)+T (n, σ). There exists another data structure using n lg n(1+ 1/t) + O(n) + S(n, σ) bits that answers the query in time O(t) + T (n, σ), for any t > 0.
Our results in this article allow, for the first time, using this result to answer τ -majority queries on the positive paths of functions. Although τmajority queries are not decomposable (i.e., we cannot answer the query from the results on a partition of the path into subpaths), we can obtain a set of O(1/τ ) candidates, with their frequencies, in each subpath of the partition. Lemma 1 shows that this is sufficient to find all the τ -majorities. In the result of Theorem 16, the query is partitioned into a constant number of subpaths; therefore we can use the result of Theorem 16 as if the query were decomposable. For example, combining it with Theorems 7 and 9, we obtain the following results. .σ] an assignment of labels to the domain elements. Then, there exists a data structures use linear space, and even succinct space, whereas our query times are close to optimal, by a factor near log-logarithmic. We also obtained analogous results for path τ -minorities.
Our query time for path τ -majorities and τ -minorities in linear space, O((1/τ ) lg lg w σ), is over the optimal time O(1/τ ) that can be obtained for the analogous range queries on sequences [5]. It is open whether we can obtain optimal time on trees within linear (or even near-linear) space. Another important open problem is how to support insertions and deletions of nodes in T while answering these queries, as achieved on sequences [28].