Abstract 1 Introduction 2 Preliminaries and Problem Formulation 3 Optimal Shifted Encoding 4 Optimal Ordered Encoding 5 Experimental Results References Appendix A Additional Material for Section 3 Appendix B Additional Material for Section 5

The Trie Measure, Revisited

Jarno N. Alanko ORCID University of Helsinki, Finland Ruben Becker ORCID Ca’ Foscari University of Venice, Italy Davide Cenzato ORCID Ca’ Foscari University of Venice, Italy Travis Gagie ORCID Dalhousie University, Halifax, Canada Sung-Hwan Kim ORCID Ca’ Foscari University of Venice, Italy Bojana Kodric ORCID Ca’ Foscari University of Venice, Italy Nicola Prezza ORCID Ca’ Foscari University of Venice, Italy
Abstract

In this paper, we study the following problem: given n subsets S1,,Sn of an integer universe U={0,,u1}, having total cardinality N=i=1n|Si|, find a prefix-free encoding enc:U{0,1}+ minimizing the so-called trie measure, i.e., the total number of edges in the n binary tries 𝒯1,,𝒯n, where 𝒯i is the trie packing the encoded integers {enc(x):xSi}. We first observe that this problem is equivalent to that of merging u sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter 0a<u, this encoding sends each xU to (x+a)modu, interpreted as a bit-string of logu bits. We develop the first efficient algorithms that find the value of a minimizing the trie measure when this encoding is used. Our two algorithms run in O(u+Nlogu) and O(Nlog2u) time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in O(N+u3) time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.

Keywords and phrases:
Succinct data structures, degenerate strings, integer encoding
Funding:
Jarno N. Alanko: Funded by Academy of Finland grant 339070.
Travis Gagie: Funded by NSERC Discovery Grant RGPIN-07185-2020.
Copyright and License:
[Uncaptioned image] © Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, and Nicola Prezza; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data compression
; Theory of computation Design and analysis of algorithms ; Theory of computation Dynamic programming ; Theory of computation Discrete optimization
Supplementary Material:
Software: https://github.com/regindex/trie-measure
Funding:
Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza: Funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Related Version:
ArXiv Version: https://arxiv.org/abs/2504.10703
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

Consider the problem of encoding a set of integers SU={0,,u1} (without loss of generality, we assume u to be a power of two), so as to minimize the overall number of bits used to represent S. In their seminal work on data-aware measures, Gupta et al. [10] proposed and analyzed an encoding for sets of integers based on the idea of packing the integers (seen as strings of logu bits) into a binary trie: this allows to avoid storing multiple times shared prefixes among the encodings of the integers. The number of edges in such a trie is known as the trie measure of the set. See Figure 1 for an example.

Figure 1: Example of trie encoding the set of integers {3,4,6}{0,1,,7} over universe of size u=8. Black edges belong to the trie. Gray edges do not belong to the trie and are shown only for completeness. Each integer is encoded using log8=3 bits (logarithms are in base 2). The trie has 8 edges, so the trie measure for this set using the standard integer encoding is 8.

Gupta et al. also showed that this measure approaches worst-case entropy on expectation when the integers are shifted by a uniformly-random quantity modulo u, i.e. when building the trie over the set S+a{x+amodu:xS} for a uniform aU. See Figure 2.

Figure 2: A trie that stores the same set of integers {3,4,6}{0,1,,7} of Figure 1, but with the shifted integer encoding mapping each xU to (the binary string of logu bits) (x+1)mod8. The trie has 6 edges, so the shifted trie measure with shift a=1 is 6.

In this paper, we revisit this problem in two natural directions: (1) we move from one set to a sequence of sets, and (2) we study more general prefix-free integer encodings. More formally, given n subsets S1,,Sn of U, having total cardinality N=i=1n|Si|, we study the problem of finding a prefix-free encoding enc:U{0,1}+ minimizing the trie measure of these sets, i.e. the total number of edges in the n binary tries 𝒯1,,𝒯n, where 𝒯i is the trie packing the encoded integers {enc(x):xSi}. Solving the problem on set sequences finds applications in data structures for rank/select queries on set sequences – also known in the literature as degenerate strings – , an important toolbox in indexes for labeled graphs [8, 6, 5, 7]. Such applications stem from the recent work of Alanko et al. [1], who showed how to solve rank and select queries on set sequences with a data structure – the subset wavelet tree – implicitly encoding sets as tries. While [1] only considered the standard binary (balanced) encoding, their technique works for any prefix-free encoding. As a result, our work can directly be applied to optimize the space of their data structure. Another application is the offline set intersection problem where the goal to preprocess subsets of the universe, so that later given a set of indices of the sets, one can compute the intersection of the sets space-time efficiently. Arroyuelo and Castillo [2] presented an adaptive approach that uses a trie representation of sets and they showed that it can benefit from its space-efficient representation.

We begin by showing that finding the optimal such prefix-free encoding is equivalent to the following natural optimization problem: find the minimum-cost sequence of binary unions that merges u given sets, where the cost of a union is the sum of the two sets’ cardinalities. As this problem was shown to be NP-hard by Ghosh et al. [9], we focus on particular (hopefully easier) families of encodings. We start by studying the shifted encoding of Gupta et al., who did not discuss the problem of finding the optimal value of aU minimizing the corresponding trie measure. We describe an algorithm solving the problem in O(u+Nlogu) time. The algorithm is based on the fact that, under this encoding, the trie measure has a highly periodic sub-structure as a function of the shift a. We then remove the linear dependency on the universe size and achieve running time O(Nlog2u) by storing this periodic structure with a DAG. O(Nlogu) running time is also possible by merging the two ideas, but for space constraints we will describe it in the extended version of this article. We also want to remark that the general prefix-free encoding problem is not only difficult to compute, but also it requires to store the ordering of the elements in addition to the trie representation. By a simple enumeration, it needs O(Nlogu) bits of space where N is the cardinality of the union of the sets (i.e., the number of distinct elements over all sets). On the other hand, the shifted encoding only requires O(logu) bits since it is sufficient to store a single integer aU.

We then move to ordered encodings: here, the encoding must preserve lexicographically the order of the universe U (i.e. the standard integer order). In this case, we show that the textbook solution of Knuth [12] based on dynamic programming, running in O(N+u3)-time, can be adapted to our scenario. We observe that essentially the same solution allows also shifting the universe (like in the shifted encoding discussed above) at no additional cost. As a result, we obtain that the best shifted ordered encoding can be computed in O(N+u3)-time as well. This encoding is never worse than the best shifted and the best ordered encodings.

We conclude our paper with an experimental evaluation of our algorithms on real datasets, showing how these encodings perform with respect to the worst and average-case scenario.

2 Preliminaries and Problem Formulation

A bit-string is a finite sequence of bits, i.e., an element from {0,1}. We count indices from 1, so that for β{0,1}+, β[1] is the first element. With we denote the lexicographic order among bit-strings and with || we denote the length of bit-strings. For two strings α and β of possibly different length, each not being a prefix of the other, we use to denote the (non-commutative) operator defined as αβ=β[j|β|], where j1 is the length of the longest common prefix of α and β, i.e., α[i]=β[i] for 1i<j and α[j]β[j].

We denote with U:={0,1,,u1} the integer universe and we assume that u is a power of two. Logarithms are in base 2, so log(x) indicates log2(x). In particular, logu is an integer. Notation [n] indicates the set {1,2,,n}. For integers <r, we denote with [,r) an interval of integers {,+1,,r2,r1} and if =r, then [,r)= is the empty set. We use double curly braces {{}} to represent multisets. Given a multiset containing subsets of U and an integer aU, the depth of at position a is defined as the number of elements of that contain a. For integer a,b and p with p1, we say apb if and only if amodp=bmodp. For a predicate P, we use 𝟙[P] to denote the indicator function that is 1 if P holds and zero otherwise.

A prefix-free encoding of U is a function enc:U{0,1}+ that satisfies that for no two distinct x,yU, it holds that enc(x) is a prefix of enc(y). For a set SU, we let enc(S):={enc(x):xS} be the set of bitstrings obtained from encoding S via enc. We say that enc is ordered if and only if x<y implies enc(x)enc(y) for all x,yU.

We study the following trie measure, generalized from the particular cases studied in [10]:

Definition 1.

Let enc:U{0,1}+ be a prefix-free encoding and let S={x1,,xm}U such that i<j implies enc(xi)enc(xj) in lexicographic order. We define

trie(enc(S))=|enc(xi1)|+j=2m|enc(xij1)enc(xij)|.

In this article we work, more generally, with sequences of sets. The above definitions generalize naturally:

Definition 2.

Let enc:U{0,1}+ be a prefix-free encoding and let 𝒮=S1,,Sn be a sequence of subsets of U. We define enc(𝒮) to be the sequence enc(S1),,enc(Sn), and:

trie(enc(𝒮))=i=1ntrie(enc(Si))

Throughout the article, we will denote with N=i=1n|Si| the total cardinality of the input sets S1,,Sn. Definition 2 leads to the central problem explored in this paper, tackled in Sections 3 and 4: given a sequence 𝒮=S1,,Sn of n subsets of U, find the encoding enc minimizing trie(enc(𝒮)), possibly focusing on particular sub-classes of prefix-free encodings.

A particularly interesting such sub-class, originally introduced by Gupta et al. [10], is that of shifted encodings:

Definition 3.

Let SU and aU. Notation S+a denotes the application to S of the prefix-free encoding sending each xU to (x+a)modu, interpreted as a bit-string of logu bits. Similarly, for a sequence 𝒮=S1,,Sn of subsets of U we denote by 𝒮+a the sequence S1+a,,Sn+a.

We call trie(𝒮+a) the shifted trie measure. In Section 3 we study the problem of finding the value aU minimizing trie(𝒮+a), a problem left open by Gupta et al. [10].

2.1 An Equivalent Problem Formulation

Observe that our trie-encoding problem is equivalent to the following natural encoding problem, which we will resort to in Section 4. Let 𝒮=S1,,Sn be a sequence of subsets of U. For each xU, denote with Ax={i[n]:xSi} the set collecting all indices i of the subsets Si containing x. Without loss of generality, in this reformulation we assume that Ax for all xU and Si for all i[n] (otherwise, simply re-map integers and ignore empty sets Si). We furthermore do not require u to be a power of two (this will be strictly required only for the optimal shifted encoding in Section 3).

For a given prefix-free encoding enc, let Tenc be the binary trie storing the encodings enc(0),,enc(u1) such that, for every xU, the leaf of Tenc reached by enc(x) is labeled with x. For any binary trie T with leaves labeled by integers let moreover: r(T) be the root of T, Tv denote the subtree of T rooted at node v, and L(Tv) be the set of (integers labeling the) leaves of subtree Tv (i.e. the leaves below node v). We overload notation and identify with T also the set of T’s nodes (the use will be always clear by the context). Observe that the definition of the sets Ax allows us to reformulate the measure trie(enc(𝒮)) as follows:

trie(enc(𝒮))=i[n]trie(enc(Si))=vTenc{r(Tenc)}|xL(Tvenc)Ax|.

In other words, the cost of node v is equal to the cardinality of the union of the sets Ax1,,Axt corresponding to the leaves x1,,xt below v. The overall cost of the tree is the sum of the costs of its nodes, excluding the root. To see why this formulation is equivalent to the previous one observe that, among the n tries for enc(S1),,enc(Sn), (a copy of) the incoming edge of v is present in the trie for enc(Si) for all ixL(Tv)Ax. See Figure 3.

U={0,1,2,3}

S1={1,2}

S2={0,1}

S3={1,2,3}

Figure 3: Left. An ordered prefix-free encoding enc of the universe U={0,1,2,3}, represented as a binary trie Tenc. This encoding is used (right part) to encode sets S1={1,2},S2={0,1},S3={1,2,3}. The corresponding sets Ax are: A0={2},A1={1,2,3},A2={1,3},A3={3}. Highlighted in red, an edge leading to a node v with xTvencAx=A2A3={1,3}, meaning that the tries for (the encoded) S1 and S3 will contain a copy of the same edge. Right. Using the prefix-free encoding enc to encode S1,S2,S3 by packing their codes into three tries (gray edges do not belong to the tries and are shown only for completeness). The tries for S1,S2,S3 contain in total 12 edges, so trie(enc(S1,S2,S3))=12. In red: the two copies of the red edge on the left part of the figure, highlighting the equivalence of the two formulations of our trie-encoding problem. As a matter of fact, this is an optimal ordered code.

As there is a one-to-one relation between the set of all prefix-free binary encodings of the integers U and the set of all binary trees with leaves U, we can conclude that the problem of finding a prefix-free encoding enc that minimizes trie(enc(𝒮)) can equivalently be seen as the problem of finding a binary tree T with u leaves (labeled with the universe elements 0,,u1, not necessarily in this order) that minimizes the cost function c(T):=vT{r(T)}|xL(Tv)Ax|. We note that this problem is similar to the standard problems of (i) finding optimal binary search trees on a set of keys with given frequencies and (ii) finding the optimal prefix-free encoding for a source of symbols with given frequencies. As a matter of fact, if the sets A0,,Au1 are disjoint then our problem is equivalent to (i-ii) and Huffman’s algorithm finds the optimal solution.

Observe that this reformulation of the problem is equivalent, in turn, to the following optimization problem: given u sets A0,,Au1, find the minimum-cost sequence of (binary) set unions that merges all sets into i=0u1Ai, where the cost of merging sets A and A is |A|+|A|. Ghosh et al. in [9] proved this problem to be NP-hard, and provided tight approximations. This motivates us to study less general families of prefix-free encodings, which hopefully can be optimized more efficiently.

3 Optimal Shifted Encoding

Given a sequence 𝒮=S1,,Sn of n subsets of U, in this section we study the problem of finding an optimal shift aU that minimizes trie(𝒮+a). After finding a useful reformulation of trie(𝒮+a), we describe an algorithm for finding an optimal shift a. The algorithm is parameterized on an abstract data structure for integer sequences. Using a simple array, we obtain an O(u+Nlogu)-time algorithm. A DAG-compressed segment tree, on the other hand, gives an O(Nlog2u)-time algorithm.

3.1 Trie measure as a function of the shift 𝒂

Let S={x1,x2,,xm}U be a non-empty integer set with 0x1<x2<<xm<u. We assume enc(x) is the standard (logu)-bit binary representation of x throughout this section. Observe that for every non-negative integers 0x<y<u, it holds that

|enc(x)enc(y)|=maxj[x,y)|enc(j)enc(j+1)|. (1)

Consider the (infinite) sequence cjj0 with cj=|enc(jmodu)enc((j+1)modu)|. This sequence is the infinite copy of the first u/2 elements (see the first row in Figure 4) of the sequence known as “ruler function” (OEIS sequence A001511111https://oeis.org/A001511). This sequence can be decomposed into the sum of logu periodic binary sequences. For integers k[logu] and j0, let us define cj(k) as follows.

cj(k)={1 if j+1 is a multiple of 2k1,0 otherwise.

Then, for every j0, it holds that cj=k=1logucj(k). Together with (1), we obtain

|enc(x)enc(y)|=maxj[x,y)k=1logucj(k)=k=1logumaxj[x,y)cj(k), (2)
Figure 4: The nodes in the trie of S={x1,x2,x3,x4}={2,4,10,13} in universe u=16. The i-th box on each row spans range [xi,xi+1), with x5:=x1+u. The number of edges in the trie of the set is equal to the number of shaded boxes that contain at least one 1-bit. The shaded boxes correspond to the edges with the same color.

where we used that the maximum and the sum are interchangeable since, for every k>1, if cj(k)=1 then cj(k1)=1 by definition.

Recalling that the value of cj(k) is either 0 or 1, this representation allows us to represent the cost (i.e. the number of edges) of a trie for S={x1,,xm} as follows. Consider the logu binary sequences (cj(k))j0 for k[logu]. For each sequence, and for every 1im, consider a range [xi,xi+1) of indices of the sequence (cj(k))j0 where we define xm+1=x1+u. For each xi+1 with i[m1], observe that adding xi+1 to the trie containing the integers {x1,x2,,xi} creates a new edge at level k if and only if maxj[xi,xi+1)cj(k)=1. Notice also that x1 always creates logu edges forming the left-most path of the trie; at the same time, it always holds that maxj[xm,xm+1)cj(k)=1 for all k[logu] because cu1(k)=1 for every k[logu]. See Figure 4 for an example. More generally, the cost of a trie shifted by a with a0 can be represented as in the following lemma whose proof is deferred to Appendix A.1.

Lemma 4.

Let S={x1,,xm} be a set of m integers with 0x1<<xm<u. Let us define xm+1:=x1+u. For every aU, it holds that

trie(S+a)=k=1logui=1mmaxj[xi+a,xi+1+a)cj(k)

Lemma 4 allows us to compute trie(S+a) by summing over k[logu] the number of the shifted ranges [xi+a,xi+1+a) (for i[m]) such that maxj[xi+a,xi+1+a)cj(k)=1; see Figure 5 for an example. The following lemmata give an illustration of which values of a involve a cost of 1 with a range [xi,xi+1).

Figure 5: Representation of trie(S+a) based on cj(k) for S={2,4,10,13} at level k=3 and u=16. Each row represents cj+a(3), for aU. Boxes indicate ranges [xi,xi+1) and red boxes contain at least one 1. As an example, consider the pair x1,x2 (leftmost boxes). Among the shifts 0,,4, the only shifts for which we do not pay the cost of k=3 for this pair are a=2 and a=3; i.e., x1+a and x2+a share an edge at level 3 for a=2,3 in trie(S+a). Hence, in the figure we have two non-red boxes in the rows corresponding to a=2,3.
Lemma 5.

Let k[logu], and x,y[0,2u) with x+2k1y. Then it holds for every aU that maxj[x+a,y+a)cj(k)=1.

Proof.

Immediate from that any interval of length 2k1 contains a multiple of 2k1.

Lemma 6.

For aU, k[logu], and x,y[0,2u) with x<y, it holds that maxj[x+a,y+a)cj(k)=1 if and only if a2k1b for some b[2uy,2ux).

Proof.

Recall that max{cj(k):j[x+a,y+a)}=1 if and only if there exists j[x+a,y+a) such that j+1 is a multiple of 2k1 by definition of cj(k). Assume that j[x+a,y+a) is such that j+1 is a multiple of 2k1. Equivalently, j=t2k11 for some integer t and x+a<t2k1y+a. The latter is equivalent to a[t2k1y,t2k1x). This is then equivalent to a2k1b for some b[y,x). Using that 2u is a multiple of 2k1 and k[logu], this is in turn equivalent to a2k1b for some b[2uy,2ux).

Considering maxj[xi+a,xi+1+a)cj(k) as a function of a, observe that it has a period of 2k1 because of the periodicity of cj(k). This means that we do not need to consider all the values of aU but we can consider only a[0,2k1). For each pair of consecutive elements xi and xi+1 for i[m], let us define Ik(xi,xi+1) as:

Ik(xi,xi+1)={[0,2k1) if xi+1xi2k1,[,r) if <r,[0,r)[,2k1) otherwise. (3)

where :=(2uxi+1)mod2k1, and r:=(2uxi)mod2k1. Note that Ik(xi,xi+1) can be represented with one or two intervals.

Let k be the multiset k={{Ik(xi,xi+1):i[m]}}. Consider the number of edges of the trie at level k for the shifted set S+a for a specific point aU. By the inner sum of Lemma 4 along with Lemma 5 and Lemma 6, this number is equal to the number of members of k containing a specific point amod2k1, i.e., to the depth of k at position amod2k1. In the following lemma, we show that trie(S+a) can be expressed as the sum over k[logu] of the depth of k at position amod2k1.

Lemma 7.

Let S={x1,x2,,xm} be a set of integers with 0x1<x2<<xm<u. Let xm+1:=x1+u and aU. Then,

trie(S+a)=k=1logui=1m𝟙[(amod2k1)Ik(xi,xi+1)].
Proof.

According to Lemma 4, we have trie(S+a)=k=1logui=1mmaxj[xi+a,xi+1+a)cj(k). Hence it is sufficient to show that amod2k1Ik(xi,xi+1) iff maxj[xi+a,xi+1+a)cj(k)=1. We distinguish two cases: (Case 1) If xi+1xi2k1, we have Ik(xi,xi+1)=[0,2k1) and thus clearly amod2k1Ik(xi,xi+1). At the same time maxj[xi+a,xi+1+a)cj(k)=1 by Lemma 5 for any aU. (Case 2) Assume xi+1xi<2k1. Let =(2uxi+1)mod2k1 and r=(2uxi)mod2k1 as in Eq. (3). By Lemma 6, it holds that maxj[xi+a,xi+1+a)cj(k)=1 if and only if a2k1b for some b[2uxi+1,2uxi). The latter is equivalent to:

a2k1b for some b[,+xi+1xi) (4)

We have two cases. (Case 2a) Assume that <r holds. Since xi+1xi<2k1, <r if and only if +xi+1xi<2k1. Observing that r=(+xi+1xi)mod2k1, we have +xi+1xi=r. (Case 2b) Now assume that r. Then Eq. (4) is equivalent to a2k1b with b[,2k1) or b[2k1,r+2k1), which can be rewritten as a2k1b with b[0,r) or b[,2k1). This completes the proof. Lemma 7 can be naturally generalized to a sequence of sets 𝒮+a=S1+a,,Sn+a. Together with Definition 2, we obtain:

trie(𝒮+a)=i=1ntrie(Si+a)=k=1logui=1nj=1|Si|𝟙[(amod2k1)Ik(xj(i),xj+1(i))]. (5)

where xj(i) is the j-th smallest element of Si. In the following subsection, we develop algorithms to find an optimal shift a[0,u) minimizing trie(𝒮+a) using this formulation.

3.2 Algorithms for the optimal shift

Let S1,,SnU be n sets of integers. For i[n] and j[|Si|], let xj(i) denote the j-th smallest element of Si. For k[logu], let Dk[0..2k1) be a sequence of length 2k1 such that, for a[0,2k1),

Dk[a]=i=1nj=1|Si|𝟙[(amod2k1)Ik(xj(i),xj+1(i))]. (6)

Observe that Dk[a] is defined as the two inner sums of Eq. (5); in other words, Dk[a] is the number of interval segments at level k that contain a specific position a[0,2k1). To consider the cumulative sum of the number of interval segments that contain position a up to level k[logu], let Ck be a sequence of length 2k that, for a[0,2k), is defined as

Ck[a]=k=1kDk[amod2k1]. (7)

Then, by Equations (5) and (6) it holds that trie(𝒮+a)=Clogu[a] for every aU. Therefore, if we can compute Ck in an efficient way, this will allow us to find an optimal shift by computing argminaUClogu[a]. To do this, now consider an abstract data type 𝒟 that supports the following four operations:

  1. 1.

    𝒟.𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎(): create a sequence A of length 1 and initialize it as A[0]0.

  2. 2.

    𝒟.𝚊𝚍𝚍(,r): update A[a]A[a]+1 for a[,r).

  3. 3.

    𝒟.𝚎𝚡𝚝𝚎𝚗𝚍(): duplicate the sequence as AAA.

  4. 4.

    𝒟.𝚊𝚛𝚐𝚖𝚒𝚗(): return argmin0a<|A|A[a].

Algorithm 1 General algorithm for finding an optimal shift of S1,,Sn.

Based on these operations, in Algorithm 1 we describe a general procedure to find an optimal shift. For each level k[logu] and every set Si, we iterate on every pair of consecutive elements xj(i),xj+1(i). In Line 5, we use the fact that Ik() can be represented as the union of at most two intervals on U, see Eq. (3). We increment A[a] by 1 for every a[,r) by calling 𝒟.𝚊𝚍𝚍(,r) (Line 6). After processing all consecutive pairs, we extend the universe into [0,2k) (Line 7). To see that the algorithm is correct, observe that the amount by which A[a] is incremented in the three inner loops equals Dk[a] according to Eq. (6). Eq. (7) yields that, for k[logu] and a[0,2k1), the number Ck can be written recursively as

Ck[a+2k1]=Ck[a]=Ck1[a]+Dk[a], (8)

where we define C0:=0 as the base case. Then, at the end of the k-th iteration of the outer for loop (after Line 7), the sequence A represented by 𝒟 is exactly Ck. The running time depends on how 𝒟 is implemented. In the following two subsections, we will present two different ways of implementing the data structure 𝒟. These allow us to find an optimal shift in O(u+Nlogu) with simple arrays and O(Nlog2u) time with a dynamic DAG structure.

3.2.1 𝑶(𝒖+𝑵𝐥𝐨𝐠𝒖)-time algorithm

Our first solution is simple: we represent A implicitly by storing an array Δ[0..|A|1] of length |A|, encoding the differences between adjacent elements of A. At the beginning, Δ is initialized as an array of length 1 containing the integer 0. Operation 𝒟.𝚊𝚍𝚍(,r) is implementing by incrementing Δ[] by 1 unit and (if r<|Δ|) decrementing Δ[r] by 1 unit, in O(1) time (Lines 45 of Algorithm 3 in Appendix A.2). As far as operations 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() and 𝒟.𝚊𝚛𝚐𝚖𝚒𝚗() are concerned, they can be supported naively in linear O(|Δ|)=O(|A|) time with a constant number of scans of array Δ. See Algorithm 3 for the details.

Next, we analyze the running time of Algorithm 1 when using this simple implementation of 𝒟. Operation 𝒟.𝚊𝚍𝚍() is called O(Nlogu) times, each of which costs O(1) time. Operation 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() is called logu times, and each call costs O(|A|) time. Each time this operation is called, the length of A doubles (starting from |A|=1). The overall cost of these calls is therefore O(1+2+4++u)=O(u). Finally, 𝒟.𝚊𝚛𝚐𝚖𝚒𝚗() is called only once when |A|=u, and therefore it costs O(u) time. We conclude that, using this implementation of 𝒟, Algorithm 1 runs in O(u+Nlogu) time. Within this running time we actually obtain a much more general result, since we can evaluate any entry of A:

Theorem 8.

Given a sequence 𝒮=S1,,Sn of n subsets of U, we can compute trie(𝒮+a) for all aU in total O(u+Nlogu) time.

3.2.2 𝑶(𝑵𝐥𝐨𝐠𝟐𝒖)-time algorithm

If u is too large compared to N, then the simple solution presented in the previous subsection is not efficient. In this case, we instead represent A with a DAG-compressed variant of the segment tree [4, 11]; see Algorithm 4 in Appendix A.3 for the details.

𝒟 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() 𝒟.𝚊𝚍𝚍(3,6)
Figure 6: DAG-compressed representation of A. (Left) Each element A[i] is represented as the sum of the values stored in its corresponding root-to-leaf path. The two children of va and the left child of vb are represented by a single node vc. (Middle) Duplicating the whole content can be performed by creating a new root (vd) with the old root (ve) being referred as both of its left and right children. (Right) Incrementing A[i] by 1 for each i[3,6)=[3,5] is performed by incrementing the values in vg and vh by 1 each, which covers [3,3] and [4,5], respectively. We duplicate every node that has more than one incoming edges when it is visited so that the visited nodes (indicated with red nodes) should have unique paths from the root.

Intuitively, consider the complete binary tree of height log|A| where the root stores a counter associated to the whole array A, its left/right children store a counter associated to A[0,|A|/21] and A[|A|/2,|A|1], respectively, and so on recursively (up to the leaves, which cover individual values of the array). Assume that these counters are initialized to 0. Observe that for every 0<r|A|, there exists the coarsest partition of A[,r1] into O(log|A|) disjoint intervals covered by nodes of the tree. The idea is to support 𝒟.𝚊𝚍𝚍(,r) by incrementing by 1 unit the counter associated to those nodes. To avoid spending time O(|A|) for operation 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍(), however, we cannot afford duplicating the whole tree when this operation is called. In this case, our idea is simple: since 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() duplicates the whole content of A, in O(1) time we simply create a new node covering AA and make both its two outgoing edges (left/right child) lazily point to the node covering A, i.e., the old root of the tree. In other words, we represented the tree as a DAG, collapsing nodes covering identical sub-arrays. This modification makes it necessary to (possibly) duplicate at most O(log|A|) nodes at each call of 𝒟.𝚊𝚍𝚍(,r): a duplication happens when, starting from the root, we reach a node x having more than one incoming edges. Without loss of generality, assume that we are moving from y that covers sub-array A[i,i+2k+11] to its left child. Since x has more than one incoming edges, A[i,i+2k1] is not the unique interval that x covers. We create a new node x by duplicating x, then make x the left child of y so that A[i,i+2k1] is the unique interval that x covers. Then proceed to x. Since A[,r1] is covered by O(log|A|) nodes, starting this procedure at the root duplicates at most O(log|A|) nodes of the DAG and operation 𝒟.𝚊𝚍𝚍(,r) therefore costs O(log|A|)O(logu) time. This slow-down (with respect to the O(1) cost of the same operation in the previous subsection) is paid off by the fact that we do not need to create O(|A|) new nodes (i.e. duplicate the tree) at each call of 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍(). With this representation, the operation 𝒟.𝚊𝚛𝚐𝚖𝚒𝚗() can be implemented in O(1) time as well by simply associating to each node the smallest value in the sub-array of A covered by that node, as well as its index (these values are updated inside 𝒟.𝚊𝚍𝚍(,r)). Operations 𝒟.𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎() and 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() trivially take O(1) time.

Plugging this structure into Algorithm 1, observe that the running time is dominated by 𝒟.𝚊𝚍𝚍(), which is called O(Nlogu) times. Since each call to this function takes O(logu) time as discussed above, we finally obtain:

Theorem 9.

Given a sequence 𝒮=S1,,Sn of n subsets of U, we can compute one value a minimizing trie(𝒮+a) in O(Nlog2u) time.

Finally, we remark that using more elaborate arguments one can obtain an algorithm running in O(Nlogu) time. Due to space limitations, we refrain from detailing this approach here, but will include it in the extended version of the article.

4 Optimal Ordered Encoding

To compute the optimal ordered encoding, we employ the equivalent problem formulation of Section 2.1. For any xU, let Ax={i[n]:xSi}. Recall that we can assume, w.l.o.g., that Ax for all xU and Si for all i[n]. Our goal is to find the binary tree T with leaves 0,,u1 (in this order) minimizing c(T):=vT{r(T)}|xL(Tv)Ax|. For a binary tree T, let T and Tr be the left and right sub-tree of the root of T, respectively. We show that the dynamic programming solution of Knuth [12] can be applied to our scenario. Consider the following alternative cost function: d(T):=|xL(T)Ax|+d(T)+d(Tr). The (recursive) function d(T) is related to c(T) by the following equality: c(T)=d(T)|xL(T)Ax|. It is not hard to turn the definition of d(T) into a set of dynamic programming formulas computing both the best tree T and its cost d(T) by creating, for every 0xy<u, a variable dx,y whose final value will be minTd(T) (where T runs over all trees such that L(T)={x,x+1,,y}), and a pre-computed constant ax,y=|t=xyAt|. Given the constants ax,y, the following set of dynamic programming formulas finds (bottom up, i.e., by increasing yx) the value c(T)=d(T)|xL(T)Ax|=d0,u1a0,u1 for the optimal tree T and, by standard backtracking, the tree T itself:

initialization:ax,y :=|t=xyAt| for all 0xyu1
dx,x :=ax,x for all xU
recursion:dx,y :=ax,y+minx<zydx,z1+dz,y for all 0x<yu1

The optimal tree’s topology can be retrieved by backtracking: if L(T)={x,x+1,,y}, then the number of leaves in T is equal to (argminx<zydx,z1+dz,y)i (we start from the root of T with x=0 and y=u1). Below, we show how to compute constants ax,y in O(N+u2) time. Since the above formulas can be evaluated bottom-up (i.e. by increasing yx) in O(u3) time222In his work [12], Knuth shows how to evalutate these recursive formulas to O(u2) time. In the simpler case considered in his article (corresponding to our problem with pairwise-disjoint A0,,Au1), one can bound argminx<zydx,z1+dz,y so that the overall cost of looking for all those optimal values of z amortizes to u2. Unfortunately, in our scenario we haven’t been able to prove that the same technique can still be applied., the overall running time of the algorithm is O(N+u3).

We now show how to compute efficiently the constants ax,y. We describe an overview of the algorithm (see Algorithm 2 for the pseudocode). We add to the sequence two sets A1=Au=[n], respectively at the beginning and end: the new sequence becomes A1,A0,,Au1,Au. The main idea behind the algorithm is to initially compute inside ax,y the number of elements from [n] that are missing from |t=ijAt|. Then, the substitution ax,ynax,y will yield the final result.

Algorithm 2 Compute ax,y.

We initialize ax,y0 for all 0x,yu. Let Ax+1,Ax+2,,Ay1 be a maximal contiguous subsequence of sets not containing a given i[n], that is, (i) iAx for all x=x+1,x+2,,y1, (ii) iAx, and (iii) iAy. Then, i|t=xyAt| for every ixyy. Imagine then adding one unit to ax,y for all (x,y)[x+1,y1]×[x+1,y1]. Clearly, if we can achieve this for every i[n] and every such maximal contiguous subsequence of sets not containing i, at the end each ax,y will contain precisely the value n|t=xyAt|. The issue is, of course, that adding one unit to ax,y for all x<xy<y, costs time O(xy) if done naively. The crucial observation is that this task can actually be performed in constant time using (bidimensional) partial sums: see Figure 7 for an example. Ultimately, this trick allows us computing all ax,y in the claimed O(N+u2) running time.

012301010101012101030101
012300110100112011030011
012300000101102012130011
Figure 7: Matrix ax,y. Suppose our goal is to add 1 unit to the two sub-matrices with indices [1,2]×[1,2] and [2,3]×[2,3]. We show how to achieve this by performing 4 updates for each of these two submatrices, and then performing two scans (partial sums) of total cost O(u2) over the full matrix. Letting t be the number of sub-matrices to update (in this example, t=2), this means that we can perform the t updates in total O(t+u2) time. Assume we start by a matrix ax,y containing only zeros. To update the sub-matrix with indices [x,y]×[x,y], we perform these 4 operations: (1) ay,yay,y+1, (2) ax1,x1ax1,y1+1, (3) ax1,yax1,y1, and (4) ay,x1ay,x11. Left: applying these four operations for each of the two sub-matrices [1,2]×[1,2] and [2,3]×[2,3]. Center: we compute partial sums row-wise, cumulating from right to left. Right: we compute partial sums column-wise, cumulating bottom-up. At the end, we correctly added 1 unit to the target sub-matrices.

4.1 Optimal Shifted Ordered Encoding

Observe that the optimal ordered encoding is not necessarily better than the optimal shifted encoding, because shifted encodings are not ordered (except the case a=0). It is actually very easy to get the best of both worlds and compute the ordered encoding enc:{0,1}{0,1} minimizing minaUtrie(enc(𝒮+a)). This encoding is guaranteed to be no worse than both the best shifted encoding and the best ordered encoding.

The solution is a straightforward extension of the technique used for the best ordered encoding. Let A0,,Au1 be the sets defined previously. Build the sequence of sets A0,,Au1,Au,,A2u1, where Au+x=Ai for all 0x<u (that is, we simply create an extra copy of the sets). Compute ax,y and dx,y as discussed previously, with the only difference that now the domain of x,y is doubled: 0x,y<2u. It is not hard to see that the optimal shifted ordered encoding has then cost c(T)=min0a<u(da,a+u1+aa,a+u1)=(min0a<uda,a+u1)+a0,u1. Again, the tree topology is obtained by backtracking starting from the optimal interval [a,a+u) given by a=argmin0a<uda,a+u1 (the optimal ordered encoding discussed previously is simply the particular case a=0). The asymptotic cost of computing this optimal shifted ordered encoding is still O(N+u3).

5 Experimental Results

We implemented our algorithms for computing the optimal shifted encoding and the optimal shifted ordered encoding in C++ and made them available at https://github.com/regindex/trie-measure. We computed the sizes of these encodings on thirteen datasets, the details of which can be found in Table 1 in Appendix 5. The datasets can be naturally split in four groups according to their origin: 1) eight sequences of sets from [3], 2) a dataset of paper tags from dblp.org xml dump [14], 3) a dataset containing amino acid sets from two protein sequence collections [16, 13], 4) and two datasets containing ratings and tags of 10000 popular books [15]. For the last three groups, we generated sequences of sets (i.e. the inputs of our algorithms) by extracting sets of features from the original datasets: in 2) we extracted sets of tags for all dblp entries, in 3) we extracted the set of amino acids contained in each protein sequence, and in the two datasets of group 4) we extracted a set of tags and ratings for each book, respectively. The repository above contains all the generated set sequences.

As far as the shifted trie measure trie(𝒮+a) of Section 3 is concerned, we evaluated it on the above datasets for all shifts aU and reported the following statistics in Table 2: the trie cost for the optimal and worst shifts, the average cost over all shifts, and the percentage differences between the average/worst shifts and the optimal shift. Our results indicate that the optimal, average, and worst shifts lead in practice to similar costs. In particular, only five datasets showed a difference larger than 5% between the optimal and worst shift (opt-shift/worst-shift(%) < 95). Among these, the differences range between 15.85% for DBLP.xml and 6.01% for tags-math-sx-seqs. The differences are even smaller when comparing the optimum with the average shift. In this case, only DBLP.xml shows an average difference (of 6.95%) being larger than 5% (opt-shift/avg-shift(%) < 95), meaning that the trie measure computed with a random shift on this dataset is, on average, 6.95% larger than with the optimal shift. These results suggest that data structures which encode integer sets as tries using the standard binary integer encoding (such as the subset wavelet tree of Alanko et al. [1]) are often efficient in practice since even arbitrary shifts allow to obtain an encoding being not far from the optimal shifted encoding.

As far as the optimal shifted ordered encoding of Section 4.1 is concerned, the last two columns of Table 2 show the size of this encoding for the ten datasets on which we could run our cubic dynamic programming algorithm. The results indicate that the size of this encoding tends to be significantly smaller than the optimal shifted encoding discussed in the previous paragraph: seven datasets showed a percentage difference between the optimal shifted encoding and the optimal shifted ordered encoding being larger than 10%, with a peak of 27.31% on tags-math-sx-seqs.

We leave it as an open question whether it is possible to compute the optimal shifted ordered encoding in sub-cubic time as a function of u. Another interesting research direction is to design fast heuristic (e.g. ILP formulations) for computing the globally-optimal prefix free encoding (NP-hard to compute).

References

  • [1] Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset wavelet trees. In Loukas Georgiadis, editor, 21st International Symposium on Experimental Algorithms, SEA 2023, July 24-26, 2023, Barcelona, Spain, volume 265 of LIPIcs, pages 4:1–4:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.SEA.2023.4.
  • [2] Diego Arroyuelo and Juan Pablo Castillo. Trie-Compressed Adaptive Set Intersection. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023), volume 259 of Leibniz International Proceedings in Informatics (LIPIcs), pages 1:1–1:19, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.CPM.2023.1.
  • [3] Austin R. Benson, Ravi Kumar, and Andrew Tomkins. Sequences of sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1148–1157, New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3219819.3220100.
  • [4] Jon Louis Bentley and Derick Wood. An Optimal Worst Case Algorithm for Reporting Intersections of Rectangles. IEEE Transactions on Computers, C29(7):571–577, 1980. doi:10.1109/TC.1980.1675628.
  • [5] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn Graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225–235, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. doi:10.1007/978-3-642-33122-0_18.
  • [6] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part i. J. ACM, 70(4), August 2023. doi:10.1145/3607471.
  • [7] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), November 2009. doi:10.1145/1613676.1613680.
  • [8] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theoretical Computer Science, 698:67–78, 2017. Algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo). doi:10.1016/j.tcs.2017.06.016.
  • [9] Mainak Ghosh, Indranil Gupta, Shalmoli Gupta, and Nirman Kumar. Fast compaction algorithms for NoSQL databases. In 2015 IEEE 35th International Conference on Distributed Computing Systems, pages 452–461, 2015. doi:10.1109/ICDCS.2015.53.
  • [10] Ankur Gupta, Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compressed data structures: Dictionaries and data-aware measures. Theoretical Computer Science, 387(3):313–331, 2007. doi:10.1016/j.tcs.2007.07.042.
  • [11] Nick Kline and Richard T. Snodgrass. Computing temporal aggregates. In Proceedings of the 11th International Conferencec on Data Engineering, pages 222–231, 1995. doi:10.1109/ICDE.1995.380389.
  • [12] Donald E. Knuth. Optimum binary search trees. Acta informatica, 1:14–25, 1971. doi:10.1007/BF00264289.
  • [13] AFproject high-id proteins dataset. Accessed: 2024-11-15. URL: https://afproject.org/app/benchmark/protein/high-ident/dataset/.
  • [14] DBLP XML dump. Accessed: 2024-11-15. URL: https://dblp.org/xml/.
  • [15] goodbooks-10k dataset. Accessed: 2024-11-15. URL: https://github.com/zygmuntz/goodbooks-10k.
  • [16] UniProt database. Accessed: 2024-11-15. URL: https://www.uniprot.org/uniprotkb.

Appendix A Additional Material for Section 3

A.1 Proof of Lemma 4

See 4

Proof.

Let aU be arbitrary. We divide S into S<={xS:x+a<u} and S={xu:xS and x+au} to consider the effect of computing modulo u. We distinguish three cases.

First, consider the case where S=. Observe that cu1(k)=1 for every k[logu] by definition and the assumption that u is a power of 2. Therefore we have

logu=k=1logu1=k=1logucu1(k). (9)

Furthermore, we have cu1(k)=maxj[xm,xm+1)cj(k)=max[xm+a,xm+1+a)cj(k), using the assumption that xm+a<ux1+u=xm+1 as S=. From Definition 1, Equations (2) and (9), we then obtain

trie(S+a) =logu+k=1logui=1m1maxj[xi+a,xi+1+a)cj(k)
=k=1logu(maxj[xm+a,xm+1+a)cj(k)+i=1m1maxj[xi+a,xi+1+a)cj(k))
=k=1logui=1mmaxj[xi+a,xi+1+a)cj(k). (10)

The claim follows analogously when S<=, since then cj(k)=cj+u(k) for every k[logu] and every j0.

Now we assume that both S< and S are not empty. Observe that x1S< because it is the smallest element in S and S< is not empty. Recalling that xm+1:=x1+u, we can observe that the trie for (S{xm+1u})+a and the trie for S<+a share exactly one path from the root to x1+a. Therefore,

trie(S+a)=trie((S{xm+1u})+a)+trie(S<+a)logu. (11)

Let i[2,m] be the integer such that xi=minS. Then S<={x1,,xi1} and S{xm+1}={xiu,xi+1u,,xm+1u}. Since 0x1+a<uxi+a, for every k[logu] it holds that maxj[x1+a,xi+a)cj(k)=cu1(k)=1. Therefore we have

trie(S<+a) =logu+k=1logui=1i2maxj[xi+a,xi+1+a)cj(k)=k=1logui=1i1maxj[xi+a,xi+1+a)cj(k)

Applying this with Equation (10) to Equation (11), we obtain

trie(S+a) =k=1logu(i=im+1maxj[xi+a,xi+1+a)cj(k)+i=1i1maxj[xi+a,xi+1+a)cj(k))logu
=k=1logui=1mmaxj[xi+a,xi+1+a)cj(k)+k=1logumaxj[xm+1+a,xm+2+a)cj(k)logu.

where we define xm+2:=xi+u.

Now note that xm+1+a=x1+a+u<2u=u+uxi+a+u=xm+2+a and thus xm+1+a2u1<xm+2+a. Since c2u1(k)=1 by definition, it follows that the second maximum is always 1. Hence the second term equals logu, which is canceled with the last term, and thus the claim follows.

A.2 Implementation of 𝓓 for the 𝑶(𝒖+𝑵𝐥𝐨𝐠𝒖)-time algorithm

All functions implementing 𝒟 besides the 𝚎𝚡𝚝𝚎𝚗𝚍 function are self-explanatory. Thus, we proceed with a detailed explanation of 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍(), which corresponds to duplicating the sequence A to A=AA. Assume that A is of length i, i.e., contains elements A[0],,A[i1] and recall that Δ at every point has to encode A via differences, i.e., Δ[0]=A[0] and Δ[a]=A[a]A[a1] for any a[1,i). Hence, when duplicating the sequence A, we can simply duplicate Δ to Δ=ΔΔ but have to change the value at the “boundary”, i.e., Δ[i]. This value has to become A[i]A[i1]=A[0]A[i1]. We can simply obtain A[0] from Δ[0] and we can reconstruct A[i1] as a=0i1Δ[a].

Algorithm 3 Implementation of 𝒟 for the O(u+Nlogu)-time algorithm.

A.3 Implementation of 𝓓 for the 𝑶(𝑵𝐥𝐨𝐠𝟐𝒖)-time algorithm

The pseudocode implementing the interface of the DAG-compressed variant of dynamic segment trees that we described in Section 3.2.2 is given in Algorithm 4. Algorithm 5 describes instead the private functions called by Algorithm 4. In what follows, we give all implementation details.

Algorithm 4 Implementation of 𝒟 for the O(Nlog2u)-time algorithm.

The structure of the DAG-compressed tree is as follows. It is built in logu iterations. At each iteration k[logu], the height of the DAG (i.e. the height of the tree resulting from the expansion of the DAG) is k and it is equal to the number of nodes on the path from the root to the leaves. At any point, the expansion of the DAG we are building is a complete binary tree of height k. We store this height k in a variable 𝒟.𝚑𝚎𝚒𝚐𝚑𝚝. The root node is stored in 𝒟.𝚛𝚘𝚘𝚝, and it covers the range [0,2k1). We say that the height of the leaves in the tree is 1, parents of leaves are at height 2, and so on. A node v at height h1 covers a range [x,x+2h1) for some integer x. Such a node v may have zero or two children. If it has two children, the left child covers [x,x+2h2) and the right child covers [x+2h2,x+2h1), respectively. Note that in the pseudocodes, neither the value of x nor the value of h are stored in the node v explicitly, but they can be reconstructed while navigating the tree.

Node v, corresponding to the subsequence A[x,x+2h11], has four variables v.𝚟𝚊𝚕, v.𝚊𝚛𝚐𝚖𝚒𝚗, v.𝚖𝚒𝚗, and v.𝚛𝚎𝚏, in addition to the pointers to its left and right child v.𝚕𝚎𝚏𝚝 and v.𝚛𝚒𝚐𝚑𝚝. Recall that the purpose of 𝒟 is to maintain an integer array A, increment by one unit all entries belonging to range A[,r1] via operation 𝒟.𝚊𝚍𝚍(,r), and duplicate the array via operation 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍(). For a node v that corresponds to a range A[x,x+2h11], we store those increments that apply to the whole range in a variable v.𝚟𝚊𝚕. The variable v.𝚊𝚛𝚐𝚖𝚒𝚗 stores argmina[0,2h1)A[x+a], while v.𝚖𝚒𝚗 stores mina[0,2h1)A[x+a].

Finally, v.𝚛𝚎𝚏 stores the number of pointers of other nodes to the node v. The role of this reference counter is to duplicate nodes only when necessary (more details are given below). This reference counter v.𝚛𝚎𝚏 is managed in function 𝚛𝚎𝚊𝚕𝚕𝚘𝚌𝚊𝚝𝚎_𝚗𝚘𝚍𝚎() in Lines 1-11. This function takes three arguments u, wL and wR to create a new node. If unull, it makes a copy of u, and decrements u.𝚛𝚎𝚏 because it means one pointer will replace u with its copy. Then it sets the left and right child of the new node to wL and wR, and (if they are not null pointers) increment wL.𝚛𝚎𝚏 and wR.𝚛𝚎𝚏 by 1 each because the new node will point to them.

All functions implementing 𝒟 besides 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() and 𝒟.𝚊𝚍𝚍(,r) are self-explanatory so we do not discuss them. Function 𝒟.𝚎𝚡𝚝𝚎𝚗𝚍() is also simple as it just allocates a new root, sets its left and right children pointers to the old root, and sets its height to the height of the old root incremented by one unit. Note that the counter 𝚛𝚎𝚏 associated with the old root is correctly set at 2. The counter 𝚛𝚎𝚏 associated with the new root is set at 1 (even though the new root is not referenced by any node, this value prevents the code from re-allocating the new root each time function 𝒟.𝚊𝚍𝚍(,r) is called).

Algorithm 5 Private functions of the data structure from Algorithm 4.

We proceed by discussing the details of 𝒟.𝚊𝚍𝚍(,r). When this function is invoked, we call a subroutine 𝚒𝚗𝚌𝚛𝚎𝚖𝚎𝚗𝚝() with arguments representing the root node, the interval [,r), and the height of the tree (Line 5). We start from the root node, and perform 𝚒𝚗𝚌𝚛𝚎𝚖𝚎𝚗𝚝() recursively. Suppose we arrive at a node v of height h covering a range [x,x+2h1). In Line 14, if v is referred by more than one pointer (i.e., if v.𝚛𝚎𝚏>1), this means that the range [x,x+2h1) is not the unique range that v is covering; in other words, v is being reused at more than one place. Thus we make a copy of v, and proceed with the copied node. This new node will be returned by the function (Lines 17,23) so that it can replace the old node properly (Lines 5,18-19). Now we are to perform an update according to the given interval. When [,r) is passed as an argument of the function, it means that we are to increment A[a] for a[x+,x+r). If the size of the interval is exactly 2h1, we need to increase A[a] by 1 unit for all a[x,x+2h1), which can be performed by incrementing v.𝚟𝚊𝚕 and v.𝚖𝚒𝚗 by 1 each (Lines 15-17). Otherwise, we split the interval at position x+2h2 (i.e., split [x+,x+r) into [x+,x+2h2) and [x+2h2,x+r)), then process each split segment with its left and right child, respectively (Lines 18-19). After processing the insertion at the child nodes, we update v.𝚖𝚒𝚗 and v.𝚊𝚛𝚐𝚖𝚒𝚗 according to the minimum computed in the children (Lines 20-22), which will propagate to the root node. If the minimum is from the right child, we add the offset 2h2 for updating v.𝚊𝚛𝚐𝚖𝚒𝚗 properly.

The cost of procedure 𝒟.𝚊𝚍𝚍() is proportional to O(log|A|). This is because in Line 15 we do not further recurse on the children of v if the local interval [,r) spans the entire range of length 2h1 associated with v; in turn, this means that the recursive calls to 𝚒𝚗𝚌𝚛𝚎𝚖𝚎𝚗𝚝 stop on the O(log|A|) nodes whose associated intervals partition the initial range [,r) on which function 𝒟.𝚊𝚍𝚍(,r) was called. Additionally, all ancestors of those nodes have at least two children (notice that, if 𝚒𝚗𝚌𝚛𝚎𝚖𝚎𝚗𝚝 is called recursively, the recursion always occurs on both children of the node: see Lines 18 and 19), therefore the total number of nodes recursively visited by 𝒟.𝚊𝚍𝚍(,r) is O(log|A|).

Appendix B Additional Material for Section 5

Table 1: Summary of the thirteen datasets. From left to right, we report the dataset id and name, the number of distinct universe elements belonging to the sets (σ), the size of the universe (u=2log|U|), the total length (N), the number of sets (n), and the average size of a set (N/n).
dataset σ u N n N/n
1 email-Enron-core-seqs 141 256 14148 10428 1.36
2 contact-prim-school-seqs 242 256 251546 174796 1.44
3 contact-high-school-seqs 327 512 377000 308990 1.22
4 email-Eu-core-seqs 937 1024 252872 202769 1.25
5 tags-mathoverflow-seqs 1399 2048 125056 44950 2.78
6 tags-math-sx-seqs 1650 2048 1177312 517810 2.27
7 coauth-Business-seqs 236226 2097152 849838 463070 1.84
8 coauth-Geology-seqs 525348 2097152 3905349 1438652 2.71
9 DBLP.xml 26 32 28815437 3939813 7.31
10 proteins-high-id 20 32 40664 2128 19.11
11 proteins-long 25 32 11051828 571282 19.35
12 book-ratings 5 8 48763 10000 4.88
13 tags-book 34252 65536 999904 10000 99.99
Table 2: Experimental result. From left to right, we report the dataset id, and the optimal shifted encoding size (opt-shift) followed by the average shifted encoding size (avg-shift), the ratio between opt-shift and avg-shift, the worst shifted encoding size (worst-shift), and the ratio between opt-shift and worst-shift. The last two columns show the optimal shifted ordered encoding size (opt-ord), and the ratio between opt-shift and opt-ord (only for datasets with small universe size u).
id opt-shift avg-shift opt-shiftavg-shift(%) worst-shift opt-shiftworst-shift(%) opt-ord opt-shift opt-ord(%)
1 105708 106787 98.99 108027 97.85 84843 124.59
2 1864240 1870395 99.67 1876731 99.33 1815243 102.70
3 3227940 3244486 99.49 3257990 99.07 2906848 111.05
4 2457044 2460375 99.86 2464644 99.69 2181173 112.65
5 1092079 1138803 95.90 1177890 92.71 880232 124.07
6 10727737 11069006 96.92 11413439 93.99 8426162 127.31
7 14940827 15073904 99.12 15171942 98.47 - -
8 66864377 67972793 98.37 68812124 97.17 - -
9 70403690 75662919 93.05 81715980 86.15 58285023 120.79
10 83178 86608 96.04 89503 92.93 80989 102.70
11 22454406 23371923 96.07 24144585 93.00 21868947 102.68
12 96467 97865 98.57 98714 97.80 87566 110.16
13 7923290 8022933 98.76 8146000 97.27 - -