Abstract 1 Introduction 2 Basic Concepts 3 Containment Entropy 4 A Containment Entropy Bounded Representation with Fast Retrieval 5 Other Operations 6 Construction 7 Concluding Remarks References

Compact Data Structures for Collections of Sets

Jarno N. Alanko ORCID Department of Computer Science, University of Helsinki, Finland Philip Bille ORCID Technical University of Denmark, Lyngby, Denmark Inge Li Gørtz ORCID Technical University of Denmark, Lyngby, Denmark Gonzalo Navarro ORCID Department of Computer Science, University of Chile, Santiago, Chile
Center for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
Simon J. Puglisi ORCID Department of Computer Science, University of Helsinki, Finland
Abstract

We define a new entropy measure L(𝒮), called the containment entropy, for a set 𝒮 of sets, which considers the fact that some sets can be contained in others. We show how to represent 𝒮 within space close to L(𝒮) so that any element of any set can be retrieved in logarithmic time. We extend the result to predecessor and successor queries and show how some common set operations can be implemented efficiently.

Keywords and phrases:
Compressed data structures, entropy of sets, data compression
Category:
Research
Funding:
Jarno N. Alanko: Funded by the Helsinki Institute for Information Technology (HIIT).
Philip Bille: Danish Research Council grant DFF-8021-002498.
Inge Li Gørtz: Danish Research Council grant DFF-8021-002498.
Gonzalo Navarro: Basal Funds FB0001 and AFB240001, ANID, Chile.
Simon J. Puglisi: Academy of Finland grant 339070.
Copyright and License:
[Uncaptioned image] © Jarno N. Alanko, Philip Bille, Inge Li Gørtz, Gonzalo Navarro, and Simon J. Puglisi; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data structures design and analysis
Acknowledgements:
This work was initiated at the NII Shonan Meeting no. 187 on “Theoretical Foundations of Nonvolatile Memory.”
Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

1 Introduction

We consider the problem of representing a collection of sets 𝒮={S1,,Ss} from a universe 𝒰 of size u while supporting basic queries, including retrieving the kth element and predecessor and successor queries. The goal is to store the sets compactly while supporting fast queries. This problem has important applications, including the representation of postings lists in inverted indexes and adjacency-list representations of graphs.

To measure space, we often consider the worst-case entropy defined as H(𝒮)=i=1slg(u|Si|) as a natural information-theoretic worst-case lower bound on the number of bits needed to store 𝒮. Using standard techniques [6, 5, 9], we can store 𝒮 within H(𝒮)+O(n+slogn) bits and support retrieval (i.e., accessing any kth element of any set) in constant time.

In this paper, we propose a finer measure of entropy for 𝒮 that can take advantage of the fact that some sets may be subsets of others. If SiSj, we can encode Si using log(|Sj||Si|) bits, by indicating which elements in Sj are also in Si. We show how to construct a hierarchy from 𝒮 such that children are subsets of their parent. This leads to a new notion of entropy called the containment entropy, defined as

L(𝒮)=i=1slg(|p(Si)||Si|),

where p(Si) is the parent of Si (see Section 3). It is easy to see that the containment entropy is a finer notion of entropy since L(𝒮)H(𝒮). Our main result is that we efficiently represent 𝒮 in space close to its containment entropy while supporting queries efficiently.

Theorem 1.

Let 𝒮 be a set of s sets of total size n, elements of which are drawn from a universe of size u. We can construct a data structure in O(snlogn) time that uses L(𝒮)+O(n+slogn) bits of space and supports retrieval, predecessor, and successor queries on any set S𝒮 in time O(log(u/|S|)).

Thus, compared to the above-mentioned standard data structures, we replace the worst-case entropy H(𝒮) with L(𝒮) in our space bound in Theorem 1.

We also obtain several corollaries of Theorem 1. We show how to implement the common operations of set intersection, set union, and set difference. By combining Theorem 1 with techniques from Barbay and Kenyon [2] we obtain fast implementations of these operations in terms of the alternation bound [2] directly on our representation. We also show how to apply Theorem 1 to efficiently store a collection of bitvectors while supporting access, rank, and select queries.

Technically, the result is obtained by constructing a tree structure, where each node represents a set, and children represent subsets of their parent. We then represent each set as a sparse bitvector that indicates which of the elements of the parent are present in the child. This leads to a simple representation that achieves the desired space bound. Unfortunately, if we directly implement a retrieval query on this representation, we have to traverse the path from the set S to the root leading to a query time of Ω(h), where h is the height of the tree. We show how to efficiently shortcut paths in the tree without asymptotically increasing the space, yielding the O(log(u/|S|)) time. We then extend these techniques to handle predecessor and successor queries in the same time, leading to the full result of Theorem 1.

Relation with wavelet trees.

We note that the idea is reminiscent of wavelet trees [7], which represent a sequence of symbols as a binary tree where each node stores a bitvector. In the root’s bitvector, 0s/1s mark the sequence positions with symbols in the first/second half of the alphabet, and left/right children recursively represent the subsequences formed by the symbols in the first/second half of the alphabet. Nodes handling single-symbol subsequences are wavelet tree leaves. If we define 𝒮 as the sets of positions in the sequence holding the symbols of each subalphabet considered in the wavelet tree, then our data structure built on 𝒮 has the same shape of the wavelet tree. The wavelet tree uses less space, however, because it corresponds to a particular case where the maximal subsets of each set partition it into two: while the wavelet tree can be compressed to the zero-order entropy of the sequence [7], L(𝒮) doubles that entropy.

2 Basic Concepts

A bitvector B[1..u] is a data structure that represents a sequence of u bits and supports the following operations:

  • access(B,i) returns B[i], the ith bit of B.

  • rankb(B,i), where b{0,1}, returns the number of times bit b occurs in the prefix B[1..i] of the bitvector. We assume rankb(B,0)=0.

  • selectb(B,j) returns the position in B of the jth occurrence of bit b{0,1}. We assume selectb(B,0)=0 and selectb(B,i)=n+1 if b does not occur i times in B.

Note that rank0(B,i)=irank1(B,i). By default we assume b=1. It is possible to represent bitvector B using u+o(u) bits and solve all the operations in O(1) time [3, 8]. If n is the number of 1s in B, it is possible [6, 5, 9] to represent B using

nlgun+2n=lg(un)+O(n+logu)

bits,111The formula on the left is zero if n=0. so that select1 is implemented in constant time, while access and rankb are solved in time O(1+logun). Operation select0 can be solved in time O(logn) by binary search on select1. This is called the sparse bitvector representation in this paper.

We note that lg(un) is the entropy of the set of positions where B contains n 1s, or equivalently, the entropy of the sets of n elements in a universe of size u. We will indistinctly speak of the operations access, rank, and select over sets S on the universe [1..u], referring to the corresponding bitvector B where B[i]=1 iff iS. For example, select(S,k) is the kth smallest element of S.

3 Containment Entropy

Let 𝒮={S1,S2,,Ss} be a set of s distinct sets, and 𝒰=S1S2Ss be their union. For simplicity we assume 𝒰=[1..u], so u=|𝒰|. Let ni=|Si| and n=i=1sni be the total size of all sets; note nu.

A simple notion of entropy for 𝒮 is its worst-case entropy

H(𝒮)=i=1slg(uni).

It is not hard to store 𝒮 within H(𝒮)+O(n+slogn) bits, by using the sparse bitvector representation of Section 2, which stores each Si within lg(uni)+O(ni+logu) bits, and offers constant-time retrieval of any element of Si via select(Si,k). We use O(slogn) additional bits to address the representations of the s sets.

We now propose a finer measure of entropy for 𝒮, which exploits the fact that some sets can be subsets of others. If SiSj, we can describe Si using lg(njni) bits, which indicate which elements of Sj are also in Si. We define a tree structure whose nodes are the sets Si, and the parent of Si, p(Si), is any smallest Sj such that SiSj. If Si is not contained in any other set, then its parent is the tree root, which represents 𝒰. We then define the following measure of entropy, which we call the containment entropy.

Definition 2.

Let 𝒮 be a set of sets Si. Its hierarchy has 𝒰=iSi at the root, and the parent p(Si) of Si is any smallest set containing Si (or 𝒰 if no such set exists). Let pi=|p(Si)|. The containment entropy of 𝒮 is

L(𝒮)=i=1slg(pini).

Clearly L(𝒮)H(𝒮) because piu for every i. Note that L(𝒮) is arguably the optimal space we can achieve by representing sets as subsets of others in 𝒮, but we could obtain less space if other arrangements were possible. For example, if many sets are subsets of both Si and Sj, it could be convenient to introduce a new set SiSj in the hierarchy, even if it is not in 𝒮. An advantage of L(𝒮) is that it is easily computed in polynomial time from 𝒮, whereas more general notions like the one that allows the creation of new sets may be not.

4 A Containment Entropy Bounded Representation with Fast Retrieval

It is not hard to represent 𝒮 within space close to L(𝒮): we can use the sparse bitvector representation of Section 2 to store each Si relative to its parent p(Si), within lg(pini)+O(ni+logpi) bits, which add up to L(𝒮)+O(n+slogu) bits. The problem is that now select(Si,k) gives the position of the kth element of Si within those of p(Si), not within 𝒰, and thus select(Si,k) is not directly the identity of the kth element of Si. In order to obtain the identity of the element, we must compute select(p(Si),select(Si,k)) and so on, consecutively following the chain of ancestors of Si until the root 𝒰; see Algorithm 1. This may take time proportional to the height of the hierarchy, which can be up to O(s). We now show how to reduce this time to logarithmic, by introducing shortcuts in the hierarchy.

Algorithm 1 Retrieving the kth element of a set S in a hierarchy.
Definition 3.

The contracted hierarchy of 𝒮 has 𝒰 as its root, and the parent c(Si) of Si is the highest ancestor St of p(Si) in the hierarchy of 𝒮 such that nt2ni. If no such ancestor exists, then c(Si)=p(Si).

We now prove a couple of relevant properties of this contracted hierarchy.

Lemma 4.

The depth of node Si in the contracted hierarchy of 𝒮 is O(log(u/ni)). The height of the contracted hierarchy is O(logu).

Proof.

By definition, the grandparent of node Si, if it exists, has size >2ni; thus the path towards the root (whose size is u) cannot be longer than 2lg(u/ni). An obvious consequence is that the height of the tree cannot be longer than 2lgu.

If we change p(S) to c(S) in Algorithm 1, then, the retrieval time becomes O(log(u/|S|)). We now show that the space is not asymptotically affected by contracting the hierarchy.

Lemma 5.

The contracted hierarchy of 𝒮 can be represented within L(𝒮)+O(n+slogn) bits.

Proof.

We start from our representation that uses L(𝒮)+O(n+slogn) bits, and show how it changes when we replace p(Si) by c(Si). In case p(Si)c(Si), it holds that |c(Si)|<2|p(Si)|, because Sip(Si) and thus |c(Si)|2|Si|<2|p(Si)|. Then, changing p(Si) to c(Si) increases the size of the representation from nilgpini+2ni to at most nilg2pini+2ni=nilgpini+3ni. Thus, the total increase produced by changing all p(Si) to c(Si) is bounded by n; we still use L(𝒮)+O(n+slogn) bits. The parent pointers that allow climbing paths also fit in O(slogn) bits.

In summary, we have the following result.

Lemma 6.

A set 𝒮 of s sets of total size n and universe size u can be represented within L(𝒮)+O(n+slogn) bits so that any element of any set S can be retrieved in time O(log(u/|S|)).

5 Other Operations

5.1 Predecessor and Successor in a Set

Given an element identifier k and a set Si, the predecessor is the largest vk such that vSi. Conversely, the successor is the smallest vk such that vSi. We can find the predecessor in O(logu) time, as follows. We start at the node Si and walk up the path to the root 𝒰, so as to determine the nodes in the path. In the return from the recursion, we start at the position vk in 𝒰, and whenever returning from a node S to its child S, we compute vrank(S,v), which gives the number of elements from S up to the element v in S, that is, the predecessor of v among the elements of S. By the time we return to Si again, v is the position in Si of the predecessor of k; see Algorithm 2. We then find out the identity of v in 𝒰 using Algorithm 1. To find the successor, we use v1+rank(S,v1) instead.

Algorithm 2 Finding the position of the predecessor of k in the set S of a hierarchy.

Operation rank takes time O(1+log(|S|/|S|)) in the sparse bitvector representation of S, as seen in Section 2. Therefore, the sum of the times along the path to Si telescopes to O(log(u/ni)). This yields the following result.

Lemma 7.

On the same representation of Lemma 6, we can compute the predecessor and successor of any element of any set S in time O(log(u/|S|)).

5.2 Set Operations

These operations are useful to compute set operations, by mimicking any standard algorithm that traverses both ordered sets. In particular, we can implement an intersection algorithm that is close to the alternation complexity lower bound δ [2]: any algorithm that can find the successor of any element in either list in time t, can intersect both lists in time O(δt). Predecessor and successor on the complement of a set can also be solved in time O(t), by using rank0 instead of rank (and managing the base case accordingly).

Corollary 8.

On the representation of Lemma 6, we can compute the intersection between any two sets Si and Sj in time O(δlogu), where δ is the alternation complexity of both sets. We can also compute their union in time O(|SiSj|logu). The difference SiSj can be implemented in time O(δlogu), where δ is the alternation complexity of Si and Sjc.

5.3 Back to Bitvectors

Returning to the bitvector semantics, our results allow us store a set of sparse bitvectors Bi[1..u], representing subsets Si of a universe 𝒰 of size u, so that operations access(Bi,k), rank(Bi,k) and select(Bi,k) take time O(log(u/ni)). To implement access(Bi,k), we return 1 if the predecessor of k in Si is k, or else 0. To implement rank(Bi,k), we compute the position v of the predecessor of k in the ordered set Si and return rank(Si,v). To implement select1(Bi,k), we retrieve the kth element of Si, and for select0 we do binary search on select1.

Corollary 9.

A set of s bitvectors Bi[1..u] can be represented in L(𝒮)+O(n+slogu) bits, where 𝒮 is the set of sets Si={k,Bi[k]=1}, ni is the number of 1s in Bi and n=ini. This representation supports operations access, rankb, and select1 on any Bi in time O(log(u/ni)), and select0 in time O(lognilog(u/ni)).

This is to be compared with storing each bitvector directly [9], which gives total space H(𝒮)+O(n+slogu) and supports access and rankb in the same time O(log(u/ni)), and selectb in the better times O(1) for b=1 and O(logni) for b=0.

6 Construction

We can build the hierarchical representation by adding one set S at a time, in decreasing order of size, to ensure it is correct to insert S as a leaf. To insert S in 𝒮, we first find a smallest set S𝒮 such that SS. The sets that contain S form a connected subgraph of the hierarchy that includes the root. So we can traverse the hierarchy from the root, looking for the lowest nodes that contain S, and retain the smallest of those. For each hierarchy node Si to check, we take each element of S and verify that it exists in Si via a predecessor query, which takes time O(log(u/|Si|))O(log(u/|S|)), because |S||Si|. We then find a smallest set S containing S in time O(s|S|log(u/|S|)), where sn/|S| is the current number of sets in 𝒮 and n is its final sum of set sizes.

We then insert S in the hierarchy by setting p(S)=S. To find c(S), we find the highest ancestor S′′ of S whose size is |S′′|2|S| (or let S′′=S if no such ancestor exists), and set c(S)=S′′. Finally, we build the representation of S relative to S′′, in time O(|S|) [9].

Note that, when we find the ancestor S′′, we traverse the upward path defined by the parent function p(), not c(). We still use c() during construction to answer the predecessor queries, to determine inclusion of S, in logarithmic time.

Lemma 10.

The representation of Lemma 6 can be built in time O(snlogn).

Combining Lemmas 6, 7, and 10 we have shown Theorem 1.

7 Concluding Remarks

We have described the containment entropy, a new entropy measure for collections of sets, which is sensitive to subset relationships between sets. To our knowledge, this idea has not before been considered in the vast literature on efficient set representations that has emerged primarily from efficiency concerns in information retrieval systems [11]. One could consider dictionary-based compression of sets (see, e.g., [10, 4]) as implicitly capturing some aspect of subset relationships, but the representations and analysis we have described here consider subset relationships explicitly.

An interesting direction for future work is to explore the practicality of these hierarchical set representations in various application contexts. An immediate concern is that of hierarchy construction. Our initial experiments with a collection of sets taken from the genomic search engine Themisto [1] applied to a set of 16 thousand bacterial genomes (over 80GB of data) indicate that, even for large set collections, hierarchy construction is tractable and scales almost linearly with the total length of the sets in practice. On a collection of 10.55 million sets of average size 3,607 (10.52 million of which were subsets of at least one other set) of more than 38 billion elements in total, we were able to find the smallest super set of every set in less than 40 hours in total, using just 16 threads of execution. The resulting representation used just 0.18 bits per element, compared to the 0.32 bits per element used by Themisto’s representation, which selects between different set representations based on set density.

References

  • [1] Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics, 39(Supplement-1):260–269, 2023. doi:10.1093/BIOINFORMATICS/BTAD233.
  • [2] J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Transactions on Algorithms, 4(1):1–18, 2008. doi:10.1145/1328911.1328915.
  • [3] D. R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
  • [4] F. Claude and G. Navarro. Fast and compact web graph representations. ACM Transactions on the Web, 4(4):article 16, 2010.
  • [5] P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21:246–260, 1974. doi:10.1145/321812.321820.
  • [6] R. Fano. On the number of bits required to implement an associative memory. Memo 61, Computer Structures Group, Project MAC, Massachusetts, 1971.
  • [7] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841–850, 2003.
  • [8] J. I. Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pages 37–42, 1996.
  • [9] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60–70, 2007.
  • [10] Giulio Ermanno Pibiri, Matthias Petri, and Alistair Moffat. Fast dictionary-based compression for inverted indexes. In Proc. 12th ACM International Conference on Web Search and Data Mining (WSDM), pages 6–14. ACM, 2019. doi:10.1145/3289600.3290962.
  • [11] Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Computing Surveys, 53(6):125:1–125:36, 2021. doi:10.1145/3415148.