Compact Data Structures for Collections of Sets
Abstract
We define a new entropy measure , called the containment entropy, for a set of sets, which considers the fact that some sets can be contained in others. We show how to represent within space close to so that any element of any set can be retrieved in logarithmic time. We extend the result to predecessor and successor queries and show how some common set operations can be implemented efficiently.
Keywords and phrases:
Compressed data structures, entropy of sets, data compressionCategory:
ResearchFunding:
Jarno N. Alanko: Funded by the Helsinki Institute for Information Technology (HIIT).Copyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures design and analysisAcknowledgements:
This work was initiated at the NII Shonan Meeting no. 187 on “Theoretical Foundations of Nonvolatile Memory.”Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott VitterSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
We consider the problem of representing a collection of sets from a universe of size while supporting basic queries, including retrieving the th element and predecessor and successor queries. The goal is to store the sets compactly while supporting fast queries. This problem has important applications, including the representation of postings lists in inverted indexes and adjacency-list representations of graphs.
To measure space, we often consider the worst-case entropy defined as as a natural information-theoretic worst-case lower bound on the number of bits needed to store . Using standard techniques [6, 5, 9], we can store within bits and support retrieval (i.e., accessing any th element of any set) in constant time.
In this paper, we propose a finer measure of entropy for that can take advantage of the fact that some sets may be subsets of others. If , we can encode using bits, by indicating which elements in are also in . We show how to construct a hierarchy from such that children are subsets of their parent. This leads to a new notion of entropy called the containment entropy, defined as
where is the parent of (see Section 3). It is easy to see that the containment entropy is a finer notion of entropy since . Our main result is that we efficiently represent in space close to its containment entropy while supporting queries efficiently.
Theorem 1.
Let be a set of sets of total size , elements of which are drawn from a universe of size . We can construct a data structure in time that uses bits of space and supports retrieval, predecessor, and successor queries on any set in time .
Thus, compared to the above-mentioned standard data structures, we replace the worst-case entropy with in our space bound in Theorem 1.
We also obtain several corollaries of Theorem 1. We show how to implement the common operations of set intersection, set union, and set difference. By combining Theorem 1 with techniques from Barbay and Kenyon [2] we obtain fast implementations of these operations in terms of the alternation bound [2] directly on our representation. We also show how to apply Theorem 1 to efficiently store a collection of bitvectors while supporting access, rank, and select queries.
Technically, the result is obtained by constructing a tree structure, where each node represents a set, and children represent subsets of their parent. We then represent each set as a sparse bitvector that indicates which of the elements of the parent are present in the child. This leads to a simple representation that achieves the desired space bound. Unfortunately, if we directly implement a retrieval query on this representation, we have to traverse the path from the set to the root leading to a query time of , where is the height of the tree. We show how to efficiently shortcut paths in the tree without asymptotically increasing the space, yielding the time. We then extend these techniques to handle predecessor and successor queries in the same time, leading to the full result of Theorem 1.
Relation with wavelet trees.
We note that the idea is reminiscent of wavelet trees [7], which represent a sequence of symbols as a binary tree where each node stores a bitvector. In the root’s bitvector, 0s/1s mark the sequence positions with symbols in the first/second half of the alphabet, and left/right children recursively represent the subsequences formed by the symbols in the first/second half of the alphabet. Nodes handling single-symbol subsequences are wavelet tree leaves. If we define as the sets of positions in the sequence holding the symbols of each subalphabet considered in the wavelet tree, then our data structure built on has the same shape of the wavelet tree. The wavelet tree uses less space, however, because it corresponds to a particular case where the maximal subsets of each set partition it into two: while the wavelet tree can be compressed to the zero-order entropy of the sequence [7], doubles that entropy.
2 Basic Concepts
A bitvector is a data structure that represents a sequence of bits and supports the following operations:
-
returns , the th bit of .
-
, where , returns the number of times bit occurs in the prefix of the bitvector. We assume .
-
returns the position in of the th occurrence of bit . We assume and if does not occur times in .
Note that . By default we assume . It is possible to represent bitvector using bits and solve all the operations in time [3, 8]. If is the number of 1s in , it is possible [6, 5, 9] to represent using
bits,111The formula on the left is zero if . so that is implemented in constant time, while and are solved in time . Operation can be solved in time by binary search on . This is called the sparse bitvector representation in this paper.
We note that is the entropy of the set of positions where contains 1s, or equivalently, the entropy of the sets of elements in a universe of size . We will indistinctly speak of the operations , , and over sets on the universe , referring to the corresponding bitvector where iff . For example, is the th smallest element of .
3 Containment Entropy
Let be a set of distinct sets, and be their union. For simplicity we assume , so . Let and be the total size of all sets; note .
A simple notion of entropy for is its worst-case entropy
It is not hard to store within bits, by using the sparse bitvector representation of Section 2, which stores each within bits, and offers constant-time retrieval of any element of via . We use additional bits to address the representations of the sets.
We now propose a finer measure of entropy for , which exploits the fact that some sets can be subsets of others. If , we can describe using bits, which indicate which elements of are also in . We define a tree structure whose nodes are the sets , and the parent of , , is any smallest such that . If is not contained in any other set, then its parent is the tree root, which represents . We then define the following measure of entropy, which we call the containment entropy.
Definition 2.
Let be a set of sets . Its hierarchy has at the root, and the parent of is any smallest set containing (or if no such set exists). Let . The containment entropy of is
Clearly because for every . Note that is arguably the optimal space we can achieve by representing sets as subsets of others in , but we could obtain less space if other arrangements were possible. For example, if many sets are subsets of both and , it could be convenient to introduce a new set in the hierarchy, even if it is not in . An advantage of is that it is easily computed in polynomial time from , whereas more general notions like the one that allows the creation of new sets may be not.
4 A Containment Entropy Bounded Representation with Fast Retrieval
It is not hard to represent within space close to : we can use the sparse bitvector representation of Section 2 to store each relative to its parent , within bits, which add up to bits. The problem is that now gives the position of the th element of within those of , not within , and thus is not directly the identity of the th element of . In order to obtain the identity of the element, we must compute and so on, consecutively following the chain of ancestors of until the root ; see Algorithm 1. This may take time proportional to the height of the hierarchy, which can be up to . We now show how to reduce this time to logarithmic, by introducing shortcuts in the hierarchy.
Definition 3.
The contracted hierarchy of has as its root, and the parent of is the highest ancestor of in the hierarchy of such that . If no such ancestor exists, then .
We now prove a couple of relevant properties of this contracted hierarchy.
Lemma 4.
The depth of node in the contracted hierarchy of is . The height of the contracted hierarchy is .
Proof.
By definition, the grandparent of node , if it exists, has size ; thus the path towards the root (whose size is ) cannot be longer than . An obvious consequence is that the height of the tree cannot be longer than .
If we change to in Algorithm 1, then, the retrieval time becomes . We now show that the space is not asymptotically affected by contracting the hierarchy.
Lemma 5.
The contracted hierarchy of can be represented within bits.
Proof.
We start from our representation that uses bits, and show how it changes when we replace by . In case , it holds that , because and thus . Then, changing to increases the size of the representation from to at most . Thus, the total increase produced by changing all to is bounded by ; we still use bits. The parent pointers that allow climbing paths also fit in bits.
In summary, we have the following result.
Lemma 6.
A set of sets of total size and universe size can be represented within bits so that any element of any set can be retrieved in time .
5 Other Operations
5.1 Predecessor and Successor in a Set
Given an element identifier and a set , the predecessor is the largest such that . Conversely, the successor is the smallest such that . We can find the predecessor in time, as follows. We start at the node and walk up the path to the root , so as to determine the nodes in the path. In the return from the recursion, we start at the position in , and whenever returning from a node to its child , we compute , which gives the number of elements from up to the element in , that is, the predecessor of among the elements of . By the time we return to again, is the position in of the predecessor of ; see Algorithm 2. We then find out the identity of in using Algorithm 1. To find the successor, we use instead.
Operation takes time in the sparse bitvector representation of , as seen in Section 2. Therefore, the sum of the times along the path to telescopes to . This yields the following result.
Lemma 7.
On the same representation of Lemma 6, we can compute the predecessor and successor of any element of any set in time .
5.2 Set Operations
These operations are useful to compute set operations, by mimicking any standard algorithm that traverses both ordered sets. In particular, we can implement an intersection algorithm that is close to the alternation complexity lower bound [2]: any algorithm that can find the successor of any element in either list in time , can intersect both lists in time . Predecessor and successor on the complement of a set can also be solved in time , by using instead of (and managing the base case accordingly).
Corollary 8.
On the representation of Lemma 6, we can compute the intersection between any two sets and in time , where is the alternation complexity of both sets. We can also compute their union in time . The difference can be implemented in time , where is the alternation complexity of and .
5.3 Back to Bitvectors
Returning to the bitvector semantics, our results allow us store a set of sparse bitvectors , representing subsets of a universe of size , so that operations , and take time . To implement , we return if the predecessor of in is , or else . To implement , we compute the position of the predecessor of in the ordered set and return . To implement , we retrieve the th element of , and for we do binary search on .
Corollary 9.
A set of bitvectors can be represented in bits, where is the set of sets , is the number of s in and . This representation supports operations , , and on any in time , and in time .
This is to be compared with storing each bitvector directly [9], which gives total space and supports and in the same time , and in the better times for and for .
6 Construction
We can build the hierarchical representation by adding one set at a time, in decreasing order of size, to ensure it is correct to insert as a leaf. To insert in , we first find a smallest set such that . The sets that contain form a connected subgraph of the hierarchy that includes the root. So we can traverse the hierarchy from the root, looking for the lowest nodes that contain , and retain the smallest of those. For each hierarchy node to check, we take each element of and verify that it exists in via a predecessor query, which takes time , because . We then find a smallest set containing in time , where is the current number of sets in and is its final sum of set sizes.
We then insert in the hierarchy by setting . To find , we find the highest ancestor of whose size is (or let if no such ancestor exists), and set . Finally, we build the representation of relative to , in time [9].
Note that, when we find the ancestor , we traverse the upward path defined by the parent function , not . We still use during construction to answer the predecessor queries, to determine inclusion of , in logarithmic time.
Lemma 10.
The representation of Lemma 6 can be built in time .
7 Concluding Remarks
We have described the containment entropy, a new entropy measure for collections of sets, which is sensitive to subset relationships between sets. To our knowledge, this idea has not before been considered in the vast literature on efficient set representations that has emerged primarily from efficiency concerns in information retrieval systems [11]. One could consider dictionary-based compression of sets (see, e.g., [10, 4]) as implicitly capturing some aspect of subset relationships, but the representations and analysis we have described here consider subset relationships explicitly.
An interesting direction for future work is to explore the practicality of these hierarchical set representations in various application contexts. An immediate concern is that of hierarchy construction. Our initial experiments with a collection of sets taken from the genomic search engine Themisto [1] applied to a set of 16 thousand bacterial genomes (over 80GB of data) indicate that, even for large set collections, hierarchy construction is tractable and scales almost linearly with the total length of the sets in practice. On a collection of 10.55 million sets of average size 3,607 (10.52 million of which were subsets of at least one other set) of more than 38 billion elements in total, we were able to find the smallest super set of every set in less than 40 hours in total, using just 16 threads of execution. The resulting representation used just 0.18 bits per element, compared to the 0.32 bits per element used by Themisto’s representation, which selects between different set representations based on set density.
References
- [1] Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics, 39(Supplement-1):260–269, 2023. doi:10.1093/BIOINFORMATICS/BTAD233.
- [2] J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Transactions on Algorithms, 4(1):1–18, 2008. doi:10.1145/1328911.1328915.
- [3] D. R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
- [4] F. Claude and G. Navarro. Fast and compact web graph representations. ACM Transactions on the Web, 4(4):article 16, 2010.
- [5] P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21:246–260, 1974. doi:10.1145/321812.321820.
- [6] R. Fano. On the number of bits required to implement an associative memory. Memo 61, Computer Structures Group, Project MAC, Massachusetts, 1971.
- [7] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841–850, 2003.
- [8] J. I. Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pages 37–42, 1996.
- [9] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60–70, 2007.
- [10] Giulio Ermanno Pibiri, Matthias Petri, and Alistair Moffat. Fast dictionary-based compression for inverted indexes. In Proc. 12th ACM International Conference on Web Search and Data Mining (WSDM), pages 6–14. ACM, 2019. doi:10.1145/3289600.3290962.
- [11] Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Computing Surveys, 53(6):125:1–125:36, 2021. doi:10.1145/3415148.
