Secure Compressed Suffix Arrays
Abstract
This paper proposes a secure compressed suffix array, which is a data oblivious and compressed version of the suffix array used for finding substrings of a large string. Secure compressed suffix arrays can be used for indexing a large collection of strings containing personal information such as DNA data.
Keywords and phrases:
suffix array, compression, encryption, oblivious algorithm, secure computationCategory:
Research2012 ACM Subject Classification:
Theory of computation Data structures design and analysisEditors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott VitterSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
As the amount of data increases, the following points are becoming more problematic.
-
Processing time for analyzing the data increases. We need efficient algorithms and data structures.
-
We need to care about the use of sensitive data such as personal information.
For the former problem, many researches on algorithms for compressed data have been conducted. Seminal results on this topic are succinct bit-vectors [6, 11], succinct ordered trees [9], and compressed suffix arrays [4, 2]. For the latter problem, data anonymization [16] and secure computation [17] have been proposed. Because data anonymization modifies input data, some information will be lost. We focus on secure computation, which is a technique to process encrypted data without decryption.
There are two main schemes for secure computation: secret sharing [13] and fully homomorphic encryption [3].
Assume that there are a client and a server. There are two main scenarios.
-
1.
A client has private data and wants to use a cloud service to process the data. Data are stored in the cloud server in an encrypted form. The client runs a program on itself. When some data are necessary, the client asks the server to obtain the data. Then the client decrypts the data, does some computation, encrypts the data and sends back to the server. In this scenario, the task of the server is to store the data and give accesses to a part of the data to the client. The data must be stored as an encrypted form on the server, but computation on encrypted data is not necessary. Therefore it is enough to hide the access pattern to the data on the server.
-
2.
The server stores encrypted data and the secret key is not known to the server. The client asks the server to run a program. The server does some computation and returns the answer to the client. The client decrypts the answer using the secret key. In this scenario, algorithms executed on the server must be data oblivious and the computation must be done on encrypted data.
The second scenario is preferable because the client does not require computation power for analyzing big data. However, we need to design special algorithms for the server which are data oblivious and which run on encrypted data. We call such algorithms secure algorithms.
This paper proposes secure compressed suffix array, which is a data oblivious and encrypted version of the compressed suffix array [4].
2 Preliminaries
2.1 Suffix arrays
Let be a string of length on alphabet of size . The -th character of is denoted by () and it belongs to . We assume that a unique terminator $, which is smaller than any character in , is appended to . That is, . A substring of is denoted by . The substring is called the -th suffix of and denoted by .
The suffix array [8] of the string is an integer array defined as () if is the lexicographically the -th suffix of . It always holds that , which corresponds to the shortest suffix consisting of only the terminator.
If we have and , we can support the following queries:
-
: returns the number of occurrences of in in time
-
: returns the positions of occurrences of in in time where
To support these queries, we use the following query:
-
: returns the maximal range so that for any the substring is equal to .
Let . Then it holds and .
The size of the suffix array is bits. We also need to store the string itself, which occupies bits. The suffix array requires a huge space compared with the string itself, especially for strings on small alphabets, such as DNA strings. For human DNA, , whereas . Then is more than times larger than .
2.2 Succinct bit-vectors
Succinct data structures are data structures storing objects in minimum number of bits. More precisely, consider storing an element of a set . Then the information-theoretic lower bound of the number of bits to represent is defined as . Hereafter we omit the base of logarithm. Let denote this value. Then a data structure for storing is called succinct if it uses bits, and compact if it uses bits.
The most fundamental succinct data structure is a bit-vector supporting rank and select queries. A bit-vector is a string on the binary alphabet and rank and select are defined as follows.
-
returns the number of ’s in . We define and if .
-
returns the position of the -th in . We define and if .
For a bit-vector of length , rankand selectare computed in constant time using the bit-vector itself and an bit auxiliary data structure [11].
We briefly review the rank data structure. The bit-vector is partitioned into super-blocks of length each, and each super-block is further partitioned into blocks of length each. We store rank values for all super-block boundaries in array using bits, and rank values for all block boundaries in array using bits. Inside a block, we count the number of ’s using table lookups for every bits. The size of the table is and the time complexity is . If we set and , we obtain the desired bound.
2.3 Secure algorithms
If we forget about time efficiency, any computation can be done on encrypted data if we can support additions and multiplications on two encrypted integers. There exist two such schemes.
-
Secret Sharing [13]. Data are distributed into two or more servers and for each server the stored data look like random values and no information is leaked. Additions can be done locally in each server, whereas for multiplications the servers must comminicate each other.
Both schemes have a drawback that the computation is much slower than plain (unencrypted) data. In Secret Sharing schemes the servers communicate each other for computing multiplications. This takes much more time than the computation on plain data in a single server. In FHE schemes, there are no communication because there is only one server. However multiplications are extremely slow. Therefore it is important to develop efficient secure algorithms.
In this paper, we assume that in both schemes, integer addition, multiplication, division, less-than comparison are done efficiently in unit time. Then our algorithm runs in both schemes.
2.4 Oblivious RAM
Oblivious RAMs [7, 15, 10], or ORAMs for short, are data structures supporting oblivious read and write to an array. Without loss of generality we can assume the array stores a binary string of length . can be regarded as an array of length storing -bit integers. An ORAM has a parameter called the block size. is partitioned into blocks of length each, and a block is accessed as a unit. To achieve oblivious access, more than one blocks are accessed to obtain one block. The ratio is called bandwidth blowup.
Obliviousness is defined as follows. Let be a sequence of accesses to an ORAM where is either read or write, is the address of the -th access, and is the content to be written in the -th block when is write. Let be denote the sequence of accesses to the server given . Then the ORAM is said to be oblivious if two any access sequences and of the same length, and are computationally indistinguishable by anyone but the client, and if the failure probability is negligible [15].
Table 1 shows existing oblivious RAM data structures. Onodera and Shibuya [10] give a succinct oblivious RAM, that is, the server space is bits. The other two data structures are compact, that is, the server space is bits.
| Server space (#bits) | Bandwidth blowup | User space (#blocks) | Reference |
|---|---|---|---|
| Kushilevitz et al. [7] | |||
| Stefanov et al. [15] | |||
| Onodera, Shibuya [10] |
3 Compressed Suffix Arrays
3.1 The original structure
Grossi and Vitter [5, 4] have proposed compressed suffix arrays, which are a compressed version of suffix arrays. Let be the suffix array of string of length on alphabet of size . The compressed suffix array is a data structure which efficiently supports the following operation:
-
: returns .
-
: returns such that (the inverse suffix array).
The core component of the compressed suffix arrays is the function, defined as follows.
if , and . The function has a good property that it is piecewise monotone. Precisely, let be the range of the suffix array such that for any where . Then if and , it holds . Then the following function is strictly increasing.
We can compress in bits so that any is computed in constant time. From , we can obtain and easily in constant time.
The encoding of is as follows. each entry of is a -bit integer. We partition it into higher part and lower part. The higher part has bits and the lower part has the rest ( bits). The lower parts for all entries are stored in an integer array in bits. The upper parts for all entries are represented by a bit-vector as follows. Let for (we assume ). Because is increasing, for any . We encode ’s by unary codes. That is, we write many zeros followed by a one, to . Then, can be computed in constant time as
From the definition of , if , it holds . That is, if we know the lexicographic order of a suffix , we can obtain that of the next suffix by computing . We sample the values of the suffix array for every entries and stores them in an array . We also store a bit-vector such that iff is sampled. Then is computed as follows.
The array uses bits and uses bits. If we set , the space for is bits and is computed in time. See Figure 1 for an example of
3.2 Self-indexes
The original compressed suffix array is a compressed representation of the suffix array and used together with the string itself. We can change it to a self-index, that is, a data structure for string matching which does not use the string. It is enough to add bits. We use a bit-vector showing that if then or . That is, iff lexicographically the -th suffix has a different head character from the -st suffix. We use an array of length such that . We define . The array uses bits (we assume ). We use another array storing in the range such that for any it holds . This array uses bits.
Using , we can recover a substring of . Assume the lexicographic order of a suffix is known (). Then the character is equal to . The substring is computed in time by iteratively computing . Computing the lexicographic order of is equivalent to computing , and it is done similarly to computing an entry of the suffix array using and the sampled suffix array [12].
To support , we use what we call backward search. Let be the length of . First we obtain the range for the shortest suffix by . Then, given the range for a suffix , we compute the range for the suffix . This is done as follows.
By a simple binary search, the new range is obtained in time, and therefore is done in time, which is the same as the suffix array [12].
We can simplify the algorithm as follows.
Now we do not need the array . This has another merit that it is easier to make it oblivious, which will be shown in the next section.
4 Secure Compressed Suffix Arrays
We show that the compressed suffix arrays can be modified so that all the operations , , , and . First we show that , , and can be done obliviously if is computed obliviously. We assume that the lengths of the string and the pattern , and the alphabet size are public.
4.1 Computing
As shown in Section 3.2, is done by many binary searches on . Given the range for , we compute the range for . To do so, we first compute and , which can be done obliviously. Then using a binary search on we compute . The initial range is , and in each step we update the range based on the result of a less-than comparison. Using is easier than using because we do not need the array that must support oblivious accesses. We can compute using many oblivious accesses to .
4.2 Computing and
can be done using the sampled array , the bit-vector , and . The original algorithm repeats computing until . This is not oblivious because the number of iteration depends on . Precisely, it holds where is the smallest number such that .
To change the algorithm oblivious, we fix the number of iterations to . Because the suffix array entries are sampled every entries in text order, for exactly one entry it holds among those for . Therefore we change the algorithm as follows.
-
1.
-
2.
for
-
3.
-
4.
-
5.
return
Here we need oblivious accesses to and , and oblivious computation of .
For , we use the Path ORAM [15] with block size . Then the space usage is bits and an entry of is obtained in many accesses to blocks, which can be done in time.
For , we use the succinct ORAM [10] with block size . Then a block of is obtained in in many accesses to blocks, which can be also done in time. For computing a rank on , we use the array defined in Section 2.2. This is stored using the Path ORAM. The space usage is bits. The rank inside a block is computed by logical operations in time. Therefore a rank value is computed in time.
The computation of is similar.
4.3 Computing
Finally, we show how to compute . As shown in Section 3.1, is represented by a bit-vector and an array of -bit integers. The array is stored using the succinct ORAMs. The bit-vector is stored similarly to , but here we also need to store the auxiliary data structure for . This can be stored using the Path ORAM.
To sum up, is computed in time and the space usage is bits.
4.4 Summary
takes time. A step of a backward search is a binary search on , and therefore it is done in time. Then takes time. For , we set . Then it takes time. The space usage is bits in total.
5 Concluding Remarks
We have proposed secure compressed suffix arrays. For a string of length on an alphabet of size , It uses bits of space and the query is done in time, and the query is done in time. Therefore there is an multiplicative overhead compared with the original compressed suffix arrays [4]. To improve the running time, we need more efficient succinct ORAM [10] and standard ORAMs [14]. Future work will be developing such oblivious RAM data structures and giving efficient implementions.
References
- [1] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. Tfhe: Fast fully homomorphic encryption over the torus. J. Cryptol., 33(1):34–91, January 2020. doi:10.1007/s00145-019-09319-x.
- [2] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
- [3] Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, pages 169–178, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1536414.1536440.
- [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
- [5] Roberto Grossi and Jeffrey S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In Proceedings of the Thirty-Second Annual ACM Symposium on the Theory of Computing, pages 397–406, Portland, OR, 2000.
- [6] G. Jacobson. Space-efficient Static Trees and Graphs. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 549–554, 1989.
- [7] Eyal Kushilevitz, Steve Lu, and Rafail Ostrovsky. On the (in)security of hash-based oblivious ram and a new balancing scheme. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 143–156, 2012. doi:10.1137/1.9781611973099.13.
- [8] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
- [9] J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001. doi:10.1137/S0097539799364092.
- [10] Taku Onodera and Tetsuo Shibuya. Succinct Oblivious RAM. In Rolf Niedermeier and Brigitte Vallée, editors, 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018), volume 96 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.STACS.2018.52.
- [11] R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. doi:10.1145/1290672.1290680.
- [12] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. J. Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.
- [13] Adi Shamir. How to share a secret. Commun. ACM, 22(11):612–613, November 1979. doi:10.1145/359168.359176.
- [14] Elaine Shi, T. H. Hubert Chan, Emil Stefanov, and Mingfei Li. Oblivious ram with o((logn)3) worst-case cost. In Dong Hoon Lee and Xiaoyun Wang, editors, Advances in Cryptology – ASIACRYPT 2011, pages 197–214, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-642-25385-0_11.
- [15] Emil Stefanov, Marten van Dijk, Elaine Shi, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. Path oram: an extremely simple oblivious ram protocol. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, CCS ’13, pages 299–310, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2508859.2516660.
- [16] Latanya Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, October 2002. doi:10.1142/S0218488502001648.
- [17] Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pages 162–167, 1986. doi:10.1109/SFCS.1986.25.
