Secure Compressed Suffix Arrays

Sadakane, Kunihiko

doi:10.4230/OASIcs.Grossi.13

Secure Compressed Suffix Arrays

Kunihiko Sadakane

The University of Tokyo, Japan

Abstract

This paper proposes a secure compressed suffix array, which is a data oblivious and compressed version of the suffix array used for finding substrings of a large string. Secure compressed suffix arrays can be used for indexing a large collection of strings containing personal information such as DNA data.

Keywords and phrases:

suffix array, compression, encryption, oblivious algorithm, secure computation

Category:

Research

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Data structures design and analysis

DOI:

10.4230/OASIcs.Grossi.2025.13

Event:

From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday

Editors:

Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

Series and Publisher:

Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

As the amount of data increases, the following points are becoming more problematic.

$\blacksquare$

Processing time for analyzing the data increases. We need efficient algorithms and data structures.
$\blacksquare$

We need to care about the use of sensitive data such as personal information.

For the former problem, many researches on algorithms for compressed data have been conducted. Seminal results on this topic are succinct bit-vectors [6, 11], succinct ordered trees [9], and compressed suffix arrays [4, 2]. For the latter problem, data anonymization [16] and secure computation [17] have been proposed. Because data anonymization modifies input data, some information will be lost. We focus on secure computation, which is a technique to process encrypted data without decryption.

There are two main schemes for secure computation: secret sharing [13] and fully homomorphic encryption [3].

Assume that there are a client and a server. There are two main scenarios.

1.

A client has private data and wants to use a cloud service to process the data. Data are stored in the cloud server in an encrypted form. The client runs a program on itself. When some data are necessary, the client asks the server to obtain the data. Then the client decrypts the data, does some computation, encrypts the data and sends back to the server. In this scenario, the task of the server is to store the data and give accesses to a part of the data to the client. The data must be stored as an encrypted form on the server, but computation on encrypted data is not necessary. Therefore it is enough to hide the access pattern to the data on the server.
2.

The server stores encrypted data and the secret key is not known to the server. The client asks the server to run a program. The server does some computation and returns the answer to the client. The client decrypts the answer using the secret key. In this scenario, algorithms executed on the server must be data oblivious and the computation must be done on encrypted data.

The second scenario is preferable because the client does not require computation power for analyzing big data. However, we need to design special algorithms for the server which are data oblivious and which run on encrypted data. We call such algorithms secure algorithms.

This paper proposes secure compressed suffix array, which is a data oblivious and encrypted version of the compressed suffix array [4].

2 Preliminaries

2.1 Suffix arrays

Let $T$ be a string of length $n$ on alphabet ${\cal A}$ of size $\sigma$ . The $j$ -th character of $T$ is denoted by $T[j]$ ( $j=0,1,\ldots,n-1$ ) and it belongs to ${\cal A}$ . We assume that a unique terminator $, which is smaller than any character in ${\cal A}$ , is appended to $T$ . That is, $T[n]=\$$ . A substring $T[i]T[i+1]\cdots T[j]$ of $T$ is denoted by $T[i,j]$ . The substring $T[j,n]$ is called the $j$ -th suffix of $T$ and denoted by $T_{j}$ .

The suffix array [8] of the string $T$ is an integer array ${\rm SA}[0,n]$ defined as ${\rm SA}[i]=j$ ( $i=0,1,\ldots,n$ ) if $T_{j}$ is the lexicographically the $i$ -th suffix of $T$ . It always holds that ${\rm SA}[0]=n$ , which corresponds to the shortest suffix consisting of only the terminator.

If we have $T$ and ${\rm SA}$ , we can support the following queries:

$\blacksquare$

${\rm Count}(P,T)$ : returns the number of occurrences of $P$ in $T$ in ${\rm O}(|P|\log n)$ time
$\blacksquare$

${\rm Locate}(P,T)$ : returns the positions of occurrences of $P$ in $T$ in ${\rm O}(|P|\log n+{\it occ})$ time where ${\it occ}={\rm Count}(P,T)$

To support these queries, we use the following query:

$\blacksquare$

${\rm Range}(P,T)$ : returns the maximal range $[s,e]\subset[0,n]$ so that for any $i\in[s,e]$ the substring $T[SA[i],SA[i]+|P|-1]$ is equal to $P$ .

Let $[s,e]={\rm Range}(P,T)$ . Then it holds ${\rm Count}(P,T)=e-s+1$ and ${\rm Locate}(P,T)=\{{\rm SA}[s],{\rm SA}[s+1],\ldots,{\rm SA}[e]\}$ .

The size of the suffix array ${\rm SA}$ is $n\log n$ bits. We also need to store the string $T$ itself, which occupies $n\log\sigma$ bits. The suffix array requires a huge space compared with the string itself, especially for strings on small alphabets, such as DNA strings. For human DNA, $\sigma=4$ , whereas $n>2^{31}$ . Then $\log n$ is more than $15$ times larger than $\log\sigma$ .

2.2 Succinct bit-vectors

Succinct data structures are data structures storing objects in minimum number of bits. More precisely, consider storing an element $x$ of a set $S$ . Then the information-theoretic lower bound of the number of bits to represent $x$ is defined as $\lceil\log_{2}x\rceil$ . Hereafter we omit the base $2$ of logarithm. Let $Z$ denote this value. Then a data structure for storing $x$ is called succinct if it uses $Z+{\rm o}(Z)$ bits, and compact if it uses ${\rm O}(Z)$ bits.

The most fundamental succinct data structure is a bit-vector supporting rank and select queries. A bit-vector is a string $B[1,n]$ on the binary alphabet $\{0,1\}$ and rank and select are defined as follows.

$\blacksquare$

${\rm rank}_{c}(B,i)$ returns the number of $c$ ’s in $B[0,i]$ . We define ${\rm rank}_{c}(B,0)=0$ and ${\rm rank}_{c}(B,i)={\rm rank}_{c}(B,n)$ if $i>n$ .
$\blacksquare$

${\rm select}_{c}(B,j)$ returns the position of the $j$ -th $c$ in $B$ . We define ${\rm select}_{c}(B,0)=0$ and ${\rm select}_{c}(B,j)=n+1$ if $j>{\rm rank}_{c}(B,n)$ .

For a bit-vector of length $n$ , rankand selectare computed in constant time using the bit-vector itself and an ${\rm O}(n\log\log n/\log n)$ bit auxiliary data structure [11].

We briefly review the rank data structure. The bit-vector $B$ is partitioned into super-blocks of length $L_{1}$ each, and each super-block is further partitioned into blocks of length $L_{2}$ each. We store rank values for all super-block boundaries in array $R_{1}$ using ${\rm O}((n\log n)/L_{1})$ bits, and rank values for all block boundaries in array $R_{2}$ using ${\rm O}((n\log L_{1})/L_{2})$ bits. Inside a block, we count the number of $c$ ’s using table lookups for every $1/2\log n$ bits. The size of the table is ${\rm O}(\sqrt{n}\log n\log\log n)={\rm o}(n)$ and the time complexity is ${\rm O}(L_{2}/\log n)$ . If we set $L_{1}=\Theta(\log^{2}n)$ and $L_{2}=\frac{1}{2}\log n$ , we obtain the desired bound.

2.3 Secure algorithms

If we forget about time efficiency, any computation can be done on encrypted data if we can support additions and multiplications on two encrypted integers. There exist two such schemes.

$\blacksquare$

Secret Sharing [13]. Data are distributed into two or more servers and for each server the stored data look like random values and no information is leaked. Additions can be done locally in each server, whereas for multiplications the servers must comminicate each other.
$\blacksquare$

Fully Homomorphic Encryption (FHE) [3, 1]. Any number of additions and multiplications on encrypted numbers can be done.

Both schemes have a drawback that the computation is much slower than plain (unencrypted) data. In Secret Sharing schemes the servers communicate each other for computing multiplications. This takes much more time than the computation on plain data in a single server. In FHE schemes, there are no communication because there is only one server. However multiplications are extremely slow. Therefore it is important to develop efficient secure algorithms.

In this paper, we assume that in both schemes, integer addition, multiplication, division, less-than comparison are done efficiently in unit time. Then our algorithm runs in both schemes.

2.4 Oblivious RAM

Oblivious RAMs [7, 15, 10], or ORAMs for short, are data structures supporting oblivious read and write to an array. Without loss of generality we can assume the array stores a binary string $S$ of length $n$ . $S$ can be regarded as an array of length $n/w$ storing $w$ -bit integers. An ORAM has a parameter $b$ called the block size. $S$ is partitioned into blocks of length $b$ each, and a block is accessed as a unit. To achieve oblivious access, more than one blocks are accessed to obtain one block. The ratio is called bandwidth blowup.

Obliviousness is defined as follows. Let $\vec{y}=(({\rm op}_{1},a_{1},d_{1}),({\rm op}_{1},a_{1},d_{1}),\ldots,({\rm op% }_{M},a_{M},d_{M}))$ be a sequence of accesses to an ORAM where ${\rm op}_{i}$ is either read or write, $a_{i}$ is the address of the $i$ -th access, and $d_{i}$ is the content to be written in the $a_{i}$ -th block when ${\rm op}_{i}$ is write. Let $A(\vec{y})$ be denote the sequence of accesses to the server given $\vec{y}$ . Then the ORAM is said to be oblivious if two any access sequences $\vec{y}$ and $\vec{z}$ of the same length, $A(\vec{y})$ and $A(\vec{z})$ are computationally indistinguishable by anyone but the client, and if the failure probability is negligible [15].

Table 1 shows existing oblivious RAM data structures. Onodera and Shibuya [10] give a succinct oblivious RAM, that is, the server space is $n+{\rm o}(n)$ bits. The other two data structures are compact, that is, the server space is ${\rm O}(n)$ bits.

Table 1: Oblivious RAM data structures for storing

n

bits. Bits are grouped into blocks of

b\geq\log n

bits each. Bandwidh blowup is the number of blocks to be accessed to obtain one block obliviously.

Server space (#bits)	Bandwidth blowup	User space (#blocks)	Reference
${\rm O}(n)$	${\rm O}(\log^{2}n/\log\log n)$	${\rm O}(1)$	Kushilevitz et al. [7]
${\rm O}(n)$	${\rm O}(\log^{2}n)$	$\omega(\log n)$	Stefanov et al. [15]
$n(1+\Theta(1/\log n))$	${\rm O}(\log^{2}n)$	${\rm O}(\log n\log\log n)$	Onodera, Shibuya [10]

3 Compressed Suffix Arrays

3.1 The original structure

Grossi and Vitter [5, 4] have proposed compressed suffix arrays, which are a compressed version of suffix arrays. Let ${\rm SA}$ be the suffix array of string $T$ of length $n$ on alphabet ${\cal A}$ of size $\sigma$ . The compressed suffix array is a data structure which efficiently supports the following operation:

$\blacksquare$

${\rm Lookup}(i)$ : returns ${\rm SA}[i]$ .
$\blacksquare$

${\rm Inverse}(j)$ : returns $i$ such that $j={\rm SA}[i]$ (the inverse suffix array).

The core component of the compressed suffix arrays is the $\Psi$ function, defined as follows.

\displaystyle\Psi[i]={\rm SA}^{-1}[SA[i]+1]

if $i>0$ , and $\Psi[0]={\rm SA}^{-1}[1]$ . The $\Psi$ function has a good property that it is piecewise monotone. Precisely, let $[s_{c},e_{c}]$ be the range of the suffix array such that $T[SA[i]]=c$ for any $i\in[s_{c},e_{c}]$ where $c\in{\cal A}$ . Then if $i,j\in[s_{c},e_{c}]$ and $i<j$ , it holds $\Psi[i]<\Psi[j]$ . Then the following function is strictly increasing.

\displaystyle\Psi^{\prime}[i]=T[{\rm SA}[i]]\cdot(n+1)+\Psi[i]

We can compress $\Psi^{\prime}$ in $n(2+\log\sigma)+{\rm o}(n)$ bits so that any $\Psi^{\prime}[i]$ is computed in constant time. From $\Psi^{\prime}[i]$ , we can obtain $T[{\rm SA}[i]]$ and $\Psi[i]$ easily in constant time.

The encoding of $\Psi^{\prime}$ is as follows. each entry of $\Psi^{\prime}$ is a $(\log\sigma+\log n)$ -bit integer. We partition it into higher part and lower part. The higher part has $\log n$ bits and the lower part has the rest ( $\log\sigma$ bits). The lower parts for all entries are stored in an integer array $L[0,n]$ in $n\log\sigma$ bits. The upper parts for all entries are represented by a bit-vector $H[0,2n]$ as follows. Let $d_{i}=(\Psi^{\prime}[i]\div\sigma)-(\Psi^{\prime}[i-1]\div\sigma)$ for $i=0,1,\ldots,n$ (we assume $\Psi^{\prime}[-1]=0$ ). Because $\Psi^{\prime}$ is increasing, $d_{i}\geq 0$ for any $i$ . We encode $d_{i}$ ’s by unary codes. That is, we write $d_{i}$ many zeros followed by a one, to $H$ . Then, $H[i]$ can be computed in constant time as

\displaystyle H[i]={\rm select}_{1}(H,i)-i

From the definition of $\Psi$ , if ${\rm SA}[i]=j$ , it holds ${\rm SA}[\Psi[i]]=j+1$ . That is, if we know the lexicographic order $i$ of a suffix $T_{j}$ , we can obtain that of the next suffix $T_{j+1}$ by computing $\Psi[i]$ . We sample the values of the suffix array for every $L_{3}$ entries and stores them in an array $A$ . We also store a bit-vector $F[0,n]$ such that $F[i]=1$ iff $SA[i]$ is sampled. Then ${\rm Lookup}(i)$ is computed as follows.

\displaystyle{\rm Lookup}(i)=\begin{cases}A[{\rm rank}_{1}(F,i)]&\text{if $F[i% ]=1$}\\ {\rm Lookup}(\Psi[i])-1&\text{if $F[i]=0$}\end{cases}

The array $A$ uses ${\rm O}((n\log n)/L_{3})$ bits and $F$ uses $n+{\rm o}(n)$ bits. If we set $L_{3}=\Theta(\log n)$ , the space for $A$ is ${\rm O}(n)$ bits and ${\rm Lookup}(i)$ is computed in ${\rm O}(\log n)$ time. See Figure 1 for an example of

Figure 1: An example of compressed suffix arrays.

3.2 Self-indexes

The original compressed suffix array is a compressed representation of the suffix array and used together with the string itself. We can change it to a self-index, that is, a data structure for string matching which does not use the string. It is enough to add $n+{\rm o}(n)+{\rm O}(\sigma\log n)$ bits. We use a bit-vector $D[0,n]$ showing that if $D[i]=1$ then $i=0$ or $T[SA[i]]\neq T[SA[i-1]]$ . That is, $D[i]=1$ iff lexicographically the $i$ -th suffix has a different head character from the $(i-1)$ -st suffix. We use an array $C$ of length $\sigma$ such that $C[{\rm rank}_{1}(D,i)]=T[SA[i]]$ . We define ${\rm Head}(i)=C[{\rm rank}_{1}(D,i)]$ . The array uses $\sigma\log\sigma={\rm O}(\sigma\log n)$ bits (we assume $\sigma\leq n$ ). We use another array $C^{-1}[1,\sigma]$ storing in $C[c]$ the range $[s_{c},e_{c}]\subset[0,n]$ such that for any $i\in[s_{c},e_{c}]$ it holds $T[SA[i]]=c$ . This array uses ${\rm O}(\sigma\log n)$ bits.

Using $C$ , we can recover a substring of $T$ . Assume the lexicographic order $i$ of a suffix $T_{j}$ is known ( ${\rm SA}[i]=j$ ). Then the character $T[j+k]$ is equal to ${\rm Head}(\Psi^{k}[i])$ . The substring $T[j,j+\ell-1]$ is computed in ${\rm O}(\ell)$ time by iteratively computing $i:=\Psi[i]$ . Computing the lexicographic order $i$ of $T_{j}$ is equivalent to computing ${\rm Inverse}(j)$ , and it is done similarly to computing an entry of the suffix array using $\Psi$ and the sampled suffix array [12].

To support ${\rm Range}(P,T)$ , we use what we call backward search. Let $m$ be the length of $P$ . First we obtain the range for the shortest suffix $P[m,m]$ by $[s,e]:=C^{-1}[P[m]]$ . Then, given the range $[s,e]$ for a suffix $P[j+1,m]$ , we compute the range $[s^{\prime},e^{\prime}]$ for the suffix $P[j,m]$ . This is done as follows.

	$\displaystyle s^{\prime}$	$\displaystyle:=$	$\displaystyle\mathrm{argmin}_{i\in C^{-1}[P[j]]}\left\{\Psi[i]\geq s\right\}$
	$\displaystyle e^{\prime}$	$\displaystyle:=$	$\displaystyle\mathrm{argmax}_{i\in C^{-1}[P[j]]}\left\{\Psi[i]\leq e\right\}$

By a simple binary search, the new range is obtained in ${\rm O}(\log n)$ time, and therefore ${\rm Range}(P,T)$ is done in ${\rm O}(|P|\log n)$ time, which is the same as the suffix array [12].

We can simplify the algorithm as follows.

	$\displaystyle s^{\prime}$	$\displaystyle:=$	$\displaystyle\mathrm{argmin}_{i\in[0,n]}\left\{\Psi^{\prime}[i]\geq P[j]\cdot(% n+1)+s\right\}$
	$\displaystyle e^{\prime}$	$\displaystyle:=$	$\displaystyle\mathrm{argmax}_{i\in[0,n]}\left\{\Psi^{\prime}[i]\leq P[j]\cdot(% n+1)+e\right\}$

Now we do not need the array $C^{-1}$ . This has another merit that it is easier to make it oblivious, which will be shown in the next section.

4 Secure Compressed Suffix Arrays

We show that the compressed suffix arrays can be modified so that all the operations $\Psi$ , ${\rm Range}$ , ${\rm Lookup}$ , and ${\rm Inverse}$ . First we show that ${\rm Range}$ , ${\rm Lookup}$ , and ${\rm Inverse}$ can be done obliviously if $\Psi^{\prime}$ is computed obliviously. We assume that the lengths of the string $T$ and the pattern $P$ , and the alphabet size $\sigma$ are public.

4.1 Computing ${\rm Range}$

As shown in Section 3.2, ${\rm Range}(P,T)$ is done by $|P|$ many binary searches on $\Psi^{\prime}$ . Given the range $[s,e]$ for $P[j+1,m]$ , we compute the range $[s^{\prime},e^{\prime}]$ for $P[j,m]$ . To do so, we first compute $P[j]\cdot(n+1)+s$ and $P[j]\cdot(n+1)+e$ , which can be done obliviously. Then using a binary search on $\Psi^{\prime}$ we compute $s^{\prime}$ . The initial range is $[0,n]$ , and in each step we update the range based on the result of a less-than comparison. Using $\Psi^{\prime}$ is easier than using $\Psi$ because we do not need the array $C^{-1}$ that must support oblivious accesses. We can compute ${\rm Range}(P,T)$ using ${\rm O}(|P|\log n)$ many oblivious accesses to $\Psi^{\prime}$ .

4.2 Computing ${\rm Lookup}$ and ${\rm Inverse}$

${\rm Lookup}(i)$ can be done using the sampled array $A$ , the bit-vector $F$ , and $\Psi$ . The original algorithm repeats computing $i:=\Psi[i]$ until $F[i]=1$ . This is not oblivious because the number $k$ of iteration depends on $i$ . Precisely, it holds ${\rm Lookup}(i)=A[{\rm rank}_{1}(F,\Psi^{k}[i])]-k$ where $k\geq 0$ is the smallest number such that $F[\Psi^{k}[i]]=1$ .

To change the algorithm oblivious, we fix the number of iterations to $L_{3}$ . Because the suffix array entries are sampled every $L_{3}$ entries in text order, for exactly one entry it holds $F[\Psi^{k}[i]]=1$ among those for $k=0,1,\ldots,L_{3}-1$ . Therefore we change the algorithm as follows.

1.

$x:=0$
2.

for $k=0,1,\ldots,L_{3}-1$
3.

$\qquad x:=x+F[i]\cdot(A[{\rm rank}_{1}(F,i)]-k)$
4.

$\qquad i:=\Psi[i]$
5.

return $x$

Here we need oblivious accesses to $F$ and $A$ , and oblivious computation of ${\rm rank}_{1}(F,i)$ .

For $A$ , we use the Path ORAM [15] with block size $b=\log^{2}n$ . Then the space usage is ${\rm O}(n)$ bits and an entry of $A$ is obtained in ${\rm O}(\log^{2}n)$ many accesses to blocks, which can be done in ${\rm O}(\log^{3}n)$ time.

For $F$ , we use the succinct ORAM [10] with block size $b=\log^{2}n$ . Then a block of $F$ is obtained in in ${\rm O}(\log^{2}n)$ many accesses to blocks, which can be also done in ${\rm O}(\log^{3}n)$ time. For computing a rank on $F$ , we use the array $R_{1}$ defined in Section 2.2. This is stored using the Path ORAM. The space usage is ${\rm O}(n/\log n)$ bits. The rank inside a block is computed by logical operations in ${\rm O}(\log n)$ time. Therefore a rank value is computed in ${\rm O}(\log^{3}n)$ time.

The computation of ${\rm Inverse}(j)$ is similar.

4.3 Computing $\Psi$

Finally, we show how to compute $\Psi[i]$ . As shown in Section 3.1, $\Psi^{\prime}$ is represented by a bit-vector $H[0,2n]$ and an array $L[0,n]$ of $\log\sigma$ -bit integers. The array $L$ is stored using the succinct ORAMs. The bit-vector $H$ is stored similarly to $F$ , but here we also need to store the auxiliary data structure for ${\rm select}_{1}$ . This can be stored using the Path ORAM.

To sum up, $\Psi[i]$ is computed in ${\rm O}(\log^{3}n)$ time and the space usage is $n(2+\log\sigma)+{\rm o}(n)\log\sigma$ bits.

4.4 Summary

$\Psi[i]$ takes ${\rm O}(\log^{3}n)$ time. A step of a backward search is a binary search on $\Psi^{\prime}$ , and therefore it is done in ${\rm O}(\log^{4}n)$ time. Then ${\rm Range}(P,T)$ takes ${\rm O}(|P|\log^{4}n)$ time. For ${\rm Lookup}(i)$ , we set $L_{3}={\rm O}(\log n)$ . Then it takes ${\rm O}(\log^{4}n)$ time. The space usage is $(n+{\rm o}(n))\log\sigma+{\rm O}(n)$ bits in total.

5 Concluding Remarks

We have proposed secure compressed suffix arrays. For a string of length $n$ on an alphabet of size $\sigma$ , It uses $(n+{\rm o}(n))\log\sigma+{\rm O}(n)$ bits of space and the ${\rm Count}(P,T)$ query is done in ${\rm O}(|P|\log^{4}n)$ time, and the ${\rm Lookup}(i)$ query is done in ${\rm O}(\log^{4}n)$ time. Therefore there is an ${\rm O}(\log^{3}n)$ multiplicative overhead compared with the original compressed suffix arrays [4]. To improve the running time, we need more efficient succinct ORAM [10] and standard ORAMs [14]. Future work will be developing such oblivious RAM data structures and giving efficient implementions.

References

[1] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. Tfhe: Fast fully homomorphic encryption over the torus. J. Cryptol., 33(1):34–91, January 2020. doi:10.1007/s00145-019-09319-x.
[2] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
[3] Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, pages 169–178, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1536414.1536440.
[4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
[5] Roberto Grossi and Jeffrey S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In Proceedings of the Thirty-Second Annual ACM Symposium on the Theory of Computing, pages 397–406, Portland, OR, 2000.
[6] G. Jacobson. Space-efficient Static Trees and Graphs. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 549–554, 1989.
[7] Eyal Kushilevitz, Steve Lu, and Rafail Ostrovsky. On the (in)security of hash-based oblivious ram and a new balancing scheme. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 143–156, 2012. doi:10.1137/1.9781611973099.13.
[8] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
[9] J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001. doi:10.1137/S0097539799364092.
[10] Taku Onodera and Tetsuo Shibuya. Succinct Oblivious RAM. In Rolf Niedermeier and Brigitte Vallée, editors, 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018), volume 96 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.STACS.2018.52.
[11] R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. doi:10.1145/1290672.1290680.
[12] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. J. Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.
[13] Adi Shamir. How to share a secret. Commun. ACM, 22(11):612–613, November 1979. doi:10.1145/359168.359176.
[14] Elaine Shi, T. H. Hubert Chan, Emil Stefanov, and Mingfei Li. Oblivious ram with o((logn)3) worst-case cost. In Dong Hoon Lee and Xiaoyun Wang, editors, Advances in Cryptology – ASIACRYPT 2011, pages 197–214, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-642-25385-0_11.
[15] Emil Stefanov, Marten van Dijk, Elaine Shi, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. Path oram: an extremely simple oblivious ram protocol. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, CCS ’13, pages 299–310, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2508859.2516660.
[16] Latanya Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, October 2002. doi:10.1142/S0218488502001648.
[17] Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pages 162–167, 1986. doi:10.1109/SFCS.1986.25.

[bib.bib1] [1] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. Tfhe: Fast fully homomorphic encryption over the torus. J. Cryptol., 33(1):34–91, January 2020. doi:10.1007/s00145-019-09319-x.

[bib.bib2] [2] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.

[bib.bib3] [3] Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, pages 169–178, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1536414.1536440.

[bib.bib4] [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.

[bib.bib5] [5] Roberto Grossi and Jeffrey S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In Proceedings of the Thirty-Second Annual ACM Symposium on the Theory of Computing, pages 397–406, Portland, OR, 2000.

[bib.bib6] [6] G. Jacobson. Space-efficient Static Trees and Graphs. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 549–554, 1989.

[bib.bib7] [7] Eyal Kushilevitz, Steve Lu, and Rafail Ostrovsky. On the (in)security of hash-based oblivious ram and a new balancing scheme. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 143–156, 2012. doi:10.1137/1.9781611973099.13.

[bib.bib8] [8] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.

[bib.bib9] [9] J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001. doi:10.1137/S0097539799364092.

[bib.bib10] [10] Taku Onodera and Tetsuo Shibuya. Succinct Oblivious RAM. In Rolf Niedermeier and Brigitte Vallée, editors, 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018), volume 96 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.STACS.2018.52.

[bib.bib11] [11] R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. doi:10.1145/1290672.1290680.

[bib.bib12] [12] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. J. Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.

[bib.bib13] [13] Adi Shamir. How to share a secret. Commun. ACM, 22(11):612–613, November 1979. doi:10.1145/359168.359176.

[bib.bib14] [14] Elaine Shi, T. H. Hubert Chan, Emil Stefanov, and Mingfei Li. Oblivious ram with o((logn)3) worst-case cost. In Dong Hoon Lee and Xiaoyun Wang, editors, Advances in Cryptology – ASIACRYPT 2011, pages 197–214, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-642-25385-0_11.

[bib.bib15] [15] Emil Stefanov, Marten van Dijk, Elaine Shi, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. Path oram: an extremely simple oblivious ram protocol. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, CCS ’13, pages 299–310, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2508859.2516660.

[bib.bib16] [16] Latanya Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, October 2002. doi:10.1142/S0218488502001648.

[bib.bib17] [17] Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pages 162–167, 1986. doi:10.1109/SFCS.1986.25.

Secure Compressed Suffix Arrays

Abstract

Keywords and phrases:

Category:

Copyright and License:

2012 ACM Subject Classification:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

2 Preliminaries

2.1 Suffix arrays

2.2 Succinct bit-vectors

2.3 Secure algorithms

2.4 Oblivious RAM

3 Compressed Suffix Arrays

3.1 The original structure

3.2 Self-indexes

4 Secure Compressed Suffix Arrays

4.1 Computing 𝐑𝐚𝐧𝐠𝐞

4.2 Computing 𝐋𝐨𝐨𝐤𝐮𝐩 and 𝐈𝐧𝐯𝐞𝐫𝐬𝐞

4.3 Computing 𝚿

4.4 Summary

5 Concluding Remarks

References

4.1 Computing ${\rm Range}$

4.2 Computing ${\rm Lookup}$ and ${\rm Inverse}$

4.3 Computing $\Psi$