On Using Toeplitz and Circulant Matrices for Johnson-Lindenstrauss Transforms

The Johnson-Lindenstrauss lemma is one of the corner stone results in dimensionality reduction. It says that for any set of vectors $X \subset \mathbb{R}^n$, there exists a mapping $f : X \to \mathbb{R}^m$ such that $f(X)$ preserves all pairwise distances between vectors in $X$ to within $(1 \pm \varepsilon)$ if $m = O(\varepsilon^{-2} \lg N)$. Much effort has gone into developing fast embedding algorithms, with the Fast Johnson-Lindenstrauss transform of Ailon and Chazelle being one of the most well-known techniques. The current fastest algorithm that yields the optimal $m = O(\varepsilon^{-2}\lg N)$ dimensions has an embedding time of $O(n \lg N + \varepsilon^{-2} \lg^3 N)$. An exciting approach towards improving this, due to Hinrichs and Vyb\'iral, is to use a random $m \times n$ Toeplitz matrix for the embedding. Using Fast Fourier Transform, the embedding of a vector can then be computed in $O(n \lg m)$ time. The big question is of course whether $m = O(\varepsilon^{-2} \lg N)$ dimensions suffice for this technique. If so, this would end a decades long quest to obtain faster and faster Johnson-Lindenstrauss transforms. The current best analysis of the embedding of Hinrichs and Vyb\'iral shows that $m = O(\varepsilon^{-2}\lg^2 N)$ dimensions suffices. The main result of this paper, is a proof that this analysis unfortunately cannot be tightened any further, i.e., there exists a set of $N$ vectors requiring $m = \Omega(\varepsilon^{-2} \lg^2 N)$ for the Toeplitz approach to work.


Introduction
The performance of many geometric algorithms depends heavily on the dimension of the input data. A widely used technique to combat this "curse of dimensionality", is to preprocess the input via dimensionality reduction while approximately preserving important geometric properties. Running the algorithm on the lower dimensional data then uses less resources (time, space, etc.) and an approximate result for the high dimensional data can be derived from the low dimensional result.
Dimensionality reduction approximately preserving pairwise Euclidean distances has found uses in a wide variety of applications, including: Nearest-neighbor search [2,12], clustering [6,8], linear programming [21], streaming algorithms [18], compressed sensing [7,11], numerical linear algebra [24], graph sparsification [19], and differential privacy [5]. See more applications in [20,14]. The most fundamental result in this regime is the Johnson-Lindenstrauss (JL) lemma [15], which says the following: Theorem 1 (Johnson-Lindenstrauss lemma). Let X ⊂ R n be a set of N vectors, then for any 0 < ε < 1/2, there exists a map f : X → R m for some m = O(ε −2 lg N ) such that This result dates back to 1984 and says that to preserve pairwise Euclidian distances amongst a set of N points/vectors in R n to within a factor (1 ± ε), it suffices to use just m = O(ε −2 lg N ) dimensions. The bound on m was very recently proven optimal [17].
The standard technique for constructing a map with the properties of Theorem 1 is the following: Let A be an m × n matrix with entries independently sampled as either N (0, 1) random variables (as in [10]) or Rademacher (uniform among {−1, +1}) random variables (as in [1]). Once such entries have been drawn, let f : R n → R m be defined as: To prove that the map f satisfies the guarantees in Theorem 1, it is first shown that for any vector x, the probability that f (x) 2 2 is not within (1 ± ε) x 2 2 is less than 1/N 2 . This probability is called the error probability and denoted δ. Using linearity of f and a union bound over all pairs x, y ∈ X, the probability that all pairwise distances (i.e. the norm of the vector x − y) are preserved can be shown to be at least 1/2.

Time Complexity
Examining the classic Johnson-Lindenstrauss reduction above, we see that to embed a vector, we need to multiply with a dense matrix and the embedding time becomes O(nm) (or equivalently O(nε −2 lg N )). This may be prohibitively large for many applications (recall one prime usage of dimensionality reduction is to speed up algorithms), and much research has been devoted to obtaining faster embedding time.
Fast Johnson-Lindenstrauss Transform. Ailon and Chazelle [2] were the first to address the question of faster Johnson-Lindenstrauss transforms. In their seminal paper, they introduced the so-called Fast Johnson-Lindenstrauss transform for speeding up dimensionality reduction. The basic idea in their paper is to first "precondition" the input data by multiplying with a diagonal matrix with random signs, followed by multiplying with a Hadamard matrix. This has the effect of "spreading" out the mass of the input vectors, allowing for the dense matrix A above to be replaced with a sparse matrix. Since we can multiply with a Hadamard matrix using Fast Fourier Transform, this gives an embedding time of O(n lg n + ε −2 lg 3 N ) for embedding into the optimal m = O(ε −2 lg N ) dimensions. For m = ε −2 lg N ≤ n 1/2−γ for any constant γ > 0, the embedding complexity was improved even further down to O(n lg m) in [3].
Another approach to achieve the O(n lg m) embedding time, but without the restriction on ε −2 lg N ≤ n 1/2−γ , is to sacrifice the target dimension. This was done in [4], where the embedding complexity was O(n lg m) at the cost of an increased target dimension m = O(ε −4 lg N lg 4 n).
Sparse Vectors. Another approach to improve the performance of JL transforms, is to assume the input data is sparse, i.e. has few non-zero coordinates. Designing an algorithm based on the work in [23], Dasgupta et al. [9] achieved an embedding complexity of O( x 0 ε −1 lg 2 (mN ) lg N ), where x 0 = |{i | x i = 0}|. This was later improved to O( x 0 ε −1 lg N ) in [16].
Toeplitz Matrices. Finally, another very exciting approach is to use Toeplitz matrices or partial circulant matrices for the embedding. We first introduce the terminology.
An m × n Toeplitz matrix is an m × n matrix, where every entry on a diagonal has the same value: A partial circulant matrix is a special kind of Toeplitz matrix, where every row, except the first, is the previous row rotated once: Hinrichs and Vybíral [13] proposed the following algorithm for generating a JL embedding based on a Toeplitz matrix 1 : Let t −(m−1) , t −(m−2) , t n−1 and d 1 , . . . , d n be i.i.d. Rademacher 2 random variables, and T be a Toeplitz matrix defined from t −(m−1) , t −(m−2) , . . . , t n−1 such that entry (i, j) takes values t j−i for i = 1, . . . , m and j = 1, . . . , n. Let D be an n × n diagonal matrix with the random variable d i giving the i'th diagonal entry. Define the map f as Multiplying with a Toeplitz matrix corresponds to computing a convolution and can be done using Fast Fourier Transform. By appropriately blocking the input coordinates, the complexity of embedding a vector x is just O(n lg m) for any target dimension m. The big question is of course, how low can the target dimension m be, while preserving the distances between vectors up to a factor of 1 ± ε?
In the original paper [13], the authors proved that setting the target dimension to m = O(ε −2 lg 3 (1/δ)), the norm of any vector would be preserved to within (1 ± ε) with probability at least 1 − δ. Setting δ = 1/N 2 , a union bound over all pairwise difference vectors (as in the classic construction) shows that dimension m = O(ε −2 lg 3 N ) suffices. Later, the analysis was refined in [22], which lowered the target dimension to m = O(ε −2 lg 2 (1/δ)) for preserving norms to within (1 ± ε) with probability 1 − δ. Again, setting δ = 1/N 2 , this gives m = O(ε −2 lg 2 N ) target dimension. Now if the analysis could be tightened even further to give the optimal m = O(ε −2 lg N ) dimensions, this would end the decades long quest for faster and faster embedding algorithms! Our Contribution. Our main result unfortunately shows that the analysis of Vybíral [22] cannot be tightened to give an even lower target dimensionality. More specifically, we prove that the upper bound given in [22] is optimal: Theorem 2. Let T and D be the m × n Toeplitz and n × n diagonal matrix in the embedding proposed by [13]. For all 0 < ε < C, where C is a universal constant, and any desired error probability δ > 0, if the following holds for every unit vector x ∈ R n : then it must be the case that m = Ω(ε −2 lg 2 (1/δ)).
While Theorem 2 already shows that one cannot tighten the analysis of Vybíral for preserving the norm of just one vector, Theorem 2 does leave open the possibility that one would not need to union bound over all N 2 pairs of difference vectors when trying to preserve all pairwise distances amongst a set of N vectors. It could still be the case that there somehow was a strong positive correlation between distances being preserved (though this seems extremely unlikely, and would be something not seen in any previous approach to JL). To complete the picture, we indeed show that this is not the case, at least for N somewhat smaller than the dimension n: Theorem 3. Let T and D be the m × n Toeplitz and n × n diagonal matrix in the embedding proposed by [13]. For all 0 < ε < C, where C is a universal constant, if the following holds for every set of N vectors X ⊂ R n : then it must be the case that either m = Ω(ε −2 lg 2 N ) or m = Ω(n/N ).
We remark that our proofs also work if we replace T be a partial circulant matrix (which was also proposed in [13]). Furthermore, we expect that minor technical manipulations to our proof would also show the above theorems when the entries of T and D are N (0, 1) distributed rather than Rademacher (this was also proposed in [13]).

Lower Bound for One Vector
Let T be m× n Toeplitz matrix defined from random variables t −(m−1) , t −(m−2) , . . . , t n−1 such that entry (i, j) takes values t j−i for i = 1, . . . , m and j = 1, . . . , n. Let D be an n × n diagonal matrix with the random variable d i giving the i'th diagonal entry. This section shows the following: and furthermore, all but the first O( √ m) coordinates of x are 0.
It follows from Theorem 4 that if we want to have probability at least 1 − δ of preserving the norm of any unit vector x to within (1 ± ε), it must be the case that ε √ m = Ω(lg(1/δ)), i.e. m = Ω(ε −2 lg 2 (1/δ)). This is precisely the statement of Theorem 2. Thus we set out to prove Theorem 4.
To prove Theorem 4, we wish to invoke the Paley-Zygmund inequality, which states, that if X is a random variable with finite variance and 0 ≤ θ ≤ 1, then We carefully choose a unit vector x, and define the random variable for Paley-Zygmund to be the k'th moment of the difference between the norm of x transformed and 1.
Proof. Let k be an even positive integer less than m/4 and define s := 4k. Note that s ≤ m. Let x be an arbitrary n-dimensional unit vector such that the first s coordinates are in {−1/ √ s, +1/ √ s}, while the remaining n − s coordinates are 0. Define the random variable parameterized by k Since k is even, the random variable Z k is non-negative. We wish to lower bound E[Z k ] and upper bound E[Z 2 k ] in order to invoke Paley-Zygmund. The bounds we prove are as follows: Before proving Lemma 5 we show how to use it together with Paley-Zygmund to complete the proof of Theorem 4.
We start by invoking Paley-Zygmund and then rewriting the expectations according to Lemma 5, Here C 0 is some constant greater than 0. For any This choice of k satisfies k ≤ √ m as required by Lemma 5. We have thus shown that: Remark. Theorem 4 can easily be extended to partial circulant matrices. The difference between partial circulant and Toeplitz matrices is the dependence between the values in the first m and last m columns. However, as only the first s = 4k ≤ 4 √ m entries in x are nonzero, the last m columns are ignored, and so partial circulant and Toeplitz matrices behave identically in our proof.
Proof of Lemma 5. Before we prove the two bounds in Lemma 5 individually, we rewrite E[Z k ], as this benefits both proofs.
Observe that for j > s or h > s the product becomes 0, as either x j or x h is 0. By removing all these terms, we simplify the sum to x h is 0 if one of the following two things are true: • A d j occurs an odd number of times in the product.
• A variable t a occurs an odd number of times in the product.
To see this, note that by the independence of the random variables, we can write the expectation of the product, as a product of expectations where each term in the product has all the occurrences of the same random variable. Since the d j 's and t a 's are Rademachers, the expectation of any odd power of one of these random variables is 0. Thus if just a single random variable amongst the d j 's and t a 's occurs an odd number of times, we have Similarly, we observe that if every random variable occurs an even number of times, then the expectation of the product is exactly 1/s k since each x j also occurs an even number of times. If Γ k denotes the number of tuples S ∈ ([m]×[s]×[s]) k such that ∀(i, j, h) ∈ S we have h = j and furthermore: Then we conclude Note that Z 2 k = Z 2k . Therefore, To complete the proof of Lemma 5 we need lower and upper bounds for Γ k and Γ 2k . The bounds we prove are Lemma 6. If k ≤ √ m, then Γ k and Γ 2k satisfy: The proofs of the two bounds in Lemma 6 are in Sections 2.1 and 2.2. Substituting the bounds from Lemma 6 in (1) and (2) we get which are the bounds we sought for Lemma 5.

Lower Bounding Γ k
We first recall that the definition of Γ k is the number of tuples S ∈ ([m] × [s] × [s]) k satisfying that ∀(i, j, h) ∈ S we have h = j and furthermore: • For all columns a ∈ [s], |{(i, j, h) ∈ S | j = a ∨ h = a}| mod 2 = 0.

We view a triple (i, j, h) ∈ ([m] × [s] × [s]) as two entries (i, j) and (i, h)
in an m × s matrix. Furthermore, when we say that a triple touches a column or diagonal, a matrix entry of the triple lie on that column or diagonal, so (i, j, h) touches columns j and h and diagonals j − i and h − i. Similarly, we say that a tuple S ∈ ([m] × [s] × [s]) k touches a given column or diagonal l times, if l triples in S touches that column or diagonal.
We intent to prove a lower bound for Γ k by constructing a big family of tuples F ⊆ ([m] × [s] × [s]) k , where each tuple satisfies, that each column and diagonal touched by that tuple is touched exactly twice. As each column and diagonal is touched an even number of times, the number of tuples in the family is a lower bound for Γ k .
Proof of Γ k = m k/2 s k k k 2 −O(k) . We describe how to construct a family of tuples F ⊆ ([m] × [s] × [s]) k satisfying that ∀S ∈ F, ∀(i, j, h) ∈ S we have h = j and furthermore: From this and the definition of Γ k it is clear that |F | ≤ Γ k . When constructing an S ∈ F, we view S as consisting of two halves S 1 and S 2 , such that S 1 touches exactly the same columns and diagonals as S 2 and both S 1 and S 2 touches each column and diagonal at most once. To capture this, we give the following definition, where S is meant to be the family of such halves S 1 and S 2 .

Definition 7. Let S be the set of all tuples
Definition 7 mimics the definition of Γ k , and the first item in Definition 7 ensures that the triples in a tuple in S are of the same form as in Γ k . The final two items ensure that each column and diagonal, respectively, is touched at most once. This is exactly the properties we wanted of S 1 and S 2 individually.
We can now construct F as all pairs of (half) tuples S 1 , S 2 ∈ S, such that S 1 touches exactly the same columns and diagonals as S 2 . To capture that S 1 and S 2 touch the same columns and diagonals, we introduce the notion of a signature. A signature of S i is the set of columns and diagonals touched by S i .
To have S 1 and S 2 touch exactly the same columns and diagonals, it is necessary and sufficient that they have the same signature. We introduce the following notation: B denotes the number of signatures with at least one member, and by enumerating the signatures, we let b i denote the number of (half) tuples in S with signature i.
We recall that a (half) tuple S 1 ∈ S touches each column and diagonal at most once, and if S 1 and S 2 share the same signature, they touch exactly the same columns and diagonals. Therefore, using • to mean concatenation, S = S 1 • S 2 ∈ F, as each column and diagonal touched is touched exactly twice. Therefore |F | is a lower bound for Γ k . Note that for a given signature i, the number of choices of S 1 and S 2 with that signature is b 2 i . This gives the following inequality, We now apply the Cauchy-Schwarz inequality: To get a lower bound on |S| 2 /B (and in turn Γ k ), we need a lower bound on |S| and an upper bound on B. These bounds are stated in the following lemmas Lemma 8. |S| = Ω(m k/2 s k 2 −k ).

Lemma 9. B = O m+s k/2 s k/2 s k
Before proving any of these lemmas, we show that they together with (3) give the desired lower bound on Γ k : Because s = 4k, we have (k/2) k/2 s k/2 = 2 −Θ(k) , and because s ≤ m: . With this we can simplify (4), which is the lower bound we sought.
is the set of (half) tuples that touch each column and diagonal at most once, and, for each triple (i, j, h) in these (half) tuples, we have j = h.
We prove Lemma 8 by analysing how we can create a large number of distinct S ∈ S by choosing the triples in S iteratively.
For each triple, we choose a row and two distinct entries on this row. We choose the row among any of the m rows.
However, because S ∈ S, when choosing entries on the row, we cannot choose entries that lie on columns or diagonals touched by previously chosen triples. Instead we choose the two entries among any of the other entries. Therefore, whenever we choose a triple, this triple prevents at most four row entries from being chosen for every subsequent triple, as the two diagonals and two columns touched by the chosen triple intersect with at most four entries on the rows of the subsequent triples. This leads to the following recurrence, describing a lower bound for the number of triples where r is the number of rows to choose from, c is the minimum number of choosable entries in any row, and t is the number of triples left to choose. Inspecting (5), we can see that F can equivalently be defined as If t ≤ c 8 then the terms inside the product in (6) are greater than c 2 , so we can bound F from below: We now insert the values for r, c and t to find a lower bound for |S|, noting that s = 4k ensures that t ≤ c 8 : Proof of Lemma 9. Recall that for at triple S ∈ S we define the signature as the set of columns and diagonals touched by S. Furthermore, viewing a triple (i, j, h) ∈ ([m] × [s] × [s]) as the two entries (i, j) and (i, h) in an m × s matrix, we define the left endpoint as (i, min{j, h} and the right endpoint as (i, max{j, h}).
The claim to prove is This is proven by first showing an upper bound on the number of choices for the diagonals of left endpoints, then diagonals of right endpoints and finally for columns.
In an m × s matrix there are m + s different diagonals and as the chosen diagonals have to be distinct, there are m+s k/2 choices for the diagonals corresponding to left endpoints in a triple. As the right endpoint of a triple has to be in the same row as the left endpoint, there are at most s choices for the diagonal corresponding to the right endpoint when the left endpoint has been chosen (which it has in our case). This gives a total of s k/2 choices for diagonals corresponding to right endpoints.
Finally, there are s columns to choose from and the chosen columns have to be distinct, and so the total number of choices of columns is s k . The product of these number of choices gives the upper bound sought.

Upper Bounding Γ 2k
Proof. Recall that Γ 2k is defined as the number of tuples S ∈ ([m] × [s] × [s]) 2k such that ∀(i, j, h) ∈ S we have h = j and furthermore: be the family of tuples satisfying these conditions, and so |F | = Γ 2k . To prove an upper bound on Γ 2k , we show how to encode a tuple S ∈ F using at most k lg m + 2k lg s + 2k lg k + O(k) bits, such that S can be decoded from this encoding. Since any S ∈ F can be encoded using k lg m + 2k lg s + 2k lg k + O(k) bits and |F | = Γ 2k , we can conclude: Let σ denote the encoding function and σ −1 denote the decoding function. If S ∈ F and t ∈ S, σ(t) denotes the encoding of the triple t, σ(S) denotes the encoding of the entire tuple S, and σ(F ) denotes the image of σ.
A tuple S ∈ F consists of triples t 1 , t 2 , . . . , t 2k such that S = t 1 • t 2 • · · · • t 2k . To encode S ∈ F we encode each of the triples and store them in the same order: We will first describe a graph view of a tuple S which will be useful for encoding and decoding, then we will show an encoding algorithm and finally a decoding algorithm.
Graph A tuple S ∈ F forms a (multi)graph structure, where every triple (i, j, h) ∈ S is a vertex. Since S ∈ F, there lies an even number of triple endpoints on each diagonal. We can thus pair endpoints lying on the same diagonal, such that every endpoint is paired with exactly one other endpoint. When two triples have endpoints that are paired, the triples have an edge between them in the graph. As every triple has two endpoints, every vertex has degree two, and so the graph consists entirely of simple cycles of length at least two.
Encoding To encode an S ∈ F, we first encode each cycle by itself by defining the σ(t)'s for the triples t in the cycle. After this, we order the defined σ(t)'s as the t's were ordered in the input.
1. For each cycle we perform the following: (a) We pick any vertex t = (i, j, h) of the cycle and give it the type head. Define σ(t) as the concatenation of its type head, its row i, and two columns j and h. This uses lg m+2 lg s+O(1) bits. (b) We iterate through the cycle starting after head and give vertices, except the last, type mid.
The last vertex just before head is given the type last. (c) For each triple t of type mid we store its type and two columns explicitly. However, instead of storing its row we store the index of its predecessor in the cycle order as well as how they are connected: If we typed t r just before t s when iterating through the cycle, when encoding t s we store r as well as whether t r and t s are connected by the left or right endpoint in t r and left or right endpoint in t s . So define σ(t) as the concatenation of mid, the two columns, the predecessor index and how it is connected to the predecessor. All in all we spend lg k + 2 lg s + O(1) bits encoding each mid.
(d) Finally, to encode the triple t, which is typed last, we define σ(t) as the concatenation of its type, its predecessors index, how it is connected to its predecessor, and the column of the endpoint on the predecessor's diagonal. We thus spend lg k + lg s + O(1) bits to encode a last. However, since s = 4k the number of bits per encoded last is equivalent to 2 lg k + O(1), which turns out to simplify the analysis later.
Note that for each triple, the type is encoded in the first 2 bits 3 in the encoding of the triple. This will be important during decoding.
2. After encoding all cycles, we order the encoded triples in the same order as the triples in the input, and output the concatenation of the encoded triples: To analyse the number of bits needed in total, we look at the average number of bits per triple inside a cycle. Since all cycles have a length of at least two, each cycle has a head and a last triple. These two use an average of lg m 2 + lg s + lg k + O(1) bits. We now claim that the number of bits per mid is bounded by this average. Recall that we assumed √ m ≥ k and s = 4k: Since the average of all 2k triples is at most lg m 2 + lg s + lg k + O(1) bits, the number of bits for all triples is at most k lg m + 2k lg s + 2k lg k + O(k).
First we need to extract the encodings of the individual triples, then we decode each cycle, and finally we restore the order of triples. The following steps describes this algorithm in more detail.
1. We first extract the individual σ(t i )'s, by iterating through the bit string σ(S). By looking at the type in the first two bits of an encoded triple we know the length of the encoded triple. From this we can extract σ(t i ) as well as know where σ(t i+1 ) begins.
2. We now wish to decode each cycle to get the row and two columns of every triple, so do the following for each head t i : (a) t ← t i (b) Look at the first two bits in t to determine its type, and while t is not a last, do: i. If t is a head, the row and two columns are stored explicitly.
ii. If t is a mid, the two columns are stored explicitly. From the reference to t's predecessor, we can calculate the diagonal shared between t and its predecessor. This is can be done as we have stored which endpoint (left or right) of the predecessor lies on the diagonal, and as we decode in the same order as we encoded, we have already decoded the predecessor. We know which of t's columns the shared diagonal intersects, and so we can calculate the row.
If the predecessor is a head, take note of which diagonal is shared with it. iii. t ← the triple that has t as its predecessor. (c) t is now a last, so we know the predecessor's index as well as a column. However, as the graph consists of cycles rather than just linked lists, last shares a diagonal with its predecessor as well as sharing a diagonal with head. We have noted which diagonal has already been used for head, so the other diagonal head touches is shared with last. From these two diagonals and the column, it is possible to calculate the row by the intersection between the predecessor diagonal and the column, and the other column by the intersection of the row and the other diagonal.
3. Finally order the triples as they were, when the encoded versions were extracted in step 1.

Lower Bound for N Vectors
In this section, we generalize the result of Section 2 to obtain a lower bound for preserving all pairwise distances amongst a set of N vectors. Our proof uses Theorem 4 as a building block. Recall that Theorem 4 guarantees that there is a vector x such that