Optimal White-Box Adversarial Streaming
Lower Bounds for Approximating LIS Length
Abstract
The space complexity of deterministic streaming algorithms for approximating the length of the longest increasing subsequence (LIS) in a string of length has been known to be for almost two decades. In contrast, the space complexity of this problem for randomized streaming algorithms remains one of the few longstanding open problems in one-pass streaming. In fact, no better than lower bounds are known, and the best upper bounds are no better than their deterministic counterparts.
In this paper, we push the limits of our understanding of the streaming space complexity of the approximate LIS length problem by studying it in the white-box adversarial streaming model. This model is an intermediate model between deterministic and randomized streaming algorithms that has recently attracted attention. In the white-box model, the streaming algorithm can draw fresh randomness when processing each incoming element, but an adversary generating the stream observes all previously used randomness and adaptively chooses the subsequent elements of the stream.
We prove a tight (up to logarithmic factors) space lower bound for any white-box streaming algorithm that approximates the length of the LIS of a stream of length to within a factor better than . Thus, for this problem, white-box algorithms offer no improvement over deterministic ones.
Keywords and phrases:
White-bos streaming, Longest increasing subsequenceFunding:
Raghuvansh R. Saxena: We acknowledge support of the Department of Atomic Energy, Government of India, under project no. RTI4001.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithmsEditor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The LIS problem.
The length of the Longest Increasing Subsequence (LIS) is a fundamental measure of sequences that has been extensively studied, with applications in various areas, including bioinformatics, string-matching, text processing and more. See the books by Gusfield [15] and Pevzner [21] and the survey by Aldous and Diaconis [2].
Formally, given a sequence of length over an ordered alphabet , an increasing subsequence in is a subsequence such that . For , if the longest LIS in is of length , a -approximation of the length of the LIS in is any value between and .
The streaming model.
In this paper, we study the space complexity of approximating the length of the LIS in the data stream model of computation. In this model, the input arrives as a stream of elements, the algorithm is allowed only one or a few passes over the stream, and it should use limited memory. The streaming model has been extensively studied in recent years, driven by the need to process massive data sets arising in domains such as network traffic monitoring, database management, machine learning, social networks, and large-scale graph analysis (see the surveys by Muthukrishnan [20] and Babcock et al. [4]). In these applications, the input is either stored on an external device, or obtained on the fly from some source, and the algorithm does not have sufficient working memory to store the entire data set.
LIS in the streaming model.
The LIS length approximation problem is particularly motivated in the streaming model. For example, since efficient sorting is infeasible in this setting, some applications wish to estimate how close their input is to being sorted. The length of the LIS is a natural measure of this closeness.
Streaming algorithms for computing or approximating LIS length have received considerable attention. Computing the length of the LIS exactly via a streaming algorithm is known to require space, even for randomized algorithms [14, 25]. This shows that to get sublinear space, one needs to settle for approximation. Indeed, [14] gave a deterministic one-pass streaming algorithm that approximates LIS length to within any multiplicative constant and uses space. A matching (up to factors) lower bound of was shown by [12, 8] for any alphabet size .
As discussed above, the space complexity of (deterministic or randomized) streaming algorithms for computing LIS length exactly, as well as the space complexity of deterministic streaming algorithms for approximating the length of LIS, are well understood.
In contrast, as far as we know, for randomized streaming algorithms that -approximate the length of the LIS, the lower bound of Sun and Woodruff [25] is still the best known for any (arbitrarily small) constant . Thus, the optimal space complexity of randomized streaming algorithms for approximating the length of LIS remains wide open.
The white-box adversarial streaming model.
In this paper, we push the limits of our understanding of the LIS length approximation problem by considering it in the white-box adversarial streaming model, an intermediate model between randomized and deterministic streaming algorithms that received a lot of attention lately [1].
In the standard data stream model, it is assumed that the entire input is fixed in advance, and a randomized streaming algorithm has to be correct on any input stream with high probability (over the algorithm’s randomness). This can also be interpreted as the input is chosen by an adversary, where the adversary knows the algorithm, but does not have any information about the algorithm’s internal states (memory and random coins) during the execution of the algorithm.
Recently, streaming algorithms in stronger adversarial settings received a lot of attention, where instead of fixing the entire input in advance, the adversary is allowed to choose the next element of the stream after observing parts of the algorithm’s behavior up to the current step.
The white-box adversarial streaming model, introduced by [1], allows the adversary to have full access to the internal states of the algorithm. More precisely, upon receiving each element, the algorithm can use fresh randomness to update its memory. However, the white-box adversary that generates the stream can observe all previously used random coins during the execution of the algorithm up to the current step (and therefore also knows the algorithm’s memory content), before fixing the next element of the stream. The algorithm has to be correct with high probability over both the algorithm’s and the adversary’s random choices, for every white-box adversary.
The white-box model captures scenarios where algorithms lack access to a securely private source of randomness, and, as pointed out in [1, 10, 9], also some non-adversarial settings. For instance, a server may publish its initial state and users may subsequently generate their data based on this state and send it back to the server. Analogous notions of the white-box streaming model are studied in areas such as dynamic data structures, machine learning, and persistent data structures. For a broader survey of white-box models across different domains, see [1, 10, 9].
1.1 Our Result
We prove the following lower bound on the space complexity of any white-box streaming algorithm for approximating LIS length.
Theorem 1.
Any (randomized) white-box streaming algorithm that, with probability , outputs a -approximation to the length of the LIS of a stream of length with alphabet size requires space.
The space lower bound in Theorem 1 is tight (up to factors) as a deterministic (and therefore also white-box) streaming algorithm with space complexity is known [14]. Additionally, for alphabet size , there are deterministic one-pass streaming algorithms that compute the length of LIS exactly using space (see Section 1.2). Therefore, the space lower bound in Theorem 1 can only hold for large enough alphabet size .
Our proof of Theorem 1 differs from the proofs of previous white-box streaming lower bounds. Known white-box lower bounds were obtained by proving one-way, two-party deterministic communication lower bounds for the corresponding communication problem, as [1] show that such bounds imply white-box streaming lower bounds. This approach, however, cannot be used to prove Theorem 1 since, for every constant , there is a dynamic-programming-based two-party one-way deterministic communication protocol with only communication that gives a -approximation of LIS length. More generally, for , there is a -party one-way deterministic communication protocol with communication [14, 25]. This suggests that to prove Theorem 1 we should consider communication protocols with many parties. Unfortunately, [1] shows that their result cannot be extended to the case of multi-party communication protocols.
Another intermediate model between deterministic and randomized streaming that attracted attention recently is the adversarially robust streaming model [5]. In this model the adversary sees only part of the algorithm’s internal state (see Section 1.2), and therefore any white-box algorithm is also adversarially robust. While our Theorem 1 shows that LIS length approximation is hard in the white-box model, the work of [5] implies that the space complexity of LIS length approximation in the adversarially robust model is essentially the same as that of randomized algorithms. See Section 1.2 for more details. In this sense, our result proves a lower bound in the strongest studied model that does not imply a randomized lower bound.
1.2 Prior Work
LIS in other models.
A classical dynamic-programming-based algorithm, called patience sorting [2], finds a longest increasing subsequence, and thus also computes the length of LIS exactly, in nearly linear time using space. Fredman [11] proved that the running time of patience sorting is optimal for computing the length of LIS exactly. Sublinear-time and sublinear-space algorithms for approximating the LIS length have also been extensively studied in various settings; see, e.g., [3, 22, 24, 13, 17, 7].
LIS in the streaming model.
Patience Sorting can naturally be viewed as a one-pass space streaming algorithm for computing the length of LIS exactly. More precisely, the space bound is where is the length of the LIS. Thus, the one-pass deterministic streaming complexity is .
Gopalan, Jayram, Kumar and Krauthgamer [14] and Sun and Woodruff [25] proved an space lower bound for computing the length of LIS exactly, even for randomized algorithms.
[14] gave a deterministic one-pass streaming algorithm to compute a -approximation of the length of LIS using space for any . Gál and Gopalan [12] and Ergun and Jowhari [8] proved matching lower bounds showing that space is necessary for deterministic constant-pass streaming algorithms for any and large enough alphabet size . More precisely, the bound of [12] is of the form for -pass streaming algorithms. Li and Zheng [18] extended the deterministic lower bounds of [12, 8] to smaller alphabet sizes, proving lower bounds of the form for constant-pass streaming algorithms. In addition, Li and Zheng [19] extended the above space lower bounds to the streaming model where the stream may be provided in other than the “standard” order of the input sequence . Saks and Seshadhri [23] gave a space randomized streaming algorithm that approximates LIS length within additive error for any constant .
The adversarially robust streaming model.
Another adversarial streaming model, closely related to the white-box model, is the adversarially robust streaming model [5], where the adversary is allowed to observe outputs of the algorithm up to the current step before fixing the next element of the stream, but cannot see the random coins of the algorithm. It is known that space complexity in the “standard” randomized streaming model may be strictly smaller than in adversarially robust streaming [16], which in turn can be strictly smaller than white-box adversarial streaming space complexity [1].
On the other hand, it was shown in [5] that for a certain class of problems “standard” randomized streaming algorithms can be simulated by adversarially robust streaming algorithms without using much more space. In particular, for approximating LIS length this means that the space complexity of adversarially robust streaming is within factor of the space complexity of standard randomized streaming. Thus, proving space lower bounds in the adversarially robust streaming model for approximating the length of LIS would be equivalent to proving lower bounds for standard randomized streaming, up to factors.
In light of this, we focus our attention on the white-box adversarial model, as an intermediate model between randomized and deterministic streaming.
2 Proof Sketch
We overview our proof of Theorem 1 in this section. The proof has two main parts. First, we define a communication game (Less Than Matrix) and show a reduction saying that a streaming algorithm for approximating the length of the LIS in the input stream implies a protocol that solves the communication game. With this reduction, it suffices to prove a lower bound on the length of the longest message in any protocol solving the communication game , which forms the second part of the proof.
2.1 The Communication Game
The communication game is parameterized by two parameters, an integer and a real number . In the communication game , there are parties and the input of each party is an bit vector. We interpret the input vectors of the odd-numbered parties as the rows of a matrix and therefore call these parties , and interpret the rows of the even-numbered parties as the rows of a matrix and call these parties . The goal of the parties is to distinguish between the case when , which we call the no case, and the case where at least an fraction of the entries in any row of are strictly smaller than the corresponding entries in , which we call the yes case.
We study one-way protocols for the game when the parties’ inputs are determined by a white-box adversary. That is, first an all-powerful adversary selects an input for the party , who then uses its randomness and the selected input to send a message to the party . The adversary can see the sampled randomness (and therefore, the message) and then selects an input for the party , who then uses this input together with fresh randomness to send a message to the party . This continues till the last party computes the output, which is either yes or no. The protocol succeeds in solving the game if whenever the selected input is in the no (respectively, yes) case, the output is no (resp. yes) with high probability.
The game can also be viewed in the white-box adversarial streaming setting, where the input vectors of the parties are streamed coordinate by coordinate in that order. Note that, when viewed this way, the length of the resulting stream is and that the white-box adversary has more power as it can change the subsequent entries in the same vector after seeing how a streaming algorithm behaved in a prefix. In contrast, the white-box adversary in the communication version commits to the entire vector at once. This means that a lower bound on the length of the longest message in a white-box communication protocol implies a lower bound on the space needed by a white-box streaming algorithm. We use these two views interchangeably when we talk about our reduction.
2.2 Reducing to Approximating the Length of the LIS
We now show that a white-box streaming algorithm for estimating the length of the LIS implies a white-box streaming algorithm that solves the game . This means that a lower bound on the latter suffices to show Theorem 1.
For this, for any row and any column , we define a fixed offset and consider the sequence formed by adding the offset to the corresponding entry in both the matrices and . As the original entries in and are just bits, we can define the offsets so that regardless of the original entries, the offsetted entries in any column are decreasing in both and , and any entry in any column exceeds any entry in any preceding column in both and .
Observe that because of our choice of offsets, any increasing subsequence (when the input is viewed as a stream) has length at most . Indeed, as the columns are decreasing in both and , at most one entry can be selected from each column in each matrix, giving a total of at most entries. Moreover, the only way to have a column such that an entry from column of is in the LIS and an entry from column of is in the LIS is if it is the same entry in both the matrices and . Indeed, it must be the same entry as if not, the fact that the columns are decreasing would violate the LIS property. Moreover, given that it is the same entry, the entry for comes before the entry for in the stream, and therefore we must have . Combining, we conclude that when the inputs come from the no case of , the length of the resulting LIS is .
Consider now the case when the inputs come from the yes case of . By the definition of the yes case, we have that for any row , it holds that for at least an fraction of the columns . This makes for a total of such pairs which by the pigeonhole principle means that at least such pairs would be on the same (parallel to the top-left to bottom-right) diagonal. These pairs would each contribute entries to an increasing subsequence, which when combined with one entry from all the columns not participating in these pairs would give an increasing subsequence of length at least .
As we have shown that inputs from the no case of give rise to sequences where the length of the LIS is and inputs from the yes case of give rise to sequences where the length of the LIS is at least , we get that an algorithm that gives a -approximation to the length of the LIS with high probability would solve the game with the same probability, finishing the proof of our reduction.
2.3 A Lower Bound for
We now show that any white-box communication protocol that solves the game will have at least one message with length . To this end, note that while it may seem that being able to decide the input of the players after seeing the randomness used by the players so far gives the adversary a lot of power, this is hardly the case, as the definition of the game itself imposes a lot of restrictions on the inputs the adversary can give the parties.
For example, in the no case of , the adversary needs to ensure that which is possible only if the input of the party is the same as the input of the party for all . This means that, despite being able to see the randomness used by , once the adversary has selected the input for the party , he cannot use that ability to his advantage, as the input for the party must be the same as .
Moreover, the fact that the definition of treats all the rows “separately” with nothing connecting the inputs for different rows (for ) means that selecting the input for the party and then looking at the player’s randomness may not even help the adversary in selecting the inputs for the parties and onwards. Thus, despite the apparent power a white-box adversary has in selecting the input after seeing the parties’ randomness, for the no case in our lower bound, we have to consider very simple adversaries. Specifically, the adversary we consider takes a (balanced) binary error correcting code and samples an independent and uniformly random codeword as the inputs for all the odd parties and gives the same input to the corresponding even parties. Note that the reason we sample a uniformly random codeword instead of a uniformly random bit string is that any two distinct (balanced) codewords would have many coordinates where one of them is larger than the other, a property that we need in the yes case.
We now discuss how our adversary selects the inputs for the parties in the yes case. As the goal of the adversary is to make this case indistinguishable from the no case, the adversary must select the inputs from in this case as well. This means that all the adversary has to do in order to satisfy the promise in the yes case is to ensure that for all , the parties and always receive different codewords in . Then, the fact that they are balanced would imply that the codeword for is strictly smaller than the codeword for in a large fraction of the coordinates. Additionally, all a protocol has to do in order to solve the game is to check whether the codewords for and are the same or not.
While the requirements of the yes case force the adversary to select different codewords for and , the adversary wants to ensure that the party cannot detect that the codewords are different by looking at its input and the message it gets from . These two requirements are incompatible for general protocols: The party can simply send its input to the party who can then detect that their inputs are different by comparing this input with the received message. However, for short protocols where the messages between the parties have length , it is not possible for to send its entire input, and this protocol does not work. Still the question remains whether the adversary can select a different codeword for without it being able to detect that its different.
Selecting the input of .
In order to show that the adversary can always select an input for , we show a general lemma (see Lemma 7) saying that for any random variable such that no value in the support is taken with probability larger than , there exists a way to couple it with an identically distributed random variable such that in the joint distribution, we have with probability . The assumption that no value is taken with probability larger than is tight, as if it is false, there exists such that . By a union bound, we get that contradicting the existence of a coupling for which with probability .
To prove this lemma, we interpret all the values the random variable can take as the vertices in a weighted graph and treat the value as the desired degree of this vertex. Then our goal is assign weights to all the edges in the graph (no self-loops) such that the sum of the weights on the edges incident to any given vertex equals its desired degree. If we can achieve this, then, we can treat the weight on any edge as the probability that and also the probability that , there by defining a coupling between and . As the sum of the weights on the edges incident to any vertex equals its desired degree, we get that and are identically distributed with the desired distribution, and as there are no self-loops, we get that with probability .
We use an iterative “greedy” procedure that adds weight to the edge between two vertices with the highest desired degree, till either the weight adds up to the desired degree for some vertex, or conditioned on the weights already assigned, the probability of the random variable taking some value reaches . In the former case, we can use our procedure again with that vertex removed while in the latter case, we can connect the value whose probability is with all other vertices in a star configuration and the desired weights, finishing the proof.
With this lemma in hand, the adversary looks at the message sent by to and applies the lemma on the distribution of the input of conditioned on this message. We show that, if this distribution assigns probability more than to some input, then it is as if sent that input in its entirety which means the protocol must be long. Otherwise, we can use the lemma, to generate an input for that is guaranteed to be different from the input of and also guaranteed to have the right conditional probability given the message, as desired.
A hybrid argument.
The above analysis shows that one can generate the input of any one party in a way so that it is hard for it to distinguish between the yes case and the no case. To generate the input of all the parties, we use this argument times, once for each . For this, we define adversaries , where for all , the adversary is such that the first rows of the input matrices look like the no case and the remaining look like the yes case. Thus, any two consecutive adversaries only differ in one of the input rows meaning we can apply the argument above and show that any two consecutive adversaries are indistinguishable. Then, by a triangle inequality, we conclude that the adversaries and , corresponding to the yes and no case are also indistinguishable as desired.
For this hybrid argument to work, the simplicity of our adversary above is helpful. Recall that our adversary in the no case treats each row independently, and thus, we can “stitch” together distributions for different copies without worrying about undesired correlations. Finally, for the triangle inequality to be useful, we need that the probability of distinguishing between any two consecutive adversaries is , which we show holds. In fact, we show that the probability is exponentially small in .
3 Model and Preliminaries
For a set , we use to denote the set of distributions supported on .
3.1 The White-Box Streaming/One-Way Communication Model
In the white-box streaming model, an algorithm is defined by a set of states , a designated initial state , a stream length111As our main result is a lower bound, letting the algorithm know the stream length only makes the lower bound stronger. , an alphabet and a transition function to modify its state. The algorithm has access to randomness which is included in the transition function as a bit-string . Such an algorithm runs in the presence of an adversary whose task is to generate an input stream for the algorithm.
The execution of an algorithm in the presence of the adversary takes place as follows: The algorithm starts in the initial state proceeds in steps, where at the beginning of step , the algorithm is in state and has sampled random strings . In step , the adversary, that knows the algorithm and the random strings , generates an input based on its knowledge. The algorithm then samples a random string and to update its state from to and the random string is added to the knowledge of the adversary. The state obtained after steps is the output of the algorithm and number of bits needed to represent a state is the memory of the algorithm.
The white-box, one-way, communication model is similar, except that is now the set of messages in the protocol, is a dummy message, is the number of parties, is the input space for (all the) parties, and is the function the parties use to compute their messages. Note that the identity of the party can be included in and does not need to be given explicitly to the message function. Also, note that when we reduce streaming problems to communication problems, we combine many symbols in the input stream to obtain the input of a single party in the communication model. This will decrease the number of parties and blowup and also limit the power of the adversary who now has to commit to more symbols at once. In particular, lower bounds proved for the communication model will translate to the streaming model as well.
3.2 Technical Lemmas
Lemma 2.
Let and be finite sets. Let , and , be random variables supported on and respectively. Suppose there is a function such that for all , we have . Let be a random variable independent of . For all functions with a finite range, we have that:
Proof.
Let be the range of . For all sets , we have:
(Definition of )
(Independence of and )
(As for all , we have )
3.3 Error Correcting Codes
Lemma 3.
Let be a large enough even integer. There exist a subset of size such that all have Hamming weight222For convenience, we ignore divisibility issues and omit floor/ceiling signs. and for all , the Hamming distance between and is at least .
Proof.
Let be independent and uniformly random elements of . By a Chernoff bound, we have for all that:
By a union bound over all , there exists that satisfy the (fractional) Hamming distance property. For the Hamming weight property, simply concatenate all the with their complement (which preserves the fractional Hamming distance).
4 Reduction to a Communication Game
We now present a communication problem (Less Than Matrix) and show that a low-space white-box streaming algorithm that solves LIS implies a low communication white-box protocol for . The problem is parameterized by an integer and a parameter . We write when we want to make these parameters explicit. The input to consists of two matrices . For all , we shall use and to denote the -th row of and respectively. The goal of is to distinguish between the “no” case and the “yes” case when for all , we have that for at least different values of . If neither case holds, the protocol can output anything.
We consider one-way, -party communication protocols for the problem , where there are parties and and for all , the inputs for party is and the input for party is . The messages go one-way where the first message is sent from to , then second message is from to , and so on till the party computes the output, which we treat as the last message. We now present our reduction.
Lemma 4.
Let and . For all , if there exists a white-box streaming algorithm that computes a -approximation to the length of the LIS in -space with probability , then there exists a one-way, -party, white-box communication protocol that solves the problem with probability where the length of every message is at most bits.
Proof.
Fix a streaming algorithm as in the lemma statement and consider the following one-way communication protocol . As above, the parties send messages in the order where for all , the message sent by parties and is computed as follows: The party interprets the message received by it as the current memory state of the streaming algorithm (or the initial state, if the party is ) and changes the entries of whatever input she gets from the white-box adversary by adding to the -th entry, for all . It then simulates the algorithm on this changed input and sends the resulting memory state to the next party. In case this party is the last party , then the party outputs no if outputs a number at most and outputs yes otherwise.
We claim that the protocol solves the problem . For this, we show that whenever the inputs to the parties belong to the no case of , then they simulate on a sequence with LIS at most and whenever the inputs belong to the yes case of , then they simulate on a sequence with LIS at least . Thus, a -approximation to the length of the LIS is enough to determine the output of .
For this, consider any execution of the protocol and let and be the matrices formed by the inputs of the parties. If these inputs belong to the no case, i.e., if , then for all , the inputs for the parties and are the same. Moreover, for all and all , we have that:
| (As ) | ||||
It follows that, for all , the -th coordinate of the parties’ inputs forms a non-increasing sequence which implies that the length of the LIS of the input stream to the algorithm is at most . Next, consider the case when these inputs belong to the yes case. That is, for all , we have that for at least different values of . Thus, the number of pairs such that is at least . By the pigeonhole principle, at least of these pairs have the same value of .
Let be this value of and let be these values sorted in increasing order of the first coordinate. Define for notational convenience and consider the values:
Observe that after adding our offsets, these values form a subsequence of the sequence is simulated on. We finish the proof by showing that these values are increasing. Because of our choice of these pairs, this follows if we show that, for all and all , we have:
| (As ) | |||
5 The Communication Lower Bound
The goal of this section is to show the following theorem.
Theorem 5.
Let be large enough. Any one-way, -party, white-box communication protocol that solves the problem with probability at least must have at least one message of length at least .
We prove Theorem 5 by contradiction. Fix an and suppose that there is a protocol such that every message in has length at most . Let be the space of all such messages and be an error-correcting code with codewords where each codeword has Hamming weight and the pairwise distance between any two codewords is at least . Such codes are well known to exist and we also include an easy proof of their existence in Lemma 3.
All the white-box adversaries for the protocol that we consider will be such that, for all , the party will get as input an independent, uniformly random element . We will use to denote the random variable corresponding to333This means that we use to denote both the player and the random variable corresponding to the player’s input. . Moreover, for all , the input for the party will be sampled from a distribution that is determined by the message that it receives from the party and the input sampled for the party . In other words, for our purposes, a white-box adversary for the protocol can be defined by a vector of functions where for all , the function is a function mapping a message and an input for the party to a distribution of inputs for the party .
Execution in the presence of an adversary.
An execution of the protocol in the presence of an adversary as above proceeds in steps, where for all , step involves the party sending a message to the party and the party sending a message to the party (or computing the output, if ). In more details, for any , let be the message sent by the party to the party in the previous step (or the empty message, if ). Then, in step , the party gets an independent uniformly random element . This party then uses its private randomness to compute a message , which is sent to the party . Then, the party gets an input sampled from the distribution and uses its private randomness to compute a message , which is sent to the party , and the protocol moves to the next step.
This means that, for the protocol , every adversary induces a (joint) distribution on the tuple consisting of the messages and the inputs of all the parties. We define to be the random variables corresponding to this distribution. We extend this definition to the case by defining those random variables to be some dummy variables that are the same for all . For any random variable , we use the notation to denote its distribution. From the discussion above, we have that, for all adversaries , all , and all pairs , it holds that (letting denote the uniform distribution on ):
| (1) |
The adversary .
For all , let be the function that ignores its first argument and outputs a distribution that is a point mass on its second argument (equivalently, outputs the second argument with probability ). We define the adversary . Observe that any input generated by the adversary is supported only on inputs that belong to the no case of . Throughout this proof, we will often abbreviate the random variables corresponding to by writing instead of , instead of , etc.. We will also abbreviate events of the form to .
Bad messages.
Recall that in all adversaries we consider, for all , the party gets as input an independent and uniformly random element . Thus, each individual input has probability . We define a message to be “bad” if conditioning on increases the probability of some input to a value more than . Formally, the set of all such messages is:
| (2) |
The following lemma shows that messages are unlikely to be bad.
Lemma 6.
For all , it holds that:
Proof.
For all , we have from Equation 2 that there exists such that . As , this means that for all . As , we get:
5.1 Auxiliary Distributions
Intuitively, Equation 2 says that a message is not in only if the distribution is such that the probabilities of all are most . For such distributions, we show the following lemma (stated for general functions).
Lemma 7.
Let be a finite, non-empty set and be a non-negative function such that . There exists a symmetric, non-negative function such that for all , we have and .
Proof.
Define for convenience. Proof by induction on . The base case is trivial. For the case , note that implies that takes a fixed value, say , on all of . In this case, defining for all satisfies the conditions of the lemma.
We show the result for assuming it holds for smaller values. As , we can sort all the elements in increasing order of and define to be the three largest values in this order (breaking ties arbitrarily). Assume for now that . In this case, define the function according to the following matrix:
It can be verified that satisfies the conditions of the lemma. Now, assume that . In this case, we apply the induction hypothesis on the set and the function that matches everywhere except that it is not defined on and that . As , we have that and satisfy the condition in the induction hypothesis and we get a non-negative function such that for all , we have and . Now, define the function to match when neither of its arguments equals . Otherwise, we define for all . Observe that this satisfies the conditions of the lemma.
Observe that when Lemma 7 is applied to a function that is a probability distribution over , then the resulting function represents a probability distribution on .
Definition 8 (Auxiliary distribution).
Let be a finite, non-empty set and be a probability distribution on such that . Let be the probability distribution (on the set ) obtained by applying Lemma 7 on . For all , we will use to denote the distribution conditioned on the first element being .
5.2 White-Box Adversaries
We now define the other white-box adversaries we consider. For this, we define functions where each such function is a map from to . Let . We define the function to be such that for all and all , it outputs the distribution if and otherwise, it outputs a uniformly random element in . Observe that the fact that means that the former is well-defined. Using these, for all , we define the adversary . Observe that and that the following holds:
Observation 9.
Let . Then, it holds that .
Lemma 10.
For all , we have that:
Proof.
To show the lemma, we show that the term
upper bounds the expression on the left and lower bounds the expression on the right. For the former, we apply Lemma 2 with , and , and , the private randomness of the party . Observe that the condition in Lemma 2 is satisfied as for all , we have and we also have from Equation 1. These are the same by Observation 9.
For the latter, we use , and , and , the private randomness of the party when applying Lemma 2. Observe that the condition in Lemma 2 is satisfied as for all , we have that both distributions and are just the uniform distribution over , and hence identical.
Lemma 11.
For all , we have that:
Proof.
Fix an arbitrary and observe that the case is trivial. Moreover, due to Lemma 10, it suffices to consider the case . For this, we will show that:
This suffices as then we use Lemma 2 with , and , and , the private randomness of the party . The condition in Lemma 2 is satisfied as and are just the second coordinates of and respectively and Lemma 2 then finishes the proof. To see why the foregoing equation holds, note from the definition of and Equation 1 that . This means that it suffices to show that:
Next, note that the behavior of the adversaries , , and is the same for parties up to and including . This means that and it suffices to show that:
| (3) |
To show Equation 3, note that for all such that , we have that:
| (Equation 1) | |||
| (Definition of and as ) | |||
| (As ) | |||
| (Lemmas 7 and 8) | |||
With this equation, Equation 3 now follows as we get:
| (As ) | |||
5.3 Finishing the Proof
Observe that the adversary always satisfies for all and thus, always generates inputs in the no case of . We now show that the adversary always generates inputs in the yes case of . As and only has codewords with Hamming weight and the pairwise distance between any two codewords is at least , it suffices to show that the adversary always satisfies . This is because of the definition of and Lemma 7. However, applying Lemma 11 with and using the triangle inequality, we get that:
As is large enough, it follows that either the probability that the last message (which is the output) of the protocol is yes when the adversary is is at most or the probability that the last message of the protocol is yes when the adversary is is at least . Theorem 5 follows straightforwardly in either case.
Finishing the proof of Theorem 1.
References
- [1] Miklós Ajtai, Vladimir Braverman, T. S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, and Samson Zhou. The white-box adversarial data stream model. In International Conference on Management of Data (PODS), pages 15–27, 2022. doi:10.1145/3517804.3526228.
- [2] David Aldous and Persi Diakonis. Longest increasing subsequences: From patience sorting to the baik–deift–johansson theorem. Bull. Amer. Math. Soc. (N. S.), 36:413–432, 1999.
- [3] Alexandr Andoni, Negev Shekel Nosatzki, Sandip Sinha, and Clifford Stein. Estimating the longest increasing subsequence in nearly optimal time. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 708–719, 2022. doi:10.1109/FOCS54457.2022.00073.
- [4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 1–16. ACM, 2002. doi:10.1145/543613.543615.
- [5] Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, , and Eylon Yogev. A framework for adversarially robust streaming algorithms. Journal of the ACM, 69(2), 2022. doi:10.1145/3498334.
- [6] Amit Chakrabarti. A note on randomized streaming space bounds for the longest increasing subsequence problem. Information Processing Letters, 2012.
- [7] Kuan Cheng, Alireza Farhadi, MohammadTaghi Hajiaghayi, Zhengzhong Jin, Xin Li, Aviad Rubinstein, Saeed Seddighin, and Yu Zheng. Streaming and small space approximation algorithms for edit distance and longest common subsequence. In 48th International Colloquium on Automata, Languages, and Programming, ICALP, volume 198 of LIPIcs, pages 54:1–54:20, 2021. doi:10.4230/LIPIcs.ICALP.2021.54.
- [8] Funda Ergün and Hossein Jowhari. On the monotonicity of a data stream. Comb., 35(6):641–653, 2015. doi:10.1007/S00493-014-3035-1.
- [9] Ying Feng, Aayush Jain, and David P. Woodruff. Fast white-box adversarial streaming without a random oracle. In Forty-first International Conference on Machine Learning, ICML, 2024.
- [10] Ying Feng and David P. Woodruff. Improved algorithms for white-box adversarial streams. In International Conference on Machine Learning (ICML), volume 202, pages 9962–9975, 2023. URL: https://proceedings.mlr.press/v202/feng23d.html.
- [11] Michael L. Fredman. On computing the length of longest increasing subsequences. Discrete Math., 11(1):29–35, January 1975. doi:10.1016/0012-365X(75)90103-X.
- [12] Anna Gál and Parikshit Gopalan. Lower bounds on streaming algorithms for approximating the length of the longest increasing subsequence. SIAM Journal on Computing, 2010.
- [13] Paweł Gawrychowski and Wojciech Janczewski. Fully dynamic approximation of lis in polylogarithmic time. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 654–667, 2021.
- [14] Parikshit Gopalan, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Estimating the sortedness of a data stream. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 318–327. SIAM, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283417.
- [15] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, MA, 1997. doi:10.1017/CBO9780511574931.
- [16] Haim Kaplan, Yishay Mansour, Kobbi Nissim, and Uri Stemmer. Separating adaptive streaming from oblivious streaming using the bounded storage model. In Advances in Cryptology - 41st Annual International Cryptology Conference, CRYPTO Proceedings, Part III, volume 12827 of Lecture Notes in Computer Science, pages 94–121, 2021. doi:10.1007/978-3-030-84252-9_4.
- [17] Tomasz Kociumaka and Saeed Seddighin. Improved dynamic algorithms for longest increasing subsequence. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 640–653, 2021. doi:10.1145/3406325.3451026.
- [18] Xin Li and Yu Zheng. Lower bounds and improved algorithms for asymmetric streaming edit distance and longest common subsequence. In 41st IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS, volume 213 of LIPIcs, pages 27:1–27:23, 2021. doi:10.4230/LIPIcs.FSTTCS.2021.27.
- [19] Xin Li and Yu Zheng. Streaming and query once space complexity of longest increasing subsequence. In Computing and Combinatorics: 29th International Conference, COCOON, pages 29–60, 2023. doi:10.1007/978-3-031-49190-0_3.
- [20] S. Muthukrishnan. Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci., Now Publisher, Hanover, MA, 2005.
- [21] Pavel Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, Cambridge, MA, 2003.
- [22] Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for lcs and lis with truly improved running times. In Symposium on Foundations of Computer Science (FOCS), 2019.
- [23] Michael E. Saks and C. Seshadhri. Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1698–1709, 2013. doi:10.1137/1.9781611973105.122.
- [24] Michael E. Saks and C. Seshadhri. Estimating the longest increasing sequence in polylogarithmic time. SIAM J. Comput., 46(2):774–823, 2017. doi:10.1137/130942152.
- [25] Xiaoming Sun and David P. Woodruff. The communication and streaming complexity of computing the longest common and increasing subsequences. In Symposium on Discrete Algorithms (SODA), 2007.
