Extensions of Self-Improving Sorters

Ailon et al. (SIAM J Comput 40(2):350–375, 2011) proposed a self-improving sorter that tunes its performance to an unknown input distribution in a training phase. The input numbers $$x_1,x_2,\ldots ,x_n$$ x 1 , x 2 , … , x n come from a product distribution, that is, each $$x_i$$ x i is drawn independently from an arbitrary distribution $${{{\mathcal {D}}}}_i$$ D i . We study two relaxations of this requirement. The first extension models hidden classes in the input. We consider the case that numbers in the same class are governed by linear functions of the same hidden random parameter. The second extension considers a hidden mixture of product distributions.


Introduction
Self-improving algorithms proposed by Ailon et al. [1] can tune their computational performance to the input distribution.There is a training phase in which the algorithm learns certain input features and computes some auxiliary structures.After the training phase, the algorithm uses these auxiliary structures in the operation phase to obtain an expected time complexity that is no worse and possibly smaller than the best worstcase complexity known.The expected time complexity in the operation phase is called the limiting complexity.This computational model addresses two issues.First, the worst-case scenario may not happen, so the best time complexity for the input encountered may be smaller than the worst-case optimal bound.Second, previous efforts for mitigating the worstcase scenarios often consider average-case complexities, and the input distributions are assumed to be simple distributions like Gaussian, uniform, Poisson, etc. whose parameters are given beforehand.In contrast, Ailon et al.only assume that individual input items are independently distributed, while the distribution of an input item can be arbitrary.No other information is needed.
The problems of sorting and two-dimensional Delaunay triangulation are studied by Ailon et al. [1].An input instance I for the sorting problem has n numbers.The i-th number x i is drawn independently from a hidden distribution D i .The joint distribution n i=1 D i is called a product distribution.Let π(I ) denote the sequence of the ranks of the x i 's, which is a permutation of [n].It is shown that for any ε ∈ (0, 1), there is a self-improving algorithm with limiting complexity O(ε −1 (n + H π )), where H π is the entropy of the distribution of π(I ).By Shannon's theory [4], any comparison-based sorting algorithm requires (n + H π ) expected time.The self-improving sorter uses O(n 1+ε ) space.The training phase processes O(n ε ) input instances in O(n 1+ε ) time, and it succeeds with probability at least 1 − 1/n, i.e., the probability of achieving the desired limiting complexity is at least 1 − 1/n.For two-dimensional Delaunay triangulations, Ailon et al. also obtained an optimal limiting complexity for product distributions.
Subsequently, Clarkson et al. [3] developed self-improving algorithms for twodimensional coordinatewise maxima and convex hulls, assuming that the input comes from a product distribution.The limiting complexities for the maxima and the convex hull problems are O(OptM + n) and O(OptC + n log log n), where OptM and OptC are the expected depths of optimal linear decision trees for the maxima and convex hull problems, respectively.
On one hand, the product distribution requirement is very strong; on the other hand, Ailon et al. showed that (2 n log n ) bits of storage are necessary for optimal sorting if the n numbers are drawn from an arbitrary distribution.We study two extensions of the input model that are natural and yet possess enough structure for efficient selfimproving algorithms to be designed.
The first extension models the situation in which some input elements depend on each other.We consider a hidden partition of the input I = (x 1 , . . ., x n ) into classes S k 's.The input numbers in a class S k are distinct linear functions of the same hidden random parameter z k .The distributions of the z k 's are arbitrary and each z k is drawn independently. 1 We call this model a product distribution with hidden linear classes.Our first result is a self-improving sorter with optimal limiting complexity under this model.Theorem 1.1 For any ε ∈ (0, 1), there exists a self-improving sorter for any product distribution with hidden linear classes that has a limiting complexity of O (n/ε + H π /ε).The storage needed by the operation phase is O(n2 ).The training phase processes O(n ε ) input instances in O(n 2 log 3 n) time and O(n 2 ) space.The success probability is at least 1 − 1/n.
In the second extension, the distribution of I is a mixture κ q=1 λ q D q , where κ and the λ q 's are hidden, and every D q is a hidden product distribution of n real numbers.In other words, over a large collection of input instances, for all q ∈ [1, κ], a fraction λ q of them are expected to be drawn from D q .Although κ is unknown, we are given an upper bound m of κ.We call this model a hidden mixture of product distributions.Our second result is a self-improving sorter under this model.Theorem 1.2 For any ε ∈ (0, 1), there is a self-improving sorter for any hidden mixture of at most m product distributions that has a limiting complexity of O ((n log m)/ε + H π /ε).The storage needed by the operation phase is O(mn + m ε n 1+ε ).The training phase processes O(mn log(mn)) input instances in O(mn log 2 (mn)+m ε n 1+ε log(mn)) time using O(mn log(mn)+m ε n 1+ε ) space.The success probability is at least 1 − 1/(mn).
In the interesting special case of m = O(1), the limiting complexity is O(n/ε+ H π /ε) which is optimal.

Hidden Linear Classes
There is a hidden partition of [n] into classes.For every i ∈ [1, n], the distribution of x i is degenerate if x i is equal to a fixed value.Each such x i will be recognized in the training phase.For the remaining i's, the distributions of x i 's are non-degenerate, and we use S 1 , . . ., S g to denote the hidden classes formed by them.Numbers in the same class S k are generated by linear functions of the same hidden random parameter z k .Different classes are governed by different random parameters.We know that the functions are linear, but no other information is given to us.
Let D k denote the distribution of z k .There is a technical condition that is required of the D k 's: there exists a constant ρ ∈ (0, 1) such that for every k ∈ [1, g] and every c ∈ R, Pr [z k = c] ≤ 1 − ρ.This condition says that D k does not concentrate too much on any single value, which is quite a natural phenomenon.Our algorithm does not need to know ρ, but ρ affects the probabilistic guarantees on the correctness and limiting complexity.The input size must be at least e 3/ρ 2 for Theorem 1.1 to hold.

Learning the Linear Classes
We learn the classes and the linear functions using 3 ln 2 n input instances.Denote these instances by I 1 , I 2 , . . ., I 3 ln 2 n .Let x (a) i denote the i-th input number in I a .We first recognize the degenerate distributions by checking which x Applying the union bound establishes the lemma.
Assume that the degenerate distributions are taken out of consideration.If i and j belong to the same class S k , then x (a) i and x (a) j are linearly related as a varies.Conversely, if i and j belong to different classes, it is highly unlikely that x (a)  i and x (a) j remain linearly related as a varies because they are governed by independent random parameters.We check if the triples of points ), and j ) are collinear for every a ∈ [3, 3 ln 2 n] and every distinct pair of i and j from [1, n].We quantify this intuition in the following result.[1, n]

Lemma 2.2 Let i and j be two distinct indices in
), and (x (3a) i , x (3a) j ) are collinear if and only if Assume that x (3a−2) are independent for all b and b .Also, x j in one instance I b does not influence x j in a different instance I b .So there is no dependence among does not occur, then by (1), we can express x The above shows that the probability of E (3a) i j conditioned on some fixed values of The events in ln 2 n a=1 E (3a) i j are independent of each other.Therefore, Since n ≥ e 3/ρ 2 , we get (1 − ρ 2 ) ln 2 n ≤ e −ρ 2 ln 2 n ≤ e −3 ln n = n −3 , establishing the lemma.
By Lemmas 2.1 and 2.2 and the union bound, we can generate the classes based on collinearity in O(n 2 log 3 n) time.The classification is correct with probability at least 1 − 1/n.We label the classes as S 1 , S 2 and so on.We use g to denote the number of classes identified.

Lemma 2.3
Assume that n ≥ e 3/ρ 2 .Using 3 ln 2 n input instances, we can correctly identify all linear classes in O(n 2 log 3 n) time and O(n log 2 n) space with probability at least 1 − 1/n.

Structures for the Operation Phase
In addition to learning the linear classes, we need to construct a data structure in the training phase that will allow the operation phase to run efficiently.We first give an overview of what this data structure will do.
The construction and operation of this data structure require the determination of a , where v 0 and v n+1 denote −∞ and ∞, respectively.They divide the real line into n + 1 intervals: where we use [v 0 , v 1 ) to denote (−∞, v 1 ).For every input instance I = (x 1 , x 2 , . . ., x n ) in the operation phase, the data structure supports the following three operations.F1: For every class S k , retrieve the sorted order of the numbers in I with indices in S k .Denote this sorted order as σ k .
F2: For every class S k , every i ∈ S k , and every number x i ∈ I , determine the largest v r in the V -list that is less than or equal to x i .F3: For every interval We describe how to compute the V -list and the data structure in the following.

V-list
The determination of the V -list requires taking another ln n input instances.Sort all numbers in these instances into one sorted list L.Then, for i ∈ [1, n], v i in the V -list is the number of rank i ln n in L. Note that if the distribution of x i is degenerate, the same x i appears ln n times in the sorted list L, which implies that x i must be selected to be an element of the V -list.
The data structure is based on the following arrangements of lines and their refinement into vertical slabs.
• For each class S k , fix an arbitrary index s k ∈ S k .For each i ∈ S k , we associate with i the equation of the line i that expresses x i as a linear function in x s k .This can be done by computing the equation of the support line through (x (a) Within each slab in W k , each line i in A k lies between two consecutive values v r and v r +1 , i.e., v r is the predecessor of i in the V -list.Moreover, the bottom-to-top order of the lines for S k is fixed within a slab.
We compute A k and store W k as a collection of ordered lists of lines as follows.
2. Each slab in W k is represented as a list of lines for S k ordered from bottom to top.Each line i is associated with its predecessor v r in the V -list within the slab.These ordered lists of lines for W k are stored in a persistent search tree [5] in order to save storage and processing time.A persistent search tree is a collection of balanced search trees of different versions.Given a tree of a specific version, it can be searched in logarithmic time.When the first version is constructed, it is just an ordinary balanced search tree.When an update (including insertion, deletion and changing the content of a node) on the current version is specified, instead of modifying the current version, a new version is generated that incorporates the update.Each update uses O(1) extra amortized space and takes logarithmic time.
The construction of the persistent search tree for W k is done as follows.3. Initialize the first version of the search tree to store the lines for S k in the leftmost slab of W k in decreasing order of their slopes (which is the same as the bottom-totop order).Lines with positive slopes are labelled with v 0 as their predecessors in this slab.4. Given an input instance I in the operation phase, we need to provide fast access to different versions of the persistent search tree for all classes.This is done as follows.
(a) Take another n ε input instances for any choice of ε ∈ (0, 1).For every class S k , record the frequencies of x s k falling into the slabs in W k among these n ε instances (via binary search among the slabs).This step takes O( total time over all classes.Then, for every class S k , we build a binary search tree T k on these slabs whose expected search time is asymptotically optimal with respect to the recorded frequencies.Each T k has O(n|S k |) nodes and can be constructed in O(n|S k |) time [6,8].(b) Each node in T k corresponds to a slab in W k .We associate with this node a pointer to the version of the persistent search tree for the corresponding slab.A very low frequency cannot give a good estimate of the probability distribution of x s k , so navigating down T k to a node of very low frequency may be too time-consuming.Thus, if a search of T k reaches a node at depth below ε 3 log 2 n, we answer the query by performing a binary search among the slabs in W k , which takes O(log n) time.Note that the slab also stores a pointer to the corresponding version of the persistent search tree.
We explain how to use the data structure to support the operations F1, F2 and F3 described earlier.
Let I = (x 1 , x 2 , . . ., x n ) be an input instance in the operation phase.For every class S k , we query T k with x s k to find the slab in W k whose span of x-coordinates contains x s k .This provides access to the version of the persistent search tree for that slab.Denote this version by T .An inorder traversal of T gives the sorted order of the lines i 's for all i ∈ S k in O(|S k |) time.Each line i stores its predecessor v r in the V -list.The above handles F1 and F2.Consider F3.For k = 1, 2, • • • , g, we walk through the sorted list of lines i 's in S k produced by the inorder traversal of T , and for each i encountered in the traversal, let v r be the predecessor of i , and we append x i to the list in Z r under construction, i.e., the list that represents σ k ∩ [v r , v r +1 ).Afterwards, we scan all intervals and output σ k ∩ [v r , v r +1 ) for all k and r .
We summarize the above processing in the following result.
Lemma 2.4 Assume that the hidden classes S 1 , S 2 , . . ., S g have been determined.
(i) Using ln n input instances, we can set the V -list v i is the number of rank i ln n in the sorted list of all numbers in the ln n input instances.(ii) Given the V -list, there is a data structure that performs functions F1, F2, and F3 in O(E + n) expected time for every input instance in the operation phase, where E is the total expected time to query the T k 's.The data structure uses O(n 2 ) space and can be constructed in O(n 2 log n) time using n ε input instances.

Operation Phase
Given an input instance I = (x 1 , . . ., x n ), the operation phase proceeds as follows.
1.During the construction of the V -list in the training phase, for each x i that is degenerately distributed, x i must appear ln n times when we sort the concatenation of ln n input instances.Therefore, for each degenerately distributed x i , there is a unique v r in the V -list that is equal to x i , and we mark v r .2. Use Lemma 2.4(ii) to determine for every class S k , the sorted sequence σ k of numbers belonging to S k and for every interval [v r , v r +1 ), the list of sorted lists ), merge all lists in Z r into one sorted list.The merging is facilitated by a min-heap that stores the next element from each list in Z r .Thus, each step of the merging takes O(log |Z r |) time.4. Finally, we concatenate in O(n) time the marked v r 's and the merged lists for all Z r 's to form the output sorted list.
Correctness is obvious.The limiting complexity has two main components.First, the sum of expected query times of all T k 's in Lemma 2.4(ii).Second, the total time spent on merging the lists in Z r for r ∈ [0, n].The remaining processing time is We give the analysis in the next section to show that the first two components sum to O(n/ε + H π /ε).Recall that π(I ) is the sequence of the ranks of numbers in I , which is a permutation of [n], and H π is the entropy of the distribution of π(I ).

Analysis
in this order.Similarly, assign labels n + 2 to 2n + 1 to the input numbers x 1 , . . ., x n in this order.
Define the random variable B V to be the permutation of the labels that appear from left to right after sorting {v 0 , . . ., v n+1 } ∪ {x 1 , . . ., x n } in increasing order.
For each k ∈ [1, g], define a random variable B V k to be the permutation of the labels that appear from left to right after performing the following operations: (1) sort {v 0 , . . ., v n+1 } ∪ {x i : i ∈ S k } in increasing order, and (2) remove all v r 's that do not immediately precede some x i 's in the sorted list.Let H V k denote the entropy of the distribution of B V k .Determining B V k takes at least H V k expected time by Shannon's theory [4].
Our algorithm uses Lemma 2.4(ii) to construct σ k ∩ [v r , v r +1 ) for all k and r in O(E + n) expected time, where E is the total expected time to query the T k 's.Then, is the number of classes that have numbers falling into [v r , v r +1 ).As shown in Lemma 3.4 in [1] and the discussion that immediately follows its proof, the expected query complexity of T k is O(H V k /ε).The limiting complexity is thus equal to We bound We show that Proof Suppose that we are given a setting of B V , i.e., the permutation of labels from left to right in the sorted order of {v 0 , . . ., v n+1 } ∪ {x 1 , . . ., x n }.We scan the sorted list from left to right.We maintain the most recently scanned v r .Suppose that we see a number x i .Let S k be the class to which x i belongs.If this is the first time that we encounter an index in S k after seeing v r , we initialize an output list for B V k that contains the label of v r followed by the label of x i .If this is not the first time that we encounter an index in S k after seeing v r , we append the label of x i to the output list for B V k .Clearly, we obtain the settings of all B V k 's correctly from B V .The number of comparisons needed is O(n).Therefore, Lemmas 2.5 and 2.6 imply that Given (I , π(I )), we use π(I ) to sort I and then merge the sorted order with (v 0 , . . ., v n+1 ).Afterwards, we scan the sorted list to output the labels of the numbers.This gives the setting of B V .Clearly, O(n) comparisons suffice, and so Lemma 2.6 implies that H (B V ) = O(n + H π ).Lemma 2.7 takes care of the first term in (2).We will show that the second term in (2) is O(n) with high probability.We first prove that E[|Z r |] = O(1) for all r ∈ [0, n] with high probability.Our proof is modeled after the proof of a similar result in [1].There is a small twist due to the handling of the classification.

Lemma 2.8 It holds with probability at least 1 − 1/n that for all r ∈ [0, n], E[|Z r |] = O(1).
Proof Let I 1 , . . ., I ln n denote the input instances used in the training phase for building the V -list.Let y 1 , y 2 , . . ., y n ln n denote the sequence formed by concatenating I 1 , . . ., I ln n in this order.We adopt the notation that for each α ∈ [1, n ln n], y α belongs to the class S k α and the input instance Since we take every ln n numbers in forming the V -list, we want to discuss the probability of Y  For every r ∈ [0, n + 1], let y α r denote v r , where y α 0 = −∞ and y α n+1 = ∞.Fix a particular r ∈ [0, n + 1].By construction, there are at most ln n numbers among ln n with probability at least Let X kr be an indicator random variable such that if some element of the input instance that belongs to S k falls into The random process that generates the input instances is independent of the training phase.It follows that because the index pairs (a α r , k α r ) and (a α r +1 , k α r +1 ) are excluded from J We are ready to bound the second term in (2).
Lemma 2.9 It holds with probability at least 1 − 1/n that The largest possible values of n kr and z r are n and g, respectively.
The range of i can be reduced to [1, gn] without changing the sum: The last equality follows from the fact that if j = j or l = l , then the events z r = j ∧ n kr = l and z r = j ∧ n kr = l are disjoint.Let y kr be a random variable that counts the number of classes other than S k that have numbers in [v r , v r +1 ).In the event of n kr = l for some l ∈ [1, n], the class S k has number(s) in [v r , v r +1 ), implying that z r = y kr + 1.Therefore, In the last step, the equality of Pr [y kr = j ∧ n kr = l] and Pr [y kr = j] • Pr [n kr = l] follows from the independence of the events y kr = j and n kr = l.Hence, For all k ∈ [1, g], z r ≥ y kr by their definitions, and so E[z r ] ≥ E[y kr ].By Lemma 2.8, it holds with probability at least 1 By (2) and Lemmas 2.7 and 2.9, we conclude that the limiting complexity of the sorter is O(n/ε + H π /ε) as stated in Theorem 1.1.The O(n 2 ) space needed by the operation phase follows from Lemma 2.4(ii).In the training phase, the space usage, the number of input instances, and the processing time required follow from Lemmas 2.3 and 2.4.The success probability of 1 − 1/n follows from Lemma 2.9.This completes the proof of Theorem 1.1.

Mixture of Product Distributions
Let κ be the number of product distributions in the mixture.Although κ is hidden, we are given an upper bound m of κ.Let D q , q ∈ [1, κ], denote the hidden product distributions in the mixture.The input distribution is κ q=1 λ q D q for some hidden positive λ q 's such that κ q=1 λ q = 1.
To facilitate the operation phase, we group the mn + 1 intervals into n buckets as follows.We group the first m intervals into the first bucket, the next m intervals into the second bucket, and so on.There are n buckets.Each bucket contains m intervals except for the last one which contains m + 1 intervals.Each interval keeps a pointer to the bucket that contains it.Also, each bucket is associated with an initially empty van Emde Boas tree [10] with the intervals in that bucket as the universe.Each tree has O(m) size and can be initialized in O(m) time. 2  Use another O(m ε n ε ) input instances to record the frequency f ir of x i falling into [v r , v r +1 ).The frequencies are determined by locating the numbers in these O(m ε n ε ) input instances among the intervals using binary search.The total time needed is O(m ε n 1+ε log(mn)).Then, for every i ∈ [1, n], build an asymptotically optimal binary search tree T i with respect to the f ir 's on the intervals with positive frequencies.Each T i has O(m ε n ε ) size and can be constructed in O(m ε n ε ) time [6,8].If a search of T i reaches a node at depth below ε 3 log 2 (mn) or is unsuccessful, we answer the query by performing a binary search among the mn + 1 intervals in O(log(mn)) time.
Let P i be a random variable indicating the predecessor of x i in the V -list.Let H (P i ) denote the entropy of the distribution of P i .As shown in [1,Lemma 3.4], querying T i takes O(H (P i )/ε) expected time (including the binary search among the mn + 1 intervals, if applicable).
We summarize the processing in the training phase in the following result.Each bucket keeps an initially empty van Emde Boas tree with the intervals in that bucket as the universe.The processing time and space needed are O(mn).(iii) Search trees T i for i ∈ [1, n]  the sorted lists together.One easy way to perform the concatenation is to scan all mn + 1 intervals from left to right, but this takes O(mn) time.We describe an improvement below.
1.By Lemma 3.1(ii), the mn + 1 intervals are grouped into n buckets in the training phase.For each bucket B, let U B denote the van Emde Boas tree for B which is initially empty.The universe for U B is the set of intervals in B. We merge the N r 's for the intervals within each bucket as follows.2. For each input number x i , we perform the following steps.
(a) Let [v r , v r +1 ) be the interval containing x i which has been located using T i .
Let B be the bucket pointed to by otherwise, do nothing.
3. By now, for each bucket B, U B stores all non-empty intervals in B. We have already discussed the sorting of each N r .We scan the n buckets in left-to-right order.For each bucket B encountered, we find the minimum element in U B and then find successors in U B iteratively.This allows us to visit the non-empty N r 's in B in increasing order of r , so we can output the sorted N r 's in increasing order.At the end, we delete all elements from U B for each bucket B in preparation for sorting the next input instance.

Analysis
Let I be an input instance.Let X ir be a random variable such that if x i falls into [v r , v r +1 ), then X ir = 1; otherwise, X ir = 0. We first bound κ q=1 n i=1 Pr X ir = 1 ∧ I ∼ D q .Lemma 3.3 Let I be an input instance.Let X ir be a random variable that is 1 if x i ∈ [v r , v r +1 ) and 0 otherwise.It holds with probability at least 1 − 1/(mn) that for every r ∈ Proof In building the V -list in the training phase, we constructed the list (s 1 , s 2 , . . ., s mn ln(mn) ) using mn ln(mn) input instances I 1 , . . ., I mn ln(mn) , where s a is equal to x i in I a for every i ∈ [1, n] and every a ∈ [(i − 1)m ln(mn) + 1, im ln(mn)].
For any α, β ∈ [1, mn ln(mn)] such that s α < s β , let J Among all i ∈ J β α , the variables Y β α (i)'s are independent from each other because the s i 's are taken from independent input instances.By Chernoff's bound, for any Since we take every ln(mn) numbers in forming the V -list, we want to discuss the probability of Y β α > ln(mn).This motivates us to consider E[Y β α (q)] > ln(mn)/(1 − μ).We also want the probability bound 1 − e −μ 2 E[Y β α ]/2 of Y β α > ln(mn) to be at least 1 − m −5 n −5 .This allows us to apply the union bound over at most mn ln(mn)(mn ln(mn)−1) choices of α and β to obtain a probability bound of at least 1 − ln 2 (mn)/(m 3 n 3 ).Therefore, as we consider E[Y We conclude that: It holds with probability at least 1 − ln 2 (mn)/(m 3 n 3 ) that for any α, β ∈ [1, mn ln(mn)] such that s For every r ∈ [0, mn + 1], let s α r denote v r , where s α 0 = −∞ and s α mn+1 = ∞.Fix a particular r ∈ [0, mn].By construction, there are at most ln(mn) numbers among s 1 , • • • , s mn ln(mn) that fall in [v r , v r +1 ), which guarantees the event of Y α r +1 α r ≤ ln(mn).Our previous conclusion implies that: It holds with probability at least 1 − ln 2 (mn)/(m ln(mn).
The random process that generates the input is independent of the training phase.In the training phase, for each i ∈ [1, n], we sample m ln(mn) x i 's from m ln(mn) input instances to form (s 1 , . . ., s mn ln(mn) ).Therefore, Apply the union bound over r ∈ [0, mn].The probability bound is thus at least 1 − (mn + 1) ln 2 (mn)/(m 3 n 3 ) ≥ 1 − 1/(mn).
Recall that N r is the subset of points that fall into [v r , v r +1 ) in the operation phase when sorting an input instance.We bound the expected total time E Both X ir and X jr are random indicator variables.If X ir = 1 and X jr = 1, then X ir X jr = 1; otherwise, X ir X jr = 0. Therefore, Let I denote an input instance.Conditioned on i = j and I ∼ D q for some q ∈ [1, κ], X ir = 1 and X jr = 1 are two independent events, and so Pr X ir = 1 ∧ X jr = 1|I ∼ D q = Pr X ir = 1|I ∼ D q • Pr X jr = 1|I ∼ D q .Therefore, i = j mn r =0 Pr X ir = 1 ∧ X jr = 1 We expand the outermost summation over all i ∈ [1, n] and j ∈ [1, n].Also, we replace Pr X jr = 1|I ∼ D q • Pr I ∼ D q by Pr X jr = 1 ∧ I ∼ D q .Then, By Lemma 3.3, it holds with probability at least 1−1/(mn) that for every q ∈ [1, κ] and every r ∈ [0, mn], the quantity n j=1 Pr X jr = 1 ∧ I ∼ D q is O(1/m).Therefore, Pr X ir = 1|I ∼ D q ⎞ ⎠ .
Conditioned on a product distribution, x i must fall into one of the mn +1 intervals, and so mn r =0 Pr X ir = 1|I ∼ D q = 1, implying that n i=1 mn r =0 Pr X ir = 1|I ∼ D q = n.We conclude that

Conclusion
There are several possible directions for future research.One is to extend the hidden classification to allow the x i 's in the same class S k to be more arbitrary functions in the random parameter z k .Linear functions in z k have the nice property that any x i and x j in the same class are linearly related.This helps us to learn the hidden classes.Another direction is to improve the performance in the case of a hidden mixture of product distributions.It would also be interesting to design self-improving algorithms for other problems and possibly other input settings as well.

i = c 1
and x (3a−1) i = c 2 for two fixed values c 1 and c 2 .Since i and j are in different classes, x two arbitrary, distinct input instances I a and I b in O(1) time.The total processing time over all classes is O(n).• For every class S k , let A k be the arrangement formed by the n horizontal lines induced by v 1 , v 2 , . . ., v n and the lines i 's for all i ∈ S k .The size of A k is O(n|S k |).• Draw vertical lines through the vertices of A k .Two adjacent vertical lines bound a vertical slab.Denote by W k the set of slabs obtained.The size of W k is O(n|S k |).
For any (a, k) ∈ J β α , let Y β α (a, k) be an indicator random variable such that if some element of the input instance I a that belongs to S k falls into [y α , y β ), then Y β α (a, k) = 1; otherwise, Y β α (a, k) = 0. Define Y β α = (a,k)∈J β α Y β α (a, k).Among the (a, k)'s in J β α , the random variables Y β α (a, k) are independent from each other.By Chernoff's bound, for any μ ∈ [0, 1], β α > ln n.This motivates us to consider E[Y β α ] > ln n/(1 − μ).We also want the probability bound 1 − e −μ 2 E[Y β α ]/2 of Y β α > ln n to be at least 1 − n −5.This allows us to apply the union bound over at most (n ln n)(n ln n − 1) choices of α and β to obtain a probability bound of at least 1 − ln 2 n/n 3 .Therefore, as we consider 5 which is satisfied by setting μ = √ 35 − 5 ≈ 0.9161.We conclude that: It holds with probability at least 1 − ln 2 n/n 3 that for any pair of distinct indices α, β ∈ [1, n ln n] such that y α ≤ y β , if E[Y

n with probability at least 1 −
α r +1 α r but they are considered in ln n a=1 g k=1 E[X kr ]. 123 We have shown previously that E[Y ln 2 n/n 3 .It follows that E[|Z r |] = O(1) with probability at least 1 − ln 2 n/n 3 .Since the above statement holds for every fixed r ∈ [0, n], by the union bound, it holds with probability at least 1 − 1/n that E[|Z r |] = O(1) for all r ∈ [0, n].

Lemma 3 . 1
The training phase constructs the following structures.(i)The V -list (v 0 , v 1 , . . ., v mn+1 ) is constructed in O(mn log 2 (mn))time using mn ln(mn) input instances and O(mn log(mn)) space, where v 0 = −∞, v mn+1 = ∞, and for i ∈ [1, mn], v i is the number of rank i ln(mn) in n i=1 {x (a) i : a ∈ [(i − 1)m ln(mn) + 1, im ln(mn)]}.(ii) The mn +1 intervals induced by the V -list are organized as n consecutive buckets of m intervals each, except for the last bucket which contains m + 1 intervals.

mnrLemma 3 . 4 E
=0 |N r | log |N r | to sort the N r 's.It holds with probability at least 1−1/(mn) that E mn r =0 |N r | log |N r | = O(n).Proof E mn r =0 |N r | log |N r | ≤ E X ir X jr .

Pr
X ir = 1|I ∼ D q • Pr X jr = 1 ∧ I ∼ D q

Pr X ir = 1 ∧
X jr = 1 = O Similarly, lines with negative slopes are labelled with v n .The construction of this version takes O(|S k | log |S k |) time and O(|S k |) space.Run a plane sweep over A k from left to right.We exit the current slab and enter a new slab when crossing a vertex of A k .If we cross an intersection between two lines i and j , then we swap i and j in the persistent search tree (by swapping node contents).Suppose that we cross an intersection between a horizontal line y = v r and a line i .If i is above y = v r to the right of this intersection, then we update the predecessor of i to v r ; otherwise, we update the predecessor of i to vr−1 .As a result, we obtain a new version of the persistent search tree in O(log |S k |) time and O(1) extra amortized space.Constructing all versions thus take O(n|S k | log |S k |) time and O(n|S k |) space.Notice that there is one version for each slab in W k .
in the rest of this section.We need two technical results.Let H (X 1 , . . ., X n ) be the joint entropy of independent random variables X 1 , . . ., X n .Then H (X 1 , . . ., X n ) = n i=1 H (X i ).
Lemma 2.6 [1, Lemma 2.3] Let X : U → X and Y : U → Y be two random variables obtained with respect to the same arbitrary distribution over the universe U. Suppose that the function f : (I , X (I )) → Y (I ), I ∈ U, can be computed by a comparison-based algorithm with C expected comparisons, where the expectation is over the distribution on U.Then, H (Y ) ≤ C + O(H (X )).
are built on the intervals [v 0 , v 1 ), ..., [v mn , v mn+1 ) using O(m ε n ε ) input instances.The processing time is O(m ε n 1+ε log(mn)) and the search trees use O(m ε n 1+ε ) space.For any input instance (x 1 , ..., x n ) in the operation phase, T i can be queried to find the interval that contains x i in O(H (P i )/ε) expected time.Given an input instanceI = (x 1 , . .., x n ), for each i ∈ [1, n], we search T i to place x i in the interval [v r , v r +1 ) that contains it.For each r ∈ [0, mn], the interval [v r , v r +1 ) keeps a list N r of x i 's that fall into it.Wesort each N r in O(|N r | log |N r |) time.Recall that querying T i with x i takes O(H (P i )/ε) expected time, where P i is the random variable indicating the predecessor of x i in the V -list.Therefore, the total time for processing I is O 1 =0 |N r | log |N r | plus the time to concatenate [10]he total time needed is O(n) plus the time for manipulating the n van Emde Boas trees.The van Emde Boas tree[10]supports ordered dictionary operations in O(log log N ) worst-case time each, where N is the size of the universe.This is O(log log m) time in our case.
Lemma 3.2 In the operation phase, the search trees T i 's, the V -list, and the van Emde Boas trees require O(m ε n 1+ε ), O(mn), and O(mn) space, respectively.Sorting an input instance takes O n log log m + 1 ε n i=1 H (P i ) + E mn r =0 |N r | log |N r |]) expected time.