Generalized List Decoding

This paper concerns itself with the question of list decoding for general adversarial channels, e.g., bit-flip ($\textsf{XOR}$) channels, erasure channels, $\textsf{AND}$ ($Z$-) channels, $\textsf{OR}$ channels, real adder channels, noisy typewriter channels, etc. We precisely characterize when exponential-sized (or positive rate) $(L-1)$-list decodable codes (where the list size $L$ is a universal constant) exist for such channels. Our criterion asserts that:"For any given general adversarial channel, it is possible to construct positive rate $(L-1)$-list decodable codes if and only if the set of completely positive tensors of order-$L$ with admissible marginals is not entirely contained in the order-$L$ confusability set associated to the channel."The sufficiency is shown via random code construction (combined with expurgation or time-sharing). The necessity is shown by 1. extracting equicoupled subcodes (generalization of equidistant code) from any large code sequence using hypergraph Ramsey's theorem, and 2. significantly extending the classic Plotkin bound in coding theory to list decoding for general channels using duality between the completely positive tensor cone and the copositive tensor cone. In the proof, we also obtain a new fact regarding asymmetry of joint distributions, which be may of independent interest. Other results include 1. List decoding capacity with asymptotically large $L$ for general adversarial channels; 2. A tight list size bound for most constant composition codes (generalization of constant weight codes); 3. Rederivation and demystification of Blinovsky's [Bli86] characterization of the list decoding Plotkin points (threshold at which large codes are impossible); 4. Evaluation of general bounds ([WBBJ]) for unique decoding in the error correction code setting.

In favour of introducing general notions, motivating general problems and stating our general theorems, we first go through concrete numerical examples that are special cases of our results.
Suppose Alice can transmit a length-n bit string (codeword) to Bob and an adversary James can flip np (0 ď p ď 1) of these bits. Consider first the classic coding theory question. 1) Error correction. For what values of p, can one construct a code (collection of codewords) of positive rate (i.e., size at least 2 Rn for some constant R ą 0) such that Bob can uniquely decode? The classic Plotkin bound tells us that this is impossible for p ą 1{4, 1 and the classic Gilbert-Varshamov (GV) bound tells us that this is possible for p ă 1{4. 2) List decoding. For what values of p, can one construct a code of positive rate such that is 3-list decodable (i.e., regardless of which np bits James flips, Bob can always decode the received word to a list of at most 3 codewords, one of which is the codeword transmitted by Alice)? 2 Due to work by Blinovsky, it is known that this is possible if and only if p ď 5{16. 3 In this work, we are able to rederive all the above thresholds, but are also able to derive the corresponding thresholds for a vast variety of general adversarial channels, such as, bit-flip channels, erasure channels, AND (Z-) channels, OR ( Z-) channels, adder channels, noisy typewriter channels, etc.
In this section, let us revisit the answers to questions 1 and 2 in the technical language we develop in this paper. 1) Error correction. Consider any pair of codewords x 1 , x 2 that are resilient to np bit-flips. They must therefore be at a Hamming distance larger than 2np. Said differently, the joint type (i.e., the 2ˆ2 matrix whose px 1 , x 2 q-th entry is the fraction of locations i of px 1 , x 2 q such that x 1 piq " x 1 and x 2 piq " x 2 ) τ x 1 ,x 2 " " tp0, 0q tp0, 1q tp1, 0q tp1, 1q  of these two codewords must satisfy the condition that C1 tp0, 1q`tp1, 0q ě 2p. a) In [Bli86], [Pol16], [ABP18] 4 and [WBBJ], it was shown that: if a code C of size 2 Rn exists, then there must exist a positive rate subcode C 1 Ă C such that for every pair of codewords x 1 , x 2 in C 1 , their joint type is approximately the same (as, say, P x1,x2 ). b) In [WBBJ], it was shown that: it is possible to construct positive rate codes with joint types (close to) P x1, x2 if and only if P x1,x2 is a completely positive (CP) distribution, i.e., joint distributions that can be written as a convex combination of products of independent and identical distributions, for some positive integer k, convex combination coefficients tλ i u 1ďiďk and probability vectors tP xi u 1ďiďk . For example, is CP for λ P r0, 1s since it can be written as λ can check that for λ ă 0, matrix (1) is not CP. For condition C1 to be satisfied by some CP distribution, it must be the case that 2p ď 2¨p1´λq¨p1{4q for some λ P r0, 1s. This is impossible if p ą 1{4. As a consequence, the classic Plotkin bound is recovered in this convex geometry language, since the non-CP matrices of the form (1) with negative λ correspond to codes with minimum pairwise fractional distance 1`|λ| 2 (hence correspond to p " 1`|λ| 4 ą 1{4), which, by the Plotkin bound, cannot have positive rate. 2) List decoding. Now let us move to the list decoding question in hands. For a code to be 3-list decodable, it must be the case that for any quadruple x 1 , x 2 , x 3 , x 4 , there is no y such that the Hamming distance from x i to y is at most np for every i P t1, 2, 3, 4u. In this case, the appropriate object is therefore a 2ˆ2ˆ2ˆ2 tensor (or a joint distribution of px 1 , x 2 , x 3 , x 4 q) P x1,x2,x3,x4 such that C2 any its extension P x1,x2,x3,x4,y (i.e., a coupling of px 1 , x 2 , x 3 , x 4 q and y, or a 2ˆ2ˆ2ˆ2ˆ2 tensor such that P x1,x2,x3,x4 " P x1,x2,x3,x4,0`Px1,x2,x3,x4,1 ) satisfies the condition that P xi,y p0, 1q`P xi,y p1, 0q ą p for at least one i P t1, 2, 3, 4u. a) Again, by [Bli86], [Pol16], [ABP18] and our work, we can restrict our attention to codes in which every L-tuple of codewords has joint type close to some P x1,¨¨¨,xL , since we can find such a subcode which is sufficiently large in any positive rate code. b) Generalizing [WBBJ], we show that codes with order-4 joint types (close to) P x1,x2,x3,x4 if and only if P x1,x2,x3,x4 is a completely positive tensor of order-4, i.e., joint distributions that can be written as a convex combination of products of independent and identical distributions, One can check that distributions of the form is CP if and only if λ P r0, 1s. On the other hand, for condition C2 to be satisfied by some tensor like this, it turns out, as shown by Blinovsky [Bli86] and us, that p has to be no larger than 5{16. Of course, bit-flips are just one of the simplest models of corruption that may occur in real-world communication/storage systems. Perhaps, under certain circumstances, in the system, we are allowed to transmit length-n codewords taking values from t0, 1, 2, 3, 4, 5u, but each legitimate codeword x has to satisfy the following constraints inherently associated to the system where τ x pxq denotes the fraction of x in x. An adversary is allowed to change symbols in the transmitted codeword only from small values to large values, the cost he pays by changing every i to j (0 ď i ă j ď 5) is j´i dollars, and he has a budget of 2.3n dollars in total. The fundamental type of questions we are able to answer in this paper is: is it possible for us to design exponentially large codes so that no matter which codeword is transmitted and how a legitimate adversary corrupts it, the decoder is always able to output a list of at most 10 codewords which contains the correct one?
The answer can be stated in a similar manner. This is possible if and only if there is a CP tensor of order 11 and dimension 6 which does not lie inside the confusability set determined by the channel. In particular, the confusability set is the set of joint distributions which fail to meet the conditions similar to C1 or C2 that are determined by the channel.
Our results tell us that if one only aim to search for exponentially large pL´1q-list decodable codes (instead of optimizing its size) for a given general adversarial channel, it is sufficient (and obviously necessary) to restrict our attention to codes that are chunk-wise random-like. Such codes correspond to some CP distribution ř k i"1 λ i P xi . If a random code of positive rate in which the λ i n (1 ď i ď k) components in the i-th chunk of each codeword is sampled from distribution P xi does not work with high probability (w.h.p.), then we can never find positive rate codes of any other form that work for this channel.
By setting the list size L´1 " 1, results in [WBBJ] are recovered by our work.

II. INTRODUCTION
While the main contribution of this work is to strictly generalize notions that have been primarily studied for "Hamming metric" channels, before we precisely define general channels, let us reprise what is known for Hamming metric channels in this section.

A. Error correction codes and Plotkin bound
The theory of error correction codes is about protecting data from errors. In classical coding theory, a code, say C, is just a collection of binary codewords (which are usually just binary length-n sequences, where n is called the blocklength). The most well-studied error model is bit-flip. When a certain codeword is transmitted, an adversary can arbitrarily flip at most np (0 ă p ă 1{2) bits. It is easy to see that two codewords are not confusable if and only if their Hamming distance (number of locations where they differ, denoted d H p¨,¨q) is at least 2np`1. Let d min pCq " min enote the minimum pairwise distance of codewords in C. The goal is to pack as many codewords as possible in Hamming space F n 2 while ensuring that the minimum distance is at least 2np`1. By a simple volume argument (Gilbert-Varshamov (GV) bound [Gil52], [Var57]), it is known that exponentially many such vectors can be packed when p ă 1{4. The fundamental quantity that coding theorists are seeking when faced with any communication model is the largest achievable rate, i.e., capacity. The rate of a code is its normalized cardinality, RpCq " log|C| n . The capacity C measures asymptotically, as the blocklength grows, the largest fraction of bits (out of n) that can be reliably transmitted despite np adversarial bit-flips. C is formally defined as RpCq. 5 For the aforementioned bit-flip model, as said, the problem of finding the capacity can be also cast as determining the sphere packing density. It is notoriously difficult and is still open to date. However, we do know that p " 1{4 is the threshold below which exponential packing exists (as suggested by the Gilbert-Varshamov (GV) bound) and above which it is impossible. The latter fact is the famous Plotkin bound. Formally, Theorem 2 (Plotkin bound [Plo60]). If p " 1{4` , then any code C of distance larger than 2np has cardinality at most 1`1 4 (and hence zero rate).
We will call the value of p at which the capacity hits zero the Plotkin point. Note that the Plotkin bound actually tells us that, above the Plotkin point, any code/packing not only has size 2 opnq (hence rate zero), but should be at most a constant (independent of the blocklength n). Coupled with the achievability result given by the GV bound, the phase transition threshold for exponential-sized packing is thereby identified precisely.

B. List decoding and list decoding Plotkin bound
We now introduce another important notion: list decoding. List decodability still requires codewords to be separated out, but in a more relaxed sense. It requires that only a few codewords can be captured by a ball of some radius, no matter where it is put.
Definition 3 (List decodability [Eli57], [Woz58]). A code C is pp, L´1q-list decodable (or pp, ă Lq-list decodable) if for all y P F n 2 ,ˇˇC X B H`y , np˘ˇˇă L, where B H py, npq denotes a Hamming ball centered at y of radius np. Of course we want the list size L to be as small as possible. In particular, the problem is trivial when L " |C|. (The decoder ignores the channel output and outputs the full code.) When L " 2, it becomes precisely packing. As the admissible L grows, the problem is expected to become easier.
List decoding is an important and well-studied subject in coding theory. It is a natural mathematical question to pose for understanding high-dimensional geometry in discrete spaces. It also serves as a useful primitive that shows power within and beyond the scope of coding theory. For instance, in many communication problems (e.g., [Ahl73], [CJM15]), a proof technique is to let the decoder first perform list decoding and get a short list (usually polypnq suffices) of candidate messages, then use other information to disambiguate the list and get the truely transmitted message. List decoding also finds application in complexity theory, cryptography, etc [Gur06]. For instance, it is used for amplifying hardness and constructing extractors, pseudorandom generators and other pseudorandom objects [DMOZ19]. The idea of relaxing the problem by asking the solver to just output a list (ideally as small as possible) of solutions that is guaranteed to contain the correct one, instead of insisting on a unique answer, is also adopted in (a) An pL´1q-packing for L " 2, i.e., disjoint packing.
(b) An pL´1q-packing for L " 3, i.e., packing with multiplicity 2. Fig. 1: Packing (uniquely decodable codes) vs. multiple packing (list decodable codes). The geometry depicted in the above figures may be misleading compared with the truth in binary Hamming space.
many other fields of computer science [DKS18], [RY19], [KKK19]. In the context of high-dimensional geometry in finite fields, list decoding is equivalent to multiple packing just like error correction codes are equivalent to sphere packing. Multiple packing is a natural generalization of the famous sphere packing problem in which, instead of insisting on disjoint balls, overlap is allowed but with bounded multiplicity.
Definition 4 (Multiple packing). A subset C Ă F n 2 is a pp, L´1q-multiple packing if when we put balls of radii np around each vector in C, no point in the space simultaneously lies in the intersection of at least L balls.
See Fig. 1 for examples of packing and multiple packing in Hamming space. Surprisingly, list decoding capacity is known if we allow L to be asymptotically large. In some sense, list decoding makes us information-theoretic since in many (but not all) cases the list decoding capacity coincides with the corresponding Shannon channel capacity for which the noise is random with the same "power" (e.g., in the bit-flip/erasure case, the random noise is independently and identically distributed (i.i.d.) according to a Bernoulli distribution per component with mean p).
Theorem 5 (List decoding capacity (folklore)). Given any δ ą 0, there exists an infinite sequence of pp, Op1{δqq-list decodable codes C of rate 1´Hppq´δ. Indeed, a random code (each codeword sampled uniformly at random from F n 2 ) of rate 1´Hppq´δ is pp, Op1{δqq-list decodable w.h.p. On the other hand, any infinite sequence of codes of rate 1´Hppq`δ is`p, 2 Ωpnδq˘list decodable. We call 1´Hppq the p-list decoding capacity (without specifying a specific L). In particular, the Plotkin point for p-list decoding when L is sufficiently large is 1{2.
Though the fundamental limit for the relaxed problem for large constant L is essentially understood, pp, L1 q-list decodability for small L (e.g., absolute constant, say 3, 8, 100, etc.; or sublinear in 1{δ, say p1{δq 1{2 , p1{δq 1{3 logp1{δq, log logp1{δq) is way far from being understood. Indeed, it is believed (at least for absolute constant L) to be equivalently hard as the sphere packing problem. Formally, the question of understanding the role of L can be cast as follows. Note first that when L " 2, the (unknown) capacity lies somewhere between the Gilbert-Varshamov bound and Linear Programming bound ( [Del73], [Mac63], [WMR74], [MRRW77], [NS09]). When L " Op1{δq, the list decoding capacity 1´Hppq is much larger than the unique decoding capacity. As we increase L, the pp, L´1q-list decoding capacity should be gradually lifted and the Plotkin point should somehow move rightwards from 1{4 to 1{2. The final goal is to completely understand the dynamics of this evolution.
Remark 6. In this paper, we explicitly distinguish the list decoding capacity for large L and for small L. When we say that L is asymptotically large, we refer to L " Ωp1{δq which suffices to approach the p-list decoding capacity within gap δ. When we say that L is small without further specification, we refer to absolute constant L. The p-list decoding capacity for large L is fully characterized as in Theorem 63, denoted C, yet the pp, L´1q-list decoding capacity for small L is widely open and is denoted by C L´1 .
Again, for any absolute constant L, the pp, L´1q-list decoding capacity is poorly understood. We only have non-matching lower and upper bounds. To the best of our knowledge, the current record holder is still the ones by Blinovsky from the 80s [Bli86], [Bli05], [Bli08], except for sporadic values of L in some regimes of p. Specifically, for L " 3, Ashikhmin-Barg-Litsyn [ABL00] can uniformly improve Blinovsky's upper bound for all values of p. For even L's that are at least 4, Polyanskiy [Pol16] can partially beat Blinovsky's bounds in the low rate regime.
Though the speed of convergence in L is not exactly known, Blinovsky's bounds do resolve the dynamics of Plotkin point evolution! Let P L´1 denote the Plotkin point for pp, L´1q-list decoding. Let L " 2k or 2k`1 (k ě 1). Then Blinovsky's results imply that P L´1 is precisely given by the following formula Later, Alon-Bukh-Polyanskiy [ABP18] recover this result with a simpler looking formula For instance, P 1 " P 2 " 1{4, P 3 " P 4 " 5{16, etc. As can be noted, the Plotkin point moves periodically! The fact that the above two formulas are always evaluated to the same value is implicit in [ABP18] and formally justified in Appendix D.

III. OUR CONTRIBUTIONS
Our motivation comes from a well-known connection between list decodability and reliability of communication over adversarial channels. A binary code is pp, L´1q-list decodable if and only if it has zero error when used over the following adversarial bit-flip channel (Fig. 4a).
The above system depicts a one-way point-to-point communication in which the encoder (Alice) randomly picks a message m from 2 nR of them and encodes it into a n-bit string, the adversary (James) stares at this codeword and maliciously flips at most np bits of it, the decoder (Bob) receives the corrupted word y and is required to output a short list of messages which is guaranteed to contain m with probability 1.
In the above model, the adversary is power constrained in the sense that he only has a budget of np bit-flips. But the encoder is not constrained -she can encode the message into any vector in F n 2 . In some scenarios, codewords are also weight constrained. It makes sense to pose the same question (understanding the list decoding capacity) for input constrained channels. Indeed, this was also studied [GN13] and the list decoding capacity is Hpwq´Hppq when each codeword has weight at most nw. Note that it vanishes at p " w. That is, the Plotkin point for weight constrained adversarial bit-flip channels is w.
Motivated by this connection, we significantly generalize the bit-flip model and define list decodability for general adversarial channels. We consider a large family of channels in which the encoder is allowed to encode the message into a length-n sequence x over any alphabet X of constant size, the adversary is allowed to design an adversarial noise pattern s over any alphabet S and the channel can be any deterministic component-wise function taking a pair of strings from X nˆS n , outputting a sequence y over any alphabet Y of the same length. The system designer can incorporate a large family of constraints on x and s in terms of their types (i.e., empirical distributions). The above family of adversarial channels includes but is not limited to 1) The standard adversarial bit-flip channels and adversarial erasure channels; 2) Z-channels in which the adversary can only flip 1 to 0 but not the other way around; 3) Adder channels in which the output is the sum of inputs over the reals rather than modulo the input alphabet size; 4) Channels equipped with Lee distance instead of Hamming metric. Indeed, our framework covers most popular error models and more that potentially have not been studied in the literature.
However, since we require the channel transition function to act on each component of the input codeword independently, a well-studied family of channels is excluded: the adversarial deletion channels. In this model, the adversary can delete at most np entries of the transmitted codeword and the decoder receives a vector of smaller length (but at least p1´pqn) without knowing the original locations of the symbols he got. 6 Determining the Plotkin point for this channel is a long standing open problem. It is known [BGH16] that for binary channels, it lies between ? 2´1 « 0.414 and 0.5; for q-ary channels, between 1´2 k`?k and 1´1 k . The capacity of this channel is even less known.
For technical simplicity, we also assume that the channel transition function is deterministic, i.e., the output symbol is a deterministic function of the codeword symbol x and the error symbol s. 7 However, without loss of generality one can assume that none of the encoder, decoder and adversary has private randomness to randomize their strategy. This is because that there are reductions showing that, given randomized encoder/decoder, we can construct a deterministic coding scheme with essentially the same rate. Similarly, given a randomized adversarial error function, we can turn it into a deterministic one which is equivalently malicious in terms of rate. Therefore, for the encoder, it suffices to only consider deterministic codes, i.e., each message is mapped to a unique codeword with probability 1. For the adversary, we can assume the error pattern is a deterministic function of the transmitted codeword. Note that the error function does not have to be component-wise independent. The i-th component spiq of the noise pattern s can depend on every entry of x, not only on the corresponding xpiq. Moreover, the decoder's decision of the estimate message given the received word can also be assumed to be deterministic. That is, we can require that the decoder outputs the correct message with zero error probability. Hence, the problem is purely combinatorial and all desirable events should happen with probability one.
In this work, we precisely characterize the Plotkin point for list decoding over any channel from the above large family of general adversarial channels. That is, we provide a criterion (sufficient and necessary condition) under which positive pL´1q-list decoding rate is possible for such channels.
In the context of high-dimensional geometry over finite spaces, the result can be also cast as pinning down the location of phase transition threshold for pL´1q-multiple packing using general shapes (not necessarily Hamming balls) corresponding to the defining constraints for codewords and errors of the channel, above which exponentialsized multiple packing exists and below which impossible.
This criterion can be summarized in one sentence: exponential-sized pL´1q-list decodable codes for general adversarial channels (or pL´1q-multiple packings using general shapes) exist if and only if the completely positive tensor cone of order-L is not entirely contained in the pL´1q-list decoding confusability set of the channel. Jargon in the above informal statement will become understandable once we formalize the problem setup and present rigorous claims. The proof consists of sufficiency part and necessity part. At a very high level, the sufficiency part follows from a random coding argument and its generalization inspired by time-sharing argument frequently used in Network Information Theory. The necessity part builds upon and significantly generalizes the classical Plotkin bound, which goes by first extracting an equicoupled subcode using Ramsey theory and then applying a double counting trick.
Other results include the following. 1) We pin down the list decoding capacity of any given general adversarial channel for asymptotically large L. This generalizes the classic list decoding capacity in the bit-flip case. The lower bound is achieved by a purely random code. The upper bound follows from volume packing. 2) We determine the exact order (in terms of δ) of the list sizes for a large fraction (exponentially close to one) of constant composition codes (all codewords have the same type) achieving the list decoding capacity of a given general adversarial channel within gap δ. It turns out that if we pick a constant composition code from the set of all such codes, with high probability, it is exactly Θp1{δq-list decodable. 3) We give a lower bound on the pL´1q-list decoding capacity of a given general adversarial channel. It coincides with the generalized Gilbert-Varshamov bound obtained by [WBBJ] when L´1 is set to be 1. Our bound is given by a random code construction assisted by expurgation, generalizing a classic construction for pp, L´1qlist decoding in the bit-flip case [Gur04]. Note that this construction differs from [WBBJ]'s construction for unique decoding using greedy packing. 4) In the special case where L " 2, i.e., the unique decoding setting, we evaluate the Gilbert-Varshamov-type bound and an achievable rate expression of cloud codes (codes constructed from CP distributions) obtained by [WBBJ] under the bit-flip model. In particular, we show that the Gilbert-Varshamov-type bound for general adversarial channels matches the classic GV bound in the theory of error correction codes. We also provide an explicit convex program for evaluating achievable rates of codes arising from CP distributions. 5) By evaluating our general criterion under the bit-flip model, we numerically recover Blinovsky's [Bli86] characterization of the Plotkin point for pp, L´1q-list decoding. This boils down to checking the feasibility of an explicit a linear program with structured coefficient matrix. Though the LP has size exponential in L, its feasibility can be checked in constant time since our results are tailored for constant L with no dependence on the blocklength n (which typically approaches infinity for many of our results to hold). 6) By utilizing facts discovered in this paper, we rigorously recover Blinovsky's [Bli86] characterization of the Plotkin point for pp, L´1q-list decoding. Our proof avoids the harder calculations and demystify the formula by Blinovsky 8 . In particular, our lower bound on the Plotkin point explains, in the low rate regime, the fact that average-radius 9 list decoding is equivalent to the classic notion of list decoding. We believe that this fact is first observed and rigorously justified by Blinovsky. It was later rediscovered many times and became the basic starting point of many papers, especially those regarding list decoding random q-ary linear codes. Our upper bound relates the Plotkin point P L´1 to the expected translation distance of a one-dimensional unbiased random walk after L steps. In summary, using connections between codes and random variables, we are able to re-interpret of the formulas given by Blinvosky [WBBJ] and Alon-Bukh-Polyanskiy [ABP18] and provide a new intuitive formula which matches known formulas.

IV. OVERVIEW OF TECHNIQUES
Our paper is highly correlated to a sister paper [WBBJ] which a subset of the authors are involved in. That paper provides generalized Plotkin bound for unique decoding over general adversarial channels. The authors showed that exponential-sized uniquely decodable codes or hard packings exist if and only if the set of completely positive matrices is not entirely contained in the confusability set associated to the given channel. This answers the question we posed in the beginning of the paper for L " 2 case. We generalize their results to any universal constant L. Almost all results in [WBBJ] can be recovered by setting L " 2 in our paper.
We review the techniques used in this paper and highlight the similarities and differences between [WBBJ] 10 and our work.
1) The general adversarial channel models that both papers are concerned with belong to a larger family of channels known as Arbitrarily Varying Channels (AVC) in Information Theory community. We want to emphasize that a bulk of the literature of AVCs deals with oblivious channels in which the adversary has to pick his noise pattern maliciously before the codewords is chosen from the codebook by the encoder. This makes the problem significantly easier and the capacity of such channels are precisely known. The channels that [WBBJ] and we are considering are such that the adversary gets to design the error pattern with the knowledge of the transmitted codeword. This problem is way more difficult and the capacity is, again, widely open even 8 In fact, he provided upper and lower bounds for pp, L´1q-list decoding capacity which happen to vanish at the same value of p. 9 pp, L´1q-average-radius list decodability requires that the average distance (instead of maximum distance required by the classic notion of pp, L´1q-list decodability) between any L-tuple of codewords and their centroid is larger than np. Average-radius list decodability is a more stringent requirement since it implies classic list-decodability. However, it is easier to analyze since the problem is linearized. Indeed it shows power in a long line of work understanding the bit-flip model [GN13], [Woo13], [RW14], [RW15], [RW18]. 10 Though the work by Wang-Budkuley-Bogdanov-Jaggi [WBBJ] has been accepted to ISIT 2019, the conference version is limited to 5 pages and contains essentially no proof. At the time this paper is written, we do not have a publicly available full version of [WBBJ] and the following comparison is w.r.t. the current status of a draft of [WBBJ] that the authors kindly shared with us.
for simple models such as the bit-flip channels. Indeed, the subclass of AVCs that [WBBJ] and we defined is motivated by the bit-flip channels and its various variants, e.g., weight constrained channels, q-ary channels, etc.
2) The connection between codes and random variables or distributions are classical in Theoretical Computer Science. The idea of realizing binary error correction codes using t´1, 1u-valued random variables or functions supported on the Boolean hypercube t´1, 1u n is spread out in the literature explicitly or in disguise. Such tricks show power since it allows people to borrow tools from other fields of Theoretical Computer Science, e.g., the theory of expander graphs, randomness extractors, small-bias distributions, discrete Fourier analysis, etc. ( [SS96], [BADTS18], [TS17], [BL14]) to understand, construct and analyze codes. 3) With respect to (w.r.t.) codes for general adversarial channels, the specific idea of collecting admissible types of good codes and studying the set of corresponding distributions was used in [WBBJ]. In particular, they defined similar notions of self-couplings and confusability sets which are submanifolds of matrices. Such objects only take care of pairwise interaction of codewords, which are insufficient for understanding list decoding. We generalize their notions to tensors which captures the (empirical) joint distributions of lists of codewords. Some properties in [WBBJ] continue to hold when objects in matrix versions are extended to tensor versions.
Other properties fail to hold, as we will see in the rest of the paper. We also encounter issues which merely do not exist in the unique decoding setting. As is well-known, tensors are much more delicate [HL13] to handle than matrices. 4) To prove upper bounds on capacity, it is also an old idea to extract structured subcodes from any infinite sequence of good codes. Depending on the applications, the types of structures and techniques for extracting such structures may vary. To the best of our knowledge, in coding theory, the use of Ramsey theory for obtaining symmetric subcodes dates back to as least as early as Blinovsky [Bli86]. His techniques are applied in a similar manner in followup work by Polyanskiy [Pol16] and Alon-Bukh-Polyanskiy [ABP18]. [WBBJ] generalizes this idea and manages to extract subcodes from arbitrary codes for general adversarial channels. Since they work with unique decoding, pairwise equicoupledness suffices. In our setup, we would like a sequence of subcodes which are L-wise equicoupled in the sense that the (empirical) joint distribution of any L-tuple of codewords from the extracted subcode is approximately the same and close to some p P x1,¨¨¨,xL . This resembles but generalizes Polyanskiy's [Pol16] techniques. One of the downsides of invoking Ramsey theory is that the reduction usually causes terrible detriment to the rate of the code, since the smallest size for a combinatorial object to contain abundant structures is generally poorly understood in combinatorics. However, we are fine to tolerate such a rate loss since we only care about the positivity of list decoding capacity. 5) To show lower bounds on capacity, we use random coding argument aided by expurgation. In the prior work [WBBJ], the achievability result is obtained by greedy packing. This is reminiscent of a classical technique in Coding Theory for proving existence of good codes of certain size. Since, in the unique decoding (hard packing) setting, goodness of a code relies merely on pairwise statistics, the size of a greedy packing can be lower bounded using a standard volume counting argument. Indeed, this idea can be implemented in the general setting by counting the volume of the "forbidden region" of any codeword [WBBJ]. However, in list decoding setting, the notion of confusability is defined for tuples of codewords and does not translate to non-intersection of forbidden regions of codewords. It is also not clear how to pack codewords in a greedy manner while ensuring non-existence of local dense clusters. Instead, our code construction is more information-theoretic. We apply ideas of random coding with expurgation which is commonly used in the study of error exponent in Information Theory. A random code may be mildly locally clustered, but this only occurs at rare locations in the space of all length-n sequences over the input alphabet. Indeed, we are able to show that, with high probability, a random code carefully massaged by shoveling off a small number of codewords attains a GV-type bound for general channels. 6) The most difficult part of our work is the converse. a) First assume that the distribution p P x1,¨¨¨,xL associated to the subcode obtained by Ramsey reduction is symmetric. To show that no large code exists for general adversarial channels when p P x1,¨¨¨,xL is not completely positive, we show contradicting upper and lower bounds, if the code size exceeds certain constant (not even depending on the codeword length!), on the empirical distribution taken inner product with a copositive witness of non-complete positivity of p P x1,¨¨¨,xL and averaged over all L-tuples in the symmetric equicoupled subcode. We review this double counting trick (for unique and list decoding under special settings that appeared in prior work) in Section V. The L " 2 case is proved in [WBBJ]. The existence of witness of non-complete positivity is guaranteed by duality of certain matrix cones. We generalize calculations in [WBBJ] to joint distributions of ą 2 random variables. Similar notions of complete positivity and CoPositivity for tensors exist in the literature and duality continues to hold. b) If p P x1,¨¨¨,xL is asymmetric, we use a completely different argument. We reduce the problem, in a nontrivial way, to the L " 2 case which is known to be true [WBBJ]. The L " 2 case itself is proved [WBBJ] by viewing the task of constructing a long sequence of random variables with prescribed asymmetric marginals as a zero sum game and using discrete Fourier analysis to provide conflicting bounds on the value of the game, if the sequence is longer than certain constant (again independent of the blocklength).
V. PRIOR WORK Among various ideas, our results are built upon prior work which applies a double counting trick to obtain upper bounds on code sizes. We first review this technique which can be found in the proof of classical Plotkin bound and its generalizations.
One way to prove Theorem 2 is by lower and upper bounding the expected pairwise distance of any given code C with minimum distance larger than 2np (p " 1{4` ) where x, x 1 are uniformly and independently picked from C. First note that pairs x " x 1 do not contribute to the expectation. On the one hand, the expectation is clearly at least |C|´2|C|p|C|´1qd min ě |C|´1p|C|´1q2np ě |C|´1p|C|´1q2np1{4` q.
On the other hand, if we stack codewords into a 2 nRˆn matrix and let S j denote the number of 1's in the j-th column, then from the column's perspective, the above expectation is at most The coefficient 2 is because we need to count px, x 1 q and px 1 , xq separately. This bound is at most n{2 by concavity of the summands. Comparing the upper and lower bounds we have that |C| ď 1`1 4 , as claimed in Theorem 2.
The above double counting argument can be generalized to the setting of list decoding. For the pp, L´1q-list decoding setup we introduced in Definition 3, the earliest work we are aware of following this idea is the one by Blinovsky [Bli86].
Unlike Theorem 2, Blinovsky did not only show that any pp, L´1q-list decodable code has to be small as long as p ą P L´1 . He actually gave an upper bound (and is still essentially the best as far as we know) on pp, L´1q-list decoding capacity for any L. We sketch his idea below but omit the complicated calculations.
First note that proving upper bounds on C L´1 for fixed p is equivalent to proving upper bounds on p for fixed rate R. We define the following three quantities All expectations are over uniform selection from corresponding sets. Namely, Let us parse what these quantities are measuring. 1) r LD is known as the list decoding radius of a given code C. The minimax expression associated to a set L of vectors r Cheb :" min is precisely the largest allowable p for pp, L´1q-list decodable code of a fixed rate R. 2) r avg is known as the average list decoding radius and the min-average expression is the average radius of a list. It is not hard to see that the average radius center of L is the component-wise majority of vectors in L, i.e., the minimizer y˚has MAJ pxpiq : x P Lq as its i-th component. Define plurality as PLUR : F L 2 Ñ r0, 1s px 1 ,¨¨¨, x L q Þ Ñ 1 L |ti P rLs : x i " MAJpx 1 ,¨¨¨, x L qu| , which is the fraction of the most frequent symbol. Then the average radius of L can be explicitly written as 3) r DC is a further variant of r LD -the ultimate quantity we are looking for. This is the object that Blinovsky was really dealing with. Note that this is in the same spirit as the quantity (7) considered in the double counting argument in the proof of the classical Plotkin bound. Blinovsky used r DC as a proxy to finally bound By extracting a constant weight subcode and applying the double counting trick (and using convexity of a certain function), Blinovsky showed that Lemma 11. Let λ P r0, 1{2s and fix R " 1´Hpλq. Then Apparently, by definition, we have r LD ě r avg , r DC ě r avg .
So Lemma 11 automatically holds for r avg . However, a priori the relation between r LD and r DC is unclear. Surprisingly, Blinovsky showed that it is "okay" to replace the first and third optimization with averaging, in the sense that Lemma 12. For any infinite sequence of codes C n , there exists an infinite sequence of subcodes C 1 n Ď C n such that r LD pC 1 q " r avg pC 1 q`opnq.
The proof involves an equidistant subcode extraction step using Ramsey theory. Lemma 12 implies that the same bound in Lemma 11 holds for r LD as well! C. Cohen-Litsyn-Zémor [CLZ94] Similar ideas were used to provide upper bounds on erasure list decoding capacity. A binary code is said to be pp, L´1q-erasure list decodable if for any T P`r ns np1´pq˘a nd any y P F p1´pqn 2 ,ˇˇ x P C : x| T " y (ˇˇď L´1, where x| T denotes the restriction of x to T , i.e., a vector of length |T | only consisting of components from x indexed by elements in T . The erasure list decoding radius r LD,eras and the pp, L´1q-erasure list decoding capacity C L´1,eras are defined in the same manner. Cohen-Litsyn-Zémor [CLZ94] showed that The idea is essentially again double counting. Here, it turns out that the right object to be counted is the erasure radius of a list L, r eras :" |ti P rns : xpiq are the same @x P Lu| . Extracting a subcode living on a sphere (followed by shifting out the center to get a constant weight code C 1 ) and conducting similar calculations on E allow the authors to conclude Theorem 13.
Remark 14. The original paper [CLZ94] was stated for generalized distance which is an equivalent object and can be mapped to erasure list decoding radius via a well-known connection. The above version was presented in Guruswami's PhD thesis [Gur04].

D. Wang-Budkuley-Bogdanov-Jaggi [WBBJ]
As mentioned, our work is a continuation of the prior work [WBBJ] which a subset of authors were involved in. We refer the readers to the corresponding paragraphs in Sec. I and Sec. III for review of their work and comparison with ours.

VI. ORGANIZATION OF THE PAPER
In Sec. I we have seen numeric examples that illustrate our results. In Sec. II we properly motivated the problem and introduced relevant background in coding theory. Our contributions in this paper were listed in details in Sec. III. In Sec. IV we reviewed various techniques used in this paper and highlighted our innovations. Prior works that our results build on and push forward were surveyed in Sec. V.
The rest of the paper is organized as follows. We fix our notational conventions in Sec. VII and provide necessary preliminaries, especially the method of types in information theory, in Sec. VIII. We develop basic notions that will be used throughout the paper in Sec. IX. In particular, general adversarial channels and objects associated to them will be introduced in this section. In Sec. X we prove the list decoding capacity theorem for general adversarial channels when L is asymptotically large. Furthermore, we obtain tight list size bounds for most capacity-achieving constant composition codes. In Sec. XII and Sec. XIII we show sufficiency and necessity, respectively, of the criterion we obtain for the existence of exponential-sized pL´1q-list decodable codes (where L is a arbitrary universal constant) for general adversarial channels. In Sec. XIV we make two remarks on the converse, which is technically the most challenging piece of our work. In Sec. XV we verify the correctness of our characterization in Sec. XII and Sec. XIII by running it on the problem specialized to a typical coding theory model which has been understood in prior works [Bli86], [ABP18]. In Sec. XVI, utilizing tools developed and facts proved in this paper, we rigorously rederive Blinovsky's [Bli86] results. We obtain more intuitive expressions and demystify his calculations. In Sec. XVII we evaluate bounds on unique decoding capacity (L " 2) in [WBBJ] under a typical coding theory model. We conclude the paper and list several open questions and future directions in Sec. XVIII. Some calculations and background knowledge are deferred to Appendices A, B, C and D.
VII. NOTATION Conventions. Sets are denoted by capital letters in calligraphic typeface, e.g., C, I, etc. Random variables are denoted by lower case letters in boldface or capital letters in plain typeface, e.g., m, x, s, U, W , etc. Their realizations are denoted by corresponding lower case letters in plain typeface, e.g., m, x, s, u, w, etc. Vectors (stochastic or deterministic) of length n, where n is the blocklength, are denoted by lower case letters with an underline, e.g., x, s, x, s, etc. The i-th entry of a vector x P X n is denoted by xpiq since we can alternatively think x as a function from rns to X . Same for random vector xpiq. Matrices are denoted by capital letters in boldface, e.g., P, Σ, etc. Similarly, the pi, jq-th entry of a matrix G P X nˆκ is denoted by Gpi, jq. Letter I is reserved for identity matrix. We sometimes write I n to explicitly specify that it is an nˆn square identity matrix. Tensors are denoted by capital letters in plain typeface, e.g., T, P , etc. Functions. We use the standard Bachmann-Landau (Big-Oh) notation for asymptotics of functions in positive integers.
For two real valued functions f, g on the same domain Ω, let f g and f {g denote the functions obtained by multiplying and taking the ratio of the images of f and g point-wise, respectively. That is, for ω P Ω, In particular, for types or distributions, we can write τ x,y " τ x τ y|x , τ y|x " τ x,y {τ x , or P x,y " P x P y|x , P y|x " P x,y {P x and so on.
For two real-valued functions f pnq, gpnq in positive integers, we say that f pnq asymptotically equals gpnq, For instance, 2 n`log n -2 n`log n`2n , 2 n`log n ffi 2 n . We write f pnq .
For any q P R ą0 , we write log q p¨q for the logarithm to the base q. In particular, let logp¨q and lnp¨q denote logarithms to the base two and e, respectively. Sets. For any two sets A and B with additive and multiplicative structures, let A`B and A¨B denote the Minkowski sum and Minkowski product of them which are defined as respectively. If A " txu is a singleton set, we write x`B and xB for txu`B and txu¨B.
For any finite set X and any integer 0 ď k ď |X |, we use`X k˘t o denote the collection of all subsets of X of size k.ˆX k˙: " tY Ď X : |Y| " ku .
For M P Z ą0 , we let rM s denote the set of first M positive integers t1, 2,¨¨¨, M u.
For any A Ď Ω, the indicator function of A is defined as, for any x P Ω, At times, we will slightly abuse notation by saying that 1 A is 1 when event A happens and zero otherwise. Note that 1 A p¨q " 1 t¨PAu . Geometry. For any x P F n q , let wt H pxq denote the Hamming weight of x, i.e., the number of nonzero entries of x. wt H pxq :" |ti P rns : xpiq ‰ 0u| .
For any x, y P F n q , let d H`x , y˘denote the Hamming distance between x and y, i.e., the number of locations where they differ.
Balls and spheres in F n q centered around some point x P F n q of certain radius r P t0, 1,¨¨¨, nu w.r.t. the Hamming metric are defined as follows.
We will drop the subscript and superscript for the associated metric and dimension when they are clear from the context. Probability. For a finite set X , ∆pX q denotes the probability simplex on X , i.e., the set of all probability distributions supported on X , ∆pX q :" Similarly, ∆ pXˆYq denotes the probability simplex on XˆY, Let ∆pY|X q denote the set of all conditional distributions, ∆pY|X q :" The general notion for multiple spaces is defined in the same manner. The probability mass function (p.m.f.) of a discrete random variable x or a random vector x is denoted by P x or P x . Here we use the following shorthand notation to denote the probability that x or x distributed according to P x or P x takes a particular value.
for some x P X or x P X n . If every entry of x is independently and identically distributed (i.i.d.) according to P x , then we write x " P bn x , where P bn x is a product distribution defined as P x pxq " P bn x pxq :" Let UnifpΩq denote the uniform distribution over some probability space Ω. For a joint distribution P x,y P ∆pXˆYq, let rP x,y s x P ∆pX q denote the marginalization onto the variable x, i.e., for x P X , rP x,y s x pxq " ÿ yPY P x,y px, yq.
Sometimes we simply write it as P x when notation is not overloaded. Algebra. Let }¨} p denote the standard p -norm. Specifically, for any x P R n , For brevity, we also write }¨} for the 2 -norm. An order-k dimension-pn 1 ,¨¨¨, n k q tensor T is a multidimensional array. It can be thought as a function on the product space rn 1 sˆ¨¨¨ˆrn k s which identifies the value of each of its entries.
where, as usual, we use T pi 1 ,¨¨¨, i k q to denote its pi 1 ,¨¨¨, i k q-th entry.
We list below various sets/spaces of matrices and tensors that we are going to use in this paper. Without specification, all matrices and tensors are over the real number field. ‚ The space of nˆm matrices: When n " m, we write Mat n for the space of square matrices of dimension n. ‚ The space of order-k dimension-pn 1 ,¨¨¨, n k q tensors: If every dimension of T is the same, i.e., n 1 "¨¨¨" n k " n, then we write Ten bk n for the space of equilateral tensors of order k and dimension n. ‚ Definitions of sets of symmetric (Sym), non-negative (NN), doubly non-negative (DNN), positive semidefinite (PSD), completely positive (CP), copositive (coP), etc. of matrices and tensors are deferred to the corresponding sections. Note that Mat n,m " Ten b2 n,m . When the order of the tensors is k " 2, namely matrices, we drop the superscript b2. For a tensor T P Ten bk n1,¨¨¨,nk , we use }T } F to denote the Frobenius norm of T , which is the 2 norm when T is vectorized into a length-n 1¨¨¨nk vector.
We use }T } sav to denote the sum-absolute-value norm of T which is the 1 norm after vectorization.
Similarly, define }T } mav :" max pi1,¨¨¨,ikqPrn1sˆ¨¨¨ˆrnks |T pi 1 ,¨¨¨, i k q| to be the max-absolute-value norm of T , which is the 8 norm when viewed as a vector. Note that the Frobenius norm, sum-absolute-value norm and max-absolute-value are different from the matrix/tensor 2-norm, 1-norm and 8-norm. However, they do coincide with the corresponding vector norm when the order of the tensor is one.
We endow the matrix/tensor space with an inner product. For tensors T 1 and T 2 both of order k and dimension pn 1 ,¨¨¨, n k q, xT 1 , T 2 y " ÿ pi1,¨¨¨,ikqPrn1sˆ¨¨¨ˆrnks When T 1 , T 2 are matrices, the above definition agrees with the Frobenius inner product, which is alternatively defined as Tr`T J 1 T 2˘. When T 1 , T 2 are vectors, this inner product becomes the standard inner product associated to R n as a Hilbert space, which is denoted by the same notation without confusion.
Let S n denote the symmetric group of degree n consisting of n! permutations on rns. Permutations are typically denoted by Greek letters. Information theory. We use Hp¨q to interchangeably denote the binary entropy function or the Shannon entropy; the exact meaning will usually be clear from the context. In particular, for any p P r0, 1s, Hppq denotes the binary entropy For a distribution P P ∆pX q on a finite alphabet X or a random variable x " P distributed according to P , the Shannon entropy of P or x is defined similarly as HpP q " Hpxq :" For two distributions P, Q P ∆pX q on the same alphabet X , the Kullback-Leibler (KL) divergence between them is defined as DpP }Qq :" If x, y are jointly distributed according to P x,y P ∆pXˆYq, then ‚ Their joint entropy is defined as ‚ Their mutual information is defined as If the conditional distribution of y given x is P y|x P ∆pY|X q, then the conditional entropy of y given x is defined as It is easy to check that different definitions above for the same quantities are consisted with each other.
Corollary 16 (Asymptotics of multinomials). For any positive integers n ě q and any q-partition pn 1 ,¨¨¨, n q q of n (n 1`¨¨¨`nq " n, n i ě 0 for every i),ˆn where P P ∆prqsq is an empirical distribution such that for i P rqs, More precisely, we haveˆn where νpnq is a polynomial defined as Fact 17 (Approximation of binomials). For any positive integers n ě k, Without loss of generality, write X " x 1 ,¨¨¨, x |X | ( . For x P X n and x P X , let N x pxq :" |ti P rns : xpiq " xu| , which counts the number of occurrences of a symbol x in a vector x. Definition 20 (Types). For a length-n vector x over a finite alphabet X , the type τ x of x is a length-|X | (empirical) probability vector (or the histogram of x), i.e., τ x P r0, 1s |X | has entries τ x pxq :" N x pxq n for any x P X .
Definition 21 (Joint types and conditional types). The joint type τ x,y P r0, 1s |X |ˆ|Y| of two vectors x P X n and y P Y n is defined as The conditional type τ y|x P r0, 1s |X |ˆ|Y| of a vector y P Y n given another vector x P X n is defined as Remark 22 (Types vs. distributions). Types are empirical distributions of length-n vectors. They can only take rational values, in particular, a{n for a P t0, 1,¨¨¨, nu. For a fixed n and finite alphabets, there are only polypnq many types. However, there are uncountably infinitely many distributions on any finite alphabets and they form a probability simplex.
Definition 23 (Set of types). We use P pnq pX q to denote the set of all possible types of length-n vectors over X .
to be 1) the set of all joint types; 2) the set of all conditional types of y given a particular x; 3) the set of all conditional types of y given some x, respectively.
Lemma 24 (Types are dense in distributions). The union of sets of types of all possible blocklengths is dense in the set of distributions, i.e., 8 ď n"1 P pnq pX q is dense in ∆pX q. This holds true for joint types and conditional types as well.
Lemma 25 (Number of types). When alphabet sizes are constants, the number of types of length-n vectors is polynomial in n. To be precise, the number of types of length-n vectors over X išˇˇP pnq pX qˇˇ"ˆn`| X |´1 |X |´1˙. (26) For a vector x P X n of type τ x , the number of conditional types of length-n vectors over Y given x išˇˇP The number of conditional types of Y-valued vectors given some X -valued vector is The following elementary bounds from [CK11] are sufficient for our purposes in this paper.ˇˇP pnq pX qˇˇďpn`1q |X | ,ˇˇP pnq pY|xqˇˇďˇˇP pnq pY|X qˇˇďpn`1q |X |¨|Y| .
Definition 29 (Type classes). Define type class T x pτ x q w.r.t. a type τ x P P pnq pX q as Joint type classes and conditional type classes can be defined in a similar manner. The joint type class T x,y pτ x,y q w.r.t. a joint type τ x,y P P pnq pXˆYq is defined as The conditional type class T y|x`τy|x˘w .r.t. a conditional type τ y|x P P pnq pY|xq given a vector x P X n is defined as The conditional type class T y|x`τy|x˘w .r.t. a conditional type τ y|x P P pnq pY|X q given some vector of type τ x P P pnq pX q is defined as where in Eqn. (30) x 1 P T x pτ x q can be chosen arbitrarily and τ y|x 1 " τ y|x in both Eqn. (30) and (31).
Remark 32. We will also write τ x , τ x,y , τ y|x , τ y|x etc. for generic types that are taken from the corresponding sets of types even if they do not come from instantiated vectors. For instance, τ x is a type in P pnq pX q corresponding to any x P T x pτ x q. The particular choice of x is not important and will not be specified. This is to explicitly distinguish between types and distributions.
2) For any vector x P X n and any conditional type τ y|x P P pnq pY|xq,ˇˇT where the conditional entropy is evaluated w.r.t. the joint type τ x τ y|x .
3) For any conditional type τ y|x P P pnq pY|X q,ˇˇT y|x`τy|x˘ˇ. " 2 n max τxPP pnq pX q Hpy|xq , where the conditional entropy is evaluated w.r.t. the joint type τ x τ y|x .
Proof. 1) The number of sequences x P X n of type τ x is preciselŷ n τ x p1q,¨¨¨, τ x p|X |qȧ nd the claim follows from Lemma 15. 2) Given x P X n , the number of sequences y P Y n of conditional type τ y|x is precisely ź xPXˆτ x pxq τ y|x p1|xq,¨¨¨, τ y|x p|Y| |xq˙, and the lemma follows from 15.
The claim follows from Eqn. (26) and the previous claim.
Lemma 34. If x is generated using the product distribution P bn x , then for any x P T x pP x q, Pr rx " xs " 2´n HpPxq .

Moreover,
Proof. Both claims follow from elementary calculations. For the first one, where Eqn. (35) is because τ x " P x and hence N x pxq{n " P x pxq for any x P X . For the second one, where Eqn. (36) is by Corollary 16. 11 In the argmax, x P Tx pτxq is arbitrary as well.
Lemma 37 (Markov). For any non-negative random variable X and any positive number x P R ą0 , Lemma 38 (Chernoff). Let X 1 ,¨¨¨, X n be independent (not necessarily identically distributed) t0, 1u-valued random variables. Let

Then
Pr rX ě p1` qE rXss ďe´ Lemma 39 (Sanov). Let Q Ă ∆ pX q be a subset of distributions such that it is equal to the closure of its interior. Let x " P bn x be a random vector whose components are i.
Sanov's theorem determines the first-order exponent of the probability that the vector looks like coming from some distribution Q P Q empirically. Pr Remark 40. One can view Sanov's theorem as a particular form of Chernoff bound. Since xpiq's are independent, it gives the correct exponent of Pr " τ x P Q ‰ up to lower order term rather than merely a bound.
Lemma 42 ([CJ81]). Given arbitrary finite sets U and X , for every R ą 0, sufficiently large n and τ x P P pnq pX q, there are M " 2 nR vectors C " tx i u 1ďiďM Ă T x pτ x q, such that for every u P U n and conditional type τ x|u P P pnq pX |uq, we haveˇˇC Fact 43 (Binomial identities). For any non-negative integers n, K P Z ě0 and 0 ď k ď n, we havê n k˙"ˆn n´k˙, We list several basic (in)equalities concerning information measures that we will frequently refer to.
Fact 48 (Information (in)equalities). The following inequalities hold for any random variables/distributions over finite sets.

IX. BASIC DEFINITIONS
Definition 49 (Adversarial channels). An adversarial channel A " pX , λ x , S, λ s , Y, W y|x,s q (Fig. 3) is a sextuple consisting of 1) an input alphabet X ; 2) a set of input constraints λ x Ď P pnq pX q; 3) a noise alphabet S; 4) a set of noise constraints λ s Ď P pnq pSq; 5) an output alphabet Y; 6) a channel law given by a transition probability W y|x,s P ∆pY|XˆSq. Remark 50. In this paper, we are only concerned with finite alphabets of constant size independent of the blocklength n.
Specifically, ‚ Though the alphabets X , S and Y can be arbitrary finite sets, it is without loss of generality to realize them using the first |X |, |S| and |Y| positive integers, i.e., X " r|X |s , S " r|S|s and Y " r|Y|s. 12 ‚ The input and noise constraint sets λ x and λ s are subsets of types P pnq pX q and P pnq pSq. In this paper we assume they are convex sets. Since there are polynomially many types in total, we can also think these collections of types as defined by intersections of hyperplanes or halfspaces, that is, types satisfying a certain finite number of linear (in the entries of the types) (in)equality constraints. ‚ In this paper, for technical simplicity, we assume that the channel transition function has only singleton mass.
That is, for each x P X , s P S, W y|x,s py|x, sq " 1 only for one y P Y and is zero for all other outputs. Equivalently, such degenerate distributions can be alternatively thought as deterministic functions where y is the unique output which is assigned the full probability, W y|x,s py|x, sq " 1. Here we slightly abuse the notation and use the same letter for the channel transition distribution and the channel transition function (when the distribution is degenerate). Moreover, we use y " W px, sq (with the superscript bn being dropped) to denote the output of n uses of the channel, or equivalently, the n-letter output of the function which acts on px, sq component by component.
It seems this is a severe restriction (and turns out indeed to be so). Nevertheless, it is still a very first and significant step towards understanding general adversarial channels in full generality. The case where W y|x,s is an arbitrary conditional distribution, or equivalently, the function W is non-deterministic, is interesting as well and is left as a future direction. ‚ For notational convenience, let be sets of codewords and error patterns of admissible types.
Example 51. Our framework covers a large family of channel models, including most of the popular and wellstudied ones.
1) The standard bit-flip channels.
-, Y " Z q , y " W px, sq " x`s over the reals. 10) Other more complicated channels, e.g., the one we defined in Sec. I.
Definition 52 (Self-couplings). A joint distribution P x1,¨¨¨,xL P ∆pX L q is said to be a pP x , Lq-self-coupling for some P x P ∆pX q if all of its marginals equal P x , i.e., rP x1,¨¨¨,xL s xi " P x for all i P rLs. The set of all pP x , Lqself-couplings is denoted by J bL pP x q. (c) Z-channels (or multiplier/AND channels).
Definition 53 (Codes). In general, a code C is a subset of X n . A code C for an adversarial channel A " pX , λ x , S, λ s , Y, W y|x,s q is a subset of Λ x . n is called the blocklength. Elements in C are called codewords. The rate RpCq of C is defined as RpCq " plog |C|q {n.
Definition 54 (Constant composition codes). A code C Ă X n is said to be P x -constant composition for some P x P ∆pX q if the type of each codeword is P x , i.e., τ x " P x for every x P C.
Lemma 55. For any code C Ă X n of rate R, there is a constant composition subcode C 1 Ď C of asymptotically the same rate.
Definition 56 (Confusability of tuples of vectors). A list of L distinct codewords x 1 ,¨¨¨, x L P X n is said to be L-confusable if there are y P Y n and s 1 ,¨¨¨, s L P Λ s such that W px i , s i q " y for all i P rLs.
Definition 57 (Confusability of joint distributions). A pP x , Lq-self-coupling P x1,¨¨¨,xL P J bL pP x q is said to be L-confusable if it has an extension P x1,¨¨¨,xL,s1,¨¨¨,sL,y P ∆`X LˆS LˆY˘s uch that 1) rP x,¨¨¨,xL,s1,¨¨¨,sL,y s x1,¨¨¨,xL " P x1,¨¨¨,xL ; 2) P si P λ s for all i P rLs; 3) P xi,si,y " P x P si|xi W y|xi,si for all i P rLs.
Definition 58 (Confusability set). The pP x , Lq-confusability set K bL pP x q of a channel A "`X , λ x , S, λ s , Y, W y|x,sȋ s defined as K bL pP x q :" P x1,¨¨¨,xL P J bL pP x q : P x1,¨¨¨,xL is L-confusable ( .
Remark 59. In the above definitions, we overload the notion of confusability for types and distributions.
K bL pP x q " 8 ď n"1 τ x 1 ,¨¨¨,x L : px 1 ,¨¨¨, x L q is L-confusable; x i P T x pP x q, @i P rLs ( . Definition 60 (List decodable codes). A code C Ă X n is said to be pL´1q-list decodable if no size-L list is confusable, i.e., for any L P`C L˘, L is non-L-confusable.
Definition 61 (Achievable rate and list decoding capacity). A rate R is said to be achievable under pL´1q-list decoding if there is an infinite sequence of pL´1q-list decodable codes tC i u iě1 of blocklength n i P Z ą0 (such that tn i u is a non-vanishing sequence) and rate RpCq ě R.
The pL´1q-list decoding capacity is defined as the maximal achievable rate. RpCq.
X. LIST DECODING CAPACITY Theorem 62 (List decoding capacity). For any adversarial channel A " pX , λ x , S, λ s , Y, W q, let which can be viewed as a generalized sphere-packing bound. The mutual information is evaluated w.r.t.
1) (Achievability) For any δ ą 0 and sufficiently large n, there exists C of rate C´δ such that it can be Op1{δq list decoded. 2) (Converse) For any C of rate C`δ, C is 2 Ωpnδq -list decodable.
Proof. We follow the idea used in the proof of list decoding theorem 5 under the standard bit-flip model but conduct the calculations under our generalized setting [Sar08].
1) (Achievability) Let R " C´δ. Fix Px P λ x to be a maximizer of expression (63). Generate a random code by sampling 2 nR codewords independently and uniformly from T x pPx q. We will actually show that Lemma 64. For any δ ą 0 and sufficiently large n, a random Px -constant composition code of rate R " C´δ as defined above is´1`l og |Y| δ´1¯-list decodable with probability at least 1´2´n p1´Rq . For every y P Y n , define conditional typical set A x|y :" x P T x pPx q : Ds P Λ s , y " W px, sq ( to be the set of all x of type Px that can reach y via allowable s P Λ s . Note that A x|y is precisely the list of codewords around y whose size we would like to bound. In favour of proceeding calculations, we write A x|y in terms of types and estimate its size. We say that a type τ x,s,y P P pnq pXˆSˆYq is valid if a) rτ x,s,y s In Eqn. (65) and (66), the conditional entropy is evaluated w.r.t. rτ x,s,y s x,y and " Px τ s|x W y|x,s ‰ x,y , respectively. In Eqn. (67), the conditional entropy is evaluated w.r.t.
" Px P s|x W y|x,s ‰ x,y . This equality holds in the limit as n approaches infinity since types are asymptotically dense in distributions. Note that A x|y Ă T x pPx q. We have that the probability q that a random codeword x is able to result in y via some admissible s P Λ s is 1 n log q :" 1 n log Pr  70) is by the choice of Px . The probability that there is a large list clustered around y is given by Let S i denote the summand where Eqn. (71) follows since i ě L ě 1 and Eqn. (72) follows since 1´2´n C ě 1 2 when n ě 1 C . The largest summand is the first term. Therefore we can bound the error probability by replacing each term with the first one.
Finally taking a union bound over all y P Y n , we know that the probability of list decoding error is at most which is 2´Ω pnq if L ą 1`log |Y| δ´1 . Specifically, taking L " 1`log |Y| δ , we have that the list decoding error probability is at most 2´n p1`δ´Cq " 2´n p1´Rq , as desired.
2) (Converse) Given any code C of rate C`δ, choose the τx P P pnq pX q such that |C X T x pτx q| is maximized.
By Lemma 55, RpC 1 q -RpCq. For this τx , choose legitimate τs |x P λ s|x such that where Ipx; yq is evaluated according to " τx τ s|x W y|x,s ‰ x,y . Now define τx ,s,y :" τx τs |x W y|x,s , τx ,y :" " τx ,s,y ‰ x,y and τẙ :" " τx ,y ‰ y . Over the randomness of selecting y uniformly from T y`τẙ˘, the average number of codewords in A x|y is dot equal to 1ˇˇT y`τẙ˘ˇź xPXˆτx pxqn τẙ p1qn¨τx |y px|1q,¨¨¨, τẙ p|Y|qn¨τx |y px||Y|q˙.  76) is by analyzing the sampling procedure from the first principle. The product is exactly, given x P C 1 , the number of ways to pick y from T x|y`τẙs uch that τ x|y " τx |y . We compute the exponent of the above expectation.  (63)). τx always gives rise to mutual information no larger than the maximizer in C. Therefore, we have shown that there exists at least one y P Y n such that the corresponding list around y has size at least 2 npδ´op1qq .

XI. LIST SIZES OF RANDOM CODES
In this section, we show that, if L has order lower than 1{δ, then the code used in the proof of achievability (part 1) of the list decoding capacity theorem (Theorem 62) is list decodable with vanishingly small probability. This coupled with Theorem 62 implies that, for the majority (an exponentially close to 1 fraction) of random constant composition capacity-achieving (within gap δ) codes, Θp1{δq is actually the correct order of their list sizes.
Corollary 81. For δ ą 0 and sufficiently large n, at least a 1´2´n p1´Rq´2´nδ`2 δ log 1 δ fraction of Px -constant composition codes (Px as defined in Eqn. (83)) of rate R " C´δ is pL´1q-list decodable, where L " Θ p1{δq lies within the following range Theorem 82. For an adversarial channel A "`X , λ x , S, λ s , , Y, W y|x,s˘, take an optimizing input distribution P x which attains the list decoding capacity C, For any δ ą 0, for each sufficiently large blocklength n, sample a random code C of rate R " C´δ whose codewords are selected independently and uniformly from T x pPx q. Then C is ă pC{δ´1q-list decodable with probability at most 2´n δ`2 δ log 1 δ .
The theorem follows from second moment calculations and generalizes similar theorems for list decodability of random error/erasure correction codes over F q [GN13].
Proof. Let M :" 2 nR . Define typical set Put in the language of types, it can also be written as Lower bounding E rW s. We can get a lower bound on the expected value of W from a straightforward calculation. Pr " x m1 ,¨¨¨, x mL ( Ă A x|y ı " ÿ yPAy ÿ tm1,¨¨¨,mLuPp rM s L q Pr " x P A x|y ı L (86) . that the list L is L-confusable w.r.t. y. Now the variance of W can be upper bounded as follows.
Eqn. (88) follows from the definition of variance and Eqn. (89) follows from linearity of expectation. Note that I´y 1 , L 1¯a nd I´y 2 , L 2¯a re independent if and only if L 1 X L 2 " H. When they are independent, the first expectation factors and the summand vanishes. The inequality (90) follows by dropping the negative term in the summand. In Eqn. (91), we rewrite the summation by randomizing the centers y 1 , y 2 of the lists L 1 , L 2 . The probability is taken over y 1 and y 2 chosen uniformly at random from A y and over the random code sampling procedure. We use E to denote the event that the lists L 1 and L 2 are simultaneously L-confusable w.r.t. y 1 and y 2 , respectively, E :" It then suffices to bound Pr rEs. To this end, first define conditional typical set, for x P X n , where τ y|x is computed from τ x,s,y and τ x , τ y|x " rτ x,s,y s x,y {τ x . Then define the following events in favour of bounding Pr rEs.
We upper bound Pr rEs by neglecting the fact that codewords x i for i P pL 1 X L 2 q z tmu are simultaneously y 1 -confusable and y 2 -confusable, or equivalently, neglecting that y 1 , y 2 should simultaneously belong to A y|x m 1 for all m 1 P L 1 X L 2 , not only the particular m we have chosen.
where m P L 1 X L 2 is any message that appears in both L 1 and L 2 . It is easy to verify that E Ă E 1 X E 2 X E 3 (see Fig. 5). Note that E 2 and E 3 are independent conditioned on E 1 since L 1 z tmu and L 2 zL 1 are disjoint. The probabilities of the above events can be computed precisely.
Pr rE 1 s " Pr where Eqn. (92) is because y 1 and y 2 are independent, and Eqn. (93) follows since y is chosen uniformly from A y . We now compute the exponent of Pr rEs.
Note that the number of pairs of lists L 1 and L 2 with intersection size iŝ Therefore, the variance of W can be bounded as follows.
Var rW s ďˇˇA yˇ2 where Eqn. Pr rC is pL´1q-list decodables ď Var rW s E rW s 2 ď2´n C`nδL`p2L`1q log L .

XII. ACHIEVABILITY
In this section, we are going to show, via concrete random code constructions, that as long as some completely positive pP x , Lq-self-coupling of order L lies outside the order-L confusability set of the channel, the pL´1q-list decoding capacity is positive.
Let CP bL |X | pP x q :" CP bL |X | X J bL pP x q. Theorem 106 (Achievability). For any given general adversarial channel A " pX , λ x , S, λ s , Y, W y|x,s q, its pL´1qlist decoding capacity is positive if there is a completely positive pP x , Lq-self-coupling P x1,¨¨¨,xL CP bL |X | pP x q outside K bL pP x q for some P x P λ x .
We first state a lemma concerning the rate of a random constant composition code.
Lemma 107 (Constant composition codes). Let C " tx i u 2 nR i"1 be a random code of rate R in which each codeword is selected according to product distribution P bn x independently. Let C 1 be the P x -constant composition subcode of C, C 1 " C X T x pP x q. Then Proof. The lemma is a simple consequence of concentration of measure (Lemma 38). Pr where in Eqn. (197), we note that

A. Low rate codes
Let us proceed gently. We first show that a purely random code with each entry i.i.d. w.r.t. some distribution P x is pL´1q-list decodable w.h.p. as long as P bL x is not L-confusable.
Lemma 109. For any general adversarial channel A " pX , λ x , S, λ s , Y, W y|x,s q, if there exists a legitimate input distribution P x P λ x such that P bL x R K bL pP x q, then the pL´1q-list decoding capacity of A is positive.
Proof. Let M " 2 nR for some rate R to be specified momentarily. Sample a code C " tx 1 ,¨¨¨, x M u where each x i i.i.d.
" P bn x . The expected joint type τ x i 1 ,¨¨¨,x i L (1 ď i 1 ă¨¨¨ă i L ď M ) of any list x i1 ,¨¨¨, x iL is P bL x . (See Fig. 6.) Let C 1 " C X T x pP x q be the P x -constant composition subcode of C. Let for some small constant δ ą 0. We will show that Lemma 110. The random P x -constant composition code C 1 as constructed above has rate R " log e 12 ρ 2 L´δ and is pL´1q-list decodable with probability at least 1´2 exp`´2 nR {νpnq˘´2´n δ`L log |X |`1 .

By Lemma 107,
Pr rE 1 s ď 2 expˆ´2 nR νpnq˙. Fig. 6: Low rate codes from product distribution. If the product distribution P bL x is strictly separated away from K bL pP x q, then we could hope for a positive rate achieved by a random code with each entry sampled from P x . This is because w.h.p. the joint types of all (ordered) lists are contained in a }¨} mav -ball which is completely outside the confusability set.
Hence the rate R 1 of C 1 is asymptotically equal to R w.h.p.
By Chernoff bound,  116) is by the choice of and that P bL x px 1 ,¨¨¨, x L q ď 1 for any px 1 ,¨¨¨, x L q P X L . Taking a union bound over all lists pi 1 ,¨¨¨, i L q P`M L˘, We therefore get that C is pL´1q-list decodable with probability at least 1´2´n δ`L log |X |`1 as long as R " log e 12 ρ 2 L´δ .

B. Random codes with expurgation
In the previous section, we only got an pL´1q-list decodable code of positive rate without making the effort to optimize the rate. In this section, we provide a lower bound on the pL´1q-list decoding capacity. It is achieved by a different code construction (random code with expurgation). However, we can only show the existence of such codes instead of showing that they attain the following bound w.h.p.
Lemma 117. The pL´1q-list decoding capacity of a channel A is at least (118) Proof. Fix any P x P λ x to be the maximizer of Eqn. (118). Let M " 2 nR for some rate R to be determined. Generate a random code C of size 2M by sampling each entry of the codebook independently from P x .
For any x P C, by Lemma 34, Hence the expected number of codewords with type P x is 2M {νpnq. For any px 1 ,¨¨¨, x L q P`C L˘, by Sanov's theorem 39. Let P˚P K bL pP x q be the extremizer for the above supremum. Hence the expected number of confusable lists is at mostˆ2 M L˙2´n DpP˚}P bL i.e., L`nRL´nD`P˚}P bL x˘ď nR´log νpnq.
That is, R can be taken arbitrarily close to 1 L´1 D`P˚}P bL x˘.
R ď D`P˚}P bL xL´1´l og νpnq pL´1qn´L pL´1qn Fig. 7: Low rate codes from CP distribution. If there is a CP distribution strictly outside K bL pP x q, then we can get a positive rate from random code using time-sharing. The only variation is that we divide codebook into chunks according to P u and construct random codes of shorter length for each chunk u using distribution P x|u"u .
. Now, we remove all codewords of types different from P x . We also remove one codeword from each of the confusable lists. In expectation, this process reduces the size of the code by at most 2M´2M {νpnq (due to the first expurgation) plus p2M q L 2´n DpP˚}P bL x q ď M {νpnq (due to the second expurgation). After expurgation, we get an pL´1q-list decodable P x -constant composition code C 1 of size at least The rate R 1 of C 1 is asymptotically the same as R.
This finishes the proof.

C. Cloud codes
Lemma 119. If there is a pP x , Lq-self-coupling (P x P λ x ) P x1,¨¨¨,xL P J bL pP x q zK bL pP x q which can be decomposed into for some distributions P u P ∆pUq of finite support |U| and P x|u P ∆pX |Uq. See Fig. 7.
Proof. The proof follows from a time-sharing argument combined with the previous low rate code construction (Lemma 109). Fix R to be determined later. Sample 2 nR codewords in C independently from the following distribution. Divide each length-n codeword into |U| chunks 1,¨¨¨, |U|. For the u-th (u P U) chunk, sample P u puqn components in the chunk independently using distribution P x|u"u . Let P u,x " P u P x|u and P x " rP u,x s x . Let C 1 be all codewords in C of type P x . (See Fig. 8.) Define Fig. 8: An example of cloud code construction in which U " t1, 2, 3u. The codebook is divided into 3 chunks and symbols in the i-th chunk are sampled independently from P x|u"i (i " 1, 2, 3).
Note that P u pu˚q ą 0 since |U| is the support of P u . Let R " Pupu˚q log e 12 ρ 2 L´δ . We will show that Lemma 120. A random P x -constant composition cloud code as constructed above has rate R " Pupu˚q log e 12 ρ 2 L´δ and is pL´1q-list decodable with probability at least 1´2 expˆ´2 nR 12 ś uPU νpP u puqnq˙´2´n δ`L log |X |`log |U |`1 .
We write a length-n codeword as the concatenation of |U| chunks, x "´x p1q ,¨¨¨, x p|U |q¯.
First we argue that w.h.p. the code C is almost P x -constant composition. The expected size of C 1 is where Eqn. Pr "ˇˇC 1ˇR p1˘1{2qE "ˇˇC 1ˇ‰ ‰ ď2 expˆ´2 nR 12 ś uPU νpP u puqnq˙. Secondly, for any list 1 ď i 1 ă¨¨¨ă i L ď M of distinct ordered messages, nP u puq˙(124) (a) "Below Plotkin point", positive pL´1q-list decoding rate is possible. In this case, for some input distribution P x P λ x , the slice of P x -self-coupling CP tensors is not entirely contained in the confusability set K bL pP x q.
(b) "Above Plotkin point", no positive rate for pL´1qlist decoding is achievable. In this case, for every input distribution P x P λ x , the slice of P x -self-coupling CP tensors is entirely contained in the confusability set K bL pP x q. Therefore, we have that the probability that the random P x -constant composition cloud code C 1 constructed above has rate R " Pupu˚q log e 12 ρ 2 L´δ and is pL´1q-list decodable with probability at least 1´2 expˆ´2 nR 12 ś uPU νpP u puqnq˙´2´n δ`L log |X |`log |U |`1 , which completes the proof.
The above lemma apparently implies Theorem 106.

XIII. CONVERSE
Let CP bL |X | pP x q :" CP bL |X | X J bL pP x q and Sym bL |X | pP x q :" Sym bL |X | X J bL pP x q. We have shown in the previous section that if CP bL |X | pP x q X K bL pP x q c ‰ H, then the pL´1q-list decoding capacity is positive. In this section we are going to prove the converse. That is, such a condition is also necessary for positive rate being possible. Indeed, we will show that Theorem 126 (Converse). Given a general adversarial channel A "`X , λ x , S, λ s , Y, W y|x˘, if for every admissible input distribution P x P λ x , CP bL |X | pP x q Ď K bL pP x q, then the pL´1q-list decoding capacity of A is zero. The blue dots correspond to joint types of its subcode C 1 . (Note that they are all non-confusable.) They are clustered within a small ball (w.r.t. sum-absolute-value norm) centered at some distribution p P x1,¨¨¨,xL . Since the hypergraph Ramsey number is finite, there exists such C 1 which is suitably large.

A. Equicoupled subcode extraction
Definition 127 (Equicoupledness and -equicoupledness). A code C is said to be P x1,¨¨¨,xL -equicoupled if for all ordered lists px i1 ,¨¨¨, x iL q P`C L˘w here 1 ď i 1 ă¨¨¨ă i L ď |C|, τ x i 1 ,¨¨¨,x i L " P x1,¨¨¨,xL . A code C is said to be pζ, P x1,¨¨¨,xL q-equicoupled if for all ordered lists px i1 ,¨¨¨, x iL q P`C L˘, where 1 ď i 1 ă¨¨¨ă i L ď |C|, Remark 128. The above definition can also be overloaded for sequences of random variables or their joint distributions. We say a sequence of random variables w 1 ,¨¨¨, w M or the joint distribution P w1,¨¨¨,wM is P x1,¨¨¨,xLequicoupled (or pζ, P x1,¨¨¨,xL q-equicoupled) if every order-L marginal P wi 1 ,¨¨¨,wi L (1 ď i 1 ă¨¨¨ă i L ď M ) equals (or is ζ-close to in }¨} sav ) P x1,¨¨¨,xL .
Using the hypergraph Ramsey's theorem, we first show that any infinite sequence of codes of positive rate has an infinite sequence of subcodes which are ζ-equicoupled.
Lemma 129 (Equicoupled subcode extraction). For any infinite sequence of codes tC i u iě1 of blocklengths n i 's and positive rate, where tn i u iě1 is an infinite increasing integer sequence, for any ζ ą 0 and any M P Z ą0 , there is an N P Z ą0 such that if |C i | ě N then C 1 contains a subcode C 1 i satisfying that ‚ |C 1 i | ě M ; ‚ C 1 i is pζ, P x1,¨¨¨,xL q-equicoupled for some P x1,¨¨¨,xL . See Fig. 10.
Again, this lemma is a consequence of the hypergraph Ramsey's theorem. Let R pmq c pn 1 ,¨¨¨, n c q be the smallest integer n such that the complete m-uniform hypergraph on n vertices with any c-colouring of hyperedges contains at least one of a clique of colour 1 and size n 1 , ..., a clique of colour c and size n c . It is known that R pmq c pn 1 ,¨¨¨, n c q is finite (Lemma 225), i.e., independent of the size n of the hypergraph.
Proof of Lemma 129. Recall that we assume CP bL |X | pP x q X K bL pP x q c " H. Let ρ be the gap between CP bL |X | pP x q and K bL pP x q, ρ :" inf P PCP bL |X | pPxq P 1 PJ bL pPxqzK bL pPxq Definition 130 ( -net). For a metric space pX , dq, an -net N Ă X is a subset which is a discrete -approximation of X in the sense that for any x P X , there is an x 1 P N such that dpx, x 1 q ď .

We claim that
Lemma 131 (Bound on size of -net). There is an -net N of J bL pP x qzK bL pP x q equipped with 1 metric of size Proof. The following construction is by no means optimal, but its size has a finite upper bound which is enough for our purposes. Indeed, it suffices to take N to be the coordinate-quantization net of J bL pP x qzK bL pP x q. Note that for any P P J bL pP x q, each entry of P lies in r0, 1s. Take δ :" 2 |X | L . Divide r0, 1s into sub-intervals of length δ (possibly except the last sub-interval that may have length less than δ). For each entry of P , there are at most 1 δ`1 sub-intervals. Quantize each component of P to the nearest middle point of these sub-intervals. The set of all representatives whose components take values from the set of middle points of the sub-intervals form a net N . In total, there are at most`1 δ`1˘| X | L such representatives. For any P P J bL pP x qzK bL pP x q, let Q N pP q denote the quantization of P using N , i.e., Q N pP q :" argmin The quantization error is at most We thus have shown that N constructed as above is an -quantizer of small cardinality.
We know that CP cone and coP cone are dual (Theorem 220) in the space of symmetric tensor cone. Thus, for any non-CP symmetric tensor p P P Sym bL |X | pP x qzCP bL |X | pP x q, there must be a witness Q with strictly negative inner product with p P . The infimum inf λ is the absolute value of the smallest inner product among all symmetric non-CP tensors. We know that λ ą 0, since CP bL |X | pP x q is strictly contained in K bL pP x q. Let Take a ζ-net of`∆`X L˘, 1˘a s constructed in Lemma 131. Such a net has cardinality at most K :"´| X | L ρ`1¯| X | L . Build an L-uniform complete hypergraph H " pC, Eq on C. The vertices of H are codewords in C. For every tuple`x i1 ,¨¨¨, x iL˘P`C L˘( where the indices 1 ď i 1 ă¨¨¨ă i L ď |C| are sorted in ascending order) of distinct codewords, there is a hyperedge connecting them. There are totally`| C| L˘h yperedges in E. We now label hyperedges using distributions in N . For each hyperedge`x i1 ,¨¨¨, x iL˘P E, label it using the unique element Q N´τx i 1 ,¨¨¨,x i Lf rom N . This can be viewed as an edge colouring of H using at most K colours.
By hypergraph Ramsey's theorem (Theorem 225), there is a constant N such that if the size |C| of the hypergraph is at least N , then there is a monochromatic (each hyperedge in the sub-hypergraph has the same colour) clique C 1 Ă C of size at least M . Indeed, we can take N to be the hypergraph Ramsey number N " R pLq K pM,¨¨¨, M q. By Theorem 226, there is a constant c 1 ą 0 such that N ă t L pc 1¨K log Kq, where t L p¨q is the tower function of height L. Put in another way, there exists a subcode C 1 Ă C of size at least M such that for some distribution p P x1,¨¨¨,xL P N , the joint type of every ordered tuple of L distinct codewords in C 1 is ζ-close to p P x1,¨¨¨,xL . I.e., for every L " px 1 ,¨¨¨, x L q P`C 1 L˘, This completes the proof of Lemma 129.
Before proceeding with the proof of converse, we first list several corollaries that directly follow from the above lemma. They are concerned with basic properties of pζ, P x1,¨¨¨,xL q-equicoupled codes.
Corollary 134. Any two lists of L (ordered) codewords from C 1 have joint types 2ζ close to each other in sumabsolute-value distance.
Proof. For any L 1 " px i1 ,¨¨¨, x iL q and L 2 " px j1 ,¨¨¨, x jL q in`C 1 L˘, Corollary 136. Any two size-(1 ď ď L) lists in C 1 have joint type 2ζ close to each other in sum-absolute-value distance, provided |C 1 | ą 2L.
For a subset B Ă rns, we let P xB denote the marginalization of P x1,¨¨¨,xL onto the random variables indexed by elements in B, rP x1,¨¨¨,xL s txi : iPBu .
Corollary 137. For any 1 ď ă L and any subsets L 1 1 , L 1 2 P`r ns ˘, P x L 1 1 and P x L 1 2 are 3ζ close to each other in sum-absolute-value distance, given |C 1 | ą 2L.
Note that Similarly, By triangle inequality, Corollary 138. A pζ, P x1,¨¨¨,xL q-equicoupled code C 1 is p3ζ, P x1,¨¨¨,x q-equicoupled for any 1 ď ď L, as long as Proof. For any list of codewords x i1 ,¨¨¨, x i , we can always find a completion of pi 1 ,¨¨¨, i q to an L-tuple. Let T denote the set of locations of i 1 ,¨¨¨, i in the completion. We know that By the previous corollary, Now we apply the double counting trick used in the Plotkin-type bound for list decoding. We want to show that if p P x1,¨¨¨,xL is not completely positive, then any pL´1q-list decodable code cannot be large.
Definition 139 (Symmetry of tensors). A tensor T P Ten bm n is said to be symmetric if its components are invariant under permutation of indices, i.e., for any σ P S m and any pt 1 ,¨¨¨, t m q P rns m , T pt 1 ,¨¨¨, t m q " T`t σp1q ,¨¨¨, t σpmq˘.
The set of dimension-n order-m symmetric tensors is denoted by Sym bm n .

B. Symmetric case
In this subsection, assume p P x1,¨¨¨,xL is symmetric as a dimension-|X | order-L tensor. We are going to show that Lemma 140 (Converse, symmetric case). For a general adversarial channel A "`X , λ x , S, λ s , Y, W y|x,s˘a nd an admissible input distribution P x P λ x , if CP bL |X | pP x q Ď K bL pP x q, the any pζ, P x1,¨¨¨,xL q-equicoupled pL´1q-list decodable code C 1 has size at most where p P x1,¨¨¨,xL P Sym bL |X | pP x qK bL pP x q is a symmetric, non-confusable joint distribution.
Proof. Since p P x1,¨¨¨,xL P Sym bL |X | pP x qzCP bL |X | , by duality (Theorem 220) between the CP tensor cone and coP tensor cone, there is a copositive tensor Q P coP bL |X | such that }Q} F " 1 (by normalization) and xP x1,¨¨¨,xL , Qy "´η (141) for some η ą 0. Note that, by definition of λ, η ą λ. We will bound E from above and below and argue that if |C 1 | is larger than some constant 13 , then we get a strictly negative upper bound and a non-negative lower bound. Such a contradiction implies that no positive rate is possible for pL´1q-list decoding if p P x1,¨¨¨,xL is a non-CP symmetric distribution.

Upper bound
Case when i 1 ,¨¨¨, i L P r|C 1 |s are not all distinct. For i 1 ď¨¨¨ď i L P r|C 1 |s not all distinct, Eqn. (142) is by Cauchy-Schwarz inequality. Eqn. (143) is because q-norm of a vector is non-increasing in q. Eqn. (144) is because a probability/type vector has one-norm 1 and Q is normalized to have F -norm 1.
where Eqn.  133)). Therefore, Overall, if |C 1 | is sufficiently large. To see this, note that p p|C 1 |q :" |C 1 | L´`|C 1 | L˘L ! is a polynomial in |C 1 | of degree L´1, while´λ 2`| C 1 | L˘L ! is a polynomial in |C 1 | of degree L. To give an explicit bound on |C 1 |, note that the RHS of (149) equals In the above inequality, to upper bound p p|C 1 |q, we replace each term of p with a monomial with the largest possible coefficient in absolute value and the largest possible degree. To make the RHS negative, we want One can easily check that when |C 1 | ą 2pL´1q, Moreover, when |C 1 | ą 2 L`1 L! λ , is satisfied, so is the original inequality (149). Overall, we have that ÿ pi1,¨¨¨,iLqPr|C 1 |s L Though the bound (150) is crude, it is a constant not depending on the blocklength n. Lower bound To see equality (151), let P pjq x be the empirical distribution of the j-th column of C 1 as a |C 1 |ˆn matrix, i.e., for x P X , P pjq x pxq :" The last inequality (152) follows since´P pjq x¯b L is a completely positive tensor. The lower bound and the upper bound are contradicting each other, which completes the proof.

C. Asymmetric case
In this section, we handle the asymmetric case of the converse.

We will show that
Lemma 155 (Converse, asymmetric case). If P x1,¨¨¨,xL P Ten bL |X | pP x q is asymmetric as a tensor in Ten bL |X | pP x q and has asymmetry α, then for any 0 ă ζ ă α, any pζ, P x1,¨¨¨,xL q-equicoupled (w.r.t. max-absolute-value distance) 14 code C 1 has size at most for some absolute constant c ą 0.
Lemma 155 is shown by reducing the problem, in a nontrivial way, from general values of L to L " 2 in which case it is known [WBBJ] that such codes cannot be large.
Lemma 156 (Reduction from general L to L " 2). If P x1,¨¨¨,xL P Ten bL |X | has asymmetry asymmpP x1,¨¨¨,xL q " α, then among the following distributions P y1,z1 , P y2,z2 ,¨¨¨, P yL´1,zL´1 , there is at least one distribution P y i˚, z i˚( i˚P rL´1s) with asymmetry at least Here, for i P rL´1s, y i and z i (1 ď i ď L´1) are tuples of random variables defined as respectively.
Proof. The proof is by contradiction. We will show that if all of tP yi,zi u 1ďiďL´1 have small asymmetry, then they do not not suffice to back propagate their asymmetry using transpositions to result in the asymmetry α of P x1,¨¨¨,xL .
To make this intuition clear, assume, towards a contradiction, that all of the distributions tP yi,zi u 1ďiďL´1 have asymmetry strictly less than α 1 " α p L 2 q , asymm pP yi,zi q ă ὰ L 2˘, @i P rL´1s.
Note that the set of transpositions tσ 1 ,¨¨¨, σ L´1 u forms a generator set of S L , where Any permutation σ P S L can be written as a product of σ i 's, σ " σ i ¨¨¨σ i1 for some positive integer and a subset of transpositions, i j P rL´1s for each j P r s. Such a representation, in particular the value of , is not necessarily unique. Let pσq :" min t P Z ě0 : σ " σ i ¨¨¨σ i1 transposition representationu be the transposition length of σ, i.e., the length of the shortest representation using product of transpositions. Let ˚: " max σPSL pσq.
We claim that ˚ď`L 2˘. To see this, it suffices to bound pσq for the worst case permutation The claim follows by noting that σ can be written as which contains`L 2˘t ranspositions.
Remark 160. A potential confusion may arise from two conflicting conventions that 1) a product is usually written from left to right, i.e., 2) a composition of permutations acts like functions on an element from right to left, i.e., for σ, π P S L and i P rLs, pσπqpiq " σpπpiqq.
The product in the pL´1q-st parenthesis (from left to right) moves L in the initial sequence pL, L´1,¨¨¨, 1q to the L-th position; the product in the pL´2q-nd parenthesis moves L´1 to the pL´1q-st position; ...; the permutation σ 1 in the 1-st parenthesis moves 2 to the 2-st position, and automatically 1 is in the 1-st position. We get the target sequence p1, 2,¨¨¨, Lq.
Next, we show the key lemma 155 in this section. Note that, according to the statement, Lemma 155 is independent of the channel that the code C 1 is used for. Hence we will directly prove the random variable version of this lemma which is concerned with fundamental properties of joint distributions. If the joint distribution of a sequence of random variables has all of its size-L marginals being ζ-close to some asymmetric distribution, then such a sequence cannot be infinitely long. We will prove a finite upper bound on the length of the sequence by reducing this problem from the general L ą 2 case to the L " 2 case. In the L " 2 case, prior work [WBBJ] shows that this is indeed the case.
We are now ready to prove the restated version of Lemma 155.
Lemma 173 (Converse, asymmetric case, general L). If a joint distribution P x1,¨¨¨,xL P ∆`X L˘h as asymmetry asymm pP x1,¨¨¨,xL q " α, and a sequence of M random variables w 1 ,¨¨¨, w M supported on X satisfies that for any 1 ď j 1 ă¨¨¨ă j L ď M , › › P wj 1 ,¨¨¨,wj L´P x1,¨¨¨,xL Then M ď exp˜c α{`L 2˘´ζ¸`L´2 for some universal constant c ą 0.
Remark 175 (Asymmetric but projectively symmetric tensors). Lemma 156 does not follow from naïvely marginalizing an asymmetric distribution P x1,¨¨¨,xL and hoping that P xi,xj is asymmetric for some 1 ď i ă j ď L. Just like there exist asymmetric matrices (self-couplings) with the same column sum and row sum, we should not expect that the asymmetry of a tensor is preserved under projections. We say that a tensor P x1,¨¨¨,xL P Ten bL |X | is -projectively symmetric (1 ď ă L) if all of its order-projections are symmetric, i.e., for any 1 ď i 1 ă¨¨¨ă i ď L, P xi 1 ,¨¨¨,xi :" rP x1,¨¨¨,xL s xi 1 ,¨¨¨,xi P Ten b |X | is symmetric.
One can easily verify the following facts.
3) A symmetric tensor P x1,¨¨¨,xL is also -projectively symmetric for all 1 ď ă L. In particular, it is a selfcoupling, i.e., P xi is the same for all i P rLs.
We provide an example showing that the asymmetry of a tensor cannot be recovered from all of its lower order projections. That is, there is an asymmetric tensor with every projection of one less order being symmetric.
We now construct a concrete example. In order for a dimension-2 order-3 tensor T : r2s 3 Ñ R to be symmetric, it has to satisfy the following system E 1 of linear equations, t 112 "t 121 , t 121 " t 211 , t 212 " t 122 , t 122 " t 221 .
where t ijk :" T pi, j, kq for i, j, k P r2s. On the other hand, for it to be projectively symmetric, it has to satisfy the following system E 2 of linear equations, t 122`t121 "t 212`t211 , t 112`t122 "t 211`t221 , t 121`t221 "t 112`t212 .
Additionally, for T to represent a joint distribution, all entries should be non-negative and sum up to one. Note that E 2 is a less determined system than E 1 , which means that we should be able to find a solution to E 2 which does not satisfy E 1 .
In general, for any dimension-d order-L tensor, such examples can always be constructed due to the gap of degrees of freedom between the homogeneous linear systems E 1 and E 2 . Fig. 13: Construction of C 1 by permuting rows of C " tx 1 , x 2 , x 3 u using σ P S 3 (where S 3 " tid, σ 1 ,¨¨¨, σ 5 u) and juxtaposing all σpCq (6 of them in total) together.

A. A cheap converse
If for a general A " pX , λ x , S, λ s , W y|x,s q, for every P x P λ x , the confusability set is a halfspace defined by a single linear constraint K bL pP x q :" P x1,¨¨¨,xL P J bL pP x q : xP x1,¨¨¨,xL , Cy ď b ( , for some tensor C P Ten bL |X | and constant b, then the converse can be significantly simplified. In particular, we do not have to handle symmetric and asymmetric cases separately. We describe the proof idea below. Proof. The proof essentially follow from the following observation. For any asymmetric P x1,¨¨¨,xL , given any P xconstant composition pζ, P x1,¨¨¨,xL q-equicoupled code C " tx i u M i"1 in X n of size M , we can construct a code C 1 " tx 1 i u M i"1 in X n¨M ! of the same size which is symmetric. Indeed, we can permute the rows of C using σ P S M and juxtapose all possible (M ! of them in total) such row-permuted codes σpCq. (See Fig. 13.) The resulting code C 1 is actually not only L-wise approximately equicoupled, but M -wise exactly equicoupled! For any L P rM s and any L-sized (not necessarily ordered) subset ti 1 ,¨¨¨, i L u of rM s, the joint type of x 1 i1 ,¨¨¨, x 1 iL is exactly equal to which is symmetric and independent of the choice of the list pi 1 ,¨¨¨, i L q (hence let us denote it by p P x1,¨¨¨,xL ). In particular, letting L " M , we get that To see the above claims, note that if we juxtapose two pairs of codewords px 1 , x 2 q and px 1 1 , x 1 2 q, we get a pair of longer codewords pr x 1 , r x 2 q :" px 1˝x 1 1 , x 2˝x 1 2 q (where˝denotes concatenation) with joint type τ r x 1 ,r x 2 " 1 ,x 1 2 q. This still holds if two pairs of codewords of different blocklengths are juxtaposed. Say, px 1 , x 2 q has blocklength n while px 1 1 , x 1 2 q has blocklength n 1 . Then τ r x 1 ,r x 2 " n n`n 1 τ x 1 ,x 2`n 1 n`n 1 τ x 1 Back to the proof of the converse in such a spacial case, since the confusability set is defined by a single linear constraint, any convex combinations of non-confusable joint types is still outside the confusability set, in particular, p P x1,¨¨¨,xL . We hence reduce the problem to the symmetric case and the rest of the proof is handled by Theorem 140.

B. Towards a unifying converse
We feel it unusual that we have to use drastically different techniques to prove the symmetric and the asymmetric parts of the converse. We suspect that it can be proved in a unifying way using the duality between CP and coP tensors which is the source of contradiction in our current proof of the symmetric case.
Note that the duality holds only in the space of symmetric tensors. To be specific, traditionally, CP and coP tensors are defined to be symmetric. And they are dual cones living in the ambient space Sym b n . If we extend the definitions of CP and coP tensors to the set of all (including asymmetric) tensors, then it is unclear whether duality still holds. Indeed, there are pairs of cones which are dual to each other in a certain ambient space but are no long dual in a larger ambient space. In a word, the ambient space that the dual cone is computed with respect to matters much.
We provide evidence showing that the symmetric and asymmetric parts of the converse can be potentially unified by the Plotkin-type bound since duality between CP and coP tensors-the core of the double counting argumentfortunately holds in larger generality. Duality. We know that CP bL |X | and coP bL |X | are dual cones in the space Sym bL |X | of symmetric tensors. However, p P x1,¨¨¨,xL (associated to the equicoupled subcode extracted using hypergraph Ramsey's theorem) is not guaranteed to be symmetric. We claim that duality still holds in the space Ten bL |X | of all tensors. Hence, copositive witness Q of a non-CP p P x1,¨¨¨,xL exists even when p P x1,¨¨¨,xL is asymmetric.
Note that it is important that B is now taken from Ten bL |X | rather than Sym bL |X | . Also recall that coP bL |X | :" ) .
Note that this definition differs from the standard one 219 and this cone is potentially larger. 15 The goal is to shoẃ CP bL |X |¯˚" coP bL |X | . The direction coP |X | Ď´CP bL |X |¯˚i s trivial, since the definitions of CP and coP tensors remain the same but the dual cone is computed w.r.t. a larger space. The new dual cone we are considering is no smaller than the old one. The inclusion that used to hold in the traditional setting should continue to hold now. Indeed, take any B P coP bL |X | , for any A " This finishes the whole proof. Remark 178. In general, duality does not necessarily hold in a larger ambient space. Namely, computing dual cone w.r.t. a larger space may result in a larger cone. For instance, PSD |X | cone is known to be self dual in Sym |X | , i.e., PSD| X | " PSD |X | . However, in Mat |X | , PSD| X | is strictly containing PSD |X | . To see this, note that any skew symmetric matrix B is in PSD| X | since for any PSD (hence symmetric) matrix A, xA, By " 0 ě 0; while B is not necessarily PSD.
Define, for σ P S L , σ pP x1,¨¨¨,xL q :" P xσp1q,¨¨¨,xσpLq . Though duality holds for all symmetric and asymmetric tensors, we do not have a full proof of the converse using duality, since we have trouble bounding the term xσ pP x1,¨¨¨,xL q , Qy " xP x1,¨¨¨,xL , σpQqy 15 Indeed, we will see shortly that it is strictly larger. which does not necessarily equal xP x1,¨¨¨,xL , Qy for asymmetric Q.
We next show that such asymmetric witness Q does exist and is sometimes necessary in the sense that, some asymmetric (hence non-CP) tensors have no symmetric witness. This means that the dual cone of coP w.r.t. Ten bL |X | (instead of Sym bL |X | ) is strictly larger. Asymmetric distributions without symmetric coP witness. Let L " 2. We construct an asymmetric self-coupling P x1,x2 P ∆`r3s 2˘w ithout symmetric coP witness Q such that xP x1,x2 , Qy ă 0. Indeed, let If there was a symmetric coP Q such that xP x1,x2 , Qy ă 0, then xP x1,x2 , Qy ă 0.

XV. SANITY CHECKS
Consider the bit-flip model. In this section, we are going to verify the correctness of our characterization of the generalized Plotkin point using the bit-flip model as a running example. For L " 3, 4, 16 we will numerically recover Blinovsky's [Bli86] characterization of the Plotkin point P L´1 for pp, L´1q-list decoding. In particular, P 2 " 1{4 and P 3 " 5{16.
The LP can be written in a compact form as 1 1 1 1 1 1 1 1 1 1 1 1 1   Observe that as p increases, the linear system becomes monotonically easier to be satisfied. Checked by Mathematica, the above LP is feasible if p ą 1{4 (and hence the distribution is not confusable). Therefore, the pp, L´1q-list decoding capacity hits 0 precisely at p " 1{4.

B. L " 4
For L " 4, one can obtain a similar LP whose infeasibility is equivalent to H˜" 1{2 1{2  s empty, it boils down to checking the infeasibility of a linear program with 2 L`1 variables and 2 L`1`1`2L`L constraints, 2 L`1 of them for non-negativity of probability mass, 1 of them for probability mass summing up to one, 2 L of them for ensuring that P x1,¨¨¨,xL P J bL pP x q is a pP x , Lq-self-coupling, L of them for the nonconfusability guarantee: P x1,¨¨¨,xL R K bL pP x q. The size of the program (or the number of defining constraints of the corresponding polytope) grows exponentially in L. However, since we are concerned with absolute constant L in this paper, for any given L, the feasibility can be certified in constant time. Observe that, since the LP in the bit-flip setting is so structured, one can write it down explicitly by hand for any given L.

XVI. BLINOVSKY [BLI86] REVISITED
In this section, we fully recover Blinovsky's [Bli86] results on characterization of the Plotkin points P L´1 for pp, L´1q-list decoding under the bit-flip model.
Let φ be the standard bijection between t0, 1u and t´1, 1u, We identify the type τ x P P pnq pF 2 q of a binary length-n vector x P F n 2 using a t´1, 1u-valued random variable x defined as Pr rx "´1s " wt H pxq n , Pr rx " 1s " 1´w t H pxq n .
Indeed the distribution P x P P pnq pt´1, 1uq of x is the type of the image φpxq of x under φ.
For a collection of vectors x 1 ,¨¨¨, x k P F n 2 , their joint type is now represented by a sequence of random variables x 1 ,¨¨¨, x k with joint distribution P x1,¨¨¨,xk , for any x 1 ,¨¨¨, x k P t´1, 1u, P x1,¨¨¨,xk px 1 ,¨¨¨, x k q " Pr rx 1 " x 1 ,¨¨¨, x k " x k s "τ x 1 ,¨¨¨,x k pφ´1px 1 q,¨¨¨, φ´1px k qq.
It is easy to check that, for x 1 , x 2 P F n 2 , H px 1 , x 2 q n˙" LHS.

Let
r :" E px1,¨¨¨,xLq"t´1,1u L r|x 1`¨¨¨`xL |s , be the expected translation distance of a 1-dimensional unbiased random walk after L steps. Each x i (1 ď i ď L) is independent and uniformly distributed on t´1, 1u.
Theorem 181. The Plotkin point P L´1 for pp, L´1q-list decoding is given by Remark 182. Note that the formula in Theorem 181 agrees with the one by Blinovsky. To see this, we first compute r. For odd L " 2k`1, where k P Z ą0 is some strictly positive integer, it is easy to see that Recall that, by binomial theorem (Fact (47)), i˙.
Now we simplify the formula in Theorem 181.
Proof. We will show that if p " 1´r`η L 2 ă 1´r{L 2 for any η ą 0, then the product distribution Bern bL p1{2q lies outside the corresponding confusability set K bL pBern p1{2qq. Using the framework developed in this paper, a random code of a suitable positive rate in which each codeword is sampled independently and uniformly from T x pBernp1{2qq is pp, L´1q-list decodable w.h.p.
The proof is by contradiction. If P x1,¨¨¨,xL :" Bern bL p1{2q is confusable, then, by the definition 56 of confusability of tuples, an L-tuple of distinct codewords x 1 ,¨¨¨, x L of joint type τ x 1 ,¨¨¨,x L " P x1,¨¨¨,xL can be covered by a ball of radius np centered around some y P F n 2 . Equivalently, by the definition 57 of confusability of distributions, there is a refinement P x1,¨¨¨,x,y P ∆´t´1, 1u L`1¯s uch that rP x1,¨¨¨,xL,y s x1,¨¨¨,xL " P x1,¨¨¨,xL , and for every i P rLs, P xi,y p0, 1q`P xi,y p1, 0q ď p.
This means that for every i P rLs, by the relation (Eqn. (179)) between Hamming distance between vectors and correlation of their random variable representations. Hence E rpx 1`¨¨¨`xL q ys ě r`η.
Using the above observation, we get r " E px1,¨¨¨,xLq"t´1,1u L r|x 1`¨¨¨`xL |s (188) That is, if p ą P L´1 , then no positive rate is possible, i.e, there is no infinite sequence of pp, L´1q-list decodable codes of positive rate.
Proof. Our goal is to show that if p ą P L´1 , then C L´1 " 0. Suppose p " 1´r´η L 2 for a constant η ą 0. We are going to show that any infinite sequence of codes C n each of positive rate is not pp, L´1q-list decodable. First, by the previous argument in last section, we can extract a sequence of subcodes C 1 n Ď C n of positive rate satisfying that, for every tuple of distinct codewords x 1 ,¨¨¨, x L P C 1 and x 1 ,¨¨¨, x L P F 2 ,ˇˇτ x 1 ,¨¨¨,x L px 1 ,¨¨¨, x L q´p P x1,¨¨¨,xL px 1 ,¨¨¨, x L qˇˇď ζ for some symmetric distribution p P x1,¨¨¨,xL P ∆`X L˘a nd some positive constant ζ ą 0. In favour of the proceeding calculations, it suffices to take ζ " L pL´1qr2 L`2 η.
To show non-list decodability of C 1 (and hence C), we will argue that there is a list px i1 ,¨¨¨, x iL q P`C 1 L˘t hat can be covered by a ball of radius np centered around MAJ`x i1 ,¨¨¨, x iL˘. The proof is by contradiction. Suppose this is not the case, i.e., no list can be covered by the ball centered at its majority. Define, for pi 1 ,¨¨¨, i L q P " 2 nR ‰ L , Q i1,¨¨¨,iL " px i1`¨¨¨`xiL q¨MAJ px i1 ,¨¨¨, x iL q´r.
We will provide a strictly negative upper bound and a non-negative lower bound on which is a contradiction and finishes the proof.
Thus we have That is, where the last Eqn. (199) follows by the choice of ζ (Eqn. (192)). Since the above calculations work for any list x 1 ,¨¨¨, x L P C 1 of distinct codewords, we have that for pi 1 ,¨¨¨, i L q P`r M 1 s L˘, the same bound holds, E rQ i1,¨¨¨,iL s ď´η 2 .
For lists pi 1 ,¨¨¨, i L q P rM 1 s L that are not all distinct, we use the trivial bound, E rQ i1,¨¨¨,iL s "E r|x i1`¨¨¨`xiL |´rs ďL´r.
Overall we have The last inequality (200) In the above calculations, we used the following definitions and facts. 1) Eqn. (201) follows from the definition of joint types. 2) Eqn. (202) is obtained by rearranging terms.
3) In Eqn. (203), as before, we let, for j P rns, x P F 2 , P pjq x pxq " 1 2 nR ÿ iPr2 nR s 1 tx i pjq"xu denote the empirical distribution of the j-th column of C 1 when viewed as an M 1ˆn matrix. In expression (204), the j-th summand can be viewed as the translation distance of a non-lazy one-dimensional random walk after L steps. The walker moves left (x " 1) with probability P pjq x p1q and moves right (x " 0) with probability P pjq x p0q. It is not hard to check that the expected translation distance is minimized when the walker is unbiased, i.e., when P pjq x p1q " P pjq x p0q " 1{2. This is formally justified in Appendix C. Hence, for every j P rns, Lˇı´r ě 0.
Since the above bound is valid for every j P rns, it is still valid averaged over j " rns. Hence we have Q ě 0.

XVII. GV RATE VS. CLOUD RATE
In this section, we are concerned with the question of unique decoding (special case where L´1 " 1) under the bit-flip model.
In [WBBJ], bounds on achievable rates of codes for general adversarial channels are provided. A Gilbert-Varshamov-type expression was obtained using a purely random code construction, and a rate lower bound (we call cloud rate) that generalizes the GV-type expression was given by a cloud code construction. We evaluate both bounds under the bit-flip model. We show that the Gilbert-Varshamov-type bound for general adversarial channels indeed coincide with the classic GV bound in this particular setting. We also provide a convex program for evaluating the cloud rate.
We use the probability vector " P x p1q¨¨¨P x p|X |q ‰ J to denote a distribution P x P ∆pX q. Take any input distribution P x " Bernpwq " from ∆pt0, 1uq, we first explicitly compute the basic objects we are concerned with in this paper.
Since CP 2 " DNN 2 , we have CP 2 pwq "CP 2 X J pwq In other words, 0 ă w ă 1 and 0 ă p ă w´w 2 . In this case, Actually, if the above conditions hold, then when 1{3 ď w ă 1, the boundary of Kpw, pq is p and the boundary of CP 2 pwq is w´w 2 . Note that the right boundary " p1´wq 2 w´w 2 w´w 2 w 2  " " 1´w w  b2 of CP 2 pwq is the only distribution in CP 2 pwq of CP-rank-1. GV rate. We first state the GV-type expression given by in [WBBJ].
Lemma 205 (Gilbert-Varshamov rate). For a general adversarial channel A " X , λ x , S, λ s , Y, W y|x,s ( , its unique decoding capacity is at least where the mutual information is calculated using P x1,x2 .
We now evaluate the above expression under the bit-flip model. This matches the classic GV bound given a greedy volume packing argument. Cloud rate. We now state the cloud rate expression given by [WBBJ].
For a general adversarial channel A " X , λ x , S, λ s , Y, W y|x,s ( , its unique decoding capacity is at least P u,x1,x2,s1,s2,y P ∆`UˆX 2ˆS 2ˆYP s1 , P s2 P λ s P u,x1,s1,y " P u,x P s1|u,x1 W y|x1,s1 P u,x2,s2,y " P u,x P s2|u,x2 W y|x2,s2 Remark 207. The reason that [WBBJ] has to define a different confusability set K cloud when cloud code is using is that as a part of the code design, the distributions P u , P u|x are revealed to every party, including the adversary, hence he may be able to inject noise patterns that are potentially more malicious compared with the case where he does not have such knowledge. We refer the readers to the proof in [WBBJ]. In the bit-flip setting, it is easy to verify that K cloud pP u,x q " " P u,x1,x2 P ∆`UˆX 2˘: P u,x1 " P u,x2 " P u,x P x1,x2 p0, 1q`P x1,x2 p1, 0q ď 2p * 4) Given any adversarial channel, when we are "below the Plotkin point" (i.e., there are non-confusable CP distributions), can we construct explicit codes of positive rate? We know that random codes is list decodable w.h.p.

XIX. ACKNOWLEDGEMENT
We thank Andrej Bogdanov who provided elegant reduction from general L to L " 2 for the proof the asymmetric case of the converse (Sec. 155) and reconstructed Blinovsky's [Bli86] characterization of P L´1 via conceptually cleaner proof, despite that he generously declined to co-author this paper. We also thank him for inspiring discussions in the early stage and helpful comments near the end of this work.
Part of this work was done while YZ was visiting the Simons Institute for the Theory of Computing for the Summer Cluster: Error-Correcting Codes and High-Dimensional Expansion.

A. Tensor products
Definition 208 (Tensor product). For two tensors A P Ten bm n , B P Ten b n , Their tensor product is defined as A b B :" rA pi 1 ,¨¨¨, i m q B pj 1 ,¨¨¨, j qs P Ten bpm` q n .
The Frobenius norm is defined as }A} F :" a xA, Ay.
Definition 210 (Hadamard product). For two tensors A, B P Ten bm n , Their Hadamard product is defined as A˝B :" rApi 1 ,¨¨¨, i m qBpi 1 ,¨¨¨, i m qs P Ten bm n .

B. Tensor decomposition
Definition 211 (Canonical decomposition). For a tensor A P Ten bm n , its canonical decomposition has form A " where each x j,i P S n´1 2 . The smallest r for A to admit such a decomposition is called the rank of A. If A is symmetric, then A " r ÿ j"1 α j x bm j is an analog of the eigendecomposition of symmetric matrices. The smallest r is called the symmetric rank of A.
Conjecture 212. For A P Sym bm n , rankpAq " sym-rankpAq. Remark 213. It is known to be true if rankpAq ď m.
Definition 214 (Tucker decomposition). For a tensor A P Ten bm n , the Tucker decomposition has form A " x ji,j .
It is an analogy of the singular value decomposition of matrices.
A tensor A P Ten bm n has npm´1q n´1 eigenvalues. A may have non-real eigenvalues even if A is symmetric. If an eigenvector is real, then the corresponding eigenvalue is also real. Such eigenvalues are called H-eigenvalues. They always exist for even-order tensors.

C. Special tensors
Definition 215 (NN tensors). A tensor is said to be non-negative if each of its entry is non-negative. The set of order-m dimension-n non-negative tensors is denoted by NN bm n Definition 216 (PSD tensors, PD Tensors). For even m, A P Ten bm n is positive semidefinite (PSD) if xA, x bm y ě 0 for any x P R n . A is positive definite (PD) if the above inequality is strict for all x ‰ 0.
The sets of PSD and PD tensors is denoted by PSD bm n and PD bm n , respectively. Definition 217 (CP tensors, CP tensor rank). A tensor P P Ten bm n is said to be completely positive if for some r ě 1, there are component-wise non-negative vectors p 1 ,¨¨¨, p r P R n ě0 such that P " r ÿ j"1 p bm j .
The set of CP tensors is denoted by CP bm n . The least r such that P has a completely positive decomposition is called the CP-rank of P . If span tP 1 ,¨¨¨, P r u " R n then P is said to be strongly CP.
Fact 218. Verifying if a symmetric non-negative tensor is CP is NP-hard.
Definition 219 (coP tensors). A P Sym bm n is copositive if xA, x b y ě 0 for all x P R n ě0 . The set of copositive tensors is denoted by coP bm n . Theorem 220 (Duality). CP bm n and coP bm n are closed convex pointed cones with nonempty interior in Sym bm n . For m ě 2, n ě 1, they are dual to each other.
Definition 221 (DNN tensors). For even m, A P Sym bm n is doubly non-negative (DNN) if A is entry-wise nonnegative and xA, x bm y is a sum-of-square as a polynomial in the components of x.

APPENDIX B HYPERGRAPH RAMSEY NUMBERS
Let R prq k ps 1 ,¨¨¨, s k q denote the smallest size of an r-uniform hypergraph such that for any k-colouring, there must be a monochromatic clique of size s i for some i P rks.
Lemma 224 (Properties of hypergraph Ramsey numbers). 1) For any i P rks, and s j ě r (j ‰ i), R prq k ps 1 ,¨¨¨, s i´1 , r, s i`1 ,¨¨¨, s k q "R prq k´1 ps 1 ,¨¨¨, s i´1 , s i`1 ,¨¨¨, s k q. 2) For any σ P S k , R prq k ps 1 ,¨¨¨, s k q "R prq k ps σp1q ,¨¨¨, s σpkq q. Lemma 225 (Finiteness of hypergraph Ramsey numbers). For any positive integers r, k, s 1 ,¨¨¨, s k , the hypergraph Ramsey number R prq k ps 1 ,¨¨¨, s k q is finite. In particular, it satisfies the following recursive inequalities. 2) For r ě 3, there are constants c, c 1 ą 0 such that t r´1 pc¨s 2 q ďR prq 2 ps, sq ď t r pc 1¨s q. 3) For s ą k ě 2, there are constants c, c 1 ą 0 such that t r pc¨kq ă R prq k ps,¨¨¨, sq ă t r pc 1¨k log kq.

APPENDIX C EXPECTED TRANSLATION DISTANCE OF A ONE-DIMENSIONAL RANDOM WALK
Lemma 227. Consider a random walk x 1 ,¨¨¨, x L of length L. Each x i (1 ď i ď L) is an independent and identically distributed t´1, 1u-valued random variable satisfying Pr rx i " 1s " p, Pr rx i "´1s " 1´p.
Without loss of generality, assume p ě 1{2. Then the expected translation distance E r|x 1`¨¨¨`xL |s of this random walk after L steps is minimized when p " 1{2.
Proof. Create another walk x 1 1 ,¨¨¨, x 1 L with p " 1{2 that is coupled with x 1 ,¨¨¨, x L in the following way. Pr " It is easy to see that the distribution of x 1 ,¨¨¨, x L is preserved under this coupling.

APPENDIX D BLINOVSKY [BLI86] VS. ALON-BUKH-POLYANSKIY [ABP18]
In this section we show that, though differing ostensibly, the formulas of the Plotkin points for pp, L´1q-list decoding given by Blinovsky and Alon-Bukh-Polyanskiy actually agree with each other. The proof is essentially due to the user Marko Riedel on Mathematics Stack Exchange [Cla19].
For L " 2k or 2k`1 for some positive integer k P Z ą0 , Blinovsky's formula is while Alon-Bukh-Polyanskiy wrote it as We are going to show that Lemma 228. For any k ě 1, Proof. To see the above two expressions are always evaluated to the same value, we first massage the above equation. Multiplying 2 2k`2 on both sides, shifting the summation index and rearranging terms, we have k´1 ÿ i"0`2 i iȋ`1 2 2pk´iq " 2 2k`1´2ˆ2 k k˙.
2) Assume (231) holds for certain k ě 1. We want to show it also holds for k`1.