Abstract 1 Introduction 2 Basic notions 3 Tropical algebra and piecewise complexity 4 Piecewise complexity of SLP-compressed words 5 The size of piecewise signatures 6 Concluding remarks References Appendix A Proofs omitted from main text

A Tropical Approach to the Compositional Piecewise Complexity of Words and Compressed Words

Philippe Schnoebelen ORCID Université Paris-Saclay, CNRS, ENS Paris-Saclay, Laboratoire Méthodes Formelles, Gif-sur-Yvette, France J. Veron Université Paris-Saclay, CNRS, ENS Paris-Saclay, Laboratoire Méthodes Formelles, Gif-sur-Yvette, France Isa Vialard ORCID Max Planck Institute for Software Systems, Saarland Informatics Campus, Saarbrücken, Germany
Abstract

We express the piecewise complexity of words using tools and concepts from tropical algebra. This allows us to define a notion of piecewise signature of a word that has size log(n)mO(1) where m is the alphabet size and n is the length of the word. The piecewise signature of a concatenation can be computed from the signatures of its components, allowing a polynomial-time algorithm for computing the piecewise complexity of SLP-compressed words.

Keywords and phrases:
Tropical semiring, Subwords and subsequences, piecewise complexity, SLP-compressed words
Category:
Track B: Automata, Logic, Semantics, and Theory of Programming
Copyright and License:
[Uncaptioned image] © Philippe Schnoebelen, J. Veron, and Isa Vialard; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theory
Acknowledgements:
Work started while the third author was at Univ. Paris-Saclay.
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Subwords and piecewise complexity

For two words u,v, we say that u is a subword of v, written uv, if u can be obtained by removing some letters (possibly all, possibly none) from v at arbitrary positions. For example, 𝙰𝙱𝙱𝙰𝙰𝙱𝚁𝙰𝙲𝙰𝙳𝙰𝙱𝚁𝙰. This notion corresponds to subsequences in other areas, and is more general than the usual concept of factors of a word.111Some authors use the terminology “subword” for factors, and then use “scattered subword” for subwords.

Compared to factors, subwords are not so easy to reason about, and correspondingly they have been much less studied. However, they certainly are a fundamental notion, appearing naturally in many areas of formal languages, algorithmics, data sciences, logic, computational linguistics, genomics, etc.

In language theory, semigroup theory, and logic, an important concept is piecewise-testability, introduced in 1972 by Imre Simon [27]. This corresponds exactly to the first level in Straubing-Thérien hierarchy of first-order definable languages [6] and has been extended in many directions (infinite words [4], trees [2], any well quasi-ordered set [11]).

In this paper, we are interested in subword-based descriptive complexity of words. In particular, the piecewise-complexity of words and languages has been used in [15] for bounding the complexity of decidable fragments of the logic of subwords (see also [12]). In essence, the piecewise-complexity of a word (or any piecewise-testable object) is the minimum size of the pieces required to specify it. Regarding this measure, a recent algorithmic contribution is [26], providing efficient algorithms for computing the piecewise-complexity of individual words. The algorithms and the reasoning behind them are quite involved and it is not trivial to extend them for, e.g., periodic words [23].

It turns out that many of the computations performed in [26] have a simple description in the framework of tropical algebra, i.e., the (min,+)-semiring. Indeed, some fundamental operations of [26] are just linear transformations in a tropical sense. This realisation is the starting point for this paper.

Our contribution

In a first part, we rephrase the results of [26] in the language and notation of tropical algebra. This setting makes it easier to develop a compositional way of computing piecewise-complexity: we associate with any word u a (piecewise) signature σ(u) that has size logarithmic in |u| (and polynomial in the size of the alphabet), and that contains enough information to allow extracting, among other things, the piecewise complexity of u. Furthermore, signatures can be computed compositionally, i.e., σ(uv) is obtained in polynomial-time from σ(u) and σ(v).

In a second part we use this apparatus to compute the piecewise complexity of compressed words in time polynomial in the size of their compressed description, solving a problem raised in [25]. Reasoning on (e.g., searching) and handling (e.g., editing) compressed data structures is an important field of algorithmics and compressed words are one of the most fundamental and most fruitful instantiations of the paradigm, see [17]. While our algorithm is a simple derivation of the results from the first part, its complexity analysis is quite involved and is the main technical difficulty in this paper. We expect that the techniques behind piecewise signatures can be useful for other subword-related problems on compressed words.

We hope to convince our readers that our contribution is made more elegant and technically simpler through the use of tropical linear and multilinear algebra. This allows one to recruit familiar notions of vectors, matrices, their products, their monotonicity, thus lightening the brain load required for following through our reasoning. Tropical algebra was introduced in the area of formal languages and automata by Imre Simon himself (in [29], see also [21]) in connection with the star-height problem, opening the way to the algebraic and algorithmic theory of weighted automata, now a very active area [7, 18]. In this respect it is surprising that Simon did not use tropical algebra in his fundamental work on piecewise-testable languages [27, 28] on which we heavily rely and whose bases we rephrase tropically.

Related work

Related to piecewise-complexity of words is the piecewise-distance δ(u,v) of two words, which has been more studied and is the topic of several recent papers [8, 1, 10]. Indeed the piecewise-complexity of u can be defined as 1+maxvuδ(u,v) [26] but this characterisation does not provide useful algorithms. The range of related questions include reconstructing words from their subwords [16], solving word equations [20], or computing edit-distance [20, 5], etc., modulo Simon’s congruence.

Outline of the paper

Section 2 recalls the basic concepts, notations, and results used in the paper. Section 3, the main conceptual contribution, shows how to use tropical algebra for computing piecewise complexity compositionally, introduces piecewise signatures with their associated algorithms. Section 4 applies this to SLP-compressed words. Finally, Section 5 handles the complexity analysis and contains the main technical difficulties. For readability concerns, some proofs are omitted from the main text and can be found in the Appendix.

2 Basic notions

2.1 Subwords and the piecewise complexity of words

We follow [26]. For a word u=a1a2an of length |u|=n and two positions i,j such that 0ijn, we write u(i,j) for the factor ai+1ai+2aj. Note that u(0,|u|)=u, that u(i1,i2)u(i2,i3)=u(i1,i3), and that |u(i,j)|=ji. We write u(i) as shorthand for u(i1,i), i.e., ai, the ith letter of u. Finally, we use u(i,) and u(,j) as convenient notations for u(i,n) and, respectively, u(0,j): they are the ith prefix and the jth suffix of u, with the understanding that this numbering starts with a “0th” prefix, always equal to the empty word ϵ, and that suffixes get shorter and shorter when their index grow while the length of prefixes increases.

Fix a m-letter alphabet Σ={a1,a2,,am}. A word uΣ is a subword of vΣ, written uv, if there is a factorization v=v0u1v1u2v2uv with factors v0,,v,u1,,uΣ and such that u=u1u [24]. When furthermore =1 then u is also a factor of v. For example, both u1=𝙰𝙰 and u2=𝙱𝙱 are subwords of v=𝙰𝙱𝙱𝙰 but only u2 is a factor of v. The longest subword of u is u itself, its shortest subword is the empty word, denoted ϵ. Subwords are the usual concept of subsequences in the special case of words, i.e., finite sequences of letters.

For given m, we write vmv when v and v have the same subwords of length at most m [27]. For example 𝙰𝙱𝙱𝙰2𝙱𝙰𝙱𝙰 while 𝙰𝙱𝙱𝙰≁3𝙱𝙰𝙱𝙰 since, e.g., 𝙰𝙱𝙱𝙰𝙱𝙱𝙰 and 𝙰𝙱𝙱⋠𝙱𝙰𝙱𝙰. In situations where u is subword of only one among two words v and v, we say that u is a (piecewise) distinguisher (also a separator) between v and v.

Then vkv means that v and v have no distinguisher of length k or less, or that their shortest distinguisher (guaranteed to exist when vv) has length at least k+1. Each k is an equivalence relation over Σ, with 0=Σ×Σ being the trivial “always true” relation, and u1v holding if, and only if, u and v use the same subset of letters. Every k has finite index [28, 14] and is a refinement of the previous k1, i.e., ukvuk1v, with kk=Id. One speaks of “Simon’s congruence” in view of ukuvkvuvkuv.

The piecewise complexity of a word u, denoted h(u), is the smallest k such that ukvu=v [15, 26]. In terms of distinguishers, h(u)=k means that any vu can be separated from u by a piecewise distinguisher of length k. We refer to [15, 26] and the references therein for more motivations and applications of piecewise complexity.

2.2 Computing 𝒉(𝒖)

An O(|u||Σ|) time algorithm computing h(u) is given in [26]. It is based on a characterisation of h(u) in terms of piecewise-distances. For a word uΣ and a letter aΣ, we follow [27, 28] and define the right and left “side distances”:

r(u,a) =defmax{k|ukua}, (a,u) =defmax{k|auku}.

The two notions are mirror of one another: (a,u)=r(u𝖱,a), where u𝖱 denotes the reverse, or the mirror image, of u.

One can reduce the computation of h to the computation of r and in view of the following characterisation (see [26, Prop. 3.1]):

h(u)=max0i|u|aΣr(u(,i),a)+(a,u(i,))+1. (1)

In order to compute r(u,a), Schnoebelen and Vialard introduce the following characterisation:

Lemma 2.1 ([26, Lem. 3.8]).

For any word uΣ and any letters a,bΣ, the following hold:

r(ϵ,a)= 0, (2)
r(ub,a)= {r(u,a)+1if a=b,min(r(u,b)+1,r(u,a))if ab. (3)

Following [26], the r-table of u, denoted Ru, is the m-by-(n+1) matrix with entry r(u(,j),ai) in column 0jn and row 1im, where we assume some fixed ordering {a1,,am} of the letters in Σ. There is a similar -table, denoted Lu, with entries (ai,u(j,)). The key point is that, by Eq. (1), computing h(u) amounts to finding the maximum value in the array Ru+Lu obtained by summing the two matrices.

Example 2.2.

Here are Ru and Lu for u=𝙲𝙱𝙱𝙲𝙰𝙲, assuming Σ={𝙰,𝙱,𝙲}:

With Eq. (1), one sees that h(u)=4, obtained as r(u[,4],𝙲)+(𝙲,u[4,])+1=2+1+1=4. This maximum is not uniquely attained: r(u[,3],𝙲)+(𝙲,u[3,])+1=1+2+1=4. Note that 𝙲𝙱𝙱𝙲𝙲𝙰𝙲3𝙲𝙱𝙱𝙲𝙰𝙲=u, so that indeed h(u) is necessarily larger than 3.

3 Tropical algebra and piecewise complexity

3.1 The tropical semiring

By tropical algebra we mean the semiring {+};min,+, with + as zero and 0 as unit, also called the (min,+)-semiring. It is a semiring rather than a ring because there is no additive inverse and no subtraction. In the following we use as a less cumbersome notation for +,222It is common to use ϵ as a light notation for the zero in tropical rings but we reserve ϵ to denote the empty word. and as shorthand for {} (and also for the whole structure). We refer to [9, 22, 18] for more in-depth introductions and simply observe that, modulo a sign change, (min,+)-semirings are isomorphic to the more familiar (max,+)-semirings one typically encounters in operations research [13], combinatorics [3], economics, etc.

We use the standard linear ordering on , given by 0<1<2<<i<i+1<<. Both semiring operations are monotonic w.r.t. : for any x,x,y,y, xx and yy together imply min(x,y)min(x,y) and x+yx+y. Finally, note that , the zero of , is also its maximal element.

Tropical semirings lead to their own flavour of linear algebra. In , a linear combination of variables 𝒙=(x1,,xk) has the form

f(x1,,xk)=minj=1k(αj+xj),

where the αj’s are scalars from . Such an f can be represented by a line vector (α1α2αk), simply written 𝜶, so that f(𝒙) is the (tropical) dot product 𝜶𝒙.

Since is only a commutative semiring and not a field, these linear combinations do not form a proper vector space, not even a module, but only a semimodule over . However, and for our limited purposes, the usual notions of vector and matrix sums and products work as expected so the reader should be on familiar ground. The next paragraphs introduce the notations we need and collect the main properties we shall rely on.

We will often use the vector 𝒆=def(0 0 0), made up of units only (not zeroes) and with dimension m to be inferred from context. We write 𝒆i for the vector ( 0) that has a single unit in position i, surrounded by zeroes. Thus 𝒆i is the ith column of the tropical identity matrix Im and 𝒆 is its diagonal. As expected, (𝒆i)i=1,,m is a base for m, the free m-dimensional -semimodule.

In m, a linear application is a mapping F:mn of the form F(x1,,xm)=(f1(x1,,xm),,fn(x1,,xm)) where each fi is a (tropical) linear combination. As in classical linear algebra, F can be represented by a n-by-m matrix MF, where row i contains the scalars defining fi, so that F(𝒙)=MF𝒙.

The ordering on leads to a partial ordering on tropical vectors and matrices. For two vectors 𝜶,𝜷m, we write 𝜶𝜷 when αiβi for all i=1,,m. Similarly, given two tropical matrices A and B of same heights and widths, we write AB when each value Ai,j in A is dominated by the corresponding Bi,j in B. Then AB and 𝜶𝜷 together imply A𝜶B𝜷, i.e., “linear applications are monotonic”. Note that the tropical sums of vectors 𝜶,𝜷 or matrices M,M, are computed by applying min component-wise and coincide with infimums, or meets, in the lattices (m,) and (m×n,): for this reason we denote them with 𝜶𝜷 and MM instead of “min(𝜶,𝜷)” or “min(M,M)” which suggests picking the smaller vector or matrix. We still use min for scalars since, in that case, the two interpretations coincide.

3.2 A tropical view of the 𝒓- and -tables

The starting point of this paper is the realisation that Lemma 2.1 actually describes the j+1th column of Ru as a tropically linear image of its jth column. For this we associate a tropical matrix with each word over Σ.

Definition 3.1.

The extension matrix associated with a letter aΣ={a1,,am} is the m-by-m matrix, denoted Ma, given by

Ma[i,j]={1if a=ai,0if i=j and aai,otherwise. (4)

The extension matrix associated with a word u=u(1)u(2)u(n) over Σ is the (tropical) product Mu(1)Mu(2)Mu(n), written Mu. In particular Mϵ is Im, the identity matrix of order m.
Finally, the right-to-left extension matrix Nu, is Mu𝖱, i.e., the extension matrix of the mirror word.

Example 3.2.

Assuming Σ={𝙰,𝙱,𝙲}, the above definition leads to

M𝙰=(11100),M𝙱=(01110),M𝙲=(00111),
for letters, and
M𝙲𝙱𝙱𝙲𝙰𝙲=(112223223),N𝙲𝙱𝙱𝙲𝙰𝙲=(133122233),

for a word u=𝙲𝙱𝙱𝙲𝙰𝙲. Note that Mu and Nu have no clear relation (Mu=Mv does not entail Nu=Nv).

We may now return to the table Ru. Let us write Cu,j for its column number j, recalling that the first column has index 0, and use Cu as shorthand for Cu,|u|, i.e., the last column. When u is understood, we usually just write Cj for Cu,j. Finally, we shall mostly display and use the Cj’s as line vectors without bothering to write Cj everywhere.

By definition, Cj is (r(u(,j),a1)r(u(,j),a2)r(u(,j),am)). In particular C0=𝒆.

Lemma 3.3.

For any word u and any factor u(j,j) of u,

Cj=CjMu(j,j).

In particular, Cu=𝐞Mu.

Proof.

By induction on jj, the length of the factor of u. The base case j=j is clear since Mu(j,j), i.e., Mϵ, is the identity matrix. If j=j+1 then u(j,j) is the letter u(j)=ai for some 1im. By definition we have that for all 1km, the kth entry of column j is (Cj)k=r(u(,j),ak) and likewise, the kth entry of column j+1 is (Cj+1)k=r(u(,j+1),ak). Therefore

(Cj+1)k=(3) {r(u(,j),ai)+1=(Cj)i+1if i=k,min(r(u(,j),ai)+1,r(u(,j),ak))=min((Cj)i+1,(Cj)k)if ik.
= CjMai.

Finally, if j>j+1, then for any j′′ strictly between j and j we can derive

CjMu(j,j)=CjMu(j,j′′)u(j′′,j)=Df. 3.1CjMu(j,j′′)Mu(j′′,j)=i.h.Cj′′Mu(j′′,j)=i.h.Cj.

using the induction hypothesis.

Mirror results hold for Lu once we introduce the required notations. So let us write Du,j for the jth column of Lu (counting from left and starting with j=0) with standard abbreviations Dj for Du,j when u is understood, and Du for Du,0. Then for any factor u(j,j) of u we have Dj=DjNu(j,j), and in particular Du=𝒆Nu.

3.3 Compositional computation of 𝑹𝒖

The key to computing r- and -tables compositionally is to consider Rvu and Luv as functions of Rv and Lv.

Definition 3.4 (f-r-table).

The functional r-table, denoted Fu, associated with a word uΣ is a m-by-(n+1) matrix with value

𝜹i,j=defMu(,j)𝒆i (5)

in position (i,j).

In other words, 𝜹i,j is the ith column of the extension matrix for the jth prefix of u. Note that Fu is a table of vectors where Ru was a table of values.

Now Fu can be used to compute Ru via r(u(,j),ai)=𝒆𝜹i,j (Lemma 3.3), but more generally it provides the r(w,ai)’s for any word w that has some u(,j) as a suffix. Indeed, assume w=vu(,j):

Lemma 3.5.

For any words u,vΣ, letter aiΣ, and position 0j|u|:

r(vu(,j),ai)=𝜹i,jCv. (6)
Proof.

On one hand, r(vu(,j),ai) is the i-value in the last column of Rvu(,j), i.e., is Cvu(,j)𝒆i. Now

Cvu(,j)=𝒆MvMu(,j)=CvMu(,j).

We conclude since 𝜹i,j is just Mu(,j)𝒆i.

Mirror notions and results exist for . The f--table associated with u is a m-by-(n+1) matrix, denoted Gu, and having value

𝜸i,j=defNu(j,)𝒆i (7)

in position (i,j), i.e., 𝜸i,j is the ith column of Nu(j,). This ensures

(ai,u(j,)v)=𝜸i,jDv. (8)
Example 3.6.

Taking u=𝙲𝙱𝙱𝙲𝙰𝙲 as in Example 2.2, Fu and Gu are

|(0)(01)(011)(011)(011)(122)(122)(0)(01)(12)(23)(22)(122)(122)(0)(1)(11)(11)(22)(122)(233)| and |(112)(112)(112)(12)(12)(01)(0)(323)(323)(212)(101)(101)(01)(0)(323)(212)(212)(22)(11)(1)(0)|.

3.4 Summing 𝒓 and -tables

In order to compute the r(u(,i),a)+(a,u(i,)) required by Equation 1, we now want to combine the two tables, as done in Example 2.2. Recall, however, that these + are actually products in , and that the vectors 𝜹i,j and 𝜸i,j actually encode the linear maps given by C𝜹i,jC and D𝜸i,jD in Eqs. (6) and (8).

Given two tropically linear functions f(𝒙)=𝜶𝒙 and g(𝒚)=𝜷𝒚 from m to , the tropical product fg, defined via (fg)(𝒙,𝒚)=deff(𝒙)+g(𝒚), satisfies

(fg)(𝒙,𝒚)=(𝜶𝒙)+(𝜷𝒚) =(min1imαi+xi)+(min1jmβj+yj)
=min1i,jm(αi+βj+xi+yj).

It is (tropically) bilinear since (fg)(𝒙𝒙,𝒚)=(fg)(𝒙,𝒚)(fg)(𝒙,𝒚) and, for a scalar α, (fg)(𝒙,α𝒚)=α+(fg)(𝒙,𝒚).333Here α𝒚 is a tropical scaling of a vector, which amounts to increase every value in 𝒚 by α. Such mappings are the natural objects of multilinear algebra, see e.g. [19]. Moving to the tensor space mm, we can see fg, a mapping from 2m to , as a tropically linear mapping from m2 to , denoted fg, and that associates (𝜶𝜷)(𝒙𝒚) with the tensor 𝒙𝒚.

Adapted to tropical algebra, the elementary tensor 𝜿=𝜶𝜷 obtained as outer product of two vectors of size m and n, respectively, is the m-by-n matrix given by κi,j=αi+βj. It is natural to see 𝜿 as a matrix but sometimes, as in the expression (𝜶𝜷)(𝒙𝒚), it is more natural to see the tensors as vectors of size nm.444The size-nm vector is obtained by reading the matrix column by colum, see Example 3.9 below.

Definition 3.7 (Summing Fu and Gu).

The functional r+-table associated with a word uΣ is the m-by-(n+1) matrix, denoted Hu, having 𝛈i,j=def𝛅i,j𝛄i,j in position (i,j).

In the above definition, 𝜹i,j and 𝜸i,j are as in the definition of Fu and Gu. Finally, every 𝜼i,j entry in Hu is a m-by-m tropical tensor, or, equivalently, a tropical vector of size m2.

Let us now consider a word w that contains u as a factor, i.e., assume that w is some v1uv2. For a position j in u and a letter ai in Σ, one obtains

r(v1u(,j),ai)+(ai,u(j,)v2)=𝜼i,j(Cv1Dv2), (9)

by combining Eqs. (6) and (8) with the definition of 𝜼i,j.

If now one wants to find the maximum value among all r(v1u(,j),ai)+(ai,u(j,)v2), it is not necessary to consider all possibilities for j=0,,|u| and i=1,,m. Since Cv1Dv2 does not change with i and j, and since the dot-product in Eq. (9) is monotonic with respect to the component-wise order, we do not need the full Hu table to compute h(u) but only a set of its maximal elements. This leads to:

Definition 3.8 (Piecewise signatures).

Assume a fixed alphabet Σ={a1,,am}.

  • The residual tensors associated with a word uΣ is the set Eu={𝜼,𝜼,} of all maximal elements in Hu.

  • The (piecewise) signature of u is the triple σ(u)=def(Eu,Mu,Nu) collecting Eu and the two extension matrices.

Note that Eu is never empty and usually contains several elements since the ordering on m2 is only a partial ordering when m>1.

Example 3.9.

We continue with u=𝙲𝙱𝙱𝙲𝙰𝙲. Here Hu contains 3×7=21 tensors obtained as the pairwise products 𝛅i,j𝛄i,j of the vectors from Fu and Gu listed earlier, see Example 3.6. It has seven (pairwise incomparable) maximal elements. They are

𝜼𝙰,0 =(0)(112)=(112),
𝜼𝙱,0 =(0)(323)=(323),
𝜼𝙲,0 =(0)(323)=(323),
𝜼𝙲,3 =(11)(22)=(3333),
𝜼𝙲,5 =(122)(1)=(233),
𝜼𝙰,6 =(122)(0)=(122),
𝜼𝙱,6 =(122)(0)=(122).

Note that 𝛈𝙲,5 coincides with 𝛈𝙲,6, obtained as (233)(0): however Eu only contains one copy of each maximal tensor. Similarly one has 𝛈𝙲,0=𝛈𝙲,1 and 𝛈𝙲,4=𝛈𝙲,5.

The following lemma, needed in Section 4, can be seen as another example.

Lemma 3.10.

For a letter aΣ where |Σ|=m, Ea contains 2m1 maximal tensors.

Proof.

Assume a=as. There are 2m vectors 𝜹i,j in Fa, with 1im and 0j1. The table Ga is the mirror image, i.e., 𝜸i,j=𝜹i,1j. The 2m tensors in Ha are all incomparable except for 𝜹s,0𝜸s,0=𝜹s,1𝜸s,1, hence the claim.

The signature of a word u, and more precisely the tensors in Eu, provides enough information to compute h(u):

Theorem 3.11 (Piecewise complexity from signatures).

For any word uΣ

h(u)=1+max𝜼Eu𝜼𝒆. (10)
Proof.

The proof is simple. Starting with Eq. (1), one has

h(u) =maxu=u1u2aiA1+r(u1,ai)+(ai,u2)=1+max0j|u|aiAr(u(,j),ai)+(ai,u(j,))+1
=1+max0j|u|aiΣ𝜼i,j(CϵDϵ)=1+max0j|u|aiΣ𝜼i,j(𝒆𝒆),
using Eq. (9) with v1,v2=ϵ,
=1+max0j|u|aiΣ𝜼i,j𝒆,
where it is understood that now 𝒆 denotes a size-m2 tensor,
=1+max𝜼Eu𝜼𝒆,

since dot-product is monotonic w.r.t. . Note that, for 𝜼 an m2-sized vector, the product 𝜼𝒆 in Eq. (10) is just the minimal entry in 𝜼. However the notation 𝜼𝒆 is quite convenient, and we shall extend the above reasoning with tensors different from 𝒆.

Example 3.12.

Continuing with u=𝙲𝙱𝙱𝙲𝙰𝙲, the tensor in Eu that has the largest minimal component is 𝛈𝙲,3 (see Example 3.9), yielding 𝛈𝙲,3𝐞=3 and h(u)=4, in agreement with Example 2.2.

For complexity analysis we shall assume that signatures are simply represented as two m-by-m matrices together with a collection of m2-sized residual tensors. The values contained in theses matrices and tensors are usually encoded using a logarithmic number of bits (e.g., they are written in binary).

Theorem 3.13 (Space complexity of piecewise signatures).

The size |σ(u)| of the signature of a word uΣ is in log(n)mO(1) where n=|u| and m=|Σ|.

Proof.

Let us write #Eu for the cardinal of Eu. Now σ(u) contains two matrices and #Eu tensors, each made up of m2 scalars. These scalars are bounded by |u| (or are ) so only use O(log(n)) space. It is therefore enough to show that #Eu is bounded by a polynomial of m. This is done in Section 5 where we prove #Eu2m(2m+1) (see Theorem 5.1).

3.5 Combining piecewise signatures

Since the objects in signatures can accommodate concatenation, it is possible to compute them compositionally. The main result of this section is stated as

Theorem 3.14 (Compositionality).

The signature σ(uv) of the concatenation of two words u,vΣ can be computed from σ(u) and σ(v).
Furthermore the computation of σ(uv) can be done in time polynomial in |σ(u)|+|σ(v)|+|Σ|.

The proof is developed in several steps.

Lemma 3.15.

Let w=uv. For 0j|u|, the tensor (Hw)i,j at position (i,j) in Hw is (ImNv)(Hu)i,j Similarly, for 0j|v|, the tensor (Hw)i,|u|+j in Hw is (MuIm)(Hv)i,j.

The above statement uses tensor products of linear maps. Recall that when f,g:nm are two linear maps, their tensor product fg:n2m2 can be defined via (fg)(𝒊𝜶𝒊𝜷𝒊)=𝒊f(𝜶𝒊)g(𝜷𝒊). When f and g are given by two n-by-m matrices A and B, one can obtain a n2-by-m2 matrix, denoted AB, for fg by taking the products of the rows of A and the rows of B (as expected since (𝒆i𝒆j)1i,jn is a base of nn).

Proof (of Lemma 3.15).

We only prove the first claim since the second can be obtained directly by mirroring. Assume 0j|u|. By definition, (Hw)i,j is (Fw)i,j(Gw)i,j. Now, by Eq. (5),

(Fw)i,j =Mw(,j)𝒆i=Mu(,j)𝒆i=(Fu)i,j,
while, by Eq. (7),
(Gw)i,j =Nw(j,)𝒆i=Nu(j,)v𝒆i=NvNu(j,)𝒆i=Nv(Gu)i,j.
Thus
(Hw)i,j =(Fu)i,j(Nv(Gu)i,j)=(ImNv)((Fu)i,j(Gu)i,j)=(ImNv)(Hu)i,j

as claimed.

Lemma 3.16.

Let w=uv. The tensors in Ew are exactly the maximal tensors in E=def{(ImNv)𝛈|𝛈Eu}{(MuIm)𝛈|𝛈Ev}.

Proof.

First we claim that EwE. Indeed, consider (Hw)i,jEw. We shall assume 0j|u| since the other case is symmetric. By Lemma 3.15, (Hw)i,j is (ImNv)(Hu)i,j. If (Hu)i,jEu we have proved (Hw)i,jE. If, on the other hand, (Hu)i,jEu, then there is some maximal (Hu)i,jEu that strictly dominates (Hu)i,j. Since linear applications are monotonic we deduce

(Hw)i,j=(ImNv)(Hu)i,j(ImNv)(Hu)i,j=(Hw)i,j.

Thus, and since (Hw)i,j is a maximal element of Hw, necessarily (Hw)i,j and (Hw)i,j coincide, so (Hw)i,j is indeed the image by ImNv of a tensor from Eu, hence is in E.

Now, Lemma 3.15 entails that E is included in Hw. So Ew=maxHwEHw. We conclude that Ew and maxE coincide.

Proof of Theorem 3.14.

With σ(u)=(Eu,Mu,Nu) and σ(v)=(Ev,Mv,Nv) one obtains σ(w) as follows:
Mw and Nw are matrix products MuMv and NvNu;
Ew is obtained by computing E (from Lemma 3.16) and removing the non maximal elements.
All these operations are clearly polynomial-time.

4 Piecewise complexity of SLP-compressed words

An SLP is an acyclic context-free grammar where each non-terminal has only one production rule, i.e., the grammar is deterministic and generates a single word. When a long word has many repetitions it can often be described by an SLP of smaller size and thus SLPs can be used as a compressed data structure for words. Importantly, most compression schemes used for texts and files are equivalent, via polynomial-time encodings, to the mathematically more elegant SLPs. We refer to [17] for more background and details.

Formally, an SLP X with k rules is an alphabet Σ together with a list N1ρ1;;Nkρk of production rules where each right-hand side ρi is either a letter a from Σ or a concatenation NjNj of two nonterminals with j,j<i. The size of an SLP, denoted |X|, is in O(m+klogk), where m=|Σ|.

In the context of X, the expansion of a nonterminal Ni is a word exp(Ni)Σ defined inductively via:

exp(Ni)=def{aif ρi=a,exp(Nj)exp(Nj)if ρi=NjNj.

Finally, the expansion of the SLP itself, written exp(X), is the expansion exp(Nk) of its last nonterminal. Note that the length of exp(X) can reach 2k1, hence in 2O(|X|).

Theorem 4.1 (Piecewise signatures of compressed words).

Given an SLP X, one can compute σ(exp(X)) in time k2mO(1) where k is the number of rules of X and m is the size of its alphabet.

Proof.

The algorithm computes piecewise signatures σ1,,σk for each non-terminal of X, i.e., with σi=σ(exp(Ni)). There are two cases to handle:

  1. 1.

    when Ni is given by a rule Nia, the signature of a has extension matrices given by Eq. (4) and Ea given by Lemma 3.10;

  2. 2.

    when Ni is given by a rule NiNjNj, the signature σi is obtained by combining σj and σj as shown in Theorem 3.14.

These k steps can each be performed in time kmO(1) since, by Theorem 3.13, each σi has size bounded by log(|exp(Ni)|)mO(1), hence in kmO(1). Combining with Theorem 3.11 we obtain:

Corollary 4.2.

The piecewise complexity h(X) of (the word represented by) an SLP X can be computed in time k2mO(1).

5 The size of piecewise signatures

Fix some alphabet Σ={a1,,am}. The goal of this section is to bound the number of tensors, written #Eu, in any Eu set. Such bounds are crucial for the complexity results in Theorems 3.13, 3.14, and 4.1. We first show that #Eu is bounded in O(m2). This is later complemented with Theorem 5.5 where we exhibit a family of witnesses proving an Ω(m2) lower bound.

Theorem 5.1.

For any uΣ, #Eu2m(2m+1) where m=|Σ|.

The proof of Theorem 5.1 is organized in three main lemmas.

For any 1jm, for any uΣ, let Mu[,j]=defMu𝒆j denote the jth column of Mu. The columns of Mu respect some patterns that are easier to analyze if we assume that the letters of Σ are listed in the order they appear in u (with letters not in u listed at the end). This is no loss of generality since a permutation of Σ just amounts to a permutation of the indices in our matrices. Wherever we make that assumption, we say that u and Σ are aligned, or sometimes more succinctly that “u is aligned”.

Lemma 5.2.

Assume that u is aligned and only uses letters a1,,ap with pm. Then every column Mu[,j] agrees with one of the following two patterns:

  1. 1.

    If 1jp, then

    Mu[,j]= =T(n,i,p)

    for some 1n and some 0i<p. We write T(n,i,p) for this vector.

  2. 2.

    If p<jm then

    Mu[,j]= =U(j,p)

    and we write U(j,p) for this vector.

Proof.

By induction on the length of u. For |u|=0 we have u=ϵ, p=0, Mu=Im, so p<j and indeed Mu[,j]=𝒆j=U(j,0)=U(j,p) for all j=1,,k.

Now assume that u=vaq is aligned and that the claim holds for v. First let us see how the columns of Mu are obtained from the columns of Mv: for any j we have Mu[,j]=(MvMaq)[,j]=Mv(Maq[,j]). With Eq. (4) we see

Maq[,j]={1𝒆q=1𝒆j if j=q, and(1𝒆q)𝒆j otherwise. ()

Note that the terms in (5) involve a tropical scaling 1𝒆q which means “increase every value in 𝒆q by 1”, while the second term contains a tropical sum “__” which means “collect mins componentwise”. Finally Mu[,j] is 1Mv𝒆q=1Mv[,q] if j=q, and is (1Mv[,q])Mv[,j], simply written , otherwise.

We consider whether aq occurs in v:

If aq occurs in v then Mv[,q]=T(n,i,p) for some 1n, 0i<p, hence Mu[,q]=1T(n,i,p)=T(n+1,i,p). Now for jq, Mv[,j] is of the form T(n,i,p) for some 1n,0i<p if jp, or U(j,p) if j>p. If Mv[,j]=T(n,i,p) then Mu[,j]==T(min(n+1,n),i′′,p) with i′′ being i or i. If Mv[,j]=U(j,p) then observe that T(n+1,i,p)U(j,p)=U(j,p).

Similarly, if aq does not occur in v, (in which case q=p by alignment assumption) then Mv[,q]=U(p,p1), hence Mu[,q]=1Mv[,q]=T(1,p1,p). For jq, Mv[,j] is of the form T(n,i,p1) for some 1n,0i<p1 if jp1, or U(j,p1) if j>p1. If Mv[,j]=T(n,i,p1) then =T(1,i′′,p) with i′′ being i or p1; if Mv[,j]=U(j,p1) then =U(j,p).

Finally, all cases yield results that agree with the claim and the proof is completed. When u is not aligned, the patterns above still apply but modulo a permutation of the lines and columns of Mu. Finally, the patterns also apply to Nu, i.e., to Mu𝖱, but now alignment considers the letters of u in their order of last appearance.

Lemma 5.3.

If all the letters of Σ occur in u, then #Eu2m(2m+1).

Proof.

Since Eu collects the maximal elements of Hu, it is in particular an antichain of Hu and we can bound its size by the size of the largest antichains in Hu (also known as the width of Hu, as in Dilworth’s Theorem). Recall that the tensors in Hu are partially ordered by .

Recall also, from Definition 3.7, that a tensor of Hu is obtained by combining a 𝜹i,j from Fu, with a 𝜸i,j from Gu, i.e., builing Mv[,i]Mw[,i] for a factorization u=vw and a letter aiΣ. To simplify notations, when we consider a prefix v of u, we assume that the letters in Σ are listed in the order they appear in u. When looking at Nw for w a suffix of u we assume that Σ is listed in order of last appearances in u: making the two different assumptions simultaneously is not a problem since they apply to vectors that will be combined via a tensor product, and the two implicit reindexings can be carried out independently on the two factors.

Now remember that, according to Lemma 5.2, for any 1jm, for any prefix v of u, Mv[,j] is given by:

  1. 1.

    If aj appears in v then

    Mv[,j]=T(n,i,p)=

    for some 0i<pm.

  2. 2.

    If aj does not appear in v then

    Mv[,j]=U(j,p)=

    for some 0p<j. In both cases the letters ap+1 to am do not appear in v.

Similarly, for any 1jm, for any suffix w of u, Nw[,j] matches one of the two patterns. Furthermore, for any factorization u=vw, we observe that Mv[,j] and Nw[,j] cannot both match the second pattern since aj has to occur in v or w.

First let us bound the size of the antichains of

Sp,p=def{T(n,i,p)T(n,i,p)|n,n,0i<p,0i<p}

with 1p,pm fixed. Representing tensors in matrix form, and ignoring the rows and columns, we obtain matrices of the form

T(n,i,p)T(n,i,p)=

with x=n+n. Thus two tensors T(n1,i,p)T(n1,i1,p) and T(n2,i,p)T(n2,i2,p) are always comparable:

  • if n1+n1<n2+n2 then T(n1,i,p)T(n1,i1,p)T(n2,i,p)T(n2,i2,p);

  • if n1+n1>n2+n2 then T(n1,i,p)T(n1,i1,p)T(n2,i,p)T(n2,i2,p);

  • otherwise T(n1,i,p)T(n1,i1,p)T(n2,i,p)T(n2,i2,p) iff i1i2.

Since 0i<p can have at most p distinct values, the longest antichain of Sp,p is of size at most p. But we can reason symmetrically on i, so in fine the longest antichain of Sp,p is of size at most min(p,p).

Let us now bound the size of the antichains of

Rp,p=def{U(j,p)T(n,i,p)|p<jm,n,0i<p}

with 0p<m and 0<pm. Any two elements of Rp,p are comparable if they have identical U(j,p). Since p<jm there are (mp) distinct U(j,p), hence any antichain of Rp,p is of size at most mp.

Symmetrically,

Lp,p=def{T(n,i,p)U(j,p)|n,0i<p,p<jm}

with 0<pm and 0p<m, has antichains of size at most mp.

Now observe that

Hu(p,p)PS(p,p)R(p,p)L(p,p),

where P is the set of all (p,p) such that there exists a factorization u=vw with v and w having respectively p and p distinct letters. The pairs (p,p) in P depends on the order of appearances and disappearances of letters in u, hence they can be ordered such that p is increasing and p is decreasing. Thus P is of size at most 2m+1. Moreover, min(p,p)+mp+mp2m. Summing all three bounds, we end up seeing that, as claimed, any antichain in Hu has size bounded by

(p,p)Pmin(p,p)+mp+mp2m(2m+1).

When u does not use all letters of Σ we write Γ for the subset of Σ that collect the letters occurring in u. Now we have two Eu sets depending on whether we consider the underlying alphabet to be Γ or Σ. Accordingly we write EuΓ and EuΣ (and also MuΓ, etc.)

Lemma 5.4.

Assume |Γ|=m1. Then #EuΣ#EuΓ+2m1.

(See Section A.1 for the proof). We can now wrap up this subsection.

Proof of Theorem 5.1.

Write m for |Γ|. By Lemma 5.3, #EuΓ2m(2m+1). Invoking Lemma 5.4 mm times we obtain the required

#EuΣ #EuΓ+(2m+1)++2m1
2m(2m+1)+m<im(2i1)
2m(2m+1).

We conclude this section with a matching lower bound:

Theorem 5.5 (#Eu is in Ω(m2)).

For each k=1,2,, there exists a word uk over a k-letter alphabet such that #Eukk(k+1).

Describing and analyzing uk is quite involved so the proof has been relegated to Section A.2.

6 Concluding remarks

We developed a tropical algebraic approach to the computation of the r and side distances that are the basis for the piecewise complexity of words. This allowed us to construct compact piecewise signatures and combine them elegantly and efficiently. An outcome of this approach is a polynomial-time algorithm for the piecewise complexity of compressed words. The proof that piecewise signatures have at most O(m2) elements, where m is the size of the alphabet, is technically more involved but is made clearer in the tropical framework.

One can certainly derive more results from our tropical approach to piecewise complexity. For example, recall that the subword universality index ι(u) of a word u is the largest L such that u contains all the length-L words of Σ as subwords [1, 5].

Theorem 6.1.

ι(u)=0 iff occurs in Mu. When does not occur in Mu then ι(u) is the value of the minimal entry in Mu.

(See appendix for a proof.) This provides a simpler way (compared to [25]) of computing the universality index of SLP-compressed words in polynomial-time.

We hope that a tropical viewpoint can similarly profit related investigations on piecewise distance, canonical representatives modulo Simon’s congruence, piecewise separability, or piecewise-testable languages and associated automata.

The tropical setting also leads to clearer code. We implemented all the algorithms presented in this paper, in order to analyze examples and derive the results in Section 5, or to test huge SLPs. Once a software library for tropical vectors and matrices is set up, the implementation is just a direct transcription of the algebraic formulae given in the paper.

References

  • [1] L. Barker, P. Fleischmann, K. Harwardt, F. Manea, and D. Nowotka. Scattered factor-universality of words. In Proc. DLT 2020, volume 12086 of Lecture Notes in Computer Science, pages 14–28. Springer, 2020. doi:10.1007/978-3-030-48516-0_2.
  • [2] M. Bojańczyk, L. Segoufin, and H. Straubing. Piecewise testable tree languages. Logical Methods in Comp. Science, 8(3), 2012. doi:10.2168/LMCS-8(3:26)2012.
  • [3] P. Butkovič. Max-algebra: the linear algebra of combinatorics? Linear Algebra and its Applications, 367:313–335, 2003. doi:10.1016/S0024-3795(02)00655-9.
  • [4] O. Carton and M. Pouzet. Simon’s theorem for scattered words. In Proc. DLT 2018, volume 11088 of Lecture Notes in Computer Science, pages 182–193. Springer, 2018. doi:10.1007/978-3-319-98654-8_15.
  • [5] J. D. Day, P. Fleischmann, M. Kosche, T. Koß, F. Manea, and S. Siemer. The edit distance to k-subsequence universality. In Proc. STACS 2021, volume 187 of Leibniz International Proceedings in Informatics, pages 25:1–25:19. Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.STACS.2021.25.
  • [6] V. Diekert, P. Gastin, and M. Kufleitner. A survey on small fragments of first-order logic over finite words. Int. J. Foundations of Computer Science, 19(3):513–548, 2008. doi:10.1142/S0129054108005802.
  • [7] M. Droste, W. Kuich, and H. Vogler, editors. Handbook of Weighted Automata. Monographs in Theoretical Computer Science. Springer, 2009.
  • [8] L. Fleischer and M. Kufleitner. Testing Simon’s congruence. In Proc. MFCS 2018, volume 117 of Leibniz International Proceedings in Informatics, pages 62:1–62:13. Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.MFCS.2018.62.
  • [9] S. Gaubert and Max Plus. Methods and applications of (max, +) linear algebra. In Proc. STACS ’97, volume 1200 of Lecture Notes in Computer Science, pages 261–282. Springer, 1997. doi:10.1007/BFb0023465.
  • [10] P. Gawrychowski, M. Kosche, T. Koß, F. Manea, and S. Siemer. Efficiently testing Simon’s congruence. In Proc. STACS 2021, volume 187 of Leibniz International Proceedings in Informatics, pages 34:1–34:18. Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.STACS.2021.34.
  • [11] J. Goubault-Larrecq and S. Schmitz. Deciding piecewise testable separability for regular tree languages. In Proc. ICALP 2016, volume 55 of Leibniz International Proceedings in Informatics, pages 97:1–97:15. Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.ICALP.2016.97.
  • [12] S. Halfon, Ph. Schnoebelen, and G. Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In Proc. LICS 2017, pages 1–12. IEEE Comp. Soc. Press, 2017. doi:10.1109/LICS.2017.8005141.
  • [13] B. Heidergott, G. J. Olsder, and J. van der Woude. Max Plus at Work. Modelling and Analysis of Synchronized Systems: A Course on Max-Plus Algebra and its Applications. Princeton Series in Applied Mathematics. Princeton University Press, 2006.
  • [14] P. Karandikar, M. Kufleitner, and Ph. Schnoebelen. On the index of Simon’s congruence for piecewise testability. Information Processing Letters, 115(4):515–519, 2015. doi:10.1016/j.ipl.2014.11.008.
  • [15] P. Karandikar and Ph. Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Logical Methods in Comp. Science, 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.
  • [16] V. I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. Journal of Combinatorial Theory, Series A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.
  • [17] M. Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241–299, 2012. doi:10.1515/gcc-2012-0016.
  • [18] S. Lombardy and J. Mairesse. Max-plus automata. In Jean-Éric Pin, editor, Handbook of Automata Theory. Volume I: Theoretical Foundations, chapter 5, pages 151–188. EMS Press, 2021. doi:10.4171/Automata-1/5.
  • [19] D. G. Northcott. Multilinear Algebra. Cambridge Univ. Press, 1984.
  • [20] P. P. Pach. Normal forms under Simon’s congruence. Semigroup Forum, 97:251–267, 2018. doi:10.1007/s00233-017-9910-5.
  • [21] J.-É. Pin. The influence of Imre Simon’s work in the theory of automata, languages and semigroups. Semigroup Forum, 98(1–8), 2019. doi:10.1007/s00233-019-09999-8.
  • [22] J.-É. Pin, J. M. Taylor, and M. Atiyah. Tropical semirings. In J. Gunawardena, editor, Idempotency, pages 50–69. Cambridge Univ. Press, 1998. doi:10.1017/CBO9780511662508.004.
  • [23] M. Praveen, Ph. Schnoebelen, J. Veron, and I. Vialard. On the piecewise complexity of words and periodic words. In Proc. SOFSEM 2024, volume 14519 of Lecture Notes in Computer Science, pages 456–470. Springer, 2024. doi:10.1007/978-3-031-52113-3_32.
  • [24] J. Sakarovitch and I. Simon. Subwords. In M. Lothaire, editor, Combinatorics on Words, volume 17 of Encyclopedia of Mathematics and Its Applications, chapter 6, pages 105–142. Cambridge Univ. Press, 1983.
  • [25] Ph. Schnoebelen and J. Veron. On arch factorization and subword universality for words and compressed words. In Proc. WORDS 2023, volume 13899 of Lecture Notes in Computer Science, pages 274–287. Springer, 2023. doi:10.1007/978-3-031-33180-0_21.
  • [26] Ph. Schnoebelen and I. Vialard. On the piecewise complexity of words. Acta Informatica, 62(1), 2025. doi:10.1007/s00236-025-00480-4.
  • [27] I. Simon. Hierarchies of Events with Dot-Depth One. PhD thesis, University of Waterloo, Dept. Applied Analysis and Computer Science, 1972. URL: http://maveric.uwaterloo.ca/reports/1972_Simon_PhD.pdf.
  • [28] I. Simon. Piecewise testable events. In Proc. 2nd GI Conf. on Automata Theory and Formal Languages, volume 33 of Lecture Notes in Computer Science, pages 214–222. Springer, 1975. doi:10.1007/3-540-07407-4_23.
  • [29] I. Simon. Limited subsets of a free monoid. In Proc. FOCS ’78, pages 143–150. IEEE Comp. Soc. Press, 1978. doi:10.1109/SFCS.1978.21.

Appendix A Proofs omitted from main text

A.1 Proof of Lemma 5.4

See 5.4

We first observe that

MuΣ =, NuΣ =.

As a consequence, and when im=defm1, a 𝜹i,jΣ in FuΣ is 𝜹i,jΓ extended with one . When i=m, 𝜹i,jΓ is (α1αm 0) where αs is 1 if as occurs in u(,j), and is otherwise. The same reasoning applies to a 𝜸i,jΣ in GuΣ, except that now, when i=m, one considers occurrences in u(j,).

Looking now at HuΣ, a tensor 𝜹i,jΣ𝜸i,jΣ with im is just the corresponding 𝜼i,jΓ with 2m1 extra ’s inserted at fixed positions. Therefore, for i,i<m, the tensors at positions (i,j) and (i,j) are ordered (or incomparable) in HuΣ exactly as in HuΓ. We also have to consider the new tensors at positions (m,j): they all end with 0 hence cannot dominate any tensor on a line with i<m (these end with ). However they can be incomparable with these and contribute new maximal elements. Now, there can be at most 2m+1 incomparable tensors on the line i=m since they can be ordered with pairs (p,p)P as in the proof of Lemma 5.3.

Finally HuΣ contains all the maximal elements of HuΓ (duly extended) and at most 2m+1, i.e., 2m1, new maximal elements taken from the last (new) row. Hence the claim.

A.2 Proof of Theorem 5.5

See 5.5

We have to define a word uk over a k-letter alphabet and show that #Eukk(k+1).

Fix Σ={a1,,ak}, sometimes written {𝙰,𝙱,𝙲,} in examples. We let Wk=defa1a2ak denote the word that lists the k letters in order. For 0jk we further write Wj,k for the suffix W(kj,k), i.e., the j last letters of Wk. We then define vk and uk via

vk =defW1,kW2,kWk1,k, uk =defWkvka1vk𝖱Wk𝖱.

Observe that uk is a palindrome. As an example, here are uk and vk for k=4:

v4 =W1,4W2,4W3,4=𝙳𝙲𝙳𝙱𝙲𝙳, u4 =𝙰𝙱𝙲𝙳W4𝙳𝙲𝙳𝙱𝙲𝙳v4𝙰𝙳𝙲𝙱𝙳𝙲𝙳v4𝖱𝙳𝙲𝙱𝙰W4𝖱.

We start by computing the extension matrices associated with various prefixes and suffixes of uk. An easy induction on k provides the following k-by-k matrices for any k2:

MWk=NWk𝖱 =, NWk=MWk𝖱 =.

From these, we can deduce the k-by-k extension matrices for the Wn,k where 0nk:

MWn,k=NWn,k𝖱 =, MWn,k𝖱=NWn,k =.

Combining the above matrices, we compute the extension matrix for vk:

Mvk=

with 3 as the maximal value, filling the bottom-right part. A consequence is that MWkvka1, obtained as the product MWkMvkMa1 is the k-by-k all-2s matrix.

Let us now compute Nvk and MWkvka1vk𝖱, noting that Wkvka1vk𝖱 is uk(k,)𝖱:

Nvk =, MWkvka1vk𝖱 =.

Finally, for a factorization uk=uk(,n)uk(n,) with n<k, we compute Muk(,n) and Nuk(n,), noting that uk(,n)=Wn and uk(n,)𝖱=Wkvka1vk𝖱akak1an+1.

Muk(,n) =, Nuk(n,) =.
Lemma A.1.

Huk contains all the Muk(,n)[,i]Nuk(n,)[,i] and the Nuk(n,)[,i]Muk(,n)[,i] for 0n<ik.

Proof.

Indeed this just expresses the 𝜹i,n and 𝜸i,n vectors of Fuk and Guk as columns of the matrices we computed. Note that, in the claim, the products of the first kind are taken from the k first columns of Hu, while the products of the second kind come from its k last columns but are expressed in terms of the first columns of Fuk and Guk, taking advantage of the symmetry in uk.

Looking now at the matrix for Muk(,n) we see that, for n<i, the ith column is U(i,n) in the notation used for Lemma 5.2, while the ith column of Nuk(,n) is T(n+2,k,k).

Lemma A.2.

The set

Sk=def{U(i,n)T(n+2,k,k),T(n+2,k,k)U(i,n)| 0n<ik}

is an antichain in Huk.

Proof (Sketch).

We know that these tensors are in Huk since they are exactly those listed in Lemma A.1, now expressed as T and U patterns. Stated in this form it is easy to see that they are pairwise incomparable, hence form an antichain. We now conclude with the announced result:

Lemma A.3.

#Eukk(k+1).

Proof.

After the above preparations, and since Sk has cardinal k(k+1), it is enough to prove that the tensors in Sk are maximal in Huk. However we are not ready to do precisely that in this extended abstract.

Instead, consider that every U(i,n) for all 0n<k1, n<ik contains a . If there exist a factorization uk=vw and some ik such that U(i,n)T(n+2,k,k) is strictly dominated by Mv[,i]Nw[,i], then necessarily U(i,n)Mv[,i], thus Mv[,i] contains a at every position where U(i,n) does. Hence we deduce that either v is a prefix of uk(,n) and i=i, or v=vk(,n+1)=vk(,n)an+1 with i=n+1 and in+1. However, in these few specific cases that are easy to list, we can readily check that the full product U(i,n)T(m+2,k,k) is not dominated by Mv[,i]Nw[,i]. So all but two tensors in Sk are already seen to be maximal.

There remains the case n=k1 and i=k. Here it is harder to show that U(k,k1)T(k+1,k,k), i.e., 𝜼k,k1, is maximal since this tensor has no and thus could possibly be dominated by some 𝜼i,j for any i,j with kj|uk|k, i.e., for a factorization uk=vw where both v and w contain the full alphabet.

However, and since 𝜼k,k1, is incomparable with the rest of Sk, if 𝜼k,k1 is not maximal, then Huk contains a tensor that dominates 𝜼k,k1, hence that necessarily contains no and thus does not dominate any other tensor from Sk. (The same argument goes for the symmetric tensor on the right-hand side of Huk.) Finally, the announced k(k+1) lower bound on the size of #Euk is demonstrated.

A.3 Proof of Theorem 6.1

See 6.1 First, if occurs in Mu, then u does not contain every letter of the alphabet, hence ι(u)=0.

In the other cases, we consider the arch factorization of u (see [25]). Observe that when u contains exactly one arch and ends with letter ai, the i-th row of Mu is a row of 1, while in the other rows only the numbers 1 and 2 appear.

If ι(u)=k then u can be factorized as u1ukv where every uj is an arch and v does not contain every letter of the alphabet. Let ai be the last letter of u1. Then observe that, Mu1uk contains only the numbers k and k+1, and its i-th row is a row of k’s. Thus, Mu=Mu1ukMv contains only the numbers k and k+1, and Mu[i,j]=k for any j such that aj does not appear in v.