A Tropical Approach to the Compositional Piecewise Complexity of Words and Compressed Words
Abstract
We express the piecewise complexity of words using tools and concepts from tropical algebra. This allows us to define a notion of piecewise signature of a word that has size where is the alphabet size and is the length of the word. The piecewise signature of a concatenation can be computed from the signatures of its components, allowing a polynomial-time algorithm for computing the piecewise complexity of SLP-compressed words.
Keywords and phrases:
Tropical semiring, Subwords and subsequences, piecewise complexity, SLP-compressed wordsCategory:
Track B: Automata, Logic, Semantics, and Theory of ProgrammingCopyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theoryAcknowledgements:
Work started while the third author was at Univ. Paris-Saclay.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Subwords and piecewise complexity
For two words , we say that is a subword of , written , if can be obtained by removing some letters (possibly all, possibly none) from at arbitrary positions. For example, . This notion corresponds to subsequences in other areas, and is more general than the usual concept of factors of a word.111Some authors use the terminology “subword” for factors, and then use “scattered subword” for subwords.
Compared to factors, subwords are not so easy to reason about, and correspondingly they have been much less studied. However, they certainly are a fundamental notion, appearing naturally in many areas of formal languages, algorithmics, data sciences, logic, computational linguistics, genomics, etc.
In language theory, semigroup theory, and logic, an important concept is piecewise-testability, introduced in 1972 by Imre Simon [27]. This corresponds exactly to the first level in Straubing-Thérien hierarchy of first-order definable languages [6] and has been extended in many directions (infinite words [4], trees [2], any well quasi-ordered set [11]).
In this paper, we are interested in subword-based descriptive complexity of words. In particular, the piecewise-complexity of words and languages has been used in [15] for bounding the complexity of decidable fragments of the logic of subwords (see also [12]). In essence, the piecewise-complexity of a word (or any piecewise-testable object) is the minimum size of the pieces required to specify it. Regarding this measure, a recent algorithmic contribution is [26], providing efficient algorithms for computing the piecewise-complexity of individual words. The algorithms and the reasoning behind them are quite involved and it is not trivial to extend them for, e.g., periodic words [23].
Our contribution
In a first part, we rephrase the results of [26] in the language and notation of tropical algebra. This setting makes it easier to develop a compositional way of computing piecewise-complexity: we associate with any word a (piecewise) signature that has size logarithmic in (and polynomial in the size of the alphabet), and that contains enough information to allow extracting, among other things, the piecewise complexity of . Furthermore, signatures can be computed compositionally, i.e., is obtained in polynomial-time from and .
In a second part we use this apparatus to compute the piecewise complexity of compressed words in time polynomial in the size of their compressed description, solving a problem raised in [25]. Reasoning on (e.g., searching) and handling (e.g., editing) compressed data structures is an important field of algorithmics and compressed words are one of the most fundamental and most fruitful instantiations of the paradigm, see [17]. While our algorithm is a simple derivation of the results from the first part, its complexity analysis is quite involved and is the main technical difficulty in this paper. We expect that the techniques behind piecewise signatures can be useful for other subword-related problems on compressed words.
We hope to convince our readers that our contribution is made more elegant and technically simpler through the use of tropical linear and multilinear algebra. This allows one to recruit familiar notions of vectors, matrices, their products, their monotonicity, thus lightening the brain load required for following through our reasoning. Tropical algebra was introduced in the area of formal languages and automata by Imre Simon himself (in [29], see also [21]) in connection with the star-height problem, opening the way to the algebraic and algorithmic theory of weighted automata, now a very active area [7, 18]. In this respect it is surprising that Simon did not use tropical algebra in his fundamental work on piecewise-testable languages [27, 28] on which we heavily rely and whose bases we rephrase tropically.
Related work
Related to piecewise-complexity of words is the piecewise-distance of two words, which has been more studied and is the topic of several recent papers [8, 1, 10]. Indeed the piecewise-complexity of can be defined as [26] but this characterisation does not provide useful algorithms. The range of related questions include reconstructing words from their subwords [16], solving word equations [20], or computing edit-distance [20, 5], etc., modulo Simon’s congruence.
Outline of the paper
Section 2 recalls the basic concepts, notations, and results used in the paper. Section 3, the main conceptual contribution, shows how to use tropical algebra for computing piecewise complexity compositionally, introduces piecewise signatures with their associated algorithms. Section 4 applies this to SLP-compressed words. Finally, Section 5 handles the complexity analysis and contains the main technical difficulties. For readability concerns, some proofs are omitted from the main text and can be found in the Appendix.
2 Basic notions
2.1 Subwords and the piecewise complexity of words
We follow [26]. For a word of length and two positions such that , we write for the factor . Note that , that , and that . We write as shorthand for , i.e., , the th letter of . Finally, we use and as convenient notations for and, respectively, : they are the th prefix and the th suffix of , with the understanding that this numbering starts with a “th” prefix, always equal to the empty word , and that suffixes get shorter and shorter when their index grow while the length of prefixes increases.
Fix a -letter alphabet . A word is a subword of , written , if there is a factorization with factors and such that [24]. When furthermore then is also a factor of . For example, both and are subwords of but only is a factor of . The longest subword of is itself, its shortest subword is the empty word, denoted . Subwords are the usual concept of subsequences in the special case of words, i.e., finite sequences of letters.
For given , we write when and have the same subwords of length at most [27]. For example while since, e.g., and . In situations where is subword of only one among two words and , we say that is a (piecewise) distinguisher (also a separator) between and .
Then means that and have no distinguisher of length or less, or that their shortest distinguisher (guaranteed to exist when ) has length at least . Each is an equivalence relation over , with being the trivial “always true” relation, and holding if, and only if, and use the same subset of letters. Every has finite index [28, 14] and is a refinement of the previous , i.e., , with . One speaks of “Simon’s congruence” in view of .
2.2 Computing
An time algorithm computing is given in [26]. It is based on a characterisation of in terms of piecewise-distances. For a word and a letter , we follow [27, 28] and define the right and left “side distances”:
The two notions are mirror of one another: , where denotes the reverse, or the mirror image, of .
One can reduce the computation of to the computation of and in view of the following characterisation (see [26, Prop. 3.1]):
(1) |
In order to compute , Schnoebelen and Vialard introduce the following characterisation:
Lemma 2.1 ([26, Lem. 3.8]).
For any word and any letters , the following hold:
(2) | ||||
(3) |
Following [26], the -table of , denoted , is the -by- matrix with entry in column and row , where we assume some fixed ordering of the letters in . There is a similar -table, denoted , with entries . The key point is that, by Eq. (1), computing amounts to finding the maximum value in the array obtained by summing the two matrices.
Example 2.2.
Here are and for , assuming :
With Eq. (1), one sees that , obtained as . This maximum is not uniquely attained: . Note that , so that indeed is necessarily larger than 3.
3 Tropical algebra and piecewise complexity
3.1 The tropical semiring
By tropical algebra we mean the semiring , with as zero and as unit, also called the -semiring. It is a semiring rather than a ring because there is no additive inverse and no subtraction. In the following we use as a less cumbersome notation for ,222It is common to use as a light notation for the zero in tropical rings but we reserve to denote the empty word. and as shorthand for (and also for the whole structure). We refer to [9, 22, 18] for more in-depth introductions and simply observe that, modulo a sign change, -semirings are isomorphic to the more familiar -semirings one typically encounters in operations research [13], combinatorics [3], economics, etc.
We use the standard linear ordering on , given by . Both semiring operations are monotonic w.r.t. : for any , and together imply and . Finally, note that , the zero of , is also its maximal element.
Tropical semirings lead to their own flavour of linear algebra. In , a linear combination of variables has the form
where the ’s are scalars from . Such an can be represented by a line vector , simply written , so that is the (tropical) dot product .
Since is only a commutative semiring and not a field, these linear combinations do not form a proper vector space, not even a module, but only a semimodule over . However, and for our limited purposes, the usual notions of vector and matrix sums and products work as expected so the reader should be on familiar ground. The next paragraphs introduce the notations we need and collect the main properties we shall rely on.
We will often use the vector , made up of units only (not zeroes) and with dimension to be inferred from context. We write for the vector that has a single unit in position , surrounded by zeroes. Thus is the th column of the tropical identity matrix and is its diagonal. As expected, is a base for , the free -dimensional -semimodule.
In , a linear application is a mapping of the form where each is a (tropical) linear combination. As in classical linear algebra, can be represented by a -by- matrix , where row contains the scalars defining , so that .
The ordering on leads to a partial ordering on tropical vectors and matrices. For two vectors , we write when for all . Similarly, given two tropical matrices and of same heights and widths, we write when each value in is dominated by the corresponding in . Then and together imply , i.e., “linear applications are monotonic”. Note that the tropical sums of vectors or matrices , are computed by applying component-wise and coincide with infimums, or meets, in the lattices and : for this reason we denote them with and instead of “” or “” which suggests picking the smaller vector or matrix. We still use for scalars since, in that case, the two interpretations coincide.
3.2 A tropical view of the - and -tables
The starting point of this paper is the realisation that Lemma 2.1 actually describes the th column of as a tropically linear image of its th column. For this we associate a tropical matrix with each word over .
Definition 3.1.
The extension matrix associated with a letter is the -by- matrix, denoted , given by
(4) |
The extension matrix associated with a word over
is the (tropical) product , written . In
particular is , the identity matrix of order
.
Finally, the right-to-left extension matrix , is ,
i.e., the extension matrix of the mirror word.
Example 3.2.
Assuming , the above definition leads to
for letters, and | |||
for a word . Note that and have no clear relation ( does not entail ).
We may now return to the table . Let us write for its column number , recalling that the first column has index 0, and use as shorthand for , i.e., the last column. When is understood, we usually just write for . Finally, we shall mostly display and use the ’s as line vectors without bothering to write everywhere.
By definition, is . In particular .
Lemma 3.3.
For any word and any factor of ,
In particular, .
Proof.
By induction on , the length of the factor of . The base case is clear since , i.e., , is the identity matrix. If then is the letter for some . By definition we have that for all , the th entry of column is and likewise, the th entry of column is . Therefore
Finally, if , then for any strictly between and we can derive
using the induction hypothesis.
Mirror results hold for once we introduce the required notations. So let us write for the th column of (counting from left and starting with ) with standard abbreviations for when is understood, and for . Then for any factor of we have , and in particular .
3.3 Compositional computation of
The key to computing - and -tables compositionally is to consider and as functions of and .
Definition 3.4 (f--table).
The functional -table, denoted , associated with a word is a -by- matrix with value
(5) |
in position .
In other words, is the th column of the extension matrix for the th prefix of . Note that is a table of vectors where was a table of values.
Now can be used to compute via (Lemma 3.3), but more generally it provides the ’s for any word that has some as a suffix. Indeed, assume :
Lemma 3.5.
For any words , letter , and position :
(6) |
Proof.
On one hand, is the -value in the last column of , i.e., is . Now
We conclude since is just .
Mirror notions and results exist for . The f--table associated with is a -by- matrix, denoted , and having value
(7) |
in position , i.e., is the th column of . This ensures
(8) |
Example 3.6.
Taking as in Example 2.2, and are
3.4 Summing and -tables
In order to compute the required by Equation 1, we now want to combine the two tables, as done in Example 2.2. Recall, however, that these are actually products in , and that the vectors and actually encode the linear maps given by and in Eqs. (6) and (8).
Given two tropically linear functions and from to , the tropical product , defined via , satisfies
It is (tropically) bilinear since and, for a scalar , .333Here is a tropical scaling of a vector, which amounts to increase every value in by . Such mappings are the natural objects of multilinear algebra, see e.g. [19]. Moving to the tensor space , we can see , a mapping from to , as a tropically linear mapping from to , denoted , and that associates with the tensor .
Adapted to tropical algebra, the elementary tensor obtained as outer product of two vectors of size and , respectively, is the -by- matrix given by . It is natural to see as a matrix but sometimes, as in the expression , it is more natural to see the tensors as vectors of size .444The size- vector is obtained by reading the matrix column by colum, see Example 3.9 below.
Definition 3.7 (Summing and ).
The functional -table associated with a word is the -by- matrix, denoted , having in position .
In the above definition, and are as in the definition of and . Finally, every entry in is a -by- tropical tensor, or, equivalently, a tropical vector of size .
Let us now consider a word that contains as a factor, i.e., assume that is some . For a position in and a letter in , one obtains
(9) |
If now one wants to find the maximum value among all , it is not necessary to consider all possibilities for and . Since does not change with and , and since the dot-product in Eq. (9) is monotonic with respect to the component-wise order, we do not need the full table to compute but only a set of its maximal elements. This leads to:
Definition 3.8 (Piecewise signatures).
Assume a fixed alphabet .
-
The residual tensors associated with a word is the set of all maximal elements in .
-
The (piecewise) signature of is the triple collecting and the two extension matrices.
Note that is never empty and usually contains several elements since the ordering on is only a partial ordering when .
Example 3.9.
We continue with . Here contains tensors obtained as the pairwise products of the vectors from and listed earlier, see Example 3.6. It has seven (pairwise incomparable) maximal elements. They are
Note that coincides with , obtained as : however only contains one copy of each maximal tensor. Similarly one has and .
The following lemma, needed in Section 4, can be seen as another example.
Lemma 3.10.
For a letter where , contains maximal tensors.
Proof.
Assume . There are vectors in , with and . The table is the mirror image, i.e., . The tensors in are all incomparable except for , hence the claim.
The signature of a word , and more precisely the tensors in , provides enough information to compute :
Theorem 3.11 (Piecewise complexity from signatures).
For any word
(10) |
Proof.
The proof is simple. Starting with Eq. (1), one has
using Eq. (9) with , | ||||
where it is understood that now denotes a size- tensor, | ||||
since dot-product is monotonic w.r.t. . Note that, for an -sized vector, the product in Eq. (10) is just the minimal entry in . However the notation is quite convenient, and we shall extend the above reasoning with tensors different from .
Example 3.12.
Continuing with , the tensor in that has the largest minimal component is (see Example 3.9), yielding and , in agreement with Example 2.2.
For complexity analysis we shall assume that signatures are simply represented as two -by- matrices together with a collection of -sized residual tensors. The values contained in theses matrices and tensors are usually encoded using a logarithmic number of bits (e.g., they are written in binary).
Theorem 3.13 (Space complexity of piecewise signatures).
The size of the signature of a word is in where and .
Proof.
Let us write for the cardinal of . Now contains two matrices and tensors, each made up of scalars. These scalars are bounded by (or are ) so only use space. It is therefore enough to show that is bounded by a polynomial of . This is done in Section 5 where we prove (see Theorem 5.1).
3.5 Combining piecewise signatures
Since the objects in signatures can accommodate concatenation, it is possible to compute them compositionally. The main result of this section is stated as
Theorem 3.14 (Compositionality).
The signature of the concatenation of two words
can be computed from and .
Furthermore the computation of can be done in time polynomial in
.
The proof is developed in several steps.
Lemma 3.15.
Let . For , the tensor at position in is Similarly, for , the tensor in is .
The above statement uses tensor products of linear maps. Recall that when are two linear maps, their tensor product can be defined via . When and are given by two -by- matrices and , one can obtain a -by- matrix, denoted , for by taking the products of the rows of and the rows of (as expected since is a base of ).
Proof (of Lemma 3.15).
We only prove the first claim since the second can be obtained directly by mirroring. Assume . By definition, is . Now, by Eq. (5),
while, by Eq. (7), | ||||
Thus | ||||
as claimed.
Lemma 3.16.
Let . The tensors in are exactly the maximal tensors in .
Proof.
First we claim that . Indeed, consider . We shall assume since the other case is symmetric. By Lemma 3.15, is . If we have proved . If, on the other hand, , then there is some maximal that strictly dominates . Since linear applications are monotonic we deduce
Thus, and since is a maximal element of , necessarily and coincide, so is indeed the image by of a tensor from , hence is in .
Now, Lemma 3.15 entails that is included in . So . We conclude that and coincide.
Proof of Theorem 3.14.
With and
one obtains
as follows:
– and are matrix products and ;
– is obtained by computing (from Lemma 3.16) and
removing the non maximal elements.
All these operations are clearly polynomial-time.
4 Piecewise complexity of SLP-compressed words
An SLP is an acyclic context-free grammar where each non-terminal has only one production rule, i.e., the grammar is deterministic and generates a single word. When a long word has many repetitions it can often be described by an SLP of smaller size and thus SLPs can be used as a compressed data structure for words. Importantly, most compression schemes used for texts and files are equivalent, via polynomial-time encodings, to the mathematically more elegant SLPs. We refer to [17] for more background and details.
Formally, an SLP with rules is an alphabet together with a list of production rules where each right-hand side is either a letter from or a concatenation of two nonterminals with . The size of an SLP, denoted |, is in , where .
In the context of , the expansion of a nonterminal is a word defined inductively via:
Finally, the expansion of the SLP itself, written , is the expansion of its last nonterminal. Note that the length of can reach , hence in .
Theorem 4.1 (Piecewise signatures of compressed words).
Given an SLP , one can compute in time where is the number of rules of and is the size of its alphabet.
Proof.
The algorithm computes piecewise signatures for each non-terminal of , i.e., with . There are two cases to handle:
-
1.
when is given by a rule , the signature of has extension matrices given by Eq. (4) and given by Lemma 3.10;
-
2.
when is given by a rule , the signature is obtained by combining and as shown in Theorem 3.14.
These steps can each be performed in time since, by Theorem 3.13, each has size bounded by , hence in . Combining with Theorem 3.11 we obtain:
Corollary 4.2.
The piecewise complexity of (the word represented by) an SLP can be computed in time .
5 The size of piecewise signatures
Fix some alphabet . The goal of this section is to bound the number of tensors, written , in any set. Such bounds are crucial for the complexity results in Theorems 3.13, 3.14, and 4.1. We first show that is bounded in . This is later complemented with Theorem 5.5 where we exhibit a family of witnesses proving an lower bound.
Theorem 5.1.
For any , where .
The proof of Theorem 5.1 is organized in three main lemmas.
For any , for any , let denote the th column of . The columns of respect some patterns that are easier to analyze if we assume that the letters of are listed in the order they appear in (with letters not in listed at the end). This is no loss of generality since a permutation of just amounts to a permutation of the indices in our matrices. Wherever we make that assumption, we say that and are aligned, or sometimes more succinctly that “ is aligned”.
Lemma 5.2.
Assume that is aligned and only uses letters with . Then every column agrees with one of the following two patterns:
-
1.
If , then
for some and some . We write for this vector.
-
2.
If then
and we write for this vector.
Proof.
By induction on the length of . For we have , , , so and indeed for all .
Now assume that is aligned and that the claim holds for . First let us see how the columns of are obtained from the columns of : for any we have . With Eq. (4) we see
() |
Note that the terms in ( ‣ 5) involve a tropical scaling which means “increase every value in by 1”, while the second term contains a tropical sum “” which means “collect s componentwise”. Finally is if , and is , simply written , otherwise.
We consider whether occurs in :
If occurs in then for some , , hence . Now for , is of the form for some if , or if . If then with being or . If then observe that .
Similarly, if does not occur in , (in which case by alignment assumption) then , hence . For , is of the form for some if , or if . If then with being or ; if then .
Finally, all cases yield results that agree with the claim and the proof is completed. When is not aligned, the patterns above still apply but modulo a permutation of the lines and columns of . Finally, the patterns also apply to , i.e., to , but now alignment considers the letters of in their order of last appearance.
Lemma 5.3.
If all the letters of occur in , then .
Proof.
Since collects the maximal elements of , it is in particular an antichain of and we can bound its size by the size of the largest antichains in (also known as the width of , as in Dilworth’s Theorem). Recall that the tensors in are partially ordered by .
Recall also, from Definition 3.7, that a tensor of is obtained by combining a from , with a from , i.e., builing for a factorization and a letter . To simplify notations, when we consider a prefix of , we assume that the letters in are listed in the order they appear in . When looking at for a suffix of we assume that is listed in order of last appearances in : making the two different assumptions simultaneously is not a problem since they apply to vectors that will be combined via a tensor product, and the two implicit reindexings can be carried out independently on the two factors.
Now remember that, according to Lemma 5.2, for any , for any prefix of , is given by:
-
1.
If appears in then
for some .
-
2.
If does not appear in then
for some . In both cases the letters to do not appear in .
Similarly, for any , for any suffix of , matches one of the two patterns. Furthermore, for any factorization , we observe that and cannot both match the second pattern since has to occur in or .
First let us bound the size of the antichains of
with fixed. Representing tensors in matrix form, and ignoring the rows and columns, we obtain matrices of the form
with . Thus two tensors and are always comparable:
-
if then ;
-
if then ;
-
otherwise iff .
Since can have at most distinct values, the longest antichain of is of size at most . But we can reason symmetrically on , so in fine the longest antichain of is of size at most .
Let us now bound the size of the antichains of
with and . Any two elements of are comparable if they have identical . Since there are distinct , hence any antichain of is of size at most .
Symmetrically,
with and , has antichains of size at most .
Now observe that
where is the set of all such that there exists a factorization with and having respectively and distinct letters. The pairs in depends on the order of appearances and disappearances of letters in , hence they can be ordered such that is increasing and is decreasing. Thus is of size at most . Moreover, . Summing all three bounds, we end up seeing that, as claimed, any antichain in has size bounded by
When does not use all letters of we write for the subset of that collect the letters occurring in . Now we have two sets depending on whether we consider the underlying alphabet to be or . Accordingly we write and (and also , etc.)
Lemma 5.4.
Assume . Then .
(See Section A.1 for the proof). We can now wrap up this subsection.
Proof of Theorem 5.1.
Write for . By Lemma 5.3, . Invoking Lemma 5.4 times we obtain the required
We conclude this section with a matching lower bound:
Theorem 5.5 ( is in ).
For each , there exists a word over a -letter alphabet such that .
Describing and analyzing is quite involved so the proof has been relegated to Section A.2.
6 Concluding remarks
We developed a tropical algebraic approach to the computation of the and side distances that are the basis for the piecewise complexity of words. This allowed us to construct compact piecewise signatures and combine them elegantly and efficiently. An outcome of this approach is a polynomial-time algorithm for the piecewise complexity of compressed words. The proof that piecewise signatures have at most elements, where is the size of the alphabet, is technically more involved but is made clearer in the tropical framework.
One can certainly derive more results from our tropical approach to piecewise complexity. For example, recall that the subword universality index of a word is the largest such that contains all the length- words of as subwords [1, 5].
Theorem 6.1.
iff occurs in . When does not occur in then is the value of the minimal entry in .
(See appendix for a proof.) This provides a simpler way (compared to [25]) of computing the universality index of SLP-compressed words in polynomial-time.
We hope that a tropical viewpoint can similarly profit related investigations on piecewise distance, canonical representatives modulo Simon’s congruence, piecewise separability, or piecewise-testable languages and associated automata.
The tropical setting also leads to clearer code. We implemented all the algorithms presented in this paper, in order to analyze examples and derive the results in Section 5, or to test huge SLPs. Once a software library for tropical vectors and matrices is set up, the implementation is just a direct transcription of the algebraic formulae given in the paper.
References
- [1] L. Barker, P. Fleischmann, K. Harwardt, F. Manea, and D. Nowotka. Scattered factor-universality of words. In Proc. DLT 2020, volume 12086 of Lecture Notes in Computer Science, pages 14–28. Springer, 2020. doi:10.1007/978-3-030-48516-0_2.
- [2] M. Bojańczyk, L. Segoufin, and H. Straubing. Piecewise testable tree languages. Logical Methods in Comp. Science, 8(3), 2012. doi:10.2168/LMCS-8(3:26)2012.
- [3] P. Butkovič. Max-algebra: the linear algebra of combinatorics? Linear Algebra and its Applications, 367:313–335, 2003. doi:10.1016/S0024-3795(02)00655-9.
- [4] O. Carton and M. Pouzet. Simon’s theorem for scattered words. In Proc. DLT 2018, volume 11088 of Lecture Notes in Computer Science, pages 182–193. Springer, 2018. doi:10.1007/978-3-319-98654-8_15.
- [5] J. D. Day, P. Fleischmann, M. Kosche, T. Koß, F. Manea, and S. Siemer. The edit distance to -subsequence universality. In Proc. STACS 2021, volume 187 of Leibniz International Proceedings in Informatics, pages 25:1–25:19. Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.STACS.2021.25.
- [6] V. Diekert, P. Gastin, and M. Kufleitner. A survey on small fragments of first-order logic over finite words. Int. J. Foundations of Computer Science, 19(3):513–548, 2008. doi:10.1142/S0129054108005802.
- [7] M. Droste, W. Kuich, and H. Vogler, editors. Handbook of Weighted Automata. Monographs in Theoretical Computer Science. Springer, 2009.
- [8] L. Fleischer and M. Kufleitner. Testing Simon’s congruence. In Proc. MFCS 2018, volume 117 of Leibniz International Proceedings in Informatics, pages 62:1–62:13. Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.MFCS.2018.62.
- [9] S. Gaubert and Max Plus. Methods and applications of (max, +) linear algebra. In Proc. STACS ’97, volume 1200 of Lecture Notes in Computer Science, pages 261–282. Springer, 1997. doi:10.1007/BFb0023465.
- [10] P. Gawrychowski, M. Kosche, T. Koß, F. Manea, and S. Siemer. Efficiently testing Simon’s congruence. In Proc. STACS 2021, volume 187 of Leibniz International Proceedings in Informatics, pages 34:1–34:18. Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.STACS.2021.34.
- [11] J. Goubault-Larrecq and S. Schmitz. Deciding piecewise testable separability for regular tree languages. In Proc. ICALP 2016, volume 55 of Leibniz International Proceedings in Informatics, pages 97:1–97:15. Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.ICALP.2016.97.
- [12] S. Halfon, Ph. Schnoebelen, and G. Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In Proc. LICS 2017, pages 1–12. IEEE Comp. Soc. Press, 2017. doi:10.1109/LICS.2017.8005141.
- [13] B. Heidergott, G. J. Olsder, and J. van der Woude. Max Plus at Work. Modelling and Analysis of Synchronized Systems: A Course on Max-Plus Algebra and its Applications. Princeton Series in Applied Mathematics. Princeton University Press, 2006.
- [14] P. Karandikar, M. Kufleitner, and Ph. Schnoebelen. On the index of Simon’s congruence for piecewise testability. Information Processing Letters, 115(4):515–519, 2015. doi:10.1016/j.ipl.2014.11.008.
- [15] P. Karandikar and Ph. Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Logical Methods in Comp. Science, 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.
- [16] V. I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. Journal of Combinatorial Theory, Series A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.
- [17] M. Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241–299, 2012. doi:10.1515/gcc-2012-0016.
- [18] S. Lombardy and J. Mairesse. Max-plus automata. In Jean-Éric Pin, editor, Handbook of Automata Theory. Volume I: Theoretical Foundations, chapter 5, pages 151–188. EMS Press, 2021. doi:10.4171/Automata-1/5.
- [19] D. G. Northcott. Multilinear Algebra. Cambridge Univ. Press, 1984.
- [20] P. P. Pach. Normal forms under Simon’s congruence. Semigroup Forum, 97:251–267, 2018. doi:10.1007/s00233-017-9910-5.
- [21] J.-É. Pin. The influence of Imre Simon’s work in the theory of automata, languages and semigroups. Semigroup Forum, 98(1–8), 2019. doi:10.1007/s00233-019-09999-8.
- [22] J.-É. Pin, J. M. Taylor, and M. Atiyah. Tropical semirings. In J. Gunawardena, editor, Idempotency, pages 50–69. Cambridge Univ. Press, 1998. doi:10.1017/CBO9780511662508.004.
- [23] M. Praveen, Ph. Schnoebelen, J. Veron, and I. Vialard. On the piecewise complexity of words and periodic words. In Proc. SOFSEM 2024, volume 14519 of Lecture Notes in Computer Science, pages 456–470. Springer, 2024. doi:10.1007/978-3-031-52113-3_32.
- [24] J. Sakarovitch and I. Simon. Subwords. In M. Lothaire, editor, Combinatorics on Words, volume 17 of Encyclopedia of Mathematics and Its Applications, chapter 6, pages 105–142. Cambridge Univ. Press, 1983.
- [25] Ph. Schnoebelen and J. Veron. On arch factorization and subword universality for words and compressed words. In Proc. WORDS 2023, volume 13899 of Lecture Notes in Computer Science, pages 274–287. Springer, 2023. doi:10.1007/978-3-031-33180-0_21.
- [26] Ph. Schnoebelen and I. Vialard. On the piecewise complexity of words. Acta Informatica, 62(1), 2025. doi:10.1007/s00236-025-00480-4.
- [27] I. Simon. Hierarchies of Events with Dot-Depth One. PhD thesis, University of Waterloo, Dept. Applied Analysis and Computer Science, 1972. URL: http://maveric.uwaterloo.ca/reports/1972_Simon_PhD.pdf.
- [28] I. Simon. Piecewise testable events. In Proc. 2nd GI Conf. on Automata Theory and Formal Languages, volume 33 of Lecture Notes in Computer Science, pages 214–222. Springer, 1975. doi:10.1007/3-540-07407-4_23.
- [29] I. Simon. Limited subsets of a free monoid. In Proc. FOCS ’78, pages 143–150. IEEE Comp. Soc. Press, 1978. doi:10.1109/SFCS.1978.21.
Appendix A Proofs omitted from main text
A.1 Proof of Lemma 5.4
See 5.4
We first observe that
As a consequence, and when , a in is extended with one . When , is where is if occurs in , and is otherwise. The same reasoning applies to a in , except that now, when , one considers occurrences in .
Looking now at , a tensor with is just the corresponding with extra ’s inserted at fixed positions. Therefore, for , the tensors at positions and are ordered (or incomparable) in exactly as in . We also have to consider the new tensors at positions : they all end with hence cannot dominate any tensor on a line with (these end with ). However they can be incomparable with these and contribute new maximal elements. Now, there can be at most incomparable tensors on the line since they can be ordered with pairs as in the proof of Lemma 5.3.
Finally contains all the maximal elements of (duly extended) and at most , i.e., , new maximal elements taken from the last (new) row. Hence the claim.
A.2 Proof of Theorem 5.5
See 5.5
We have to define a word over a -letter alphabet and show that .
Fix , sometimes written in examples. We let denote the word that lists the letters in order. For we further write for the suffix , i.e., the last letters of . We then define and via
Observe that is a palindrome. As an example, here are and for :
We start by computing the extension matrices associated with various prefixes and suffixes of . An easy induction on provides the following -by- matrices for any :
From these, we can deduce the -by- extension matrices for the where :
Combining the above matrices, we compute the extension matrix for :
with as the maximal value, filling the bottom-right part. A consequence is that , obtained as the product is the -by- all-s matrix.
Let us now compute and , noting that is :
Finally, for a factorization with , we compute and , noting that and .
Lemma A.1.
contains all the and the for .
Proof.
Indeed this just expresses the and vectors of and as columns of the matrices we computed. Note that, in the claim, the products of the first kind are taken from the first columns of , while the products of the second kind come from its last columns but are expressed in terms of the first columns of and , taking advantage of the symmetry in .
Looking now at the matrix for we see that, for , the th column is in the notation used for Lemma 5.2, while the th column of is .
Lemma A.2.
The set
is an antichain in .
Proof (Sketch).
We know that these tensors are in since they are exactly those listed in Lemma A.1, now expressed as and patterns. Stated in this form it is easy to see that they are pairwise incomparable, hence form an antichain. We now conclude with the announced result:
Lemma A.3.
.
Proof.
After the above preparations, and since has cardinal , it is enough to prove that the tensors in are maximal in . However we are not ready to do precisely that in this extended abstract.
Instead, consider that every for all , contains a . If there exist a factorization and some such that is strictly dominated by , then necessarily , thus contains a at every position where does. Hence we deduce that either is a prefix of and , or with and . However, in these few specific cases that are easy to list, we can readily check that the full product is not dominated by . So all but two tensors in are already seen to be maximal.
There remains the case and . Here it is harder to show that , i.e., , is maximal since this tensor has no and thus could possibly be dominated by some for any with , i.e., for a factorization where both and contain the full alphabet.
However, and since , is incomparable with the rest of , if is not maximal, then contains a tensor that dominates , hence that necessarily contains no and thus does not dominate any other tensor from . (The same argument goes for the symmetric tensor on the right-hand side of .) Finally, the announced lower bound on the size of is demonstrated.
A.3 Proof of Theorem 6.1
See 6.1 First, if occurs in , then does not contain every letter of the alphabet, hence .
In the other cases, we consider the arch factorization of (see [25]). Observe that when contains exactly one arch and ends with letter , the -th row of is a row of , while in the other rows only the numbers and appear.
If then can be factorized as where every is an arch and does not contain every letter of the alphabet. Let be the last letter of . Then observe that, contains only the numbers and , and its -th row is a row of ’s. Thus, contains only the numbers and , and for any such that does not appear in .