The Equivalence Problem of E-Pattern Languages with Length Constraints Is Undecidable
Abstract
Patterns are words with terminals and variables. The language of a pattern is the set of words obtained by uniformly substituting all variables with words that contain only terminals. Length constraints restrict valid substitutions of variables by associating the variables of a pattern with a system (or disjunction of systems) of linear diophantine inequalities. Pattern languages with length constraints contain only words in which all variables are substituted to words with lengths that fulfill such a given set of length constraints. We consider membership, inclusion, and equivalence problems for erasing and non-erasing pattern languages with length constraints.
Our main result shows that the erasing equivalence problem - one of the most prominent open problems in the realm of patterns - becomes undecidable if length constraints are allowed in addition to variable equality.
Additionally, it is shown that the terminal-free inclusion problem, a prominent problem which has been shown to be undecidable in the binary case for patterns without any constraints, is also generally undecidable for all larger alphabets in this setting.
Finally, we also show that considering regular constraints, i.e., associating variables also with regular languages as additional restrictions together with length constraints for valid substitutions, results in undecidability of the non-erasing equivalence problem. This sets a first upper bound on constraints to obtain undecidability in this case, as this problem is trivially decidable in the case of no constraints and as it has unknown decidability if only regular or only length constraints are considered.
Keywords and phrases:
Patterns, Pattern Languages, Length Constraints, Regular Constraints, Decidability, Undecidability, Membership, Inclusion, EquivalenceFunding:
Max Wiedenhöft: This work was supported by the DFG project number 437493335.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theoryEditors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
A pattern is a finite word consisting only of symbols from a finite set of letters , also called terminals, and from an infinite set of variables such that we have . It is a natural and compact device to define formal languages. From patterns, we obtain words consisting only of terminals using a substitution , a terminal preserving morphism that maps all variables in a pattern to words over the terminal alphabet. The language of a pattern is the set of all words obtainable using arbitrary substitutions.
We differentiate between two kinds of substitutions. In the original definition of patterns and pattern languages introduced by Angluin [1], only words obtained by non-erasing substitutions are considered. Here, all variables are required to be mapped to non-empty words. The resulting languages are called non-erasing (NE) pattern languages. Later, so called erasing-/extended- or just E-pattern languages have been introduced by Shinohara [35]. Here, variables may also be substituted by the empty word . Consider, for example, the pattern . Then, if we map to and to using a substitution , we obtain the word . Considering the E-pattern language of , we could also map to the empty word and obtain any word in the language .
Due to its practical and simple definition, patterns and their corresponding languages occur in numerous areas in computer science and discrete mathematics. These include, for example, unavoidable patterns [18, 23], algorithmic learning theory [1, 6, 36], word equations [23], theory of extended regular expressions with back references [12], or database theory [10, 34].
There are three main decision problems regarding patterns and pattern languages. Those are the membership problem (and its variations [14, 15, 7]), the inclusion problem, and the equivalence problem. All are considered in the erasing (E) and non-erasing (NE) cases. The membership problem determines if a word belongs to the language of a pattern. This problem has been shown to be NP-complete for both, erasing- and non-erasing, pattern languages [1, 18]. The inclusion problem determines whether the language of one pattern is a subset of the language of another pattern. It has been shown to be generally undecidable by Jiang et al. in [19]. Freydenberger and Reidenbach [11] as well as Bremer and Freydenberger [3] improved that result and showed that it is undecidable for all bounded alphabets of size for both erasing and non-erasing pattern languages. The equivalence problem asks whether the languages of two patterns are equal. For NE-pattern languages, this problem is trivially decidable and characterized by the equality of patterns up to a renaming of their variables [1]. The decidability of the erasing case is one of the major open problems in the field [19, 30, 29, 28, 31]. For terminal-free patterns, however, i.e., patterns without any terminal letters, the inclusion problem as well as the equivalence problem for E-pattern languages have been characterized and shown to be NP-complete [19, 5]. In addition to that, in the case of terminal-free NE-pattern languages, Saarela [32] has shown that the inclusion problem is undecidable in the case of a binary underlying alphabet.
Over time, various extensions to patterns and pattern languages have been introduced, either, to obtain additional expressibility due to some practical context or to get closer to an answer for the remaining open problems. Some examples are the bounded scope coincidence degree, patterns with bounded treewidth, -local patterns, or strongly-nested patterns (see [4] and references therein). Koshiba [21] introduced so called typed patterns that restrict substitutions of variables to types, i.e., arbitrary recursive languages. Geilke and Zilles [16] extended this recently to the notion of relational patterns and relational pattern languages. In a similar more recent context and with specific relational constraints such as equal length, besides string equality, Freydenberger discusses in [8], building on [2, 13, 9], so called conjunctive regular path queries (CRPQs) with string equality and with length equality. They can be understood as systems of (relational) patterns with regular constraints (restricting variable substitutions to words of given regular languages) and with equal length constraints between variables.
In [26], a special form of typed patterns has been considered, i.e., patterns with regular constraints (also comparable to singleton sets of H-Systems [13]). Here, like mentioned before, variables may be restricted to arbitrary regular languages and the same variable may occur more than once. It has been shown that this notion suffices to obtain undecidability for both main open problems regarding pattern languages, i.e., the equivalence problem of E-pattern languages and the inclusion problem of terminal-free NE-pattern languages with an alphabet greater or equal to . Another natural extension other than regular constraints is the notion of length constraints. Here, instead of restricting the choice of words for the substitution of variables, length constraints just restrict the lengths of substitution of variables in relation to each other. In the field of word equations, length constraints have been considered as a natural extension for a long time and, e.g., answering the decidability of the question whether word equations with length constraints have a solution, is a long outstanding problem (see, e.g., [22] and the references therein).
In this paper, we consider that natural extension of length constraints on patterns, resulting in the class of patterns called patterns with length constraints. In general, we say that a length constraint is a disjunction of systems of linear (diophantine) inequalities over the variables of . We denote the set of all length constraints by . A pattern with length constraints is a pattern associated with a length constraint. We say that a substitution is -valid if all variables are substituted according to . Now, the language of is defined analogously to pattern languages but restricted to -valid substitutions in the erasing- and non-erasing cases.
We examine erasing (E) and non-erasing (NE) pattern languages with length constraints. It can be shown, following existing results for patterns without additional constraints, that the membership problem for both cases in NP-complete. The inclusion problem is shown to be undecidable in both cases, too, notably even for terminal-free pattern languages, which is a difference to the decidability of the inclusion problem in the erasing case for pattern languages without any constraints, and which answers the case of alphabet sizes greater or equal to for non-erasing patterns without constraints. The main result of this paper is the undecidability of the equivalence problem for erasing pattern languages with length constraints in both cases, terminal-free and general, giving an answer to a problem of which the decidability has been an open problem for a long time in the case of no constraints. The final result shows that regular constraints and length constraints combined suffice to show undecidability of the equivalence problem for non-erasing pattern languages, a problem that is trivially decidable in case of no constraints and still open in the cases of just regular constraints or just length constraints.
We begin by introducing the necessary notation in Section 2 and then continue in Section 3 with an examination of patterns with length constraints and their corresponding languages. Here, we will briefly discuss all results regarding the related decision problems (membership, inclusion, equivalence). Due to the extensiveness of some proofs, see [27] for the full formal versions of the shortened ones. In Section 4, we continue with an examination of patterns with regular and length constraints, also giving a full picture of the related decision problems. Finally, in Section 5, we close this paper with a summary and discussion of the obtained results, the methods that were used, and the open problems that remain.
2 Preliminaries
Let denote the natural numbers. For set . Denote and . The powerset of any set is denoted by . An alphabet is a non-empty finite set whose elements are called letters or terminals. A word is a finite sequence of letters from . Let be the set of all finite words over , thus it is a free monoid with concatenation as operation and the empty word as the neutral element. Set . We call the number of letters in a word length of , denoted by . Therefore, we have . Let denote the set of all words of length (resp. or ). If for some , we call a prefix of , a factor of , and a suffix of and denote the sets of all prefixes, factors, and suffixes of by , , and respectively. For words , let denote the number of distinct occurrences of in as a factor. For , let denote ’s letter for all . For reasons of compactness, we denote by for all with . Set as ’s alphabet.
Now, we introduce the notion of patterns and pattern languages with and without additional constraints (i.e., length constraints, length constraints, and a combination of both). After that, we briefly introduce the machine models used in the proofs of the main results of this paper.
2.1 Patterns and Pattern Languages with Constraints
Let be a countable set of variables and be an alphabet such that . A pattern is then a non-empty, finite word over . A pattern is called terminal-free if it just consists of variables and is, thus, a non-empty finite word over . The set of all patterns over is denoted by . For example, is a pattern over with . For a pattern , let denote the set of variables occurring in . A substitution of is a morphism such that for all and for all . If we have for all , we call a non-erasing substitution for . Otherwise, is an erasing substitution for . The set of all substitutions w.r.t. is denoted by . If is clear from the context, we may write just . Given a pattern , its erasing pattern language and its non-erasing pattern language are defined respectively by
A length constraint is a disjunction of systems of linear diophantine inequalities over variables of . We denote the set of all length constraints by . A pattern with length constraints is a pair where all variables occurring in must occur in . We denote the set of all patterns with length constraints by . For some and , we say that is a -valid substitution if is satisfied when associating each variable in with the value , i.e., the length of the substitution of the variable . Consider the following example.
Example 1.
Let and . Assume we have the length constraint defined by the following system of linear diophantine inequalities:
Then, we know that any -valid substitution cannot have for some with , as this would already imply and . Also, we see that as the second constraint demands a substitution of length at least 1. So, for example we could have or but not or .
We denote the set of all -valid substitutions by . The notion of pattern languages is extended by the following. For any we denote by
the erasing pattern language with length constraints of and by
the non-erasing pattern language with length constraints of .
Similar to length constraints, we can define regular constraints for variables in a pattern. Let be the set of all regular languages. We call a mapping a regular constraint on . If not stated otherwise, we always have . We denote the set of all regular constraints by . For some we define the language of a variable by . If is clear by the context, we omit it and just write . A pattern with regular constraints is a pair . We denote the set of all patterns with regular constraints by . For some and , we say that is a -valid substitution if for all . The set of all -valid substitutions is denoted by . Given some , we analogously define the erasing- and non-erasing pattern languages with regular constraints and over as we did for length constraints.
Combining both, we say that a triple is a pattern with regular and length constraints and denote the set of all patterns with regular and length constraints by . Given some , we say that a substitution is --valid if it is -valid and -valid and denote the set of all --valid substitutions by . Additionally, we analogously define the erasing- and non-erasing pattern languages with regular and length constraints and over as we did in the previous two cases for length constraints and regular constraints.
In the proofs of our main results Theorem 8, Theorem 12, and Theorem 19, we use two different automata models with undecidable emptiness problems to obtain each result. We will use the notion of nondeterministic 2-counter automata without input (see, e.g., [17]), as well as, the notion of a very specific universal Turing machine as it is used in [3]. As mentioned in the introduction, due to their extensive lengths, the proofs of each of the main theorems is only given as an extended sketch in this paper together with the appendix111See [27] for the full formal proofs.. In the main body, only a rough explanation, not relying on a formal definition of the machine types, of each proof idea is provided. Hence, the formal definition of the notion of nondeterministic 2-counter automata without input and the referenced universal Turing machine are found in Appendix A and in Appendix B, respectively. We just mention that refers to the set of encodings of valid computations of some nondeterministic 2-counter machine without input and that refers to the set of encodings of valid computations from some initial configuration over the universal Turing machine .
We continue with our overview of the results regarding pattern languages with length constraints.
3 Results for Pattern Languages with Length Constraints
To begin developing an understanding of the additional expressiveness gained by allowing for length constraints, notice the following observation for patterns with length or with regular constraints which does not necessarily hold for patterns without any constraints.
Lemma 2.
For each pattern with length constraints (and for each pattern with regular constraints ), there exists some adapted set of length constraints (resp., some adapted set of regular constraints ) such that (and resp.).
Proof.
Indeed, given some pattern with length constraints , we can define the length constraint by using all constraints in and additionally, for each , adding the constraint to . Then, .
We obtain the same result for pattern languages with regular constraints by intersecting the language of all variables with , i.e., given any language for some variable that is defined by some regular constraint , we can define a regular constraint that defines the language . Then, given some pattern , we obtain .
The following statement then immediately follows by the previous lemma.
Corollary 3.
Solving any problem for erasing pattern languages with length (or regular) constraints is at least as hard as solving the same problem for non-erasing pattern languages with length constraints.
As shown in [26], considering regular constraints, problem cases over all patterns can be reduced to problems involving terminal-free patterns. The same does not hold in general in the case of length constraints, witnessed by the following proposition.
Proposition 4.
There exists such that no with a terminal-free pattern exists, for which we have that for .
Proof.
Assume and assume w.l.o.g. with . Let such that and is an empty length constraint. So, (). Suppose there exists some and length constraint such that (). W.l.o.g. we continue with the erasing case. The following also holds for the non-erasing case with small changes. Let . Then for some . As , there exists some such that . Then, for and as well as for some such that , and . But then, we also find some with as and are obtained at the same position by the same variable. Then, for some . As but , we know that cannot exist.
With that, we know that we cannot restrict ourselves to only show hardness or undecidability for the general case, as the terminal-free case may result in something else. We begin with the membership problems for both, the erasing- and non-erasing pattern languages. Those have been shown to be NP-complete for patterns without constraints in the terminal-free and general cases (see, e.g.,[1, 18, 33]). Hence, we observe the following for patterns with length constraints.
Corollary 5.
Let and . The decision problem of whether for is NP-complete, even if the considered pattern is terminal-free.
Indeed, NP-hardness is immediately obtained by the NP-hardness of patterns without any constraints. Given some , NP containment follows by the fact that any valid certificate results in a substitution of length at most and that we can check in polynomial time whether some given substitution is -valid. This can be done by checking whether the lengths of the variable substitutions for all satisfy . This concludes the membership problem for all cases.
Another result that can be immediately obtained is the general undecidability of the inclusion problem for pattern languages with length constraints. For pattern languages without any constraints, this problem has been generally shown to be undecidable, first, for unbounded alphabets by Jiang et al. [19] and later on for all bounded alphabets of sizes by Freydenberger and Reidenbach in [11] as well as Bremer and Freydenberger in [3]. So, we have the following.
Theorem 6 ([19, 11, 3]).
Let . In general, for all alphabets with , it is undecidable to answer whether for .
Out of that, in the general case, we immediately obtain the following for pattern languages with length constraints.
Corollary 7.
Let . For all alphabets with it is undecidable to answer whether for .
That leaves us with the terminal-free cases of the inclusion problem as well as the general and terminal-free cases of the equivalence problems for both, erasing and non-erasing pattern languages with length constraints. The first and most significant main result shows that the most prominent open problem for pattern languages, i.e., the equivalence problem for erasing patterns, is undecidable for pattern languages with length constraints, even if we are restricted to terminal-free patterns and use no disjunctions in the length constraints.
Theorem 8.
Let . The problem of deciding whether we have is undecidable for all alphabets with , even if and are restricted to be terminal-free and even if and have no disjunctions.
Proof.
Based on the proofs in, e.g., [19] and [11], we reduce the emptiness problem for nondeterministic 2-counter automata without input to the equivalence problem of erasing pattern languages with length constraints. Due to the length of the formal proof, here, we only give a sketch of the idea and provide the formal construction of the reduction. For a full extension of the proof containing a collection of shown properties and a formal proof of the correctness of the reduction itself, see [27].
Given an automaton , we construct two patterns with length constraints and such that if and only if . The pattern with length constraints can be seen as a pattern which allows for all possible words of a specific format to be obtained, especially also encodings of valid computations of . The pattern with length constraints can be seen as a tester pattern that allows only for words that are not encodings of valid computations of that have this specific structure. By that, if has some valid computation and, hence, an encoding of a valid computation of exists, we have that there exists a word in that is not in , hence .
The previous property is the main argument used in the proofs in [19, 11] to show undecidability of the inclusion problem for erasing pattern languages. In addition to that, using length constraints, and can be constructed in such a way, that we always have . This is the most significant difference to the aforementioned results. Thus, in this case, we have that the equivalence of and decides the emptiness problem for . The latter is undecidable, resulting in the undecidability of the equivalence problem of erasing pattern languages with length constraints. This is similar to the idea of the construction in [26] for patterns with regular constraints, where the same approach but with differently constructed patterns and with another type of constraints has been used. We also mention that length constraints allow for a construction that shows undecidability even in the case of terminal-free patterns. This is interesting, as for terminal-free erasing pattern languages without any constraints, equivalence and inclusion are actually decidable.
Now, we continue with the constructions of and . First, is defined by
for independent variables , and by the following linear system :
-
-
,
Notice that any word in has a suffix of fixed length with the form where and . In comparison to the construction in [11] that made use of a specific prefix and a specific suffix to enforce predicate choice, only this suffix suffices for this purpose in this construction.
Next, we construct . We set the pattern to
for new and independent variables and terminal-free patterns . Assume to be defined as in [11]. For , set
-
,
-
,
for new and independent variables and terminal-free patterns . Assume
-
and for ,
-
,
-
if for , and
-
for all
for new and independent variables . Hence, all variables inside do not occur anywhere else but in these factors. The length constraint is defined by the following and only the following system (no additional constraints will be added later on).
-
1.
-
2.
-
3.
-
4.
for all
-
5.
for all
For all , we say that and are defined just as in [11]. For the case of , we set and by
-
, and
-
for new and independent variables and .
We now give a really rough sketch of the idea on why this construction works. First, the length constraints allow for only one , for , to be substituted to a non-empty word. Any word obtained by can be obtained by the suffix of . Additionally, everything obtained in can then be obtained by in . Everything else can be obtained by the variables and . Hence, we get the most significant property of this proof, i.e., .
Now, for the other direction, we consider two cases. These are, given some substitution , whether or whether . By we know . It can be shown that, if , then we can use to obtain and to obtain . can be obtained in and the same holds analogously for and . Hence, if , then . If on the other hand we have and as well as , we need to find some and , for , to obtain and . By the construction in [11], we know that this only works if is not the encoding of a valid computation of or if either contains or is a unary word over that is short enough (If or , we can either use or in to obtain parts of , or we also have to rely on some and and the same property as before holds). Hence, if and only if there exists some valid computation for , we can find a substitution for which . This concludes the sketch for the binary case of the proof of Theorem 8. See Appendix C for a sketch on how to extend this to arbitrary fixed alphabets. As mentioned before, see [27] for a full formal proof of this result.
Due to the construction in the proof of Theorem 8, the undecidability of the inclusion problem for terminal-free erasing pattern languages immediately follows.
Corollary 9.
Let for terminal-free patterns . The problem of answering whether we have is undecidable for all alphabets with .
Proof.
Suppose it was decidable. Then, we could decide whether and whether and by that decide if we have . That is a contradiction to Theorem 8.
The second main result of the paper, and the last one regarding patterns with only length constraints, shows that the inclusion problem for terminal-free non-erasing pattern languages with length constraints is also undecidable for all alphabets with . In [32], Saarela has shown this problem to be undecidable for patterns without any constraints in the binary case.
Theorem 10 ([32]).
Let be two terminal-free patterns. Answering whether is undecidable for alphabets with .
Hence, the undecidability of the inclusion problem for pattern languages with length constraints follows immediately in this case.
Corollary 11.
Let for terminal-free patterns . Answering whether we have is undecidable for alphabets with .
In the case of length constraints, a general extension to all alphabet sizes is possible. We obtain the following.
Theorem 12.
Let for terminal-free patterns . Answering whether we have is undecidable for alphabets with , even if and have no disjunctions.
Proof.
Again, we only give a sketch of the idea and the formal construction used in this proof. The binary case follows by Corollary 11, i.e., the result by Saarela [32]. The general proof idea is based on the construction done by Freydenberger and Bremer [3] for the binary case and significantly adapted in its extension to arbitrary alphabets. As the extension to larger alphabets is strongly based on the construction for the binary case, we give the construction for the binary case here and sketch out the extension to arbitrary alphabets in Appendix D. As for Theorem 8, due to its extensive length, the full formal proof can be found in [27].
The idea is relatively similar to the idea for the proof of Theorem 8 (or the proof in [3]) in that is utilizes a pattern in which all words with a specific structure can be obtained, in particular, special encodings of valid computations of the specific universal Turing machine over any input , as defined in Appendix B, and a pattern from which we can obtain all words that are in but these encodings of valid computations. As this result only shows undecidability of the inclusion problem, we do not have the property and only use the case distinction whether to decide the emptiness of .
We continue with the formal construction used to reduce the emptiness problem of the universal Turing machine to the inclusion problem of terminal-free non-erasing pattern languages with length constraints in the binary case. After hat, we give a sketch of the idea on why this construction works. For the binary case, we essentially take slight adaptations of the constructed patterns from [3] but swap out any occurrence of the letter with a new variable and any occurrence of the letter with a new variable (also similar, to some extend, to the idea in [32]). In that sense, the patterns look very similar to the constructed patterns in [3], but do not contain any terminals. With length constraints, we can restrict the obtainable words in such a way that the inclusion property decides the emptiness of .
First, we construct , as the construction of is based on it. We define by
for new and independent variables and terminal-free patterns . Notice that is a number given by the construction in [3]. For all we say that
for terminal-free patterns that are defined later. The length constraints are defined by the system
-
, , , and
-
for all
Let be a homomorphism defined by , , and for all . Then we say that for all we have and for and being the patterns used in the construction in [3]. The specific construction of each such pattern is not relevant here and for details we refer to [3]. What is important is that we use the same patterns as in [3] but map them to terminal-free patterns that have the variable instead of each occurrence of a terminal symbol and the variable instead of each occurrence of .
For the missing case of , we define
for new and independent variables and being the encoding of the initial configuration of as in [3]. This will be used to obtain all substitutions where we have in . Now, we construct . is defined by
for some patterns and , and defined as in , and with defined by , , and for all other . We define and as and where and are the patterns given in [3, Section 4.4] for the second case.
The most important fact we take from there is that for some new and independent variable and some encoded initial configuration . By that, we see that starts with which is used later on in this adaptation of the proof. The length constraints are defined by the equations and . Notice that the in has no length constraint. With that, we can now sketch out why this construction works in the binary case.
Notice, if for some we have , then we immediately obtain essentially the same construction as in [3] and can rely on their proof that shows that these patterns can be used to decide the emptiness of . That leaves us with the case . But then, we notice that starts with a unary word of length at least that has the form . So, we can use to obtain and use the rest of to obtain the surrounding factors just as in [3]. So, if , then we always have . Hence, by the result from [3], there only exists some for which we have if and only if , concluding this reduction for the binary case.
To extend this construction to arbitrary alphabets, we can introduce for each additional letter a new variable in the patterns and , resulting in variables for w.l.o.g. . With length constraints, we can restrict the length of each substitution on these variables to . The construction can be extended in such a way that if any two variables of are substituted by the same letter, then there always exists some which can be used to obtain anything obtained in . In a second extension we can ensure that we can also always find some if any of the letters obtained by appears later on somewhere in or . By that, we essentially can only obtain words not in , if only the letters obtained in and are used in and . This restricts us to the same cases as in [3] and allows us to, again, rely on their arguments. As mentioned before, the construction of the extension to arbitrary alphabets is outlined in Appendix D. The full formal proof is given in [27].
Remark 13.
Whether the equivalence problem of non-erasing pattern languages with length constraints is decidable, is left as an open question with no specific conjecture so far. Interestingly, also in the case of regular constraints, no definite answer for the decidability of this problem could be given, yet (cf. [26]). A summary of the current results can be found in Table 1 at the end of the final discussion.
4 Result for Pattern Languages with Regular and Length Constraints
As there are still open problems for pattern languages with regular (or length) constraints, i.e., the equivalence problem for non-erasing pattern languages with regular (or length constraints), finding an answer to this problem for a larger class of languages using both constraint-types is well motivated. Indeed, considering pattern languages with regular and length constraints suffices to obtain the undecidability of that problem. That is noticeable, since this problem is trivially decidable for patterns without any constraints. We begin with mentioning all results for decision problems on pattern languages with regular and length constraints that we can immediately obtain from pervious results.
For the membership, we immediately obtain NP-completeness for all cases, i.e., erasing- and non-erasing as well as terminal-free and general.
Corollary 14.
Let and . The decision problem of whether for is NP-complete, even if the considered pattern is terminal-free.
Indeed, NP-hardness follows, as before, from the NP-hardness of pattern languages without any constraints. NP-containment follows from the fact NP-containment can be checked in the cases of pattern languages with regular constraints as well as pattern languages with length constraints and that no additional properties have to be checked.
As this class of pattern languages utilizes regular constraints, we immediately obtain the property mentioned before, and shown in [26], that we can always find a terminal-free pattern with regular constraints that expresses the same language as a general pattern with regular constraints.
Proposition 15 ([26]).
Let be some pattern with regular constraints. Then, there exists some terminal-free pattern and a regular constraint such that for .
Using the same logic as used to show that property, we immediately obtain the same for pattern languages with regular and length constraints, as, given some pattern with regular and length constraints , one can introduce for each terminal letter a new variable and set while not changing any length constraints.
Corollary 16.
Let be some pattern with regular- and length constraints. Then, there exists some terminal-free pattern , a regular constraint , and a length constraint such that for .
Additionally, as the inclusion problem is undecidable for all cases (E, NE, terminal-free, general) for pattern languages with regular or with length constraints, the same follows immediately for pattern languages with regular and length constraints.
Corollary 17.
Let . It is undecidable to answer whether for for all alphabets with , even if and are restricted to be terminal-free.
The same follows for the equivalence problem of general and terminal-free erasing pattern languages with regular- and length constraints, as also here we have the undecidability for both cases, regular constraints and length constraints, respectively. We obtain the following.
Corollary 18.
Let . It is undecidable to answer whether for all alphabets with , even if and are restricted to be terminal-free.
For all results obtained so far, no disjunction of systems of linear inequalities have had to be applied. For the equivalence problem of non-erasing pattern languages with regular and length constraints, making use of disjunctions in length constraints in addition to regular constraints allows us to obtain undecidability in this case. The following theorem concludes the third and final main result of this paper.
Theorem 19.
Let . It is undecidable to answer whether for all alphabets with , even if and are restricted to be terminal-free.
Proof.
As for the proofs of Theorem 8 and Theorem 12, due to its length, the full version of this proof can be found in [27]. Here, we will only give a rough sketch of the construction and the idea behind it.
The general idea of the proof works similar to the proof of Theorem 8 and the proof of the main result in [26], but needs significant adaptations to work for non-erasing pattern languages. Similar to the proof of Theorem 8, we reduce the emptiness problem for nondeterministic 2-counter automata to the equivalence problem of non-erasing pattern languages with regular and length constraints. So, given some nondeterministic 2-counter automaton without input , we construct two patterns with regular and length constraints and such that if and only if . In contrast to the proofs of Theorem 8 and Theorem 12, this time, the length constraints use a disjunction of systems of linear diophantine inequalities while has no regular or length constraints at all. Again, serves as a pattern to obtain any word with a specific structure and serves as a pattern to obtain words with that specific structure that fulfill given properties. These properties are, again, defined by sub-patterns in that are called predicates. The regular and length constraints and can be set in such a way that always a single predicate has to be chosen, while the variables outside of these predicates have to be substituted in a very fixed way. This results in words that have equal prefixes and suffixes in comparison to all words obtained from , while the middle part is then checked with the predicates. These predicates are constructed similar to the ones in [26], just with slight adaptations, as now we have to use non-erasing substitutions instead of erasing ones. In the end, it is shown, that the predicates cover all words where the middle part is not an encoding of a valid computation in . We continue with the general construction.
As we need to define , we will begin with the construction of . We define by
for new and independent variables to for some to be specified later. For each , we define for and being some terminal-free pattern, also to be defined later. For each with we assume by the following construction that , i.e., each variable occurring in one does not occur anywhere else in the pattern. Also, for any word obtained by , we assume that it is either or for some by the construction for this proof. Before giving the constraints and , we continue with the construction of . We define by
for a new and independent variable , , and a word that is obtained by the following morphism. Define by , , and for all . Then, we say that . The constraints and are empty, hence . We proceed with the definition of the constraints and . For each , we define the language of the variable by
Thus, notice, that each may only be substituted to one of words: Either , a specific suffix of or a specific prefix of . Additionally, for all variables in we add the constraint that , i.e., no variable can be substituted to just the letter . Clearly, any language can be intersected with another regular language to obtain that constraint. The specific regular and length constraints for each will be introduced later in the proof. For the length constraints , we construct a disjunction of systems of linear (diophantine) inequalities. Assume w.l.o.g. that for each we have for some . Then, we define for each , for each of the resulting the system defined by:
-
-
-
for all
-
-
for all and for
Additionally, for some , there are additional length constraints on the specific variables in occurring in . These are defined properly in the formal proof in [27] and do not interfere with the constraint that may be substituted to in certain cases. We then define by . This constraint results in that exactly two subsequent and are substituted by words longer than the length of while all others have exactly length . The other constraints result in that all variables occurring in some pattern with have to be substituted by length and that at least one variable occurring in has to be substituted to length . These constraints serve selecting one to obtain substitutions of as seen in the following parts.
As in the proof of Theorem 8, first, we have . Due to the regular and length constraints, it can be shown that each word has the form for some . Hence, by setting , we always find some such that . On the other hand, due to the regular and length constraints, it can also be shown that the part of must be obtained by some , , while the prefix is obtained by and the suffix is obtained by . Hence, for some , to find some such that , there must exist some such that , i.e., . We can construct in such a way that this is only possible, when is not an encoding of a valid computation of . The specific construction of all ’s is similar to the ones given in [26]. Due to space constraints, see [27] for more details on the construction for this proof.
Following both arguments, we see that there exists some such that if and only if there is some valid computation for , i.e., , concluding this result. As mentioned before, see [27] for a full formal proof of this result.
The construction of the proof for Theorem 19 uses patterns that consist partly of terminal-symbols. Using Corollary 16, however, we know that we can always find terminal-free patterns that produce the exact same languages. Hence, the result also follows immediately for terminal-free patterns. This concludes all results regarding decision problems for pattern languages with regular- and length constraints. Notice that by these results all main decision problems (i.e., membership, inclusion, and equivalence) for this class of pattern languages have an answer (summarized in Table 1 at the end of the final discussion).
5 Further Discussion
In this paper, we extended the research regarding decision problems on pattern languages. To get closer to an answer for the main open questions, e.g., the equivalence problem for erasing pattern languages or the larger alphabet cases of the terminal-free inclusion problem for non-erasing pattern languages, extending the research from [26], we now introduced length constraints as an alternative approach. With the results provided in this paper, we see that nearly all problems regarding pattern langauges with length constraints, except membership, are indeed undecidable (in particular the open cases for patterns without constraints). As mentioned at the end of Section 3, similar to [26], the decidability of the equivalence problem for non-erasing pattern languages with length constraints remains an open problem.
By the help of some core ideas from [19, 11, 3, 32], adapted constructions using length (and regular) constraints could be constructed to obtain these results. Regarding our first main result, Theorem 8, similar to [26], the main novelty for the proof is found in the way the constructed checker pattern does not produce too much anymore, while still being able to handle everything from in a controlled manner. Length constraints helped to reduce the overhead in to enable the approaches from the inclusion problem proof to be used for the equivalence problem. Now, due to such a construction being shown to actually work for two different constraint types, e.g., regular and length constraints respectively, finding a way to adapt the existing proofs for pattern languages without any constraints in such a way that the same can be achieved here, poses a final challenge. If this can actually be done, the main open problem for pattern languages would be settled. So far, we do not know how likely such a construction is to be found. In addition, the use of length constraints also helped to achieve undecidability for decidable problems regarding patterns without constraints, i.e., the terminal-free inclusion and equivalence problems for erasing pattern languages. Hence, the expressiveness obtained using these constraints and, thus, needed to obtain undecidability of the equivalence problem for erasing pattern languages using the approach posed in this paper, might actually be unfeasible. Regarding the terminal-free inclusion problem in the non-erasing case, due to the undecidability of the binary case having already been shown for patterns without any constraints [32], it is not unlikely that undecidability for larger alphabets can be achieved. Using the approach proposed in the current paper, we would need to somehow be able to restrict substitutions of some variables to single letters and distinct symbols in one case, and enable certain wildcards in any other case, similar to [32], where a property of binary words is used to flip a switch, determining where a word obtained from has to be obtained in . So far, this remains open.
Regarding the equivalence problem for non-erasing pattern languages for patterns with length constraints, we observe that this problem must be harder than the trivially decidable variant for non-erasing pattern languages without any constraints. This results from the fact that there are several NP-hard problems that are expressible solely by unions of linear diophantine equations. Some of which are, e.g., the subset sum problem, the knapsack problem, or integer programming [20]. Just to give a rough idea, we can easily embed, say, a positive integer subset sum instance with , for with , and to an instance of the equivalence problem for non-erasing pattern languages with length constraints by setting the pattern , for variables , defining by setting , for , and setting . Then, we set , for a variable , with being defined by . Then, only has words of length and only contains some words if actually has a solution, as otherwise we cannot find a valid substitution for . If it contains words, they have length . So, both languages are equal if and only if there exists a solution for . Here, disjunctions allowed to embed the choice of setting any to either length or length . Hence, we obtain the following.
Proposition 20.
Let and let be any alphabet with at least . Deciding whether we have is NP-hard, even when we are restricted to terminal-free patterns and .
Notice that the previous result depended on the fact that we allow disjunctions in the given length constraints. The following result shows that at least in the case of unary alphabets, where we basically restrict the problem to a number setting, we can obtain NP-hardness without using disjunctions by a reduction from the the problem. Due to length restrictions, the proof of Proposition 21 can be found in Appendix E.
Proposition 21.
Let where and each use no disjunctions. Deciding whether we have is NP-hard for alphabets with , even when we are restricted to terminal-free patterns and .
For all other problems, we obtained an answer and notice that the most prominent open problems, i.e., the equivalence problem for erasing pattern languages and the inclusion problem for terminal-free non-erasing pattern languages for alphabets with are indeed undecidable for patterns with length constraints. For patterns with regular and length constraints, we have even seen the undecidability of the equivalence problem of non-erasing pattern languages, concluding all open problems there. A final overview of the current state of results for decision problems on patterns over various constraints can be found in Table 1. We propose the following open question for which we have no definite conjecture so far.
Question 1.
Given , is it generally decidable to answer whether ?
No Constraints | Len-Constraints | Reg-Constraints | RegLen-Constraints | |||||
---|---|---|---|---|---|---|---|---|
Gen. | T.F. | Gen. | T.F. | Gen. | T.F. | Gen. | T.F. | |
E () | NPC | NPC | NPC | NPC | NPC | NPC | NPC | NPC |
E () | UD | NPC | UD | UD | UD | UD | UD | UD |
E () | Open | NPC | UD | UD | UD | UD | UD | UD |
NE () | NPC | NPC | NPC | NPC | NPC | NPC | NPC | NPC |
NE () | UD | UD222Undecidable in the binary case by [32], open for larger alphabets. | UD | UD | UD | UD | UD | UD |
NE () | P | P | Open3 | Open333Gen. and T.F. NP-hard as described above. An upper bound is unknown (may as well be undecidable). | Open4 | Open444Depending on regular languages representation, at least PSPACE-hard [26]. | UD | UD |
References
- [1] Dana Angluin. Finding patterns common to a set of strings. J. Comput. Syst. Sci., 21(1):46–62, 1980. doi:10.1016/0022-0000(80)90041-0.
- [2] Pablo Barceló, Leonid Libkin, Anthony W. Lin, and Peter T. Wood. Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst., 37(4), December 2012. doi:10.1145/2389241.2389250.
- [3] Joachim Bremer and Dominik D. Freydenberger. Inclusion problems for patterns with a bounded number of variables. Information and Computation, 220-221:15–43, 2012. doi:10.1016/J.IC.2012.10.003.
- [4] Joel D. Day, Pamela Fleischmann, Florin Manea, and Dirk Nowotka. Local Patterns. In Satya Lokam and R. Ramanujam, editors, FSTTCS 2017, volume 93 of LIPIcs, pages 24:1–24:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPICS.FSTTCS.2017.24.
- [5] Andrzej Ehrenfeucht and Grzegorz Rozenberg. Finding a homomorphism between two words is NP-complete. Inf. Process. Lett., 9(2):86–88, 1979. doi:10.1016/0020-0190(79)90135-2.
- [6] Henning Fernau, Florin Manea, Robert Mercaş, and Markus L. Schmid. Revisiting Shinohara’s algorithm for computing descriptive patterns. TCS, 733:44–54, 2018. Special Issue on Learning Theory and Complexity. doi:10.1016/J.TCS.2018.04.035.
- [7] Pamela Fleischmann, Sungmin Kim, Tore Koß, Florin Manea, Dirk Nowotka, Stefan Siemer, and Max Wiedenhöft. Matching patterns with variables under Simon’s congruence. In Olivier Bournez, Enrico Formenti, and Igor Potapov, editors, Reachability Problems, pages 155–170, Cham, 2023. Springer Nature Switzerland. doi:10.1007/978-3-031-45286-4_12.
- [8] Dominik D. Freydenberger. A logic for document spanners. Theory of Computing Systems, 63(7):1679–1754, September 2018. doi:10.1007/S00224-018-9874-1.
- [9] Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory of Computing Systems, 62(4):854–898, May 2017. doi:10.1007/S00224-017-9770-0.
- [10] Dominik D. Freydenberger and Liat Peterfreund. The theory of concatenation over finite models. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, ICALP 2021, Proceedings, volume 198 of LIPIcs, pages 130:1–130:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.130.
- [11] Dominik D. Freydenberger and D. Reidenbach. Bad news on decision problems for patterns. Information and Computation, 208(1):83–96, January 2010. doi:10.1016/J.IC.2009.04.002.
- [12] Dominik D. Freydenberger and Markus L. Schmid. Deterministic regular expressions with back-references. Journal of Computer and System Sciences, 105:1–39, 2019. doi:10.1016/J.JCSS.2019.04.001.
- [13] Dominik D. Freydenberger and Nicole Schweikardt. Expressiveness and static analysis of extended conjunctive regular path queries. Journal of Computer and System Sciences, 79(6):892–909, 2013. JCSS Foundations of Data Management. doi:10.1016/J.JCSS.2013.01.008.
- [14] Paweł Gawrychowski, Florin Manea, and Stefan Siemer. Matching Patterns with Variables Under Hamming Distance. In Filippo Bonchi and Simon J. Puglisi, editors, MFCS 2021, volume 202 of Leibniz International Proceedings in Informatics (LIPIcs), pages 48:1–48:24, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPICS.MFCS.2021.48.
- [15] Paweł Gawrychowski, Florin Manea, and Stefan Siemer. Matching patterns with variables under edit distance. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval, pages 275–289, Cham, 2022. Springer International Publishing.
- [16] Michael Geilke and Sandra Zilles. Learning relational patterns. In Jyrki Kivinen, Csaba Szepesvári, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, pages 84–98, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-642-24412-4_10.
- [17] Oscar H. Ibarra. Reversal-bounded multicounter machines and their decision problems. J. ACM, 25(1):116–133, January 1978. doi:10.1145/322047.322058.
- [18] Tao Jiang, Efim Kinber, Arto Salomaa, Kai Salomaa, and Sheng Yu. Pattern languages with and without erasing. International Journal of Computer Mathematics, 50(3-4):147–163, 1994.
- [19] Tao Jiang, Arto Salomaa, Kai Salomaa, and Sheng Yu. Decision problems for patterns. Journal of Computer and System Sciences, 50(1):53–63, February 1995. doi:10.1006/JCSS.1995.1006.
- [20] Richard M. Karp. Reducibility among Combinatorial Problems, pages 85–103. Springer US, Boston, MA, 1972. doi:10.1007/978-1-4684-2001-2_9.
- [21] Takeshi Koshiba. Typed pattern languages and their learnability. In Paul Vitányi, editor, Computational Learning Theory, pages 367–379, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg. doi:10.1007/3-540-59119-2_192.
- [22] Anthony Widjaja Lin and Rupak Majumdar. Quadratic word equations with length constraints, counter systems, and presburger arithmetic with divisibility. In Automated Technology for Verification and Analysis, 2018.
- [23] M. Lothaire. Combinatorics on Words. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 1997.
- [24] Marvin L. Minsky. Recursive unsolvability of Post’s problem of "tag" and other topics in theory of Turing machines. Annals of Mathematics, 74(3):437–455, 1961.
- [25] Turlough Neary and Damien Woods. Four small universal turing machines. Fundam. Inf., 91(1):123–144, January 2009. doi:10.3233/FI-2009-0036.
- [26] Dirk Nowotka and Max Wiedenhöft. The equivalence problem of E-pattern languages with regular constraints is undecidable. In Szilárd Zsolt Fazekas, editor, Implementation and Application of Automata, pages 276–288, Cham, 2024. Springer Nature Switzerland. doi:10.1007/978-3-031-71112-1_20.
- [27] Dirk Nowotka and Max Wiedenhöft. The equivalence problem of E-pattern languages with length constraints is undecidable, 2025. doi:10.48550/arXiv.2411.06904.
- [28] Enno Ohlebusch and Esko Ukkonen. On the equivalence problem for E-pattern languages. In MFCS 1996, pages 457–468. Springer Berlin Heidelberg, 1996. doi:10.1007/3-540-61550-4_170.
- [29] Daniel Reidenbach. On the equivalence problem for E-pattern languages over small alphabets. In DLT, pages 368–380. Springer Berlin Heidelberg, 2004. doi:10.1007/978-3-540-30550-7_31.
- [30] Daniel Reidenbach. On the learnability of E-pattern languages over small alphabets. In Learning Theory, pages 140–154. Springer Berlin Heidelberg, 2004. doi:10.1007/978-3-540-27819-1_10.
- [31] Daniel Reidenbach. An examination of Ohlebusch and Ukkonen’s conjecture on the equivalence problem for E-pattern languages. J. Autom. Lang. Comb., 12(3):407–426, January 2007. doi:10.25596/JALC-2007-407.
- [32] Aleksi Saarela. Hardness Results for Constant-Free Pattern Languages and Word Equations. In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020), volume 168 of Leibniz International Proceedings in Informatics (LIPIcs), pages 140:1–140:15, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPICS.ICALP.2020.140.
- [33] Markus L. Schmid. On the membership problem for pattern languages and related topics. PhD thesis, Loughborough University, January 2012. URL: https://repository.lboro.ac.uk/articles/thesis/On_the_membership_problem_for_pattern_languages_and_related_topics/9407606.
- [34] Markus L. Schmid and Nicole Schweikardt. Document spanners - A brief overview of concepts, results, and recent developments. In PODS ’22: International Conference on Management of Data, pages 139–150. ACM, 2022. doi:10.1145/3517804.3526069.
- [35] Takeshi Shinohara. Polynomial time inference of extended regular pattern languages, pages 115–127. Springer Berlin Heidelberg, 1983.
- [36] Takeshi Shinohara and Setsuo Arikawa. Pattern inference, pages 259–291. Springer Berlin Heidelberg, 1995. doi:10.1007/3-540-60217-8_13.
Appendix A Definition of Nondeterministic 2-Counter Automata without Input
A nondeterministic 2-counter automaton without input (see e.g. [17]) is a 4-tuple which consists of a set of states , a transition function , an initial state , and a set of accepting states . A configuration of is defined as a triple in which indicates the current state and and indicate the contents of the first and second counter. We define the relation on by as follows. For two configurations and we say that if and only if there exist and such that
-
1.
if then , otherwise if , then , for ,
-
2.
for ,
-
3.
, and
-
4.
we assume if then for .
Essentially, the machine checks in every state whether the counters equal and then changes the value of each counter by at most one per transition before entering a new state. A computation is a sequence of configurations. An accepting computation of is a sequence with , for all , and for some with .
We encode configurations of by assuming for some and defining a function by
for some numbers . The values for these numbers depend on the construction of the respective proofs and are not specified here. Encodings of this kind are used to prove Theorem 8 and Theorem 19. This is extended to encodings of computations by defining for every and every sequence
For some nondeterministic 2-counter automaton without input , define the set of encodings of accepting computations
and let . The emptiness problem for deterministic 2-counter-automata is undecidable (cf. e.g. [17, 24]), thus it is also undecidable whether a nondeterministic 2-counter automaton without input has an accepting computation [11, 19]. That the emptiness problem for universal Turing machines is undecidable is a known fact.
Appendix B Definition of the Universal Turing Machine
Here, we define the universal Turing machine used in the referenced proof from [3] and referred to in the proof sketch of Theorem 12. Let be the universal Turing machine wih 2 symbols and 15 states as described by Neary and Woods [25]. This machine hat the states and the tape alphabet . The transition function is given in Table 2.
We follow with the definition of encodings of computations of as depicted in [3]. The following conventions are needed to discuss configurations of . The tape content of any configuration of is characterized by two infinite sequences and over . The sequence describes the left side of the tape, the sequence starting at the head position of (including) and extending to the left. Analogously, describes the right side of the tape, the sequence starting directly after the head position and extending to the right. A configuration of is a triple consisting of a state , a left side of the tape and a right side of the tape .
Let be a function defined by , , and the extension to to infinite sequences over by . As in each configuration of only a finite number of cells consist of no blank symbol , is always finite and well-defined. Notice that we can always obtain the symbol that is closest to the head by (the symbol at the head position in the case of and the symbol right of the head position in the case of ). By multiplying or dividing the encoding by , each side can be lengthened or shortened, respectively. The encoding of configurations of indirectly referred to in this paper is defined by
for every configuration . Recall that as . A computation on is a finite sequence of configurations of . It is valid if ( being some initial configuration), is a halting configuration, and is a valid successor configuration of , for , as defined by . In [3], the notion is adopted that any possible configuration where both tape sides have a finite value under is a valid successor configuration of a halting configuration. The encoding of computations of is given analogously to the definition in the case of nondeterministic 2-counter automata without input, i.e., for some computation , we have
Finally, also analogous to nondeterministic 2-counter automata without input, let
Appendix C Extension to Larger Alphabets in Theorem 8
To extend the proof idea for Theorem 8 to arbitrary larger alphabets with , we fix w.l.o.g. (hence, ). We can construct adapted patterns with length constraints and in the following manner. The idea of this adapted construction is also similar to [11], but adapted to this setting and for terminal-free patterns. First, we define by
Notice, that only the parts containing the ’s is changed. We construct the adapted length constraints in the following manner:
-
, , ,
-
for all
An immediate observation that can be made, as its done in [11], is the following:
Observation 22 ([11]).
Let , , and . Consider the pattern
Then, for each homomorphism with we have that for all .
Now, we explain the changes necessary to create and . First, for each with , we change its form from to
Now, instead of adding as before, we add a series of many pairs of patterns and , for , that cover the cases that two different variables are substituted equally in . In each of these newly defined pair of patterns, we would set and just to single, new, and independent variables to obtain any substitution and . The number of those additional pairs of patterns rises significantly for alphabets of larger size, hence we only give an example here.
Consider the case that . To handle this case, we would add a predicate , for some , such that
and as well as , where each new variable is new and independent from other parts of the pattern. In particular, notice, that in this example we have no occurrence of the variable and instead four occurrences of . The idea is similar to the previous construction of in the binary case to handle the case of .
We observe, that if, for some substitution , any two different variables with the base name are substituted equally, then there exists one pair of patterns and that have the same variables equalled out, hence resulting in the existence of some for which .
Additionally, we add many pairs of patterns and , for that handle the occurrence of any of the letters to in or . These predicates work exactly as in [11], just with the same addition as before that and determine the letters that need to be used for the encodings of valid computations and for the substitutions of . All other arguments stay the same. Hence, we conclude the sketch of the extension to larger alphabets for Theorem 8.
Appendix D Extension to Larger Alphabets in Theorem 12
To extend the proof idea for Theorem 12 to arbitrary larger alphabets with , we fix w.l.o.g. (hence, ). The basic idea stays the same, but the adaptations to the constructed patterns and are more substantial.
For each new letter , we introduce a new variable that may be substituted by only a single letter. Similar to the extension to arbitrary alphabets in Theorem 12, the patterns and are adapted in such a way that two additional cases are considered for. First, if some substitution replaces two distinct variables in to the same letter, i.e., for , then we should always find a substitution such that . Next, we make sure that only the letters and may appear in and . Otherwise, we should also always find some substitution such that .
We continue with the specific adapted constructions. First, each is redefined to use the structure Notice that we just add each variable in in front of each and , for . Additionally, we redefine by adding in front of it. The adapted length constraints of look almost identical, but with the addition of constraining the length of all variables in to and changing the sum of to correspond to the new number of ’s that is explained in the following. In particular, we have , , , for all , and for all .
To define the adaptation of , we need to extend the morphism to take into account the newly defined variables. Simply, for all variables with , is defined just as in the binary case. For all other , we define . Now, we add the prefix to both, and . Finally, we also add the prefix to the whole of . We add to the length constraints that we also have , for all . We continue with the construction of all to cover the additional two emerging cases. Due to space constraints, all formal proofs regarding the correctness of this extension to the reduction can be found in [27].
First, for each , we can add two , i.e., and , which cover the case that any letter obtained by occurs somewhere later on in or , for . For , we define and just by
for new and independent variables . We can define analogously, just by swapping the role of and .
Finally, we need to cover the case that , for some . Assume these variables to be and , for some . Then, we just define , for ( being the total number of combinations), by setting und to
So, if and only if the same symbol is used twice in the prefix , then we can use these to find some such that .
By the previous two constructions, all additional cases that emerge due to the larger alphabet are captured by these additional , so they cannot result in substitutions with . Hence, only analogous cases to the binary construction can result in words that are not in , concluding the sketch of the extension to larger alphabets.
Appendix E NP-hardness of the Unary Case for the Equivalence Problem for Non-Erasing Pattern Languages with Length Constraints
This section is dedicated to show Proposition 21. W.l.o.g. assume . We proceed with a reduction from . Assume w.l.o.g. to be a set of boolean variables and let be the set of their negations. Let for with be a formula in conjunctive normal form over such that for each clause with we have w.l.o.g. with for . We define a function that maps each boolean variable to a pattern variable by
for new and independent pattern variables and . Clearly, one of the two cases is always fulfilled for each boolean variable . We proceed by defining the pattern by
for new and independent variables with . Notice that this pattern is terminal-free. The length constraints are defined by the following system:
-
for all
-
for all
-
for all
-
Notice, that the final constraint results in that may only have the word in it, as it restricts the length of all symbols in together. Now, we set the second pattern with length constraints by
for a new and independent variable and define the length constraint by . Hence, .
To proof the reduction, first, assume there exists a satisfying assignment of variables for . Let be a substitution such that and if and only if and . Otherwise, set and . As is satisfying , we know that for all . Hence, we can always set and obtain for all . Hence, we must be -valid and we have . As this is the only word we may obtain for , we have .
For the other direction, assume . So, there exists some such that . By , we know for all . As must be the case, for each , there exists some such that , hence by , we must have , resulting in . We set an assignment of variables by and if and only if and , otherwise set and . As , we know such an assignment is a valid assignment for . As for each clause for we find a variable that is set to , we know that is satisfied by , concluding the reduction.