Abstract 1 Introduction 2 Preliminaries 3 Overview 4 Removing Two-Sided Non-Flat Variables 5 𝚪-Expansion and Prefix/Suffix Trees 6 Underapproximating Non-Flat Variables 7 Decision Procedure 8 Future Work References

Negated String Containment Is Decidable

Vojtěch Havlena ORCID Brno University of Technology, Czech Republic Michal Hečko ORCID Brno University of Technology, Czech Republic Lukáš Holík ORCID Aalborg University, Denmark
Brno University of Technology, Czech Republic
Ondřej Lengál ORCID Brno University of Technology, Czech Republic
Abstract

We provide a positive answer to a long-standing open question of the decidability of the not-contains string predicate. Not-contains is practically relevant, for instance in symbolic execution of string manipulating programs. Particularly, we show that the predicate ¬Contains(x1xn,y1ym), where x1xn and y1ym are sequences of string variables constrained by regular languages, is decidable. Decidability of a not-contains predicate combined with chain-free word equations and regular membership constraints follows.

Keywords and phrases:
not-contains, string constraints, word combinatorics, primitive word
Copyright and License:
[Uncaptioned image] © Vojtěch Havlena, Michal Hečko, Lukáš Holík, and Ondřej Lengál; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Regular languages
; Theory of computation Automated reasoning ; Theory of computation Logic and verification
Related Version:
Technical Report: http://arxiv.org/abs/2506.22061 [24]
Acknowledgements:
We thank the anonymous reviewers for careful reading of the paper and their suggestions that greatly improved its quality.
Funding:
This work was supported by the Czech Ministry of Education, Youth and Sports ERC.CZ project LL1908, the Czech Science Foundation project 25-18318S, and the FIT BUT internal project FIT-S-23-8151. The work of Michal Hečko, a Brno Ph.D. Talent Scholarship [Uncaptioned image] Holder, is funded by the Brno City Municipality.
Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

1 Introduction

String constraints have been recently intensely studied in relation to their applications in analysis of string manipulation in programs, e.g., in the analyses of security of web applications or cloud resource access policies [43]. Apart from a plethora of practical solvers, e.g., cvc[26, 27, 7, 28, 42, 38, 41], Z3 [23, 11, 33], Ostrich [29, 13, 16, 14, 15], Z3-Noodler [18, 17, 12], Trau [4, 2, 1] Z3Str/2/3/4/3RE [10, 9, 8], Woorpje [20], and nfa2sat [32], the theoretical landscape of string constraints has been intensely studied too. The seminal work of Makanin [35], establishing decidability of word equations, was followed by the work of Plandowski [39] (and later Jeż’s work on recompression) that placed the problem in PSpace. A number of relatively recent works study extensions of string constraints with constraints over string lengths, transducer-defined relational constraints, string-integer conversions, extensions of regular properties, replace-all, etc. As the extended string constraints are in general undecidable, these works focus on finding practically relevant decidable fragments such as the straight-line [16, 29, 13, 15, 14] and chain-free [4, 17] fragments, quadratic equations [36], and others (e.g., [5, 21]).

The most essential constraints, from the practical perspective, are considered to be word equations, regular membership constraints, length constraints, and also ¬Contains, as argued, e.g., in [44], and as can also be seen in benchmarks, for instance, in [45, 3]. While the three former types of constraints are intensely studied, ¬Contains was studied only little. Yet, it is important as well as theoretically interesting: besides the occurrence in existing benchmarks, its importance follows also from its ability to capture other highly practical types of constraints. E.g., the indexOf(x,y) function should return the position of the first occurrence of y in x. It can be converted to the word equation x=p.y.s after which the returned value equals |p|. To ensure that y is indeed the first occurrence in x, there should be no occurrence of y in p.y where y is the prefix of y without the last symbol, i.e., y.z=y for zΣ. This can be expressed as ¬Contains(y,p.y) (e.g., Z3 solves indexOf in this way [23]).

As mentioned above, the problem is also interesting from the theoretical perspective. Although the positive version, Contains, can be easily encoded using word equations, the negation is difficult. Its precise conversion to word equations would require universal quantification, which is undecidable for word equations in general [22]. The most systematic attempts at solving ¬Contains have been made in [3, 19]. In [3], the authors extend the flattening underapproximating framework behind the solver Trau [2, 1] and give a precise solution for ¬Contains if all involved string variables are constrained by flat languages (a flat language here stands for a finite union of concatenations of iterations of words) and, moreover, if no string variable appears multiple times, thus avoiding most of the difficulty of the problem. Our recent work [19], on top of which we build here, proceeds in a similar direction and removes the restriction of [3] on multiple occurrences of variables, but still requires all languages to be flat, which is a quite severe restriction. Practical heuristics used in solvers generally solve only easy cases and quickly fail on more complex ones, cf. [19], and do not give any guarantees. E.g., cvc5 translates ¬Contains into a universally quantified disequality [40], which is in turn handled by cvc5’s incomplete quantifier instantiation [37].

In this paper, we show decidability of a much more general kind of ¬Contains than [19, 3], namely of the form ¬Contains(𝒩,)Φ where 𝒩eedle and aystack are string terms (sequences of symbols and variables) and Φ constrains variables by any regular language. The constraint is satisfied by an assignment to string variables respecting Φ under which 𝒩 is not a factor (i.e., a continuous subword) of  (i.e., if 𝒩eedle cannot be found in aystack).

Our solution of the problem leads relatively deep into word combinatorics and automata theory. We rely on the result in [19] giving a decision procedure for a flat-language version of the problem. The work [19] uses an automata-based construction inspired by deciding functional equivalence of streaming string transducers [6]. Using a variation on automata Parikh images, it transforms the problem into an equisatisfiable Presburger arithmetic formula (which is decidable). The general case with variables restricted by arbitrary regular languages, the subject of this paper, is solved by a reduction to this flat-language fragment. The core idea of our proof is that we can always find fresh primitive words in non-flat languages that can be repeated an arbitrary number of times. The result of such a repetition is a word that can share with other variables only subwords of a bounded size, assuming all words assigned to variables are sufficiently long. The reduction technically requires a dive into combinatorics on words and results on primitive words [31, 35, 34, 30], which are closely related to flat languages. Our techniques shares traits with the work of Karhumäki et al. [25], which constructs long primitive words to show that disjunctions of word equations can be encoded into a single equation. First, for variables with non-flat languages occurring on both sides of the constraint, we show that we can replace each of them with a single fresh symbol. This is because non-flat languages allow us to choose a sufficiently complex word for the variable x that can be matched only with the value of x on the other side (𝒩 is the other side of  and vice versa). For variables with non-flat languages that appear only in , we show that after enumerating all possible assignments for them up to a certain bound, their languages can be underapproximated by flat languages while preserving satisfiability.

2 Preliminaries

Numbers.

We use for natural numbers (including zero). For m,n, their greatest common divisor is denoted as 𝐠𝐜𝐝(m,n) and their least common multiple is denoted as 𝐥𝐜𝐦(m,n).

Words.

An alphabet Σ is a finite non-empty set of symbols. Let Σ be fixed for the rest of the paper. A (finite) word w over Σ is a sequence of symbols w=a1an from Σ, where n is the length of w, denoted as |w|. The empty word of the length 0 is denoted by ϵ and a concatenation of two words u and v is denoted as uv (or shortly uv). An iteration of a word w is defined as w0ϵ and wi+1wiw for i0. The set of all words over Σ is denoted as Σ. A primitive word cannot be written as vi for any v and i>1, and we will use Greek letters α,β,γ, from the beginning of the alphabet to denote primitive words. We denote the set of all primitive words Prim. A word u is a factor (i.e., a continuous subword) of every word vuv. Given two words pus and pus, we say that the factors u and u have an overlap of size k if |{|p|+1,,|p|+|u|}{|p|+1,,|p|+|u|}|=k. The overlap of u and u in the words pus and pus contains a conflict if there is a position i with |p|i<|pu| and |p|i<|pu| such that the words pus and pus contain a different letter at position i.

Languages.

language  over Σ is a subset of Σ. We will sometimes abuse notation and, given a word wΣ, use w to also denote the language {w}. For two languages 1 and 2, we use 12 (or just 12) for their concatenation {uvu1,v2}. A bounded iteration of a language is defined as 0{ϵ} and i+1i for i0. The (unbounded) iteration is i0i. For a word w we use Pref(w) (Suf(w)) to denote the set of prefixes (suffixes) of w and F(w) to denote the set of all factors of w. We lift the definitions to languages as usual. A language Σ is flat iff it can be expressed as a finite union

=i=1Nwi,1wi,2wi,3wi,4wi,5wi,i1wi,i (1)

where every wi,j s.t. 1in,1ji is a word over Σ, else it is non-flat. Flatness of  can be characterised by the absence of the so-called “butterfly loops”:

Fact 1.

A regular language Σ is non-flat iff p{u,v}s for some p,s,u,vΣ with u,vw for any word wΣ.

Automata.

A (nondeterministic finite) automaton (NFA) over Σ is a tuple 𝒜=(Q,Δ,I,F) where Q is a set of states, Δ is a set of transitions of the form qar with q,rQ and aΣ, IQ is the set of initial states, and FQ is the set of final states. A run of 𝒜 over a word w=a1an from state q0 to state qn is a sequence of transitions q0a1q1, q1a2q2, , qn1anqn from Δ. The empty sequence is a run with q0=qn over ϵ. We denote by q0𝒜wqn that 𝒜 has such a run, from where we drop the subscript 𝒜 if it is clear from the context. The run is accepting if q0I and qnF, and the language of 𝒜 is (𝒜){wΣqwr,qI,rF}. Languages accepted by NFAs are called regular. 𝒜 is a deterministic finite automaton (DFA) if |I|=1 and for every symbol aΣ and every pair of transitions q1ar1 and q2ar2 in Δ it holds that if q1=q2 then r1=r2.

The ¬𝐂𝐨𝐧𝐭𝐚𝐢𝐧𝐬 constraint.

Let 𝕏 be a set of (string) variables. A term is a word t(𝕏Σ) over variables and symbols. A ¬Contains constraint is a formula φ¬Contains(𝒩,)Φ, where 𝒩 and (for 𝒩eedle and aystack; φ holds if we cannot find 𝒩 within ) are terms and Φx𝕏x(x) associates every variable x with a regular language x. An assignment is a function σ:𝕏Σ, i.e., it assigns strings to variables. We use σ{x1w1,,xnwn} to denote the assignment obtained from σ by substituting the values of variables x1,,xn to w1,,wn respectively. We lift σ to terms so that for aΣ, we let σ(a)a, and for terms t,t, we let σ(tt)σ(t)σ(t). We then say that σ satisfies φ, written σφ, if σ(x)x for every x𝕏 and σ() cannot be written as uσ(𝒩)v for any u,vΣ, i.e., σ(𝒩) is not a factor of σ(). We call a variable z two-sided if it occurs in both 𝒩 and . Moreover, we use 𝕏𝐹𝑙𝑎𝑡 to denote the set of variables x occurring in φ s.t. x is a flat language.

Given a term t, a variable x𝕏, and a term ts, we use t[x/ts] to denote the term obtained by substituting every occurrence of the variable x in t by the term ts. Moreover, we use 𝑉𝑎𝑟𝑠(t) to denote the set of variables with at least one occurrence in the term t.

Theorem 2 ([19, Theorem 7.5]).

Satisfiability of the ¬Contains constraint is NP-hard.

2.1 Normalization

A variable z is flat (non-flat) if the language z associated with z is flat (non-flat), respectively, and finite if its corresponding language is finite. Moreover, a variable is called decomposed if its language can be represented by a DFA having a single initial, single final state, and containing exactly one nontrivial maximal strongly connected component (SCC) and no other SCCs. We say that φ is normalized if it contains an occurrence of at least one variable, does not contain any finite variable, and all of its variables are decomposed. Any ¬Contains constraint can be transformed into a disjunction of normalized constraints, as shown by the following lemma.

Lemma 3.

Let φ¬Contains(𝒩,)Φ. Then φ can be transformed to an equisatisfiable disjunction 1in¬Contains(𝒩i,i)Φi of normalized constraints or the formula 𝑡𝑟𝑢𝑒.

Due to the previous lemma, in the rest of the paper we will focus on solving a single normalized ¬Contains constraint.

In the paper, we will also make use of the following result showing decidability of ¬Contains with only flat variables.

Lemma 4 ([19]).

Satisfiability of ¬Contains(𝒩,)Φ where x is flat for any x𝕏 is decidable in NExpTime.

Proof sketch.

We can reduce φ into an equisatisfiable Presburger arithmetic formula ψ based on Parikh images of runs of the NFAs for the variables in φ. Decidability of φ follows from decidability of Presburger arithmetic. See [19] for details.

The crucial fact that Lemma 4 depends on is that there is a one-to-one mapping between runs in NFAs of flat languages and their Parikh images; this mapping fundamentally breaks for non-flat languages so one cannot directly extend this technique to the non-flat case.

2.2 Lemmas in Our Toolbox

We introduce fundamental lemmas from the area of combinatorics on words that will be used throughout the rest of the paper. The following lemma will be useful to guarantee the existence of conflicts (i.e., non-matching positions) in sufficiently large overlaps of two words αM and βN for some primitive words α,βΣ and large constants M,N. Intuitively, we will control the choice of α and β, and, thus, guarantee that α and β cannot be powers of the same word, essentially applying the contraposition of the following lemma.

Lemma 5.

Let αΣ be a primitive word, and let p and s be two words such that α=ps. Then the word sp is primitive.

Proof.

Assume that βk=sp for some k2. Then we have s=βlu and p=vβm for uPref(β), vSuf(β) such that β=uv and l+m+1=k. Thus, we have

α=vβmβlu=v(uv)m(uv)lu=(vu)l+m+1 (2)

and so the word α is not primitive, a contradiction.

Lemma 6 ([31, Proposition 1.2.1 (Fine and Wilf)]).

Let x and y be two words. If the words xk and yl, for any k,l share a common prefix of the length at least |x|+|y|𝐠𝐜𝐝(|x|,|y|), then x and y are powers of the same word.

Using Lemmas 5 and 6, we provide the following corollary that shows existence of conflicts between arbitrary overlaps of repetitions of primitive words of a sufficient size.

Corollary 7.

Let u=αM and v=βN be two words where α,βPrim, with |α||β| and M,N. Then any overlap between u and v of the size at least |α|+|β|𝐠𝐜𝐝(|α|,|β|) contains a conflict.

A natural approach to showing that an assignment σ satisfies φ is to show that σ() cannot be written as σ()=pσ(𝒩)s for any choice of words p and s. Therefore, one would have to consider all prefixes p, infixes u, and corresponding suffixes s with |u|=|σ(𝒩)| and show that σ()=pus implies uσ(𝒩). Note that the choice of the prefix pPref(σ()) uniquely determines u and s, and, therefore, we can only refer to different prefixes when showing σφ. The following lemma reduces the number of prefixes we have to consider if we have information about primitive words that are factors of σ(𝒩) and σ().

Lemma 8 ([31, Proposition 12.1.3]).

Let αΣ be a primitive word, and let α2=xαy for some words x,yΣ. Then either x=ϵ or y=ϵ, but not both.

We will use the next lemma as a recipe for constructing words wzz for non-flat z such that wz has as a factor a primitive word that is sufficiently long for our proofs.

Lemma 9 ([34]).

Let xK=yLzM such that x,y, and z are string variables and K,L and M are integers such that K,L,M2. Then any solution of the equation has the form x=αk, y=αl, and z=αm for some word α and numbers k,l,m.

We provide the following corollary to give insight into how we use Lemma 9 to construct factors that are primitive words of a suitable length.

Corollary 10.

Given two words u and v such that for any word w it holds that u,vw, we have that any word α=uLvM for L,M2 is primitive.

Proof.

By contradiction. Assume that α is not primitive, i.e., α=tK=uLvM for some t and K,L,M2. Applying Lemma 9, we see that u=wl and z=wm for some w, which contradicts the assumptions of the corrolary.

2.3 Easy Fragments

Before we establish our main result giving the decidability of the hardest fragment of ¬Contains, we first describe what we consider easy fragments and how to deal with them. We assume a normalized ¬Contains(𝒩,)Φ constraint.

  1. 1.

    The formula is solvable by length abstraction. This fragment contains formulae that can be solved easily by making the 𝒩eedle longer than the aystack. Suppose 𝒩=t1tm and =s1sn where every ti and sj is either a string variable x𝕏 or a symbol aΣ. We can then create a Presburger arithmetic formula φ over length variables {xx𝕏} such that φ1imi>1jnjΨ. In the formula, i and j are either 1 (if ti,sjΣ) or the length variable x (if ti,sj=x), and Ψ is a formula constraining the possible values for the length variables (obtained, e.g., using the Parikh images of the variables’ languages). If φ is satisfiable, so is the original ¬Contains.

  2. 2.

    All variables are flat. In this case, we can use Lemma 4.

3 Overview

We now move to our main result: deciding a hard instance of φ¬Contains(𝒩,)Φ. We can classify normalized ¬Contains constraints (cf. Section 2.1) that do not fall in the fragments of Section 2.3 based on the occurrences of non-flat variables as follows:

  1. 1.

    constraints where a non-flat variable x occurs both in 𝒩 and  and

  2. 2.

    constraints where all (and at least one) non-flat variables occur only in .

Note that the above not included cases of (a) all variables being flat and (b) a non-flat variable being only in 𝒩 are covered in Section 2.3. In particular, if there is a variable x that only occurs in 𝒩, then x is infinite due to our normalization. Therefore, such a constraint can be solved by making 𝒩 longer than .

We distinguish the classes (1) and (2) above since for (1), the string substituted for some occurrence of x in σ() may overlap with the string for an occurrence of x in σ(𝒩). We deal with the class (1) by substituting two-sided non-flat variables x with fresh symbols. In Section 4, we show that if there is a model σ of the resulting ¬Contains, we can obtain a model σ of the original constraint φ from σ by assigning σ(x) to a long-enough word that ensures a mismatch for every overlap of σ(x) in σ() and σ(x) in σ(𝒩). By doing this, we reduce (1) to either (2) or ¬Contains over flat variables (potentially with no variables at all).

For deciding the class (2), given in detail in Section 6, we construct an equisatisfiable formula that uses flat underapproximations of languages associated with the remaining (as some might have been removed at step (1)) non-flat variables present in . Our result is based on the observation that long words in a non-flat language may have a richer structure compared to long words one can construct using flat languages. Therefore, it is unlikely that a flat variable z should have a large conflict-free overlap with a non-flat variable x in an assignment that assigns these two variables sufficiently long words. In particular, we prove that the original language of x can be underapproximated by a flat language while preserving equisatisfiability. After this step, the resulting constraint can be decided using Lemma 4.

4 Removing Two-Sided Non-Flat Variables

In this section, we will show how to transform a normalized constraint ¬Contains(𝒩,)Φ with an occurrence of a two-sided non-flat variable x into a constraint without occurrences of the variable x. The resulting constraint after removing all two-sided non-flat variables can then be solved either by reduction to Presburger arithmetic (Lemma 4; if no non-flat variables remain in the constraint) or by the procedure in Section 6 (if there are still non-flat variables left in ). The main result of this section is the following theorem.

Theorem 11.

Let φ¬Contains(𝒩,)Φ be a constraint over the alphabet Σ and let z𝑉𝑎𝑟𝑠(𝒩)𝑉𝑎𝑟𝑠() be a non-flat variable. Then the formula φ#¬Contains(𝒩[z/#],[z/#])Φ with #Σ𝕏 is equisatisfiable to φ.

The proof of the theorem is given below. It is based on the observation that assigning long words to two-sided variables necessarily causes some occurrences of the same variable to overlap. Since these variables are non-flat, we can construct long words with a rich internal structure that will guarantee that any sufficiently long overlap necessarily contains a conflict.

Before we give the proof, let us formally introduce the concept of words that do not allow conflict-free overlaps of two occurrences of the same word larger than a certain bound.

Definition 12 (-aligned word).

Let w be a word and . We say that w is -aligned if for all pΣ such that 1|p||w|, w is not a prefix of pw.

Intuitively, w is -aligned if it cannot overlap with itself on a prefix/suffix of the length larger than or equal to  (except |w|). For example, the word w=abaa is 2-aligned since for no non-empty word p of the length at most |w|2=2 it holds that w is a prefix of pw. On the other hand, w=abaa is not 1-aligned since for p=aba of the length 3, it holds that w is a prefix of pw=abaabaa.

4.1 Proof of Theorem 11

If φ is satisfiable then so is φ#. To see this, take a model σφ and replace the assignment of z to #, producing σ. Then, there will be conflicts of # and some non-# symbol when checking whether σ is a model of φ#. Alternatively, it might be possible to align σ(𝒩) with σ() in a manner such that every # in σ(𝒩) matches some # in σ(). In such a case, if σ fails to be a model of φ# we reach a contradiction with σ being a model φ.

For the other direction, assume that φ# is satisfiable, which means there is a model σ of φ#. Next, we will show how to construct a word wz s.t. σ=σ{zwz} is a model of φ.

Focusing on variable z, we can write the two sides of the ¬Contains constraint as =0z,11n1z,nn and 𝒩=𝒩0z𝒩,1𝒩1𝒩m1z𝒩,m𝒩m where 𝒩i,j(𝕏Σ) for each i and j assuming 𝕏=𝕏{z}. Moreover, we write the subscript zS,k to distinguish k-th occurrence of z in S{,𝒩}. As z is non-flat, we have that p{u,v}sz for some words p,u,v, and s where u and v are not a power of the same word (Fact 1).

The core of our proof is based on the following observation. Since σ is a model of φ#, the word σ(𝒩[z/#]) is not a factor of σ([z/#]). Therefore, given any sufficiently long word wzz, if the extension σ=σ{zwz} fails to be a model, then there must be at least one occurrence of the word σ(z) in σ(𝒩) partially overlapping with an occurrence of the word σ(z) in σ(), as shown in the picture below. Thus, if we construct a word wzz that cannot partially overlap with itself, we get σ that is a model of φ.

Let α=u2ukv2 and β=u2vlv2 be two words where k=𝐥𝐜𝐦(|v|,|u|)/|u| and l=𝐥𝐜𝐦(|v|,|u|)/|v|. By invoking Corollary 10, we see that both α and β are primitive.

Note that we also have |α|=|β| (because |α|=2|u|+k|u|+2|v|, |β|=2|u|+l|v|+2|v| and from the definition of l and k we have k|u|=l|v|). We now use these two primitive words α and β to construct wz. Let γαrβrαrβrα2rβ2r and let wzz be the word wzpγs where r2 is the smallest number satisfying r|α|>M+|p|+|s| with M=max{|σ(i)|,|σ(𝒩j)|:1in,1jm}. We set σ=σ{zwz}. Let us now give two lemmas establishing the properties of wz’s infix γ.

We constructed the infix γ of wz in a way so that it prevents conflict-free overlaps with itself as shown by the following lemma.

Lemma 13.

The word γ is (r+1)|α|-aligned.

The full proof of Lemma 13 can be found in [24], but the core of the argument lies in observing that in any overlap of γ with itself of size at least (r+1)|α|, there is a factor α2 having an overlap with α of size |α|, or, similarly for β. Therefore, one can apply Lemma 8 and limit overlaps that must be considered.

The following lemma shows that long overlaps between two occurrences of γ are unavoidable when γ has a sufficient length. The ai in the lemma is used just to position the overlap within σ() and σ(𝒩).

Lemma 14.

For each 0i|σ()||σ(𝒩)|, every occurrence of γ in σ() has an overlap with some occurrence of γ in aiσ(𝒩) of size at least (r+1)|α|, where a is any symbol in Σ.

Proof.

Let us assume an arbitrary occurrence of γ in σ(). Since each i and i+1 are separated by z (the same goes for 𝒩i and 𝒩i+1), it suffices to consider only the pessimistic case, which is when an occurrence of γ in 𝒩 matches the longest σ(i) with p and s on both sides. The situation is schematically depicted below.

In the figure, o1 and o2 denote the overlaps on both sides. We show that the size of at least one overlap o1 and o2 is greater than (r+1)|α| by expressing the length |γ| using its definition and the schematic above:

r(4|β|+4|α|) =|o1|+|s|+|σ(i)|+|p|+|o2| def. of γ
7r|α|+r|α| |o1|+r|α|+|o2| since |α|=|β| and def. of r (3)
7r|α| |o1|+|o2|

We have that |o1|+|o2|7r|α|, and, thus, at least one of |o1| and |o2| is bigger than 3r|α|. Since r2, we have 3r|α|(r+1)|α| and hence γ has an overlap of the required size.

It remains to show that σ is a model of φ. For the sake of contradiction, assume that σ is not a model, meaning that σ(𝒩) is a factor of σ(). From Lemmas 13 and 14 we have that each occurrence of γ in σ(𝒩) is perfectly aligned with some γ in σ(), which also means that wz’s are perfectly aligned. Furthermore, we have that wz’s in σ(𝒩) are aligned with consecutive wz’s in σ(), i.e., any σ(z𝒩,i) is aligned with some σ(z,i+k) for some 0knm. If this were not the case and we had σ(z𝒩,1) overlapping with σ(z,1+k) while there were some σ(z𝒩,i) matching with σ(z,i+k+l) for l1, there would have to be some σ(𝒩j) with 1j<i with |σ(𝒩j)|>|σ(z)|, which is a contradiction with wz being longer than any |σ(𝒩j)| by construction. Hence, for σ′′=σ{z#}, σ′′(𝒩) is also a factor of σ′′(), which is a contradiction to σ being a model of φ#. Therefore, Theorem 11 holds.

5 𝚪-Expansion and Prefix/Suffix Trees

At this point, we are left with a normalized ¬Contains(𝒩,)Φ constraint where all variables in 𝒩 are flat. If all variables in  are also flat, we can use Lemma 4 and obtain the result. In the rest of the paper, we will deal with the case when contains at least one non-flat variable. Before we give the proof in Section 6, in this section, we introduce two concepts that will be used later: Γ-expansion on non-flat variables and prefix/suffix trees.

5.1 𝚪-Expansions on Non-Flat Variables

Intuitively, non-flat languages have words with a rich internal structure compared to flat languages. To illustrate, let x be a flat variable with the language x=α for some word α and let z be a non-flat variable with the language z. Furthermore, let wxx and wzz be two sufficiently long words. We inspect the case when wx and wz share some long common factor u. Since wxα, we have u=sαkp for some sSuf(α), pPref(α), and k. As z is non-flat, the run of 𝒜z corresponding to the word wz passes through states at which one can make a choice of which transition to take next. Since u is long, we have to make a lot of “right” choices during the run of 𝒜z in order for achieve the common factor u, highlighting the difference between the complexity of x and z, and suggesting that there is a way to pick wz to prevent long overlaps with flat variables occurring in 𝒩.

Guided by this intuition, we introduce a tool called Γz-expansion of a non-flat variable z. Given a prefix pPref(z) and a suffix sSuf(z), the Γz-expansion of (p,s) is the word Γz(p,s)=pwsz for a particular w such that only a prefix or a suffix of a bounded length can have long overlaps with (sufficiently long) words that belong to a flat language. This tool will play an important role in our proofs. Loosely speaking, if we start with a model σ and we try to find an alternative model σ=σ{zΓz(p,s)}, then the possible reasons why σ fails to be a model are narrowed down to the choice of p and s.

In order to define the Γz-expansion, we first need some auxiliary definitions. First, as a resulting of our normalization, the map Base:𝕏𝐹𝑙𝑎𝑡2Σ maps any flat variable x to a singleton containing the primitive word α that forms the basis of x, i.e., Base(x){α} such that x=(αk) for some k. We lift the definition of Base to a set X of flat variables as Base(X)xXBase(x), and to a string s(Σ𝕏𝐹𝑙𝑎𝑡) as Base(s)Base(𝑉𝑎𝑟𝑠(s)𝕏𝐹𝑙𝑎𝑡).

Second, given a variable z𝕏 with 𝒜z=(Q,Σ,Δ,I,F), we define the function conz:Q×QΣ to give the lexicographically smallest word conz(q,s)w such that q𝒜ws. Having auxiliary definitions in place, we are ready to define Γ-expansion in the context of the formula φ=¬Contains(𝒩,)Φ with 𝒩(𝕏𝐹𝑙𝑎𝑡Σ).

Definition 15 (Γ-expansion).

Let z𝑉𝑎𝑟𝑠() be a decomposed non-flat variable and 𝒜z=(Q,Σ,Δ,I,F) be a DFA s.t. (𝒜z)=z. Moreover, let qu|vQ be a state such that qu|vuqu|v and qu|vvqu|v with u,vw for any word w. Furthermore, let pPref(z) be some prefix and qp be a state such that q0pqp for some q0I. Similarly, let sSuf(z) be a suffix and qs be a state such that qssqf for some qfF.

Let γzu2+kv2 for a minimal k such that γz>|α| for any αBase(𝒩). Given K, we define the ΓzK-expansion of (p,s) to be the word ΓzK(p,s)pcon(qp,qu|v)γzKcon(qu|v,qs)s.

Intuitively, Γz-expansion takes a prefix p and finds the shortest word con(qp,qu|v) that takes the automaton to the state qu|v in which we have the freedom to read the words u and v in any suitable sequence. We loop through qu|v in a specific manner so that the resulting factor γz is primitive thanks to Corollary 10. The situation with the suffix is symmetric. Note that in the following section, we use prefix/suffix variants of the Γ-expansion defined as ΓPref(z)Kpcon(qp,qu|v)γzK and ΓSuf(z)KγzKcon(qu|v,qs)s.

The following lemma shows that Γz-expansion can be seen almost as introducing a fresh symbol # for the infix w connecting p and s into a word pwsz. Intuitively, if we have an assignment σ that assigns sufficiently long words to all flat variables, then we can find K such that any large overlap between σ(𝒩) and σ(z)=ΓzK(p,s) contains a conflict. We use MLit to be the length of the longest literal in φ, MQ to be the number of states of the largest DFA specifying the language of some variable x𝕏, and Mαmax{|α|:αBase(𝒩){γz}}.

Lemma 16.

Let z𝕏 be a decomposed non-flat variable, pPref(z), and sSuf(z). Further, let K be such that K|γz|4Mα+2MLit and let σ be an assignment with (i) |σ(x)|2Mαfor any flat variable x and (ii) σ(z)=ΓzK(p,s). Every overlap between σ(z) and σ(𝒩) of the size at least max(|p|,|s|)+MQ+2MLit+2Mα contains a conflict.

Proof sketch..

It suffices to observe that if the overlap of size at least N necessarily contains an overlap between γzK and σ(x)=αl for some flat variable x with αBase(x). The existence of a conflict follows from Corollary 7. The full proof can be found in [24].

Next, we show that Γz-expansion can be used to facilitate modularity in our proofs, allowing us to search for a suitable prefix p and a suitable suffix s separately. Searching for p and s separately requires subtle modifications to φ, resulting in us searching for p and s in the context of the modified formulae φPref and φSuf, respectively. If we find models of φPref and φSuf of a particular form, we compose them into a model of φ.

Let z𝑉𝑎𝑟𝑠() be a non-flat variable, and let zPref and zSuf be two fresh variables with their languages restricted to zPrefPref(z) and zSufSuf(z). Let φPref and φSuf be formulae defined as φPrefφ[z/zPref#] and φSufφ[z/#zSuf] where # is a fresh alphabet symbol. Further, let σPrefφPref and σSufφSuf be two models such that:

  1. 1.

    σPref and σSuf agree on the values of variables different than zPref and zSuf,

  2. 2.

    |σPref(x)|>2Mα|σSuf(x)|>2Mα for any flat variable x𝕏𝐹𝑙𝑎𝑡,

  3. 3.

    σPref(zPref)=pγzK and σSuf(zSuf)=γzLs such that n|γz|4Mα+2MLit for n{K,L}.

Lemma 17.

The assignment σPref{zpγzKs} is a model of φ where K=min(K,L).

Proof sketch..

The idea behind the proof is that if σ⊧̸φ, then σ(z) would need to have a large conflict-free overlap with σ(x) for some flat variable x𝕏. Applying Corollary 7, we reach a contradiction. The full proof of Lemma 17 can be found in [24].

5.2 Prefix (Suffix) Enumeration through Prefix (Suffix) Trees

Having defined ΓK-expansion that acts similarly to inserting a fresh symbol # between a chosen pPref(z) and a suffix sSuf(z) of a non-flat variable z𝑉𝑎𝑟𝑠(), we can start enumerating prefixes pPref(z) (or suffixes) up to a certain bound, while searching for a model. We introduce the concept of prefix (suffix) trees that play a major role in our proofs. Below, we give only the definition of a prefix tree; a suffix tree is defined symmetrically.

Definition 18 (Choice state).

Let 𝒜=(Q,Δ,I,F) be a DFA. We say that a state qQ is a choice state if |{(q,a,r)Δ:aΣ,rQ}|>1. We write C(𝒜) to denote the set of all choice states of 𝒜.

Definition 19 (Prefix tree).

Let z𝕏 be a variable with its language z given by a DFA 𝒜z=(Qz,Δz,{q0},Fz). We define z’s prefix tree Tz=(Vz,Ez,rz,stz,𝒲z) as an (infinite finitely-branching) tree with vertices Vz rooted in rzVz such that

  • stz:VzQz is a function that labels non-root vertices of Tz with 𝒜z’s choice states, i.e., stz(v)C(𝒜z) for any vrz and st(rz)=q0,

  • EzVz×Σ+×Vz is a set of labelled edges such that (v,a1an,v)Ez iff there is a run stz(v)a1q1a2anstz(v) in 𝒜z where for all 0<i<n it holds that qiC(𝒜).

  • 𝒲z:Vz×VzΣ is a function that maps any two vertices connected by an edge to the label on the edge, i.e., 𝒲z(v,v)=w iff there exists an edge (v,w,v)Ez and is undefined otherwise.

Intuitively, vertices of the tree are labeled by 𝒜z’s choice states C(𝒜z), i.e., states in which we can choose between multiple outgoing transitions along different alphabet symbols. Vertices s and s are connected by an edge in Tz if st(s) is reachable from st(s) without passing through any choice state.

A path π in Tz is a sequence of vertices π=s0sn where (si,wi,si+1)Ez for any 0i<n. We lift the definition of 𝒲 to paths as 𝒲(s0sn)𝒲((s0,s1))𝒲((sn1,sn)).

Definition 20 (Dead-end vertex of a prefix tree).

Let φ=¬Contains(𝒩,)Φ and Tz=(V,E,v0,st,𝒲) be the prefix tree for z𝕏, and let σ:(𝕏{z})Σ be a partial assignment. A vertex vnV is called a dead end in Tz w.r.t. σ if σ⊧̸φ[z/z#] where σσ{z𝒲(v0vn)} for v0vn being the (single) path between v0 and vn in Tz.

Intuitively, dead-end vertices (and all vertices that are below them in the prefix tree) are not interesting for obtaining a ¬Contains model. Consider, e.g., φ¬Contains(abx,xz)Φ with x=(ab)+ and z=(a{b,c}c). We have φ[z/z#]=¬Contains(𝒩,)=¬Contains(abx,xz#) and, thus, the vertex vVz corresponding to the prefix abca is a dead end in Tz w.r.t. σ={xab} since σ(𝒩)=abab is a factor of σ()=ababca#.

Definition 21 (H-reaching path).

Let π=v0vn be a path in a prefix tree Tz=(V,E,v0,st,𝒲) and H. We say that π is H-reaching if |𝒲(v0vn)|H|𝒲(v0vn1)|.

In our proof, we explore all prefixes of words in a language up to a certain bound H. As we have a prefix tree with edges labelled with words of (possibly) different lengths, stating that we have explored all prefixes of the length precisely H is problematic. Hence, the concept of H-reaching paths is a relaxation allowing paths (prefixes) to slightly vary in length.

6 Underapproximating Non-Flat Variables

In this section, we give the main lemma allowing to underapproximate the language of non-flat variables with a flat language. Throughout this section we use three constants λFlat,λQ,λκ with the following semantics:

  • The constant λQ is the length of prefixes (suffixes) of non-flat variables that we will enumerate in our proofs, searching a model that is shorter w.r.t. some non-flat variable.

  • The constant λFlat is the minimal size of words assigned to flat variables occurring in 𝒩.

  • λκ is used as the value of the parameter K in every application of ΓPref(z)K or ΓSuf(z)K.

First, let us define parameters of ¬Contains(𝒩,) that we use to define the above bounds. Let MLit be the length of the longest string literal in 𝒩 and , let MQ be the largest number of states of a DFA associated with some variable. Furthermore, let Mα be the length of the longest word in the set Wα(𝒩)Wγ where Wγ is the set of the primitive words γz used to define the Γz-expansion for every non-flat variable z𝑉𝑎𝑟𝑠().

Since λFlat and λκ depend on the value of λQ, we start by fixing λQ2MαMQ+MLit. Intuitively, for any non-flat variable z𝕏, we set up λQ in a way so that if we consider all prefixes in Tz up to the length MαMQ, then Tz will contain paths through any state qQz since 𝒜z is a single SCC. After extending these paths up to the length λQ, we can guarantee that Tz will contain all words read from any state q of the length at least Mα. Considering all possible words of the length |β| for some βBase(𝒩) readable from a state will be crucial later, as we will show that there can be only a few such words if we fail to find an alternative model σσ{zwz} s.t. |σ(z)|<|σ(z)|, assuming the existence of a model σ.

The remaining bounds λFlat and λκ are defined as λFlatλQ+4Mα+MQ and λκλFlat+2Mα+2MLit. Ignoring some technical details and due to reasons that will be revealed shortly, we need λFlat to be slightly longer than λQ, so that when we later construct σσ{zp} for some particular prefix |p|λQ+MQ, we can establish some of the string that precedes an occurrence of z in σ|𝕏{z}() in the case σ fails to be a model. Finally, λκ is set up so that together with λFlat they allow Lemma 17 to be applied, where λQ and λFlat play the role of K0 and N0, respectively.

We remark that the exact values of λQ, λFlat, and λκ are not important when reading the proof for the first time. It is sufficient to note that λQ<λFlat<λκ, and that the difference in sizes between these bounds is sufficiently large.

6.1 Overcoming the Infinite by Equivalence with a Finite Index

Our procedure to decide φ=¬Contains(𝒩,) containing non-flat variables in originates in enumeration of partial assignments η:𝑉𝑎𝑟𝑠(𝒩)Σ, since it is easy to find suitable values for non-flat variables when 𝒩 is a literal due to us fixing values of all variables in 𝒩. The problem is that there is an infinite number of such assignments. Our key observation allowing us prove that we can underapproximate non-flat languages using flat ones is that the precise values of flat variables occurring in 𝒩 does not matter as long as these variables have assigned sufficiently long words. In general, however, a model σφ might assign long words only to a subset of flat variables. Therefore, in our decision procedure, we first guess the set X of flat variables that are assigned words shorter than λFlat. Since there are finitely many such words, we have a finite number of possible choices τ:XΣ of values these variables can attain. We enumerate all possible valuations τ, and for every such a valuation τ we produce a new constraint φτ in which we replace every short variable xX by the word τ(x). The regular constraints restricting the remaining flat variables are modified to permit only words longer than λFlat, allowing us to assume in our proofs that flat variables in φτ have assigned sufficiently long words.

Let us formalize our observation that the precise length of words assigned to flat variables does not matter as long as they are sufficiently long. Let η:𝑉𝑎𝑟𝑠(𝒩)Σ be a partial assignment. We define the set X<λ as X<λ(η){x𝑉𝑎𝑟𝑠(𝒩):|η(x)|<λ}. Given a constant λ, we say that two partial assignments η and ϑ are λ-equivalent denoted by ηλϑ iff ηλϑ𝑑𝑒𝑓η|X<λ(η)=ϑ|X<λ(ϑ).

Clearly, λ has a finite index and if there exists a model σ of φ, then its restriction σ|𝑉𝑎𝑟𝑠(𝒩) will fall into one of the equivalence classes induced by λ. Setting λ=λFlat, we inspect all equivalence classes, checking whether any of them contains a model. Given a representative η of an equivalence class, we replace all variables x𝑉𝑎𝑟𝑠(𝒩) with η(x) if |η(x)|<λFlat, producing a new constraint φη. Furthermore, we need to include the fact that the remaining variables in 𝒩 have assigned long words. Therefore, the languages of all variables y𝑉𝑎𝑟𝑠(𝒩) such that |η(y)|λFlat will have their their languages restricted to a new language y=y{|w|λFlatwΣ} in φη. The resulting constraint φη is clearly equisatisfiable to φ with models restricted to be λFlat-equivalent to η.

Some of these instances can be decided without any additional work. In particular, if we have an assignment η such that X<λFlat(η)=𝑉𝑎𝑟𝑠(𝒩), i.e., all variables occurring in 𝒩 are short, we fix values of all variables in 𝒩, and, thus, the 𝒩eedle of φη is a word. The remaining instances with X<λFlat(τ)𝑉𝑎𝑟𝑠(𝒩) contain at least one occurrence of a (long) variable in τ(𝒩), and, thus, their decidability requires investigation.

6.2 Inspecting the Structure of Non-flat Variables in the Presence of Long Flat Variables

Throughout this section, we fix φ be a ¬Contains instance resulting from the previous section, i.e., φφη=¬Contains(𝒩,Φ,η for some equivalence class representative η such that 𝑉𝑎𝑟𝑠(𝒩). We start by stating the key theorem for our decidability result.

Theorem 22.

Let z be a (decomposed) non-flat variable present in . There is a flat language zFlatz s.t. if there exists a model σφ, then there exists a model σφ[z/z#] s.t. σσ{zwz} for some word wzzFlat.

Before presenting quite technical lemmas that allowed us to obtain the result, let us derive some intuition on why the theorem holds. Assume that we have a model σ of φ and we we pick some long prefix p and a long suffix s for the variable z, and we glue them together using Γz-expansion to produce a word wΓzλκ(p,s) and an altered assignment σ{zw}. The core of the theorem lies in analyzing the situation when σ fails to be a model. By symmetry, we focus on the case when our choice of the prefix p is problematic. We have two possibilities.

  • There is a short prefix of p due to which σ fails to be a model. We address this by systematically exploring the prefix tree of z up to a certain bound, marking the vertices that correspond to such prefixes as dead ends.

  • Our choice of p does not cause σ to immediately fail to be a model, however, by applying Γz-expansion we introduce an infix due to which σ⊧̸φ. Since we assume that φ results from a previous section, we know that all flat variables have assigned a long word, i.e., σ(𝒩) contains long factors of the form αk for some α which forms the basis of a flat variable x𝑉𝑎𝑟𝑠(𝒩) and k. Γz-expansion glues together a prefix and a suffix using a word γzK where |γz||α| for any base α. Therefore, we know that only a limited part of the infix introduced by Γz-expansion is problematic, otherwise we would have a long overlap between γzK and some factor αk of σ(𝒩). Thus, p contains a long factor αk for some k1 and a primitive α word α that forms the base of a flat variable present in 𝒩. We carefully analyze the effect of such a factor on the structure of 𝒜z.

We now provide an overview of lemmas that lead to Theorem 22. Since these lemmas are quite technical, we accompany them with intuition and only sketch their proofs. Full proofs can be found in [24]. To simplify the presentation, we focus primarily on attempting to find a suitable prefix of a non-flat variable, and hence, our results are formulated in the context of a modified formula that contains a fresh alphabet symbol #. Since the situation is symmetric for suffixes, we can use the properties of Γz-expansion (Lemma 17) and glue together a suitable prefix and a suitable suffix to produce an altered model.

We start with a technical lemma used frequently in our proofs. The lemma shows that if we know that α is a factor of , and we know that a part of in the proximity of the factor α is incompatible with α, then we can show that a large number of overlaps between σ(𝒩) and σ() must contain a conflict if 𝒩 contains a large factor of the form αN.

Lemma 23.

Let α and γ be two primitive words, such that |α||γ|. Let =tαuγλκ and 𝒩=vαNw where t,u,v and w are (possibly empty) words such that N|α|>λFlat and u<λFlat2max(|α|,|γ|). If 1. the prefix p of of the size |p|=|t|+|α|+|w| does not contain the word 𝒩, 2. αPref(v0)and v0Pref(α), then 𝒩 is not a factor of .

Proof sketch..

Since 𝒩 contains a long factor αN and contains at least one α, we can apply Lemma 8 to rule out a lot of overlaps that might be conflict-free. All of the remaining overlaps contain a conflict thanks to Condition 2. The full proof is available in [24].

Lemma 24.

Let x𝕏 be a flat variable with Base(x)={α}, and let z𝑉𝑎𝑟𝑠() be a (decomposed) non-flat variable. Let φ φ¬Contains(𝒩,)=¬Contains(𝒩xαMpW,) be formula with pα being a prefix of α and W=a0an being a non-empty word such that the word pa0 is not a prefix of α.

If there exists a model σ with σ(z) being of the form σ(z)=sαkpWV for some word V, k1, and a suffix s of α, then σ{zΓPref(z)λκ(sαkpW)} is a model of φ[z/z#].

Intuitively, the rightmost variable in 𝒩 is the flat variable x with Base(x)={α}. To the right of x, there is a literal with the prefix αMp that resembles the flat language x. Moreover, σ(z) also starts with a prefix sαkp resembling x, followed by W. Thus, the prefix sαkpW of the word σ(z) mimics the suffix of the right-hand side σ(𝒩). Hence, if we look solely on the prefix of σ(z) and the suffix of σ(𝒩), there are no obvious conflicts.

However, σφ, and, therefore, there must be a conflict outside of z when considering the above alignment. The rest of the proof can be found in [24].

Next, we derive a lemma formalizing that we can restrict languages of non-flat variables to flat ones, producing an equisatisfiable instance. Stating a symmetric lemma for suffixes, and applying Lemma 17 would give us the entire proof of Theorem 22.

Lemma 25.

Let x be the rightmost flat variable with Base(x)={α}, and let z be a non-flat variable occurring in . Let φ be a formula φ¬Contains(𝒩,)Φ=¬Contains(𝒩xαMpW,)Φ where 𝒩(𝕏Σ), M0, pα is a prefix of α, and W=a0an is a non-empty word such that pa0 is not a prefix of α. There exists a flat language zFlatz s.t. if there is a model σφ, then σ{zwz}φ[z/z#] for some word wzzFlat.

The intuition behind Lemma 25 is the same as the intuition behind Theorem 22. We note that the lemma requires the word W to be non-empty. The case for when W=ε has a similar, but simpler proof.

Proof sketch.

We assume the existence of a model σφ, and we systematically explore the prefix tree Tz of the variable z in a breadth-first fashion up to the bound λQ, searching for the word wz. When inspecting any prefix u, we check whether σ{zu} is a model of φ[z/z#], and if not we mark the last vertex corresponding to u as a dead end and we do not explore it further. At the end of the exploration, we inspect the set 𝒫λQ of all λQ-reaching paths in Tz. If there are no such paths, we know that |σ(z)|<λQ, and hence, wz can be found in the finite (flat) language {wz|w|<λQ}. Alternatively, we check for every path π in 𝒫λQ whether σ{zΓPref(z)λκ(uπ)}φ[z/z#] where uπ is the prefix corresponding to π. Since, the number of possible λQ-reaching paths is finite, we have that wz can be found in the flat language {ΓPref(z)λκ(pπ)π𝒫λQ} in the case that ΓPref(z)-expansion of some uπ leads to a model of φ[z/z#].

Next, it might be that none of the paths in 𝒫λQ can be ΓPref(z)-expanded into a model. Let x be the rightmost variable in 𝒩 with Base(x)={α}. Recall that none of the paths in 𝒫λQ contain a dead-end, and that all variables in 𝒩 are flat, and, thus, they have assigned long words. Combined with the properties of ΓPref(z)-expansion, we know the reason why σσ{zΓPref(z)λκ(pπ)} fails to be a model, i.e., we almost accurately know the position of σ(𝒩) in σ(). We show that all paths in 𝒫λQ share the same prefix of the form sαMp for some large k and sSuf(α) and pPref(α). Since z is non-flat, and λQ is larger than the number of states of 𝒜z, we have opportunities to diverge from the shared prefix sαMp in Tz. We show that diverging must immediately lead to a dead-end vertex, and in such a case σ(z) has a prefix sαMpW. Hence, we apply Lemma 23 and obtain that σ{zwz,M} is a model of φ[z/z#] where wz,M=ΓPref(z)λκ(sαMpW). Note that wz,M depends on an unknown integer M, however, the language containing all wz,Ms for every possible choice of k is flat. Alternatively, not-diverging from the path implies that σ(z)sαp for some pPref(α), which is again a flat language.

A careful analysis of the proof of Lemma 25 reveals that the lemma, and, therefore, Theorem 22, is not effective in a sense that one cannot directly construct zFlat. However, we can obtain a decision procedure at the cost of a producing larger flat language. We construct Tz and all paths up to the bound λQ without having σ available, losing the ability to mark dead-end vertices. In the resulting flat language zFlat we include all words shorter than λQ, and ΓPref(z)-expansions of all λQ-reaching paths. The remaining parts of zFlat that correspond to the situation when no λQ-reaching paths can be ΓPref(z)-expanded into a model can be computed from 𝒜z without requiring access to the original model σ. For details, we refer the reader to the full proof of Lemma 25 in [24].

7 Decision Procedure

Finally, we summarize the approach described in previous sections into a decision procedure for ¬Contains. The (nondeterministic) algorithm is shown in Algorithm 1. In the algorithm, for a negated containment φ and a (partial) assignment σ, we use σ(φ) to denote the ¬Contains predicate obtained from φ replacing variables whose assignment is defined with the corresponding assignment.

Algorithm 1 Decision Procedure for ¬Contains.

The set xPref (xSuf) contains prefixes (suffixes) of words from z that might be used to find an alternative model. The 𝗀𝗅𝗎𝖾(xPref,xSuf) procedure glues together prefixes and suffixes, resulting in a language consisting of entire words (not just prefixes or suffixes) from z. The procedure partitions the language xPref into xPref=PwPinc such that Pinc consists of words resulting from an application of ΓPref(z)-expansion. Intuitively, the words in Pw are words from x whereas Pinc are only prefixes that need to be completed into full words from x by concatenating suitable suffixes. We decompose xSuf in the same way into xSuf=SwSinc. The procedure then returns x=PwSw{pγλκs(pγλκ,γλκs)Pinc×Sinc}.

Theorem 26 (Soundness).

If Algorithm 1 terminates with an assignment σ, then σφ.

Proof.

Follows from the fact that xx for every non-flat variable x found in .

Theorem 27 (Completeness).

If φ is satisfiable, then Algorithm 1 terminates with an assignment σ such that σφ. Otherwise Algorithm 1 terminates with the answer 𝖴𝖭𝖲𝖠𝖳.

Proof.

Correctness for two-sided non-flat variables follows from Theorem 11. For remaining non-flat variables, correctness follows from Lemma 25. Finally, correctness of the 𝗀𝗅𝗎𝖾 procedure follows from Lemma 17.

Theorem 28.

A constraint ¬Contains(𝒩,)Φ is decidable in ExpSpace.

Proof sketch..

Decidability follows from the analysis of the decision procedure in Algorithm 1 given above. As for ExpSpace membership, first notice that languages of non-flat variables occurring only in are replaced with flat languages of polynomial size due to the bounds λQ and λκ. Algorithm 1 then uses Lemma 4, bringing the complexity of the procedure to NExpTime. However, obtaining a full model that includes all two-sided non-flat variables brings the procedure to ExpSpace as the length of the word to assigned to a two-sided non-flat variable doubles with each such a variable (cf. Theorem 11).

7.1 Chain-Free Word Equations with ¬𝐂𝐨𝐧𝐭𝐚𝐢𝐧𝐬

After establishing the decidability of a single ¬Contains predicate, we immediately obtain decidability of string fragments that permit the so-called monadic decomposition [46, 16], i.e., expressing the set of solutions as a finite disjunction of regular membership constraints x𝕏xx. These include fragments such as the straight-line fragment [29] or the more expressive chain-free fragment of word equations [4] (note that [4] considers also other predicates). We can therefore easily establish the following theorem.

Theorem 29.

Formula W¬Contains(𝒩,)Φ where W is a conjunction of chain-free word equations is decidable.

8 Future Work

This paper shows that chain-free word equations with regular constraints and a single instance of the ¬Contains predicate are decidable. There are several possible future work directions. First, we wish to investigate the fragment where the number of ¬Contains constraints is not limited to a single one. Another direction is examining combinations of ¬Contains with other predicates, such as length constraints or disequalities. The technique for combining these from [19] based on the reduction of the constraints to reasoning over Parikh images of finite automata is not directly applicable here. Also, the resulting complexity of our procedure is ExpSpace, however, we have hints that the problem might in fact be solvable in NP.

References

  • [1] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukáš Holík, Ahmed Rezine, and Philipp Rümmer. Flatten and conquer: A framework for efficient analysis of string constraints. In Albert Cohen and Martin T. Vechev, editors, Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, pages 602–617. ACM, 2017. doi:10.1145/3062341.3062384.
  • [2] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukáš Holík, Ahmed Rezine, and Philipp Rümmer. Trau: SMT solver for string constraints. In Nikolaj S. Bjørner and Arie Gurfinkel, editors, 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, October 30 - November 2, 2018, pages 1–5. IEEE, 2018. doi:10.23919/FMCAD.2018.8602997.
  • [3] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukáš Holík, Denghang Hu, Wei-Lun Tsai, Zhillin Wu, and Di-De Yen. Solving not-substring constraint with flat abstraction. In Programming Languages and Systems, pages 305–320, Cham, 2021. Springer International Publishing. doi:10.1007/978-3-030-89051-3_17.
  • [4] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bui Phi Diep, Lukáš Holík, and Petr Janků. Chain-free string constraints. In Yu-Fang Chen, Chih-Hong Cheng, and Javier Esparza, editors, Automated Technology for Verification and Analysis - 17th International Symposium, ATVA 2019, Taipei, Taiwan, October 28-31, 2019, Proceedings, volume 11781 of Lecture Notes in Computer Science, pages 277–293. Springer, 2019. doi:10.1007/978-3-030-31784-3_16.
  • [5] C. Aiswarya, Soumodev Mal, and Prakash Saivasan. On the satisfiability of context-free string constraints with subword-ordering. In Proceedings of the 37th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’22, New York, NY, USA, 2022. Association for Computing Machinery. doi:10.1145/3531130.3533329.
  • [6] Rajeev Alur and Pavol Černý. Streaming transducers for algorithmic verification of single-pass list-processing programs. SIGPLAN Not., 46(1):599–610, January 2011. doi:10.1145/1925844.1926454.
  • [7] Clark W. Barrett, Cesare Tinelli, Morgan Deters, Tianyi Liang, Andrew Reynolds, and Nestan Tsiskaridze. Efficient solving of string constraints for security analysis. In William L. Scherlis and David Brumley, editors, Proceedings of the Symposium and Bootcamp on the Science of Security, Pittsburgh, PA, USA, April 19-21, 2016, pages 4–6. ACM, 2016. doi:10.1145/2898375.2898393.
  • [8] Murphy Berzish, Joel D. Day, Vijay Ganesh, Mitja Kulczynski, Florin Manea, Federico Mora, and Dirk Nowotka. Towards more efficient methods for solving regular-expression heavy string constraints. Theor. Comput. Sci., 943:50–72, 2023. doi:10.1016/j.tcs.2022.12.009.
  • [9] Murphy Berzish, Mitja Kulczynski, Federico Mora, Florin Manea, Joel D. Day, Dirk Nowotka, and Vijay Ganesh. An SMT solver for regular expressions and linear arithmetic over string length. In Alexandra Silva and K. Rustan M. Leino, editors, Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II, volume 12760 of Lecture Notes in Computer Science, pages 289–312. Springer, 2021. doi:10.1007/978-3-030-81688-9_14.
  • [10] Berzish, Murphy. Z3str4: A solver for theories over strings. PhD thesis, University of Waterloo, 2021. URL: http://hdl.handle.net/10012/17102.
  • [11] Nikolaj S. Bjørner, Nikolai Tillmann, and Andrei Voronkov. Path feasibility analysis for string-manipulating programs. In Stefan Kowalewski and Anna Philippou, editors, Tools and Algorithms for the Construction and Analysis of Systems, 15th International Conference, TACAS 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings, volume 5505 of Lecture Notes in Computer Science, pages 307–321. Springer, 2009. doi:10.1007/978-3-642-00768-2_27.
  • [12] František Blahoudek, Yu-Fang Chen, David Chocholatý, Vojtěch Havlena, Lukáš Holík, Ondřej Lengál, and Juraj Síč. Word equations in synergy with regular constraints. In Marsha Chechik, Joost-Pieter Katoen, and Martin Leucker, editors, Formal Methods - 25th International Symposium, FM 2023, Lübeck, Germany, March 6-10, 2023, Proceedings, volume 14000 of Lecture Notes in Computer Science, pages 403–423. Springer, 2023. doi:10.1007/978-3-031-27481-7_23.
  • [13] Taolue Chen, Yan Chen, Matthew Hague, Anthony W. Lin, and Zhilin Wu. What is decidable about string constraints with the replaceall function. Proc. ACM Program. Lang., 2(POPL):3:1–3:29, 2018. doi:10.1145/3158091.
  • [14] Taolue Chen, Alejandro Flores-Lamas, Matthew Hague, Zhilei Han, Denghang Hu, Shuanglong Kan, Anthony W. Lin, Philipp Rümmer, and Zhilin Wu. Solving string constraints with regex-dependent functions through transducers with priorities and variables. Proc. ACM Program. Lang., 6(POPL):1–31, 2022. doi:10.1145/3498707.
  • [15] Taolue Chen, Matthew Hague, Jinlong He, Denghang Hu, Anthony Widjaja Lin, Philipp Rümmer, and Zhilin Wu. A decision procedure for path feasibility of string manipulating programs with integer data type. In Dang Van Hung and Oleg Sokolsky, editors, Automated Technology for Verification and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam, October 19-23, 2020, Proceedings, volume 12302 of Lecture Notes in Computer Science, pages 325–342. Springer, 2020. doi:10.1007/978-3-030-59152-6_18.
  • [16] Taolue Chen, Matthew Hague, Anthony W. Lin, Philipp Rümmer, and Zhilin Wu. Decision procedures for path feasibility of string-manipulating programs with complex operations. Proc. ACM Program. Lang., 3(POPL):49:1–49:30, 2019. doi:10.1145/3290362.
  • [17] Yu-Fang Chen, David Chocholatý, Vojtěch Havlena, Lukáš Holík, Ondřej Lengál, and Juraj Síč. Solving string constraints with lengths by stabilization. Proceedings of the ACM on Programming Languages, 7(OOPSLA2):2112–2141, 2023. doi:10.1145/3622872.
  • [18] Yu-Fang Chen, David Chocholatý, Vojtěch Havlena, Lukáš Holík, Ondřej Lengál, and Juraj Síč. Z3-Noodler: An automata-based string solver. In Bernd Finkbeiner and Laura Kovács, editors, Tools and Algorithms for the Construction and Analysis of Systems - 30th International Conference, TACAS 2024, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2024, Luxembourg City, Luxembourg, April 6-11, 2024, Proceedings, Part I, volume 14570 of Lecture Notes in Computer Science, pages 24–33. Springer, 2024. doi:10.1007/978-3-031-57246-3_2.
  • [19] Yu-Fang Chen, Vojtěch Havlena, Michal Hečko, Lukáš Holík, and Ondřej Lengál. A uniform framework for handling position constraints in string solving. Proc. ACM Program. Lang., 9(PLDI), 2025. doi:10.1145/3729273.
  • [20] Joel D. Day, Thorsten Ehlers, Mitja Kulczynski, Florin Manea, Dirk Nowotka, and Danny Bøgsted Poulsen. On solving word equations using SAT. In RP’19, volume 11674 of LNCS, pages 93–106. Springer, 2019. doi:10.1007/978-3-030-30806-3_8.
  • [21] Joel D. Day, Vijay Ganesh, Nathan Grewal, and Florin Manea. On the expressive power of string constraints. Proc. ACM Program. Lang., 7(POPL), January 2023. doi:10.1145/3571203.
  • [22] Joel D. Day, Vijay Ganesh, Paul He, Florin Manea, and Dirk Nowotka. The satisfiability of extended word equations: The boundary between decidability and undecidability. CoRR, abs/1802.00523, 2018. arXiv:1802.00523.
  • [23] Leonardo Mendonça de Moura and Nikolaj S. Bjørner. Z3: An efficient SMT solver. In C. R. Ramakrishnan and Jakob Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24.
  • [24] Vojtěch Havlena, Michal Hečko, Lukáš Holík, and Ondřej Lengál. Negated string containment is decidable (technical report). CoRR, abs/2506.22061, 2025. arXiv:2506.22061.
  • [25] Juhani Karhumäki, Filippo Mignosi, and Wojciech Plandowski. The expressibility of languages and relations by word equations. J. ACM, 47(3):483–505, 2000. doi:10.1145/337244.337255.
  • [26] Tianyi Liang, Andrew Reynolds, Cesare Tinelli, Clark W. Barrett, and Morgan Deters. A DPLL(T) theory solver for a theory of strings and regular expressions. In Armin Biere and Roderick Bloem, editors, Computer Aided Verification - 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014. Proceedings, volume 8559 of Lecture Notes in Computer Science, pages 646–662. Springer, 2014. doi:10.1007/978-3-319-08867-9_43.
  • [27] Tianyi Liang, Andrew Reynolds, Nestan Tsiskaridze, Cesare Tinelli, Clark W. Barrett, and Morgan Deters. An efficient SMT solver for string constraints. Formal Methods Syst. Des., 48(3):206–234, 2016. doi:10.1007/s10703-016-0247-6.
  • [28] Tianyi Liang, Nestan Tsiskaridze, Andrew Reynolds, Cesare Tinelli, and Clark W. Barrett. A decision procedure for regular membership and length constraints over unbounded strings. In Carsten Lutz and Silvio Ranise, editors, Frontiers of Combining Systems - 10th International Symposium, FroCoS 2015, Wroclaw, Poland, September 21-24, 2015. Proceedings, volume 9322 of Lecture Notes in Computer Science, pages 135–150. Springer, 2015. doi:10.1007/978-3-319-24246-0_9.
  • [29] Anthony Widjaja Lin and Pablo Barceló. String solving with word equations and transducers: Towards a logic for analysing mutation XSS. In Rastislav Bodík and Rupak Majumdar, editors, Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, pages 123–136. ACM, 2016. doi:10.1145/2837614.2837641.
  • [30] M. Lothaire, editor. Combinatorics on Words. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 1997.
  • [31] M. Lothaire. Algebraic Combinatorics on Words. Cambridge University Press, 2002.
  • [32] Kevin Lotz, Amit Goel, Bruno Dutertre, Benjamin Kiesl-Reiter, Soonho Kong, Rupak Majumdar, and Dirk Nowotka. Solving string constraints using sat. In Constantin Enea and Akash Lal, editors, Computer Aided Verification, pages 187–208, Cham, 2023. Springer Nature Switzerland. doi:10.1007/978-3-031-37703-7_9.
  • [33] Zhengyang Lu, Stefan Siemer, Piyush Jha, Joel D. Day, Florin Manea, and Vijay Ganesh. Layered and staged Monte Carlo tree search for SMT strategy synthesis. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024, pages 1907–1915. ijcai.org, 2024. URL: https://www.ijcai.org/proceedings/2024/211.
  • [34] R. C. Lyndon and M. P. Schützenberger. The equation aM=bNcP in a free group. Michigan Mathematical Journal, 9(4):289–298, 1962. doi:10.1307/mmj/1028998766.
  • [35] G S Makanin. The problem of solvability of equations in a free semigroup. Mathematics of the USSR-Sbornik, 32(2):129, February 1977. doi:10.1070/SM1977v032n02ABEH002376.
  • [36] Jakob Nielsen. Die isomorphismen der allgemeinen, unendlichen gruppe mit zwei erzeugenden. Mathematische Annalen, 78(1):385–397, 1917.
  • [37] Aina Niemetz, Mathias Preiner, Andrew Reynolds, Clark W. Barrett, and Cesare Tinelli. Syntax-guided quantifier instantiation. In Jan Friso Groote and Kim Guldstrand Larsen, editors, Tools and Algorithms for the Construction and Analysis of Systems - 27th International Conference, TACAS 2021, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021, Luxembourg City, Luxembourg, March 27 - April 1, 2021, Proceedings, Part II, volume 12652 of Lecture Notes in Computer Science, pages 145–163. Springer, 2021. doi:10.1007/978-3-030-72013-1_8.
  • [38] Andres Nötzli, Andrew Reynolds, Haniel Barbosa, Clark W. Barrett, and Cesare Tinelli. Even faster conflicts and lazier reductions for string solvers. In Sharon Shoham and Yakir Vizel, editors, Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part II, volume 13372 of Lecture Notes in Computer Science, pages 205–226. Springer, 2022. doi:10.1007/978-3-031-13188-2_11.
  • [39] Wojciech Plandowski. Satisfiability of word equations with constants is in PSPACE. In 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, 17-18 October, 1999, New York, NY, USA, pages 495–500. IEEE Computer Society, 1999. doi:10.1109/SFFCS.1999.814622.
  • [40] Andrew Reynolds, Andres Nötzli, Clark W. Barrett, and Cesare Tinelli. High-level abstractions for simplifying extended string constraints in SMT. In Isil Dillig and Serdar Tasiran, editors, Computer Aided Verification - 31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part II, volume 11562 of Lecture Notes in Computer Science, pages 23–42. Springer, 2019. doi:10.1007/978-3-030-25543-5_2.
  • [41] Andrew Reynolds, Andres Nötzli, Clark W. Barrett, and Cesare Tinelli. Reductions for strings and regular expressions revisited. In 2020 Formal Methods in Computer Aided Design, FMCAD 2020, Haifa, Israel, September 21-24, 2020, pages 225–235. IEEE, 2020. doi:10.34727/2020/isbn.978-3-85448-042-6_30.
  • [42] Andrew Reynolds, Maverick Woo, Clark W. Barrett, David Brumley, Tianyi Liang, and Cesare Tinelli. Scaling up DPLL(T) string solvers using context-dependent simplification. In Rupak Majumdar and Viktor Kuncak, editors, Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part II, volume 10427 of Lecture Notes in Computer Science, pages 453–474. Springer, 2017. doi:10.1007/978-3-319-63390-9_24.
  • [43] Neha Rungta. A billion SMT queries a day (invited paper). In Sharon Shoham and Yakir Vizel, editors, Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part I, volume 13371 of Lecture Notes in Computer Science, pages 3–18. Springer, 2022. doi:10.1007/978-3-031-13185-1_1.
  • [44] Prateek Saxena, Devdatta Akhawe, Steve Hanna, Feng Mao, Stephen McCamant, and Dawn Song. A symbolic execution framework for JavaScript. In 31st IEEE Symposium on Security and Privacy, S&P 2010, 16-19 May 2010, Berleley/Oakland, California, USA, pages 513–528. IEEE Computer Society, 2010. doi:10.1109/SP.2010.38.
  • [45] Trauc string constraints benchmark collection, 2020. URL: https://github.com/plfm-iis/trauc_benchmarks.
  • [46] Margus Veanes, Nikolaj S. Bjørner, Lev Nachmanson, and Sergey Bereg. Monadic decomposition. J. ACM, 64(2):14:1–14:28, 2017. doi:10.1145/3040488.