The Trichotomy of Regular Property Testing
Abstract
Property testing is concerned with the design of algorithms making a sublinear number of queries to distinguish whether the input satisfies a given property or is far from having this property. A seminal paper of Alon, Krivelevich, Newman, and Szegedy in 2001 introduced property testing of formal languages: the goal is to determine whether an input word belongs to a given language, or is far from any word in that language. They constructed the first property testing algorithm for the class of all regular languages. This opened a line of work with improved complexity results and applications to streaming algorithms. In this work, we show a trichotomy result: the class of regular languages can be divided into three classes, each associated with an optimal query complexity. Our analysis yields effective characterizations for all three classes using so-called minimal blocking sequences, reasoning directly and combinatorially on automata.
Keywords and phrases:
property testing, regular languagesCategory:
Track B: Automata, Logic, Semantics, and Theory of ProgrammingFunding:
Gabriel Bathie: Partially funded by the grant ANR-20-CE48-0001 from the French National Research Agency.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Regular languagesEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Property testing was introduced by Goldreich, Goldwasser and Ron [18] in 1998: it is the study of randomized approximate decision procedures that must distinguishing objects that have a given property from those that are far from having it. Because of this relaxation on the specification, the field focuses on very efficient decision procedures, typically with sublinear (or even constant) running time – in particular, the algorithm does not even have the time to read the whole input.
In a seminal paper, Alon et al. [4] introduced property testing of formal languages: given a language of finite words, the goal is to determine whether an input word belongs to the language or is -far111Informally, is -far from means that even by changing an -fraction of the letters of , we cannot obtain a word in . See Section 2 for a formal definition. from it, where is the precision parameter. The model assumes random access to the input word: a query specifies a position in the word and asks for the letter at that position, and the query complexity of the algorithm is the worst-case number of queries it makes to the input. Alon et al. [4] showed a surprising result: under the Hamming distance, all regular languages are testable with queries, where the notation hides constants that depend on the language, but, crucially, not on the length of the input word. In that paper, they also identified the class of trivial regular languages, those for which the answer is always yes or always no for sufficiently large , e.g. finite languages or the set of words starting with an , and showed that testing membership in a non-trivial regular language requires queries.
The results of Alon et al. [4] leave a multiplicative gap of between the best upper and lower bounds. We set out to improve our understanding of property testing of regular languages by closing this gap. Bathie and Starikovskaya obtained in 2021 [8] the first improvement over the result of Alon et al. [4] in more than 20 years:
Fact 1.1 (From [8, Theorem 5]).
Under the edit distance, every regular language can be tested with queries.
Testers under the edit distance are weaker than testers under the Hamming distance, hence this result does not exactly improve the result of Alon et al. [4]. We overcome this shortcoming later in this article: Theorem 4.15 extends the above result to the case of the Hamming distance.
Bathie and Starikovskaya [8] also showed that this upper bound is tight, in the sense that there is a regular language for which this complexity cannot be further improved, thereby closing the query complexity gap.
Fact 1.2 (From [8, Theorem 15]).
There is a regular language with query complexity under the edit distance222Note that, as opposed to testers, lower bounds for the edit distance are stronger than lower bounds of the Hamming distance., for all small enough .
Furthermore, it is easy to find specific non-trivial regular languages for which there is an algorithm using only queries, e.g. over the alphabet , or .
Hence, these results combined with those of Alon et al. [4] show that there exist trivial languages (that require 0 queries for large enough ), easy languages (with query complexity ) and hard languages (with query complexity ). This raises the question of whether there exist languages with a different query complexity (e.g. ), or if every regular is either trivial, easy or hard. This further asks the question of giving a characterization of the languages that belong to each class, inspired by the recent success of exact characterizations of the complexity of sliding window [15] recognition and dynamic membership [6] of regular languages.
In this article, we answer both questions: we show a trichotomy of the complexity of testing regular languages under the Hamming distance333We consider one-sided error testers, also called testing with perfect completeness, see definitions below., showing that there are only the three aforementioned complexity classes (trivial, easy and hard), we give a characterization of all three classes using a combinatorial object called blocking sequences, and show that this characterization can be decided in polynomial space (and that it is complete for PSPACE). This trichotomy theorem closes a line of work on improving query complexity for property testers and identifying easier subclasses of regular languages.
1.1 Related work
A very active branch of property testing focuses on graph properties, for instance one can test whether a given graph appears as a subgraph [2] or as an induced subgraph [3], and more generally every monotone graph property can be tested with one-sided error [5]. Other families of objects heavily studied under this algorithmic paradigm include probabilistic distributions [23, 10] combined with privacy constraints [1], numerical functions [9, 26], and programs [12, 11]. We refer to the book of Goldreich [17] for an overview of the field of property testing.
Testing formal languages.
Building upon the seminal work of Alon et al. [4], Magniez et al. [21] gave a tester using queries for regular languages under the edit distance with moves, and François et al. [14] gave a tester using queries for the case of the weighted edit distance. Alon et al. [4] also show that context-free languages cannot be tested with a constant number of queries, and subsequent work has considered testing specific context-free languages such as the Dyck languages [24, 13] or regular tree languages [21]. Property testing of formal languages has been investigated in other settings: Ganardi et al. [16] studied the question of testing regular languages in the so-called “sliding window model”, while others considered property testing for subclasses of context-free languages in the streaming model: Visibly Pushdown languages [14], Dyck languages [19, 20, 22] or and [7]. A recent application of property testing of regular languages was to detect race conditions in execution traces of distributed systems [27].
1.2 Main result and overview of the paper
We start with a high-level presentation of the approach, main result, and key ideas. In this section we assume familiarity with standard notions such as finite automata; we will detail notations in Section 2.
Let us start with the notion of a property tester for a language : the goal is to determine whether an input word belongs to the language , or whether it is -far from it. We say that of length is -far from with respect to a metric over words if all words satisfy , written . Throughout this work and unless explicitly stated otherwise, we will consider the case where is the Hamming distance, defined for two words and as the number of positions at which they differ if they have the same length, and as otherwise. In that case, means that one cannot change a proportion of the letters in to obtain a word in . We assume random access to the input word: a query specifies a position in the word and asks for the letter in this position. A -property tester (or for short, simply a tester) for a language is a randomized algorithm that, given an input word of length , always answers “yes” if and answers “no” with probability bounded away from 0 when is -far from . As in previous works on this topic, we measure the complexity of a tester by its query complexity. It is the maximum number of queries that makes on an input of length , as a function of and , in the worst case over all words of length and all possible random choices.
We can now formally define the classes of trivial, easy and hard regular languages.
Definition 1.3 (Hard, easy and trivial languages).
Let be a regular language. We say that:
-
is hard if the optimal query complexity for a property tester for is .
-
is easy if the optimal query complexity for a property tester for is .
-
is trivial if there exists such that for all positive , there is a property tester and some such that the tester makes queries on words of length .
Remark 1.4.
If is finite, then it is trivial: since there is a bound on the lengths of its words, a tester can reject words of length at least without querying them. For that reason, we only consider infinite languages in the rest of the article.
Our characterization of those three classes uses the notion of blocking sequence of a language . Intuitively, they are sequences of words such that finding those words as factors of a word proves that is not in . We also define a partial order on them, which gives us a notion of minimal blocking sequence.
Theorem 1.5.
Let be an infinite regular language recognized by an NFA and let denote the set of minimal blocking sequences of . The complexity of testing is characterized by as follows:
-
1.
is trivial if and only if is empty;
-
2.
is easy if and only if is finite and nonempty;
-
3.
is hard if and only if is infinite.
In the case where is recognised by a strongly connected automaton, blocking sequences can be replaced by blocking factors. A blocking factor is a single word that is not a factor of any word in .
Section 2 defines the necessary terms and notations. The rest of the paper is structured as follows. In Sections 3 and 4, we delimit the set of hard languages, that is, the ones that require queries. More precisely, Section 3 focuses on the subcase of languages defined by strongly connected automata. First, we combine the ideas of Alon et al. [4] with those that Bathie and Starikovskaya presented in [8] to obtain a property tester that uses queries for any language with a strongly connected automaton, under the Hamming distance. Second, we show that if the language of a strongly connected automaton has infinitely many blocking factors then it requires queries. This result generalizes the result of Bathie and Starikovskaya [8], which was for a single language, to all regular languages with infinitely many minimal blocking factors. We use Yao’s minimax principle [28]: this involves constructing a hard distribution over inputs, and showing that any deterministic property testing algorithms cannot distinguish between positive and negative instances against this distribution.
In Section 4, we extends those results to all automata. The interplay with the previous section is different for the upper and the lower bound.For the upper bound of queries, we use a natural but technical extension of the proof in the strongly connected case. Note that this result is an improvement over the result of Bathie and Starikovskaya [8], which works under the edit distance, and testers for the Hamming distance are also testers for the edit distance. For the lower bound of queries for languages with infinitely many minimal blocking sequences, we reduce to the strongly connected case. The main difficulty is that it is not enough to consider strongly connected components in isolation: there exists finite automata that contain a strongly connected component that induces a hard language, yet the language of the whole automaton is easy. We solve this difficulty by carefully defining the notion of minimality for a blocking sequence.
Section 5 completes the trichotomy, by characterising the easy and trivial languages. We show that languages of automata with finitely many blocking sequences can be tested with queries. We also prove that if an automaton has at least one blocking sequence, then it requires queries to be tested, by showing that the languages that our notion of trivial language coincides with the one given by Alon et al. [4]. By contrast, we show that automata without blocking sequences recognize trivial languages.
Once we have the trichotomy, it is natural to ask whether it is effective: given an automaton , can we determine if its language is trivial, easy or hard? The answer is yes, and we show in Section 6 that all three decision problems are PSPACE-complete, even for strongly connected automata.
2 Preliminaries
Words and automata.
We write (resp. ) for the set of finite words (resp. non-empty words) over the alphabet . The length of a word is denoted , and its th letter is denoted . The empty word is denoted . Given and , define as the word if and otherwise. Further, denotes the word . A word is a factor (resp. prefix, suffix) of is there exist indices such that (resp. with ). We use to denote “ is a factor of ”. Furthermore, if is a factor of and , we say that is a proper factor of .
A nondeterministic finite automaton (NFA) is a transition system defined by a tuple , with a finite set of states, a finite alphabet, the transition function, the initial state and the set of final states. The semantics is as usual [25]. When there is a path from a state to a state in , we say that is reachable from and that is co-reachable from . In this work, we assume w.l.o.g. that all NFA are trim, i.e., every state is reachable from the initial state and co-reachable from some final state.
Property testing.
Definition 2.1.
Let be a language, let be a word of length , let be a precision parameter and let be a metric. We say that the word is -far from w.r.t. if , where
We assume random access to the input word: a query specifies a position in the word and asks for the letter in this position.
Throughout this work and unless explicitly stated otherwise, we will consider the case where is the Hamming distance, defined for two words and as the number of positions at which they differ if they have the same length, and as otherwise. In that case, means that one cannot change an -fraction of the letters in to obtain a word in .
A -property tester (or simply a tester) for a language is a randomized algorithm that, given an input word , always answers “yes” if and answers “no” with probability bounded away from 0 when is -far from .
Definition 2.2.
A property tester for the language with precision is a randomized algorithm that, for any input of length , given random access to , satisfies the following properties:
The query complexity of is a function of and that counts the maximum number of queries that makes over all inputs of length and over all possible random choices.
We measure the complexity of a tester by its query complexity. Let us emphasize that throughout this article we focus on so-called “testers with perfect completeness”, or “one-sided error”: if a word is in the language, the tester answers positively (with probability ). In particular our characterization applies to this class. Because they are based on the notion of blocking factors that we will discuss below, all known testers for regular languages [4, 21, 14, 8] have perfect completeness.
In this paper, we assume that the automaton that describes the tested language is fixed, and not part of the input. Therefore, we consider its number of states as a constant.
Graphs and periodicity.
We now recall tools introduced by Alon et al. [4] to deal with periodicity in finite automata.
Let with be a directed graph. A strongly connected component (or SCC) of is a maximal set of vertices that are all reachable from each other. It is trivial if it contains a single state with no self-loop on it. We say that is strongly connected if its entire set of vertices is an SCC.
The period of a non-trivial strongly connected graph is the greatest common divisor of the length of the cycles in . Following the work of Alon et al. [4], we will use the following property of directed graphs.
Fact 2.3 (From [4, Lemma 2.3]).
Let be a non-empty, non-trivial, strongly connected graph with finite period . Then there exists a partition and a reachability constant that does not exceed such that:
-
1.
For every and for every , the length of any directed path from to in is equal to .
-
2.
For every , for every and for every integer , if , then there exists a directed path from to in of length .
The sets are the periodicity classes of . In what follows, we will slightly abuse notation and use even when to mean .
An automaton defines an underlying graph where . In what follows, we naturally extend the notions defined above to finite automata through this graph . Note that the numbering of the periodicity classes is defined up to a shift mod : we can thus always assume that is the class that contains the initial state . The period of is written .
Positional words and positional languages.
Consider the language . The word can appear as a factor of a word if occurs at an even position (e.g. position 0, 2, etc.) in . However, if occurs at an odd position in , then cannot be in . Therefore, can be used to witness that is not in , but only if we find it at an odd position. This example leads us to introducing -positional words, which additionally encode information about the index of each letter modulo an integer .
More generally, we will associate to each regular language a period , and working with -positional words will allow us to define blocking factors in a position-dependent way without explicitly considering the index at which the factor occurs.
Definition 2.4 (Positional words).
Let be a positive integer. A -positional word is a word over the alphabet of the form for some non-negative integer . If , we write to denote this word.
With this definition, if and we consider the -positional word , the factor appears at position in and is mapped to the factor . In this case, even when taking factors of , we still retain the (congruence classes of the) indices in the original word .
Any strongly connected finite automaton can naturally be extended into an automaton over -positional words with states. It suffices to keep track in the states of the current state of and the number of letters read modulo .
We call the language recognized by the positional language of , and denote it . This definition is motivated by the following property:
Property 2.5.
For any word , we have if and only if .
Positional words make it easier to manipulate factors with positional information, hence we phrase our property testing results in terms of positional languages. Notice that a property tester for immediately gives a property tester for , as one can simulate queries to with queries to by simply pairing the index of the query modulo with its result.
3 Hard Languages for Strongly Connected NFAs
Before considering the case of arbitrary NFAs, we first study the case of strongly connected NFAs, which are NFAs such that for any pair of states , there exists a word such that . We will later generalize the results of this section to all NFAs.
We show that the query complexity of the language of such an NFA can be characterized by the cardinality of the set of minimal blocking factors of , which are factor-minimal -positional words that witness the fact that a word does not belong to . In this section, we consider a fixed NFA and simply use “positional words” to refer to -positional words, where is the period of .
Definition 3.1 (Blocking factors).
Let be a strongly connected NFA. A positional word is a blocking factor of if for any other positional word we have .
Further, we say that is a minimal blocking factor of if no proper factor of is a blocking factor of . We use to denote the set of all minimal blocking factors of .
Intuitively and in terms of automata, the positional word is blocking for if it does not label any transition in labeled by starting from a state of . (This property is formally established later in Lemma 3.3.) The main result of this section is the following:
Theorem 3.2.
Let be an infinite language recognised by a strongly connected NFA . If is infinite, then is hard, i.e., it has query complexity
This result gives both an upper bound of and a lower bound of on the query complexity of a tester for : we prove the upper bound in Section 3.2 and the lower bound in Section 3.3.
3.1 Positional words, blocking factors and strongly connected NFAs
We formalize the intuition we gave earlier about blocking factors. In Section 3.2, we highlight the connection between property testing and blocking factors in strongly connected NFAs.
Lemma 3.3.
A positional word is a blocking factor for iff for every states , we have .
The above claim allows us to interchangeably use the statements “ is -far from ” and “ is -far from ”.
3.2 An efficient property tester for strongly connected NFAs
In this section, we show that for any strongly connected NFA , there exists an -property tester for that uses queries.
Theorem 3.4.
Let be a strongly connected NFA. For any , there exists an -property tester for that uses queries.
Our proof is similar to the one given in [8], with one notable technical improvement: we use a new method for sampling factors in , which greatly simplifies the correctness analysis.
3.2.1 An efficient sampling algorithm
We first introduce a sampling algorithm (Algorithm 1) that uses few queries and has a large probability of finding at least one factor from a large set of disjoint “special” factors. Using this algorithm on a large set of disjoint blocking factors gives us an efficient property tester for strongly connected NFAs. We will re-use this sampling procedure later in the case of general NFAs (Theorem 4.15).
The procedure is fairly simple: the algorithm samples factors of various lengths in at random. On the other hand, the correctness of the tester is far from trivial. The lengths and the number of factors of each length are chosen so that the number of queries is minimized and the probability of finding a “special” factor is maximized, regardless of their repartition in . (In what follows, the “special” factors are blocking factors.)
Claim 3.5.
A call to (Algorithm 1) makes queries to .
Lemma 3.6.
Let be a word of length , and consider a set containing at least disjoint factors of , each of length at most . A call to the function (Algorithm 1) returns a set of factors of such that there exists a word of that is a factor of some word of , with probability at least .
3.2.2 The tester
The algorithm for Theorem 3.4 is given in Algorithm 2.
We now show that Algorithm 2 is a property tester for that uses queries. In what follows, we use to denote the length of the input word and to denote the number of states of .
Claim 3.7.
The tester given in Algorithm 2 makes queries to .
Alon et al. [4, Lemma 2.6] first noticed that if a word is -far from , then it contains short factors that witness the fact that is not in . We start by translating the lemma of Alon et al. on “short witnesses” to the framework of blocking factors. More precisely, we show that if is -far from , then contains many disjoint (i.e. non-overlapping) blocking factors.
Lemma 3.8.
Let , let be a word of length and assume that contains at least one word of length . If is -far from , then contains at least disjoint blocking factors.
Next, we show that if is -far from , then contains blocking factors, each of length .
Lemma 3.9.
Let , let be a word of length and assume that contains at least one word of length . If is -far from , then the positional word contains at least disjoint blocking factors of length at most .
Proof.
Let be a word and an automaton satisfying the above hypotheses. By Lemma 3.8, contains at least disjoint blocking factors. As these factors are disjoint, at most half of them (that is, of them) can have length greater than , as the sum of their lengths cannot exceed .
Proof of Theorem 3.4.
First, note that if , cannot contain a blocking factor for , hence Algorithm 2 always accepts . Next, if is empty or if , the tester has the same output as , hence it is correct.
In the remaining case, is long enough and -far from , hence Lemma 3.9 gives us a large set of short blocking factors in : this is exactly what the Sampler function needs to find at least one factor containing a blocking factor with probability at least . More precisely, by Lemma 3.9, contains at least blocking factors of length at most , hence the conditions of Lemma 3.6 are satisfied.
As a factor containing a blocking factor is also a blocking factor, the set computed on line 10 of Algorithm 2 contains at least one blocking factor with probability at least , and Algorithm 2 satisfies Definition 2.2.
3.3 Lower bound from infinitely many minimal blocking factors
We now show that languages with infinitely many minimal blocking factors are hard, i.e. any tester for such a language requires queries.
Let us first give an example that will motivate our construction. Consider the parity language consisting of words that contain an even number of ’s, over the alphabet . Distinguishing from requires queries, as changing the letter at single position can change membership in . However, is trivial to test, as any word is at distance at most 1 from , for the same reason. Now, consider language consisting of words over such that between a and the next , there is a word in . Intuitively, this language encodes multiple instances of , hence we can construct words -far from , and each instance is hard to recognize for property testers, hence the whole language is. In [8, Theorem 15], Bathie and Starikovskaya proved a lower bound of on the query complexity of any property tester for , matching the upper bound in the same paper.
The minimal blocking factors of include all words for the form where : there are infinitely many such words. This is no coincidence: we show that this lower bound can be lifted to any language with infinitely many minimal blocking factors.
Theorem 3.10.
Let be a strongly connected NFA. If is infinite, then there exists a constant such that for any , any -property tester for uses queries.
The proof of this result is full generalization of the lower bound against the “repeated Parity” example given above.
Our proof is based on (a consequence of) Yao’s Minimax Principle [28]: if there is a distribution over inputs such that any deterministic algorithm that makes at most queries errs on with probability at least , then any randomized algorithm with queries errs with probability at least on some input .
To prove Theorem 3.10, we first exhibit such a distribution for . We take the following steps:
-
1.
we show that with high probability, an input sampled w.r.t. is either in or -far from (Lemma 3.11),
-
2.
we show that with high probability, any deterministic tester that makes fewer than queries (for a suitable constant ) cannot distinguish whether the instance is positive or -far, hence it errs with large probability.
-
3.
combine the above two results to prove Theorem 3.10 via Yao’s Minimax principle.
Constructing a Hard Distribution
To sample an input w.r.t. , we first sample a uniformly random bit : if , we construct a word that belongs to , and if , will be far from with high probability. The input word is then conceptually divided into disjoint intervals. For the -th interval, we sample a random variable : this random variable describes the content of the interval. If , we fill the interval with non-blocking factors of length , and if , we instead fill the interval with blocking factors of the same length, chosen to be very similar the non-blocking factors, making it hard to distinguish between the two cases with few queries. By carefully choosing the distribution for , we can ensure that when , the resulting instance is -far from with high probability.
Lemma 3.11.
Conditioned on , the probability of the event is -far from goes to as goes to infinity.
Corollary 3.12.
For large enough , we have .
Intuitively, our distribution is hard to test because positive and negative instance are very similar. Therefore, a tester with few queries will likely not be able to tell them apart: the perfect completeness constraint forces the tester to accept in that case. Below, we establish this result formally.
Lemma 3.13.
Let be a deterministic tester with perfect completeness (i.e. one sided error, always accepts ) and let denote the number of queries that it makes in the -th interval. Conditioned on the event , the probability that accepts is .
Next, we show that if a tester makes few queries, then the event has large probability.
Lemma 3.14.
Let be a deterministic tester, let denote the number of queries that it makes in the -th interval, and assume that makes at most queries, i.e. . The probability of the event is at least .
We are now ready to prove Theorem 3.10.
Proof of Theorem 3.10.
We want to show that any tester with perfect completeness for requires at least queries, by showing that any tester with fewer queries errs with probability at least . We show that any deterministic algorithm with perfect completeness that makes less than queries errs on when with probability at least , and conclude using Yao’s Minmax principle.
Consider such an algorithm . The probability that makes an error on is lower-bounded by the probability that is -far from and accepts, which in turn is larger than the probability of . By Corollary 3.12, we have , and by Lemma 3.14, is at least . Therefore, we have
This concludes the proof of Theorem 3.10, and consequently of Theorem 3.2.
Next, we show that if a tester makes few queries, then the event has large probability.
4 Characterisation of Hard Languages for All NFAs
In this section we extend the results of the previous section to all finite automata. This extension is based on a generalization of blocking factors: we introduce blocking sequences, which are sequences of factors that witness the fact that we cannot take any path through the strongly connected components of the automaton. For the lower bound, we define a suitable partial order on blocking sequences, which extends the factor relation on words to those sequences, and allows us to define minimal blocking sequences.
4.1 Blocking sequences
4.1.1 Examples motivating blocking sequences
Before presenting the technical part of the proof, let us go through two examples, which motivate the notions that we introduce and illustrate some of the main difficulties.
Example 4.1.
Consider the automaton depicted in Figure 1: it recognizes the language of words in which all ’s appear before the first , over the alphabet .
The set of minimal blocking factors of is infinite: it is the language . Yet, is easy to test: we sample letters at random, answer “no” if the sample contains a occurring after a , and “yes” otherwise. To prove that this yields a property tester, we rely on the following property:
Property 4.2.
If is -far from , then it can be decomposed into where contains letters and contains letters .
The pair of factors is an example of blocking sequence: a word that contains an occurrence of the first followed by an occurrence of the second cannot be in . We can also show that a word -far from must contains many disjoint blocking sequences – this property (Lemma 4.17) underpins the algorithm for general regular languages.
What this example shows is that blocking factors are not enough: we need to consider sequences of factors, yielding the notion of blocking sequences. Intuitively, a blocking sequence for is a sequence of (positional) words such that if each word of the sequence appears in , with the occurrences of the ’s ordered as in , then is not in .444This is not quite the definition, but it conveys the right intuition. While has infinitely many minimal blocking factors, it has a single minimal blocking sequence .
Notice that the (unique) blocking sequence can be visualized on Figure 1: it is composed of the red letters that label transitions between the different SCCs. This is no coincidence: in many simple cases, blocking sequences are exactly sequences that contain one blocking factors for each SCC. This fact could lead one to believe that the set of minimal blocking sequences is exactly the set of sequences of minimal blocking factors, one for each SCC. In particular, this would imply that as soon as one SCC has infinitely many minimal blocking factors, the language of the whole automaton is hard to test. We show in the next example that this is not always the case, because SCCs might share minimal blocking factors.
Example 4.3.
Consider the automaton in Figure 2: it has two SCCs and a sink state. The minimal blocking factors of the first SCC are given by , and for the second SCC. This automaton is easy to test: intuitively, a word that is -far from this language has to contain many ’s, as otherwise we can make it accepted by deleting all ’s, thanks to the second SCC. However, is also a blocking factor of the first SCC, therefore, as soon as we find two ’s in the word, we know that it is not in .
The crucial facts here are that the set of minimal blocking factors of the second SCC is finite and it is a subset of : the infinite nature of is made irrelevant because any word far from the language contains many ’s. Therefore, has a single minimal blocking sequence, .
4.1.2 Portals and SCC-paths
Intuitively, blocking sequences are sequences of blocking factors of successive strongly connected components. To formalize this intuition, we use portals, which describe how a run in the automaton interacts with a strongly connected component, and SCC-paths, that describe a succession of portals.
In what follows, we fix an NFA . We assume w.l.o.g. that has a unique final state . Let be the set of SCCs of . We define the partial order relation on by if and only if is reachable from . We write for its strict part . These relations can be naturally extended to states through their SCC: if and , then if and only if .
We define as the least common multiple of the lengths of all simple cycles of . Given a number , we say that a state is -reachable from a state if there is a path from to of length modulo . In what follows, we use “positional words” for -positional words with this value of .
Definition 4.4 (Portal).
A portal is a 4-tuple , such that and are in the same SCC. It describes the first and last states visited by a path in an SCC, and the positions (modulo ) at which it first and lasts visits that SCC.
The positional language of a portal is the set
Portals were already defined by Alon et al. [4], in a slightly different way. Our definition will allow us to express blocking sequences more naturally.
Definition 4.5.
A positional word is blocking for a portal if it is not a factor of any word of . In other words, there is no path that starts in and ends in , of length modulo , which reads after steps modulo .
The above definition matches the definition of blocking factors for strongly connected automata. This is no coincidence: we show in the next lemma that the language of a portal has a strongly connected automaton.
Lemma 4.6.
Let be an automaton and a portal of . There is a strongly connected NFA with at most states that recognizes .
Portals describe the behavior of a run inside a single strongly connected component of the automaton. Next, we introduce SCC-paths, which describe the interaction of a run with multiple SCCs and between two successive SCCs.
Definition 4.7 (SCC-path).
An SCC-path of is a sequence of portals linked by single-letter transitions such that for all , , , and .
Intuitively, an SCC-path is a description of the states and positions at which a path through the automaton enters and leaves each SCC.
Definition 4.8.
The language of an SCC-path is the set
We say that is accepting if , with , , and is non-empty.
Lemma 4.9.
We have
As a consequence the distance between a word and the (positional) language of is equal to the minimum of the distances between and the languages of the SCC-paths of .
Corollary 4.10.
For any positional word , we have
Decomposing as a union of SCC-paths allows us to use them as an intermediate step to define blocking sequences. We earlier defined blocking factors for portals: we now generalize this definition to blocking sequences for SCC-paths, to finally define blocking sequence of automata.
Definition 4.11 ((Strongly) Blocking Sequences for SCC-paths).
We say that a sequence of positional factors is blocking for an SCC-path if there is a sequence of indices such that for every is blocking for .
Furthermore, if there is a sequence of indices with the same property, then is said to be strongly blocking for .
Note that, crucially, in the definition of blocking sequences, consecutive indices and can be equal, i.e. a single factor of the sequence may be blocking for multiple consecutive SCCs in the SCC-path. This choice is motivated by Example 4.3, where the language is easy because consecutive SCCs share blocking factors.
We say that two occurrences of blocking sequences in a word are disjoint if the occurrences of their factors appear on disjoint sets of positions in .
In the strongly connected case, we had the property that if contains an occurrence of a factor blocking for , then is not in the language of . The following lemma gives an extension of this result to strongly blocking sequences and the language of an SCC-path.
Lemma 4.12.
Let be an SCC-path. If contains a strongly blocking sequence for , then .
We can now define sequences that are blocking for an automaton: they are sequences that are blocking for every accepting SCC-path of the automaton.
Definition 4.13 (Blocking sequence for ).
Let be a sequence of positional words. We say that is blocking for if it is blocking for all accepting SCC-paths of .
As an example, observe that the sequences and are both blocking for the automaton displayed in Figure 3 (see Example 4.14).
Example 4.14.
Consider the automaton displayed in Figure 3. The of the lengths of its simple cycles is . This automaton has six accepting SCC-paths, for instance
The language of the portal is . A blocking sequence for this SCC-path is , which is in fact blocking for all of the SCC-paths.
On the other hand, is not blocking for , as is not a blocking factor for the portal . It is, however, a blocking sequence for . This is because if we enter the SCC through , a factor can only appear after an even number of steps, while if we enter through , it can only appear after an odd number of steps.
4.2 An efficient property tester
In this section, we show that for any regular language and any small enough , there is an -property tester for that uses queries.
Theorem 4.15.
For any NFA and any small enough , there exists an -property tester for that uses queries.
As mentioned in the overview, this result supersedes the one given by Bathie and Starikovskaya [8]: while both testers use the same number of queries, the tester in [8] works under the edit distance, while that of Theorem 4.15 is designed for the Hamming distance. As the edit distance never exceeds the Hamming distance, the set of words that are -far with respect to the former is contained in the set of words -far for the latter. Therefore, an -tester for the Hamming distance is also an -tester for the edit distance.
The property tester behind Theorem 4.15 uses the property tester for strongly connected NFAs as a subroutine, and its correctness is based on an extension of Lemma 3.9 to blocking sequences. We show that we can reduce property testing of to a search for blocking sequences in the word, in the following sense:
-
If contains a strongly blocking sequence for each of the SCC-paths of , then it is not in the language and we can answer no (Corollary 4.16).
-
If is -far from the language, then for each accepting SCC-path of , is far from for the language of and contains many disjoint strongly blocking sequences for (Lemma 4.17), hence random sampling is likely to find at least one of them, and we reject with constant probability.
Corollary 4.16.
If contains a strongly blocking sequence for each SCC-path of , then .
Proof.
This follows from Lemma 4.9. The next lemma expresses a partial converse to Corollary 4.16 and generalizes Lemma 3.8 from the strongly connected case: if a word is far from the language, then it contains many strongly blocking sequences for any SCC-path.
Lemma 4.17.
Let be an SCC-path, let , and let be a positional word of length such that is finite. There is a constant such that if and is -far from , then can be partitioned into such that for every , contains at least disjoint blocking factors for , each of length at most .
Lemma 4.18.
Let and let be a positional word of length . If contains a word of length and is -far from , then contains disjoint blocking sequences for .
4.3 Lower bound
In order to characterize hard languages for all automata, we define a partial order on sequences of positional factors. It is an extension of the factor partial order on blocking factors. It will let us define minimal blocking sequences, which we use to characterize the complexity of testing a language.
Definition 4.19 (Minimal blocking sequence).
Let and be sequences of positional words. We have if there exists a sequence of indices such that is a factor of for all .
A blocking sequence of (resp. ) is minimal if it is a minimal element of among blocking sequences of (resp. ). The set of minimal blocking sequences of (resp. ) is written (resp. ).
Remark 4.20.
If and is a blocking sequence for an SCC-path then is also a blocking sequence for .
To prove a lower bound on the number of queries necessary to test a language when is infinite, we present a reduction to the strongly connected case. Under the assumption that has infinitely many minimal blocking sequences, we exhibit a portal of with infinitely many minimal blocking factors and “isolate it” by constructing two sequences of positional factors and such that for all , is blocking for if and only if is a blocking factor of . Then we reduce the problem of testing the language of this portal to the problem of testing the language of .
To define “isolating ” formally, we define the left (and right) effect of a sequence on an SCC-path. Informally, the left effect of a sequence on an SCC-path is related to the index of the first portal in where a run can be after reading , because all previous portals have been blocked. The right effect represents the same in reverse, starting from the end of the run. More formally, the left effect of a sequence on an SCC-path is the largest index such that the sequence is blocking for ( if there is no such ). We denote it by . Similarly, the right effect of a sequence on is the smallest index such that the sequence is blocking for ( if there is no such ); we denote it by .
Remark 4.21.
A sequence is blocking for an SCC-path if and only if , if and only if .
Also, given two sequences , the sequence is blocking for if and only if .
For the next lemma we define a partial order on portals: if all blocking factors of are also blocking factors of . We write for the reverse relation, for the equivalence relation and for the complement relation of .
Additionally, given an SCC-path and two sequences of positional words , we say that the portal survives in if .
Definition 4.22.
Let be a portal and and sequences of positional words.
We define three properties that those objects may have:
- P1)
-
is not blocking for
- P2)
-
has infinitely many minimal blocking factors
- P3)
-
for any accepting SCC-path in , every portal in which survives is -equivalent to .
Lemma 4.23.
If has infinitely many minimal blocking sequences, then there exist a portal and sequences and satisfying properties P1, P2 and P3.
Lemma 4.24.
If there exist a portal and , satisfying properties P1, P2 and P3 then is hard.
Proposition 4.25.
If has infinitely many minimal blocking sequences, then is hard.
Proof.
5 Trivial and Easy languages
5.1 Upper bound for easy languages
We first establish that an automaton with finitely many minimal blocking sequences is easy (or trivial) to test, using a similar algorithm as in Theorem 4.15.
Lemma 5.1.
Let be an NFA with a finite number of minimal blocking sequences, let be an SCC-path of , let , and let be a positional word of length such that is finite. There are constants such that if and is -far from , then can be partitioned into such that for every , contains at least disjoint blocking factors for , each of length at most .
Proposition 5.2.
If has finitely many minimal blocking sequences, then there is a tester for that uses queries.
Proof.
We use the same algorithm that for Theorem 4.15, except that we use the factors given by Lemma 5.1, therefore, in the call to the Sampler function (Algorithm 1), the upper bound on the length of the factors is instead of . In that case, the query complexity becomes .
This already gives us a clear dichotomy: all languages either require queries to be tested, or can be tested with queries.
5.2 Separation between trivial and easy languages
It remains to show that languages that can be tested with queries have query complexity either , or for large enough . Our proof uses the class of trivial regular languages identified by Alon et al. [4], which we revisit next.
An example of a trivial language is consisting of words containing at least one over the alphabet . For any word , replacing any letter by yields a word in , hence . Therefore, for , no word of length is -far from , and the trivial property tester that answers “yes” without sampling any letter is correct.
Alon et al. [4] define non-trivial languages as follows.
Definition 5.3 ([D]efinition 3.1).
alon2001regular] A language is non-trivial if there exists a constant , so that for infinitely many values of the set is non-empty, and there exists a word so that .
It is easy to see that if a language is trivial in the above sense (i.e. not non-trivial), then for large enough input length , the answer to testing membership in only depends , and the algorithm does not need to query the input. Alon et al. [4, Property 2] show that if a language is non-trivial, then testing it requires queries for small enough .
To obtain our characterization of trivial languages, we show that is non-empty if and only if is non-trivial (in the above sense). It follows that if is empty, then testing requires queries for large enough . Furthermore, by the result of Alon et al. [4], if is non-empty, then testing requires queries.
Recall that we focus on infinite languages, since we know that all finite ones are trivial (Remark 1.4).
Lemma 5.4.
is empty if and only if is trivial.
It is easy to see that if a language is trivial in the above sense, then for large enough input length , membership in only depends , and the algorithm does not need to query the input. Alon et al. [4] show that if a language is non-trivial, then testing it requires queries for small enough . As a corollary of that lower bound, we obtain that if is non-empty, then testing requires queries.
6 Hardness of classifying
In the previous sections, we have shown that testing some regular languages (easy ones) that requires fewer queries than testing others (hard ones). Therefore, given the task of testing a word for membership in , it is natural to first try to determine if the language of is easy, and if this is the case, run the appropriate -tester, that uses fewer queries. In this section, we investigate the computational complexity of checking which class of the trichotomy the language of a given automaton belongs to. We formalize this question as the following decision problems:
Problem 6.1 (Triviality problem).
Given an finite automaton , is trivial?
Problem 6.2 (Easiness problem).
Given an finite automaton , is easy?
Problem 6.3 (Hardness problem).
Given an finite automaton , is hard?
In these problems, the automaton is the input and is no longer fixed. We show that, our combinatorial characterization based on minimal blocking sequences is effective, in the sense that all three problems are decidable. However, it does not lead to efficient algorithms, as both problems are PSPACE-complete.
Theorem 6.4.
The triviality and easiness problems are both PSPACE-complete, even for strongly connected NFAs.
7 Conclusion
We presented an effective classification of regular languages in three classes, each associated with an optimal query complexity for property testing. We thus close a line of research aiming to determine the optimal complexity of regular languages. All our results are with respect to the Hamming distance. We conjecture that they can be adapted to the edit distance. We use non-deterministic automata to represent regular languages. A natural open question is the complexity of classifying languages represented by deterministic automata.
References
- [1] Maryam Aliakbarpour, Ilias Diakonikolas, and Ronitt Rubinfeld. Differentially private identity and equivalence testing of discrete distributions. In International Conference on Machine Learning, ICML, 2018. URL: https://proceedings.mlr.press/v80/aliakbarpour18a.html.
- [2] Noga Alon, Richard A. Duke, Hanno Lefmann, Vojtech Rödl, and Raphael Yuster. The algorithmic aspects of the regularity lemma. Journal of Algorithms, 16(1):80–109, 1994. doi:10.1006/JAGM.1994.1005.
- [3] Noga Alon, Eldar Fischer, Michael Krivelevich, and Mario Szegedy. Efficient testing of large graphs. Combinatorica, 20(4):451–476, 2000. doi:10.1007/s004930070001.
- [4] Noga Alon, Michael Krivelevich, Ilan Newman, and Mario Szegedy. Regular languages are testable with a constant number of queries. SIAM Journal on Computing, 30(6):1842–1862, 2001. doi:10.1109/SFFCS.1999.814641.
- [5] Noga Alon and Asaf Shapira. Every monotone graph property is testable. SIAM Journal of Computing, 38(2):505–522, 2008. doi:10.1137/050633445.
- [6] Antoine Amarilli, Louis Jachiet, and Charles Paperman. Dynamic membership for regular languages. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 116:1–116:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.116.
- [7] Ajesh Babu, Nutan Limaye, Jaikumar Radhakrishnan, and Girish Varma. Streaming algorithms for language recognition problems. Theoretical Computer Science, 494:13–23, 2013. doi:10.1016/J.TCS.2012.12.028.
- [8] Gabriel Bathie and Tatiana Starikovskaya. Property testing of regular languages with applications to streaming property testing of visibly pushdown languages. In International Colloquium on Automata, Languages, and Programming, ICALP, 2021. doi:10.4230/LIPIcs.ICALP.2021.119.
- [9] Manuel Blum, Michael Luby, and Ronitt Rubinfeld. Self-testing/correcting with applications to numerical problems. Journal of Computer and System Sciences, 47(3):549–595, 1993. doi:10.1016/0022-0000(93)90044-W.
- [10] Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In IEEE Symposium on Foundations of Computer Science, FOCS, pages 685–694. IEEE, 2016. doi:10.1109/FOCS.2016.78.
- [11] Yevgeniy Dodis, Oded Goldreich, Eric Lehman, Sofya Raskhodnikova, Dana Ron, and Alex Samorodnitsky. Improved testing algorithms for monotonicity. In International Workshop on Randomization and Approximation Techniques in Computer Science, RANDOM, pages 97–108. Springer, 1999. doi:10.1007/978-3-540-48413-4_10.
- [12] Funda Ergün, Sampath Kannan, S.Ravi Kumar, Ronitt Rubinfeld, and Mahesh Viswanathan. Spot-checkers. Journal of Computer and System Sciences, 60(3):717–751, 2000. doi:10.1006/jcss.1999.1692.
- [13] Eldar Fischer, Frédéric Magniez, and Tatiana Starikovskaya. Improved bounds for testing Dyck languages. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1529–1544. SIAM, 2018. doi:10.1137/1.9781611975031.100.
- [14] Nathanaël François, Frédéric Magniez, Michel de Rougemont, and Olivier Serre. Streaming property testing of visibly pushdown languages. In Proceedings of the 24th Annual European Symposium on Algorithms, volume 57 of LIPIcs, pages 43:1–43:17, 2016. doi:10.4230/LIPIcs.ESA.2016.43.
- [15] Moses Ganardi, Danny Hucke, and Markus Lohrey. Randomized sliding window algorithms for regular languages. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, volume 107 of LIPIcs, pages 127:1–127:13, 2018. doi:10.4230/LIPIcs.ICALP.2018.127.
- [16] Moses Ganardi, Danny Hucke, Markus Lohrey, and Tatiana Starikovskaya. Sliding Window Property Testing for Regular Languages. In Pinyan Lu and Guochuan Zhang, editors, 30th International Symposium on Algorithms and Computation (ISAAC 2019), volume 149 of Leibniz International Proceedings in Informatics (LIPIcs), pages 6:1–6:13, Dagstuhl, Germany, 2019. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ISAAC.2019.6.
- [17] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017. doi:10.1017/9781108135252.
- [18] Oded Goldreich, Shari Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653–750, July 1998. doi:10.1145/285055.285060.
- [19] Rahul Jain and Ashwin Nayak. The space complexity of recognizing well-parenthesized expressions in the streaming model: The index function revisited. IEEE Transactions on Information Theory, 60(10):6646–6668, October 2014. doi:10.1109/TIT.2014.2339859.
- [20] Andreas Krebs, Nutan Limaye, and Srikanth Srinivasan. Streaming algorithms for recognizing nearly well-parenthesized expressions. In Proc. of MFCS 2011, volume 6907 of LNCS, pages 412–423. Springer, 2011. doi:10.1007/978-3-642-22993-0_38.
- [21] Frédéric Magniez and Michel de Rougemont. Property testing of regular tree languages. Algorithmica, 49(2):127–146, 2007. doi:10.1007/s00453-007-9028-3.
- [22] Frédéric Magniez, Claire Mathieu, and Ashwin Nayak. Recognizing well-parenthesized expressions in the streaming model. SIAM J. Comput., 43(6):1880–1905, 2014. doi:10.1137/130926122.
- [23] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008. doi:10.1109/TIT.2008.928987.
- [24] Michal Parnas, Dana Ron, and Ronitt Rubinfeld. Testing membership in parenthesis languages. Random Struct. Algorithms, 22(1):98–138, 2003. doi:10.1002/rsa.10067.
- [25] Jean-Éric Pin, editor. Handbook of Automata Theory. European Mathematical Society Publishing House, Zürich, Switzerland, 2021. doi:10.4171/AUTOMATA.
- [26] Ronitt Rubinfeld and Madhu Sudan. Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252–271, 1996. doi:10.1137/S0097539793255151.
- [27] Mosaad Al Thokair, Minjian Zhang, Umang Mathur, and Mahesh Viswanathan. Dynamic race detection with O(1) samples. Proc. ACM Program. Lang., 7(POPL):1308–1337, 2023. doi:10.1145/3571238.
- [28] Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In Symposium on Foundations of Computer Science, SFCS, pages 222–227. IEEE, 1977. doi:10.1109/SFCS.1977.24.