The Parameterized Complexity of Learning Monadic Second-Order Logic
Abstract
Within the model-theoretic framework for supervised learning introduced by Grohe and Turán (TOCS 2004), we study the parameterized complexity of learning concepts definable in monadic second-order logic (MSO). We show that the problem of learning an MSO-definable concept from a training sequence of labeled examples is fixed-parameter tractable on graphs of bounded clique-width, and that it is hard for the parameterized complexity class para-NP on general graphs.
It turns out that an important distinction to be made is between -dimensional and higher-dimensional concepts, where the instances of a -dimensional concept are -tuples of vertices of a graph. For the higher-dimensional case, we give a learning algorithm that is fixed-parameter tractable in the size of the graph, but not in the size of the training sequence, and we give a hardness result showing that this is optimal. By comparison, in the 1-dimensional case, we obtain an algorithm that is fixed-parameter tractable in both.
Keywords and phrases:
monadic second-order definable concept learning, agnostic probably approximately correct learning, parameterized complexity, clique-width, fixed-parameter tractable, Boolean classification, supervised learning, monadic second-order logicFunding:
Steffen van Bergerem: This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 431183758 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 431183758).Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Logic ; Theory of computation Complexity theory and logic ; Theory of computation Fixed parameter tractability ; Computing methodologies Logical and relational learning ; Computing methodologies Supervised learningEditors:
Jörg Endrullis and Sylvain SchmitzSeries and Publisher:

1 Introduction
We study abstract machine-learning problems in a logical framework with a declarative view on learning, where the (logical) specification of concepts is separated from the choice of specific machine-learning models and algorithms (such as neural networks). Here we are concerned with the computational complexity of learning problems in this logical learning framework, that is, the descriptive complexity of learning [8].
Specifically, we consider Boolean classification problems that can be specified in monadic second-order logic (MSO). The input elements for the classification task come from a set , the instance space. A classifier on is a function . Given a training sequence of labeled examples , we want to find a classifier, called a hypothesis, that explains the labels given in and that can also be used to predict the labels of elements from not given as examples. In the logical setting, the instance space is a set of tuples from a (relational) structure, called the background structure, and classifiers are described by formulas of some logic, in our case MSO, using parameters from the background structure. This model-theoretic learning framework was introduced by Grohe and Turán [33] and further studied in [30, 32, 31, 7, 9, 11, 8].
We study these problems within the following well-known settings from computational learning theory. In the consistent-learning model, the examples are assumed to be generated using an unknown classifier, the target concept, from a known concept class. The task is to find a hypothesis that is consistent with the training sequence , i.e. a function such that for all . In Haussler’s model of agnostic probably approximately correct (PAC) learning [35], a generalization of Valiant’s PAC learning model [50], an (unknown) probability distribution on is assumed, and training examples are drawn independently from this distribution. The goal is to find a hypothesis that generalizes well, i.e. one is interested in algorithms that return with high probability a hypothesis with a small expected error on new instances drawn from the same distribution. For more background on PAC learning, we refer to [37, 41, 46]. In both settings, we require our algorithms to return a hypothesis from a predefined hypothesis class.
Our Contributions
In this paper, we study the parameterized complexity of the consistent-learning problem MSO-Consistent-Learn and the PAC-learning problem MSO-PAC-Learn. In both problems, we are given a graph (the background structure) and a sequence of labeled training examples of the form , where is a -tuple of vertices from and . The goal is to find a hypothesis of the form for an MSO formula and a tuple with if and otherwise. For MSO-Consistent-Learn, this hypothesis should be consistent with the given training examples. For MSO-PAC-Learn, the hypothesis should generalize well. We restrict the complexity of allowed hypotheses by giving a bound on the quantifier rank of and a bound on the length of . Both and as well as the dimension of the problem, that is, the length of the tuples to classify, are part of the parameterization of the problems. A detailed description of MSO-Consistent-Learn is given in Section 3. The problem MSO-PAC-Learn is formally introduced in Section 5.
Example 1.1.
Assume we are given the graph depicted in Figure 1, the training sequence , and , , . Note that indicates that the instances are vertices of the input graph . Furthermore, indicates that the specification may involve one vertex of the input graph as a parameter. Finally, indicates that the formula specifying the hypothesis must have quantifier rank at most .
Our choice of a hypothesis consistent with says that “there is a bipartite partition of the graph such that all positive instances are on the same side as and all negative examples are on the other side.” This hypothesis can be formally specified in MSO as for the MSO formula and parameter setting , where .
For the -dimensional case of MSO-Consistent-Learn, called 1D-MSO-Consistent- Learn, [31, 30] gave algorithms that are sublinear in the background structures after a linear-time pre-processing stage for the case that the background structure is a string or a tree. This directly implies that 1D-MSO-Consistent-Learn can be solved in time for some function , that is, in fixed-parameter linear time, if the background structure is a string or a tree. Here is the size of the background structure and are the parameters of the learning problem described above. We generalize the results to labeled graphs of bounded clique-width. Graphs of clique-width can be described by a -expression, that is, an expression in a certain graph grammar that only uses labels (see Section 2.1 for details). In our algorithmic results for graphs of bounded clique-width, we always assume that the graphs are given in the form of a -expression. We treat as just another parameter of our algorithms. By the results of Oum and Seymour [44], we can always compute a -expression for a graph of clique-width by a fixed-parameter tractable algorithm.
Theorem 1.2.
Let be a class of labeled graphs of bounded clique-width. Then 1D-MSO- Consistent-Learn is fixed-parameter linear on .
Since graphs of bounded tree-width also have bounded clique-width, our result directly implies fixed-parameter linearity on graph classes of bounded tree-width.
Our proof for Theorem 1.2 relies on the model-checking techniques due to Courcelle, Makowsky, and Rotics for graph classes of bounded clique-width [23]. To make use of them, we encode the training examples into the graph as new labels. While this construction works for , it fails for higher dimensions if there are too many examples to encode.
As far as we are aware, all previous results for learning MSO formulas are restricted to the one-dimensional case of the problem. We give the first results for , presenting two different approaches that yield tractability results in higher dimensions.
As we discuss in Section 5, for the PAC-learning problem MSO-PAC-Learn in higher dimensions, we can restrict the number of examples to consider to a constant. In this way, we obtain fixed-parameter tractability results for learning MSO-definable concepts in higher dimensions, similar to results for first-order logic on nowhere dense classes [9, 8].
Theorem 1.3.
Let be a class of labeled graphs of bounded clique-width. Then MSO-PAC- Learn is fixed-parameter linear on .
In the second approach to higher-dimensional tractability, and as the main result of this paper, we show in Section 6 that a consistent hypothesis can be learned on graphs of bounded clique-width with a quadratic running time in terms of the size of the graph.
Theorem 1.4.
There is a function such that, for a -labeled graph of clique-width and a training sequence of size , the problem MSO-Consistent- Learn can be solved in time
While this is not strictly a fixed-parameter tractability result, since we usually do not consider to be part of the parameterization, we show in Section 7 that this bound is optimal. Technically, this result is much more challenging than Theorems 1.3 and 1.2. While we still use an overall dynamic-programming strategy that involves computing MSO types, here we need to consider MSO types over sequences of tuples. The number of such sequence types is not constantly bounded, but exponential in the length of the sequence. The core of our argument is to prove that the number of relevant types can be polynomially bounded. This fundamentally distinguishes our approach from typical MSO/automata arguments, where types are from a bounded set (and they correspond to the states of a finite automaton).
Lastly, we study MSO-Consistent-Learn on arbitrary classes of labeled graphs. Analogously to the hardness of learning FO-definable concepts and the relation to the FO-model-checking problem discussed in [9], we are interested specifically in the relation of MSO-Consistent-Learn to the MSO-model-checking problem MSO-Mc. We show that MSO-Mc can already be reduced to the 1-dimensional case of MSO-Consistent-Learn, even with a training sequence of size two. This yields the following hardness result that we prove in Section 4.
Theorem 1.5.
1D-MSO-Consistent-Learn is para-NP-hard under fpt Turing reductions.
Related Work
The model-theoretic learning framework studied in this paper was introduced in [33]. There, the authors give information-theoretic learnability results for hypothesis classes that can be defined using first-order and monadic second-order logic on restricted classes of structures.
Algorithmic aspects of the framework were first studied in [32], where it was proved that concepts definable in first-order logic can be learned in time polynomial in the degree of the background structure and the number of labeled examples the algorithm receives as input, independently of the size of the background structure. This was generalized to first-order logic with counting [7] and with weight aggregation [11]. On structures of polylogarithmic degree, the results yield learning algorithms running in time sublinear in the size of the background structure. It was shown in [31, 7] that sublinear-time learning is no longer possible if the degree is unrestricted. To address this issue, in [31], it was proposed to introduce a preprocessing phase where, before seeing any labeled examples, the background structure is converted to a data structure that supports sublinear-time learning later. This model was applied to monadic second-order logic on strings [31] and trees [30].
The parameterized complexity of learning first-order logic was first studied in [9]. Via a reduction from the model-checking problem, the authors show that on arbitrary relational structures, learning hypotheses definable in FO is -hard. In contrast to this, they show that the problem is fixed-parameter tractable on nowhere dense graph classes. This result has been extended to nowhere dense structure classes in [8]. Although not stated as fpt results, the results in [31, 30] yield fixed-parameter tractability for learning MSO-definable concepts on strings and trees if the problem is restricted to the 1-dimensional case where the tuples to classify are single vertices.
The logical learning framework is related to, but different from the framework of inductive logic programming (see, e. g., [21, 42, 43]), which may be viewed as the classical logic-learning framework. In the database literature, there are various approaches to learning queries from examples [6, 5, 34, 36, 38, 13, 49, 1, 2, 14, 48, 17]. Many of these are concerned with active learning scenarios, whereas we are in a statistical learning setting. Moreover, most of the results are concerned with conjunctive queries or queries outside the relational database model, whereas we focus on monadic second-order logic. Another related subject in the database literature is the problem of learning schema mappings from examples [3, 15, 18, 19, 29]. In formal verification, related logical learning frameworks [20, 25, 28, 40, 52] have been studied as well. In algorithmic learning theory, related works study the parameterized complexity of several learning problems [4, 39] including, quite recently, learning propositional CNF and DNF formulas and learning solutions to graph problems in the PAC setting [16].
2 Preliminaries
We let denote the set of non-negative integers. For , we let and . For a set , we let .
2.1 Clique-Width
In this paper, graphs are always undirected and simple (no loops or parallel edges); we view them as -structures for a binary relation symbol , and we denote the set of vertices of a graph by . A label set is a set of unary relation symbols, and a -graph or -labeled graph is the expansion of a graph to the vocabulary . A labeled graph is a -graph for any label set .
In the following, we define expressions to represent labeled graphs. A base graph is a labeled graph of order . For every base graph , we introduce a base expression that represents . Moreover, we have the following operations.
- Disjoint union:
-
For disjoint -graphs , we define to be the union of and . If and are not disjoint, then is undefined.
- Adding edges:
-
For a -graph and unary relation symbols with , we let be the -graph obtained from by adding an edge between every pair of distinct vertices , . That is, .
- Relabeling:
-
For a -graph and unary relation symbols with , we let be the -graph obtained from by relabeling all vertices in by , that is, , , , and for all .
- Deleting labels:
-
For a -graph and a unary relation symbol , we let be the restriction of to , that is, the -graph obtained from by removing the relation .
We also introduce a modification of the disjoint-union operator, namely the ordered-disjoint-union operator , which is used in Section 6 to simplify notations.
- Ordered disjoint union:
-
To introduce this operator, we need two distinguished unary relation symbols and . For disjoint -graphs , where we assume , we let be the -expansion of the disjoint union with and . By deleting the relations immediately after introducing them in an ordered disjoint union, we can simulate a standard disjoint-union by an ordered disjoint union, that is, .
A -expression is a term formed from base expressions , whose label set is a subset of , using unary operators , , for with , and the binary operator . We require -expressions to be well-formed, that is, all base expressions represent base graphs with mutually distinct vertices, and the label sets fit the operators.
Every -expression describes a -graph for some . Note that there is a one-to-one correspondence between the base expressions in and the vertices of . Actually, we may simply identify the vertices of with the base expressions in . We let be the set of these base expressions. We may then view an expression as a tree where is the set of leaves of this tree. We let be the number of nodes of the tree. We have . In general, we cannot bound in terms of , but for every -expression , we can find a -expression such that and .
Each subexpression of describes a labeled graph on a subset consisting of all base expressions in . Note that, in general, is not a subgraph of .
For , a -expression is a -expression for a label set of size . It is easy to see that every labeled graph of order is described by an -expression. The clique-width of a (labeled) graph is the least such that is described by a -expression.
We remark that our notion of clique-width differs slightly from the one given by Courcelle and Olariu [24], since we allow vertices to have multiple labels, and we also allow the deletion of labels. Thus, our definition is similar to the definition of multi-clique-width [27]. However, for our algorithmic results, the definitions are equivalent, since we have and for every (labeled) graph , where is the notion of clique-width from [24].
Lemma 2.1 ([44]).
For a graph with vertices and clique-width , there is an algorithm that outputs a -expression for where . The algorithm has a running time of .
2.2 Monadic Second-Order Logic
We consider monadic second-order (MSO) logic, which is a fragment of second-order logic where we only quantify over unary relations (sets). In MSO, we consider two kinds of free variables, which we call set variables (uppercase ) and individual variables (lowercase ). The quantifier rank of a formula is the nesting depth of its quantifiers.
Let be a relational vocabulary and . By , we denote the set of all MSO formulas of quantifier rank at most using only relation symbols in , and we let . By , we denote the set of all formulas with free individual variables in and free set variables in . In particular, denotes the set of sentences. Moreover, it will be convenient to separate the free individual variables into instance variables () and parameter variables (). For this, we let denote the set of all formulas with free instance variables in , free parameter variables in , and free set variables in . Furthermore, we write to denote that the formula has its free instance variables among the entries of , its free parameter variables among the entries , and its free set variables among the entries .
We normalize formulas such that the set of normalized formulas in is finite, and there is an algorithm that, given an arbitrary formula in , decides if the formula is normalized, and if not, computes an equivalent normalized formula. In the following, we assume that all formulas are normalized.
In this paper, all structures we consider will be labeled graphs for some label set . In notations such as , it will be convenient to write if . For a -labeled graph and a tuple , the -type of in is the set of all formulas such that .
2.3 VC Dimension
For , a formula , a -labeled graph , and a tuple , we let
For a set , we let
We say that is shattered by if . The VC dimension of on is the maximum such that there is a set of cardinality that is shattered by . In this paper, we are only interested in finite graphs, but for infinite , we let if the maximum does not exist. For a class of -labeled graphs, the VC dimension of over , , is the least such that for all if such a exists, and otherwise.
Lemma 2.2 ([33, Theorem 17]).
There is a function such that the following holds. Let be a label set, let be the class of all -graphs of clique-width at most , and let . Then for all .
2.4 Parameterized Complexity
A parameterization is a function mapping the input of a problem to a natural number . An algorithm is an fpt algorithm with respect to if there is a computable function and a polynomial such that for every input the running time of is at most .
A parameterized problem is a tuple . We say or is fixed-parameter tractable if there is an fpt algorithm with respect to for , and we say is fixed-parameter linear if the polynomial in the running time of the fpt algorithm is linear. We say if there is a nondeterministic fpt algorithm with respect to for . If the parameterization is clear from the context, then we omit it.
For two parameterized problems , an fpt Turing reduction from to is an algorithm with oracle access to such that decides , is an fpt algorithm with respect to , and there is a computable function such that on input , for all oracle queries with oracle input .
For additional background on parameterized complexity, we refer to [26].
3 Tractability for One-Dimensional Training Data on Well-Behaved Classes
We start by formalizing the parameterized version of the problem MSO-Consistent-Learn described in the introduction. For a training sequence , a graph , and a hypothesis , we say is consistent with on if for every positive example , we have , and for every negative example , we have .
MSO-Consistent-Learn
Instance: -labeled graph , , training sequence
Parameter:
Problem: Return a hypothesis consisting of
a formula and
a parameter setting
such that is consistent with the training sequence on , if such a hypothesis exists. Reject if there is no consistent hypothesis.
The problem 1D-MSO-Consistent-Learn refers to the -dimensional version of the problem MSO-Consistent-Learn where the arity of the training examples is . The tractability results for 1D-MSO-Consistent-Learn are significantly more straightforward than those for the higher-dimensional problem. This is due to the fact that the full training sequence can be encoded into the graph by only adding two new labels, and a parameter setting can be encoded with more new labels.
As discussed in Section 2.2, there is a function such that . Therefore, to solve 1D-MSO-Consistent-Learn, we can iterate over all formulas and focus on finding a parameter setting such that is consistent with on . Moreover, if the model-checking problem on a graph class with additional labels is tractable, then finding a consistent parameter setting is tractable as well by performing model checking on the graph with the encoded training sequence.
Lemma 3.1.
Let be a class of labeled graphs, let be the class of all extensions of graphs from by additional labels for all , let be a function, and let such that the MSO-model-checking problem on can be solved in time for all , where is the MSO sentence and is the labeled graph given as input. There is a function such that 1D-MSO-Consistent-Learn can be solved on in time .
The formal proof of this result can be found in the full version [10]. If the input graph is given in the form of a -expression, then the MSO model-checking problem is fixed-parameter linear on classes of bounded clique-width [23]. Therefore, Lemma 3.1 implies that there is a function such that 1D-MSO-Consistent-Learn can be solved in time .
Theorem 1.2 improves this bound for classes of graphs of bounded clique-width even further, showing that the problem 1D-MSO-Consistent-Learn can be solved in time linear in the size of the graph. This can be done by again encoding the training sequence into the graph, but then extracting a consistent parameter setting directly, following techniques similar to the ones used by Courcelle and Seese for the corresponding model-checking problem on graphs of bounded clique-width. The full proof of Theorem 1.2 can be found in [10].
Since graphs of tree-width have a clique-width of at most [22], Theorem 1.2 implies that for classes of graphs of bounded tree-width, 1D-MSO-Consistent-Learn is fixed-parameter linear as well. Moreover, although all background structures we consider in this paper are labeled graphs, we remark that the result for classes of bounded tree-width also holds on arbitrary relational structures and a corresponding version of 1D-MSO-Consistent-Learn.
4 Hardness for One-Dimensional Training Data
Previously, we restricted the input graph of the MSO-learning problem to certain well-behaved classes. Now, we consider the problem MSO-Consistent-Learn without any restrictions. Van Bergerem, Grohe, and Ritzert showed in [9] that there is a close relation between first-order model checking (FO-Mc) and learning first-order formulas. The fpt-reduction in [9] from model checking to learning yields -hardness for learning first-order formulas on classes of structures that are not nowhere dense. It is simple to show (and not surprising) that MSO-Consistent-Learn is at least as hard as FO-Mc. The more interesting question is whether MSO-Consistent-Learn is at least as hard as the model-checking problem for MSO sentences (MSO-Mc), which is defined as follows.
MSO-Mc
Instance: -labeled graph , sentence
Parameter:
Problem: Decide whether holds.
We give a positive answer, which even holds for the MSO-learning problem with only one-dimensional training data where we restrict the training sequence to contain at most two training examples.
Lemma 4.1.
The model-checking problem MSO-Mc is fpt Turing reducible to 1D-MSO-Consistent-Learn where we restrict the training sequence given as input to have length at most .
MSO-Mc is para-NP-hard under fpt Turing reductions as even for some fixed sentence , the corresponding model-checking problem can be NP-hard (for example for a formula defining 3-Colorability, see [26] for details). Hence, Lemma 4.1 proves Theorem 1.5. We give a proof sketch for Lemma 4.1. The full proof can be found in [10].
Proof sketch of Lemma 4.1.
We describe an fpt algorithm solving MSO-Mc using access to a 1D-MSO-Consistent-Learn oracle. Let be a -labeled graph, and let be an sentence. We decide whether holds recursively by decomposing the input formula. While handling negation and Boolean connectives is easy, the crucial part of the computation is handling quantification. Thus, we assume that or for some MSO formula . For both types of quantifiers, we use the 1D-MSO-Consistent-Learn oracle to identify a small set of candidate vertices or sets such that holds for any vertex or set if and only if it holds for any of the identified candidates. Then, since the number of candidates will only depend on , we can check recursively whether holds for any of them and thereby decide MSO-Mc with an fpt algorithm.
More specifically, using the 1D-MSO-Consistent-Learn oracle, we partition the vertices and sets based on their -type, and we only check one candidate for each class of the partition. Intuitively, for every pair of vertices, we call the oracle with one vertex as a positive example and the other vertex as a negative example. The oracle returns a hypothesis if and only if the types of the two vertices differ. When partitioning the sets of vertices, we encode the two sets to check into the graph before calling the oracle. Moreover, instead of calling the oracle for every pair of sets (which would lead to a running time that is exponential in the size of the graph), we group the sets based on their size, start by finding a small family of candidate sets of size , and we use these candidates iteratively to find a small family of sets with one additional vertex. With this technique, the number of oracle calls is only quadratic in the size of the graph.
5 PAC Learning in Higher Dimensions
So far, we considered the consistent-learning setting, where the goal is to return a hypothesis that is consistent with the given examples. In this section, we study the MSO-learning problem in the agnostic PAC-learning setting. There, for an instance space , we assume an (unknown) probability distribution on . The learner’s goal is to find a hypothesis , using an oracle to draw training examples randomly from , such that (approximately) minimizes the generalization error
For every -labeled graph and , let be the hypothesis class
Formally, we define the MSO PAC-learning problem as follows.
MSO-PAC-Learn
Instance: -labeled graph ,
numbers , ,
oracle access to probability distribution on
Parameter:
Problem: Return a hypothesis such that,
with probability of at least over the choice of examples drawn
i.i.d. from , it holds that
The remainder of this section is dedicated to the proof of Theorem 1.3, that is, we want to show that MSO-PAC-Learn is fixed-parameter linear on classes of bounded clique-width when the input graph is given as a -expression. To solve the problem algorithmically, we can follow the Empirical Risk Minimization (ERM) rule [51, 46], that is, our algorithm should minimize the training error (or empirical risk)
on the training sequence of queried examples. Roughly speaking, an algorithm can solve MSO-PAC-Learn by querying a sufficient number of examples and then following the ERM rule. To bound the number of needed examples, we combine a fundamental result of statistical learning [12, 46], which bounds the number of needed examples in terms of the VC dimension of a hypothesis class, with Lemma 2.2, a result due to Grohe and Turán [33], which bounds the VC dimension of MSO-definable hypothesis classes on graphs of bounded clique-width. Together, they imply the following result. See the full version [10] for details.
Lemma 5.1.
There is a computable function such that any algorithm that proceeds as follows solves the problem MSO-PAC-Learn. Given a -labeled graph of clique-width at most , numbers , , and oracle access to a probability distribution on , the algorithm queries at least many examples from and then follows the ERM rule.
Using this lemma, we can now give a proof sketch for Theorem 1.3, showing that MSO-PAC-Learn is fixed-parameter linear on classes of bounded clique-width if the input graph is given as a -expression, even for dimensions . The full proof can be found in [10].
Proof sketch of Theorem 1.3.
Let be a -labeled graph, let , , and assume we are given oracle access to a probability distribution on . Moreover, let and let be a -expression given as input that describes .
Let , where is the function from Lemma 5.1. We sample examples from and call the resulting sequence of training examples . Then, for every subsequence of , we make use of the techniques in Section 3 (adapted to higher-dimensional training data) and compute a hypothesis that is consistent with if such a hypothesis exists. This can be computed by an fpt-algorithm with parameters , and . Finally, from all subsequences with a consistent hypothesis, we choose a subsequence of maximum length and return . Note that this procedure minimizes the training error on the training sequence , i. e., . Hence, the procedure follows the ERM rule, and, by Lemma 5.1, it solves MSO-PAC-Learn. Since the number of subsequences to check can be bounded by , all in all, the described procedure is an fpt-algorithm that solves MSO-PAC-Learn.
6 Consistent Learning in Higher Dimensions
In this section, we consider the problem MSO-Consistent-Learn with dimension on labeled graphs of bounded clique-width. A brute-force attempt yields a solution in time , where is the length of the training sequence, for some . This is achieved by iterating over all formulas, then iterating over all parameter assignments, and then performing model checking for each training example. We assume that the graph is considerably larger in scale than the sequence of training examples . Therefore, Theorem 1.4 significantly improves the running time to . While Theorem 1.4 is not a fixed-parameter tractability result in the classical sense, we show that this is optimal in Section 7. The present section is dedicated to the proof of Theorem 1.4.
Until now, we have viewed the training sequence as a sequence of tuples . In the following, it is useful to split the training sequence into two parts, a sequence of vertex tuples and a function which assigns the corresponding label to each tuple. Let be a -labeled graph, , , and . We call -consistent if there is a parameter setting such that for all , . We say that is a -witness for . This notation allows us to state the main technical ingredient for the proof of Theorem 1.4 as follows.
Theorem 6.1.
There is a computable function and an algorithm that, given a -graph of clique-width , a sequence , a function , and a formula , decides if is -consistent in time .
Using Theorem 6.1, we can now prove Theorem 1.4.
Proof of Theorem 1.4.
Given a -labeled graph and a training sequence , we let and . We iterate over all formulas and use Theorem 6.1 to check whether is -consistent. If there is no such that is -consistent, then we reject the input. Otherwise, let be such that is -consistent. We compute a -witness following the same construction as in the proof of Lemma 3.1. That is, using a fresh label, we encode the parameter choice of a single variable into the graph, and then we check whether consistency still holds for the corresponding formula that enforces this parameter choice. In total, we perform up to such consistency checks to compute a -witness . The consistent formula together with the -witness can then be returned as a hypothesis that is consistent with on and therefore a solution to MSO-Consistent-Learn.
The remainder of this section is dedicated to the proof of Theorem 6.1. We start by introducing the formal definitions for types and sets of types over sequences of elements.
6.1 Type Definitions
Let be a -labeled graph and . Recall that the -type of in is the set of all formulas such that . A -type is a set such that, for each , either or . We denote the set of all -types by . Note that . For a type , we write if for all . Observe that . We say that a type is realizable if there is some -labeled graph and tuple such that . We are not particularly interested in types that are not realizable, but it is undecidable if a type is realizable, whereas the sets are decidable. (More precisely, there is an algorithm that, given and a set of formulas, decides if .) For a -type and a -labeled graph , we let
If , we say that is realizable in .
As for formulas, we split the variables for types into two parts, so we consider -types , and we denote the set of all these types by . For a -labeled graph and tuples , we often think of as the -type of over in . Moreover, we let
If , we say that is realizable over in .
For a vector and a set , we let be the set of all sequences of tuples . Let be a labeled graph, for some , and . We define the -type of over in to be the tuple
Again, we need an “abstract” notion of type over a sequence. A -type for a tuple is an element of
Let . For a labeled graph , a sequence , and a tuple , we write if for all . Note that . For a type , a -labeled graph , and a sequence , we let
If , we say that is realizable over in .
6.2 Computing the Realizable Types
For the proof of Theorem 6.1, we use the following result that allows us to compute the realizable types of an expression.
Lemma 6.2.
There is a computable function and an algorithm that, given , a vector with for all , a -expression with , and a sequence , computes the set of all that are realizable over in in time
Before proving this result, we first show that it implies Theorem 6.1.
Proof of Theorem 6.1.
We assume that the input graph is given as a -expression. To check whether is -consistent, we compute the set of all that are realizable over in , using Lemma 6.2. Then, for each we check if . If we find such a , then is -consistent; otherwise it is not.
It remains to prove Lemma 6.2. In a bottom-up algorithm, starting at the leaves of , we compute the set of all tuples of -types that are realizable over in . This is possible because the realizable tuples of types of an expression can be computed from the tuples of types of its subexpressions. We formally prove this for the operators , , , and in the full version [10].
The difficulty with this approach is that we are talking about -tuples of types. In general, the number of such tuples is exponential in , and hence the size of the set we aim to compute could be exponentially large. Fortunately, this does not happen in graphs of bounded clique-width. By Lemma 2.2, we can bound the VC dimension of a first-order formula over classes of graphs of bounded clique-width. Further, we show in Lemma 6.3 that this suffices to give a polynomial bound for the number of realizable tuples.
Lemma 6.3.
Let , let , and let be a -labeled graph such that for all . Let for some . Then at most types in are realizable over in .
Proof of Lemma 6.2.
As argued in Section 2, we may assume that only contains ordered-disjoint-union operators and no plain disjoint-union operators.
For every subexpression , we let be the set of labels of , that is, the set of unary relation symbols such that is a -graph. Moreover, let , and let such that .
We inductively construct, for every subexpression of and , the set of all types that are realizable over in .
- Case 1: is a base expression.
-
In this case, for each , we can construct by brute force in time for a suitable (computable) function . Let . We compute by iterating over all formulas with free variables and evaluating on the single vertex graph .
- Case 2: .
-
Let , and let . As we show in the full version [10], there is a computable mapping such that is the set of all such that there is a with for all . Moreover, for every realizable , we guarantee that there is at most one type such that . To compute the set , we step through all . For each such , for all , we compute the unique such that . If for some , no such exists, we move on to the next . Otherwise, we add to .
- Case 3: .
-
Analogous to Case 2, again based on results from the full version [10].
- Case 4: .
-
Analogous to Case 2, based on results from [10].
- Case 5: .
-
Let , , , and . For , let , , , and for all , let such that . Let .
For all such that , we let be the set of all such that for , we have
where for is a computable mapping that we give in the full version [10]. Then, as we also show in [10],
To compute , we iterate over all . For all we compute the unique such that . If, for some , no such exists, then we move on to the next . Otherwise, we compute
and check if . If it is, we add to . Otherwise, we move on to the next .
This completes the description of our algorithm. To analyze the running time, let
where ranges over all subexpressions of . By Lemmas 2.2 and 6.3, there is a computable function such that
The running time of each step of the constructions can be bounded by for a suitable computable function . We need to make steps. Thus, overall, we obtain the desired running time.
7 Hardness of Checking Consistency in Higher Dimensions
The following result shows, under the assumption , that Theorem 6.1 can not be improved to an fpt-result.
Theorem 7.1.
There is a such that the following parameterized problem is -hard.
Instance: graph of clique-width at most , sequence , function , formula Parameter: Problem: decide if is -consistent.
Proof sketch.
We prove this by a reduction from the -complete weighted satisfiability problem for Boolean formulas in -conjunctive normal form [26]. The weight of an assignment to a set of Boolean variables is the number of variables set to .
WSat(2-CNF)
Instance: Boolean formula in -CNF
Parameter:
Problem: decide if has a satisfying assignment of weight .
Given a -CNF formula in the variables and , we construct an instance of the consistency problem from Theorem 7.1 where the graph is a forest that encodes . Moreover, , , and are chosen to verify that a -witness for encodes exactly variables among that can be set to to satisfy . Hence, there is a -witness for if and only if has a satisfying assignment of weight . Details on the construction can be found in the full version [10].
8 Conclusion
Just like model checking and the associated counting and enumeration problems, the learning problem we study here is a natural algorithmic problem for logics on finite structures. All these problems are related, but each has its own challenges requiring different techniques. Where model checking and enumeration are motivated by automated verification and database systems, we view our work as part of a descriptive complexity theory of machine learning [8].
The first problem we studied is 1D-MSO-Consistent-Learn, where the instances to classify consist of single vertices, and we extended the previous fixed-parameter tractability results for strings and trees [31, 30] to (labeled) graphs of bounded clique-width. Moreover, on general graphs, we showed that the problem is hard for the complexity class para-NP.
For MSO-learning problems in higher dimensions, we presented two different approaches that yield tractability results on graphs of bounded clique-width. For the agnostic PAC-learning problem MSO-PAC-Learn, we described a fixed-parameter tractable learning algorithm. Furthermore, in the consistent-learning setting for higher dimensions, we gave an algorithm that solves the learning problem and is fixed-parameter tractable in the size of the input graph. However, the algorithm is not fixed-parameter tractable in the size of the training sequence, and we showed that this is optimal.
In the learning problems considered so far, hypotheses are built using MSO formulas and tuples of vertices as parameters. We think that the algorithms presented in this paper for the -dimensional case could also be extended to hypothesis classes that allow tuples of sets as parameters. Finally, utilizing the full power of MSO, one could also consider a learning problem where, instead of classifying tuples of vertices, we are interested in classifying sets of vertices. That is, for a graph , we are given labeled subsets of and want to find a hypothesis that is consistent with the given examples. It is easy to see that the techniques used in our hardness result also apply to this modified problem, proving that it is para-NP-hard. However, it remains open whether our tractability results could also be lifted to this version of the problem.
References
- [1] Azza Abouzied, Dana Angluin, Christos H. Papadimitriou, Joseph M. Hellerstein, and Avi Silberschatz. Learning and verifying quantified boolean queries by example. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA, June 22–27, 2013, pages 49–60. ACM, 2013. doi:10.1145/2463664.2465220.
- [2] Howard Aizenstein, Tibor Hegedüs, Lisa Hellerstein, and Leonard Pitt. Complexity theoretic hardness results for query learning. Comput. Complex., 7(1):19–53, 1998. doi:10.1007/PL00001593.
- [3] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, 2011. doi:10.1145/2043652.2043656.
- [4] Vikraman Arvind, Johannes Köbler, and Wolfgang Lindner. Parameterized learnability of k-juntas and related problems. In Algorithmic Learning Theory, 18th International Conference, ALT 2007, Sendai, Japan, October 1–4, 2007, volume 4754 of Lecture Notes in Computer Science, pages 120–134. Springer, 2007. doi:10.1007/978-3-540-75225-7_13.
- [5] Pablo Barceló, Alexander Baumgartner, Víctor Dalmau, and Benny Kimelfeld. Regularizing conjunctive features for classification. J. Comput. Syst. Sci., 119:97–124, 2021. doi:10.1016/j.jcss.2021.01.003.
- [6] Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In 20th International Conference on Database Theory, ICDT 2017, Venice, Italy, March 21–24, 2017, volume 68 of LIPIcs, pages 7:1–7:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPIcs.ICDT.2017.7.
- [7] Steffen van Bergerem. Learning concepts definable in first-order logic with counting. In 34th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2019, Vancouver, BC, Canada, June 24–27, 2019, pages 1–13. IEEE, 2019. doi:10.1109/LICS.2019.8785811.
- [8] Steffen van Bergerem. Descriptive Complexity of Learning. PhD thesis, RWTH Aachen University, Germany, 2023. doi:10.18154/RWTH-2023-02554.
- [9] Steffen van Bergerem, Martin Grohe, and Martin Ritzert. On the parameterized complexity of learning first-order logic. In PODS 2022: International Conference on Management of Data, Philadelphia, PA, USA, June 12–17, 2022, pages 337–346. ACM, 2022. doi:10.1145/3517804.3524151.
- [10] Steffen van Bergerem, Martin Grohe, and Nina Runde. The parameterized complexity of learning monadic second-order logic, 2023. doi:10.48550/arXiv.2309.10489.
- [11] Steffen van Bergerem and Nicole Schweikardt. Learning concepts described by weight aggregation logic. In 29th EACSL Annual Conference on Computer Science Logic, CSL 2021, Ljubljana, Slovenia (Virtual Conference), January 25–28, 2021, volume 183 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CSL.2021.10.
- [12] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4):929–965, October 1989. doi:10.1145/76359.76371.
- [13] Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23–27, 2015, pages 109–120. OpenProceedings.org, 2015. doi:10.5441/002/edbt.2015.11.
- [14] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016. doi:10.1145/2818637.
- [15] Angela Bonifati, Ugo Comignani, Emmanuel Coquery, and Romuald Thion. Interactive mapping specification with exemplar tuples. ACM Trans. Database Syst., 44(3):10:1–10:44, 2019. doi:10.1145/3321485.
- [16] Cornelius Brand, Robert Ganian, and Kirill Simonov. A parameterized theory of PAC learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7–14, 2023, pages 6834–6841. AAAI Press, 2023. doi:10.1609/aaai.v37i6.25837.
- [17] Balder ten Cate and Víctor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In 24th International Conference on Database Theory, ICDT 2021, Nicosia, Cyprus, March 23–26, 2021, volume 186 of LIPIcs, pages 9:1–9:24. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICDT.2021.9.
- [18] Balder ten Cate, Víctor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
- [19] Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10–15, 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
- [20] Adrien Champion, Tomoya Chiba, Naoki Kobayashi, and Ryosuke Sato. ICE-based refinement type discovery for higher-order functional programs. J. Autom. Reason., 64(7):1393–1418, 2020. doi:10.1007/s10817-020-09571-y.
- [21] William W. Cohen and C. David Page Jr. Polynomial learnability and inductive logic programming: Methods and results. New Gener. Comput., 13(3&4):369–409, December 1995. doi:10.1007/BF03037231.
- [22] Derek G. Corneil and Udi Rotics. On the relationship between clique-width and treewidth. SIAM J. Comput., 34(4):825–847, 2005. doi:10.1137/S0097539701385351.
- [23] Bruno Courcelle, Johann A. Makowsky, and Udi Rotics. Linear time solvable optimization problems on graphs of bounded clique-width. Theory of Computing Systems, 33(2):125–150, 2000. doi:10.1007/s002249910009.
- [24] Bruno Courcelle and Stephan Olariu. Upper bounds to the clique width of graphs. Discret. Appl. Math., 101(1-3):77–114, 2000. doi:10.1016/S0166-218X(99)00184-5.
- [25] P. Ezudheen, Daniel Neider, Deepak D’Souza, Pranav Garg, and P. Madhusudan. Horn-ICE learning for synthesizing invariants and contracts. Proc. ACM Program. Lang., 2(OOPSLA):131:1–131:25, 2018. doi:10.1145/3276501.
- [26] Jörg Flum and Martin Grohe. Parameterized complexity theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006. doi:10.1007/3-540-29953-X.
- [27] Martin Fürer. Multi-clique-width. In 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, Berkeley, CA, USA, January 9–11, 2017, volume 67 of LIPIcs, pages 14:1–14:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ITCS.2017.14.
- [28] Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. ICE: A robust framework for learning invariants. In Computer Aided Verification - 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18–22, 2014, volume 8559 of Lecture Notes in Computer Science, pages 69–87. Springer, 2014. doi:10.1007/978-3-319-08867-9_5.
- [29] Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010. doi:10.1145/1667053.1667055.
- [30] Émilie Grienenberger and Martin Ritzert. Learning definable hypotheses on trees. In 22nd International Conference on Database Theory, ICDT 2019, Lisbon, Portugal, March 26–28, 2019, volume 127 of LIPIcs, pages 24:1–24:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICDT.2019.24.
- [31] Martin Grohe, Christof Löding, and Martin Ritzert. Learning MSO-definable hypotheses on strings. In International Conference on Algorithmic Learning Theory, ALT 2017, Kyoto University, Kyoto, Japan, October 15–17, 2017, volume 76 of Proceedings of Machine Learning Research, pages 434–451. PMLR, October 2017. ISSN: 2640-3498. URL: https://proceedings.mlr.press/v76/grohe17a.html.
- [32] Martin Grohe and Martin Ritzert. Learning first-order definable concepts over structures of small degree. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2017, Reykjavík, Iceland, June 20–23, 2017, pages 1–12. IEEE, June 2017. doi:10.1109/LICS.2017.8005080.
- [33] Martin Grohe and György Turán. Learnability and definability in trees and similar structures. Theory Comput. Syst., 37(1):193–220, January 2004. doi:10.1007/s00224-003-1112-8.
- [34] David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989. doi:10.1007/BF00114802.
- [35] David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput., 100(1):78–150, 1992. doi:10.1016/0890-5401(92)90010-D.
- [36] Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Algorithmic Learning Theory, 11th International Conference, ALT 2000, Sydney, Australia, December 11–13, 2000, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000. doi:10.1007/3-540-40992-0_18.
- [37] Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. URL: https://mitpress.mit.edu/books/introduction-computational-learning-theory.
- [38] Benny Kimelfeld and Christopher Ré. A relational framework for classifier engineering. ACM Trans. Database Syst., 43(3):11:1–11:36, 2018. doi:10.1145/3268931.
- [39] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal complexity. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, July 6–9, 2018, volume 75 of Proceedings of Machine Learning Research, pages 1125–1144. PMLR, 2018. URL: http://proceedings.mlr.press/v75/li18b.html.
- [40] Christof Löding, P. Madhusudan, and Daniel Neider. Abstract learning frameworks for synthesis. In Tools and Algorithms for the Construction and Analysis of Systems - 22nd International Conference, TACAS 2016, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2016, Eindhoven, The Netherlands, April 2–8, 2016, volume 9636 of Lecture Notes in Computer Science, pages 167–185. Springer, 2016. doi:10.1007/978-3-662-49674-9_10.
- [41] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2nd edition, 2018.
- [42] Stephen H. Muggleton. Inductive logic programming. New Gener. Comput., 8(4):295–318, February 1991. doi:10.1007/BF03037089.
- [43] Stephen H. Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. J. Log. Program., 19/20:629–679, 1994. doi:10.1016/0743-1066(94)90035-3.
- [44] Sang-il Oum and Paul D. Seymour. Approximating clique-width and branch-width. J. Comb. Theory, Ser. B, 96(4):514–528, 2006. doi:10.1016/j.jctb.2005.10.006.
- [45] Norbert Sauer. On the density of families of sets. J. Comb. Theory, Ser. A, 13(1):145–147, 1972. doi:10.1016/0097-3165(72)90019-2.
- [46] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, Cambridge, 2014. doi:10.1017/CBO9781107298019.
- [47] Saharon Shelah. A combinatorial problem: stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247–261, 1972. doi:10.2140/pjm.1972.41.247.
- [48] Robert H. Sloan, Balázs Szörényi, and György Turán. Learning Boolean functions with queries. In Boolean Models and Methods in Mathematics, Computer Science, and Engineering, pages 221–256. Cambridge University Press, 2010. doi:10.1017/cbo9780511780448.010.
- [49] Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In 15th International Conference on Database Theory, ICDT 2012, Berlin, Germany, March 26–29, 2012, pages 140–154. ACM, 2012. doi:10.1145/2274576.2274592.
- [50] Leslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, November 1984. doi:10.1145/1968.1972.
- [51] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems 4, [NIPS Conference, Denver, Colorado, USA, December 2–5, 1991], pages 831–838, 1991. URL: https://proceedings.neurips.cc/paper_files/paper/1991/file/ff4d5fbbafdf976cfdc032e3bde78de5-Paper.pdf.
- [52] He Zhu, Stephen Magill, and Suresh Jagannathan. A data-driven CHC solver. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18–22, 2018, pages 707–721. ACM, 2018. doi:10.1145/3192366.3192416.