Learning Aggregate Queries Defined by
First-Order Logic with Counting
Abstract
In the logical framework introduced by Grohe and Turán (TOCS 2004) for Boolean classification problems, the instances to classify are tuples from a logical structure, and Boolean classifiers are described by parametric models based on logical formulas. This is a specific scenario for supervised passive learning, where classifiers should be learned based on labelled examples. Existing results in this scenario focus on Boolean classification. This paper presents learnability results beyond Boolean classification. We focus on multiclass classification problems where the task is to assign input tuples to arbitrary integers. To represent such integer-valued classifiers, we use aggregate queries specified by an extension of first-order logic with counting terms called .
Our main result shows the following: given a database of polylogarithmic degree, within quasi-linear time, we can build an index structure that makes it possible to learn -definable integer-valued classifiers in time polylogarithmic in the size of the database and polynomial in the number of training examples.
Keywords and phrases:
Supervised learning, multiclass classification problems, counting logicCopyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Database theory ; Theory of computation Logic ; Theory of computation Finite Model Theory ; Computing methodologies Logical and relational learning ; Computing methodologies Supervised learningFunding:
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 541000908 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 541000908).Editors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
We study the complexity of learning aggregate queries from examples. This is a classification problem of the following form. The elements that are to be classified come from a set , the instance space. For a given set , a -valued classifier on is a function . We are given a training set of labelled examples , i. e., is the label assigned to the instance . The goal is to find a classifier, called a hypothesis, that can be used to predict the label of elements from , including those not given in . The term Boolean classification problem refers to the case where (often, is ). We use the term multiclass classification problem to refer to cases where may be arbitrarily large. In machine learning, these problems fall into the category of supervised learning tasks: we want to learn a function from given input-output pairs. In contrast to this, in unsupervised learning (e. g. clustering), the goal is to learn patterns from unlabelled data [48].
We focus on learning problems related to the framework introduced by Grohe and Turán [35]. There, the instance space is a set of tuples from a logical structure (that is sometimes called the background structure), and the classifiers are Boolean and are described using parametric models based on logical formulas. In this paper, we extend the framework to multiclass classification problems where the classifiers are integer-valued, i. e., . In the framework that we consider, the background structure is a relational database , and the instance space is the set of all -tuples of elements from the active domain of (also called the universe of ). Here, is a fixed positive integer. One fixes a parameter length (a fixed non-negative integer). A classifier is specified by a pair , where is an -tuple of elements in , and is a counting term in the first-order logic with counting [34] with free variables . This pair represents the classifier that assigns to each -tuple the integer that is obtained by evaluating the counting term in the database while interpreting the variables with the elements and the variables with the “parameters” . We will write to denote this classifier .
Given a training set , we want to find a pair such that the classifier is consistent with , i. e., it satisfies for every .
Example 1.1.
Let be a relational database where the active domain contains authors and publications, the binary relation Author contains all pairs where is an author of the publication , and the binary relation Citation contains all pairs where the publication cites the publication .
Suppose we are given a training set that consists of a few pairs where is an author and is the total number of citations of the publications of . A reasonable classifier for this setting would be a mapping that assigns to every author present in the database the total number of citations of their publications. In our setting, this can be represented as follows. We let and . Since , the “parameter” is fixed to be the empty tuple . Since , we use a counting term with a single free variable (that will be assigned with authors present in the database). We choose the counting term
Evaluating for an author yields the number of tuples such that is an author of publication , and is a publication that cites . This is precisely the total number of citations of publications authored by . Hence, is the desired classifier .
Example 1.2.
Suppose we have a database that maintains a list of all cakes colleagues brought to work. We model this as a relational database whose active domain contains persons, IDs of cakes, and types of cake. The binary relation Brought contains all pairs where is a person that brought the cake with ID , and the binary relation Type contains all pairs where is the ID of a cake of type (e. g., “chocolate cake”, “strawberry cake”, “carrot cake”, etc). Suppose we want to find a classifier that predicts the popularity of colleagues. For this, via a survey, we gather examples where is a person and is the popularity of the person, and we call the resulting set of labelled examples . We choose , so we want to find a classifier that uses a single parameter. According to our own experience at work, it seems conceivable that the following classifier is consistent with : the parameter is “chocolate cake” and is the counting term
Note that counts the number of cakes brought by person , where cakes of type are counted twice, and the variable will always be assigned the value of the parameter .
In many application scenarios, the same database is used multiple times with different training sets to learn different classifiers. Thus, we consider a setting in which we are first only given the database, without any training examples. In a precomputation step, we allow gathering information that will be helpful for solving future learning tasks. This precomputation step can be viewed as building an index structure that is designed in order to support solving multiple learning tasks.
In the actual learning phase, we are repeatedly given training sets of labelled examples, and our task is to output a hypothesis that is consistent with the corresponding training set. For this learning phase, it would be desirable to have algorithms that run efficiently even if the database is too large to fit into the main memory. To achieve this, we are interested in algorithms that require only local access to the database, i. e., instead of having random access to the database, a learning algorithm should initially start with the elements given in the training set; subsequently, it may only retrieve the neighbours of elements it already holds in memory. By utilising the memory hierarchy, such local access can be achieved efficiently even in cases where random access is too prohibitive. In the context of learning (concerning Boolean classification problems), this local-access model has been introduced by Grohe and Ritzert [33].
Our contribution
Our main result is an algorithm that builds the index structure in time linear in the size and polynomial in the degree of the database. Afterwards, upon input of concrete training sets, classifiers definable in can be learned in time polynomial in the degree of the database and polynomial in the number of examples given in the training set. Moreover, the classifiers returned by our algorithm can be evaluated in time polynomial in the degree of the database. Furthermore, our algorithms for finding a classifier and for evaluating this classifier do not require random access to the database but only rely on the local-access model.
For databases of polylogarithmic degree (i. e., of degree up to where is a constant and is the size of the database), our main result implies that the index structure can be built in quasi-linear time (i. e., time ); afterwards, -definable integer-valued classifiers can be learned in time polylogarithmic (so, in particular, sublinear) in the size of the database and polynomial in the number of training examples.
Previous results in the framework of Grohe and Turán for Boolean classification problems relied on the fact that it suffices to check a constant number of queries while limiting the search space for the parameters to a neighbourhood of a certain radius [33, 7, 10]. For our setting of multiclass classification with aggregate queries, however, this does not hold any more. Hence, a priori, it is not clear that sublinear-time learning algorithms are possible for the multiclass case at all. The main technical challenge towards our learnability result was to find an approach that keeps the number of queries to check small (i. e., polynomial in the degree of the database), while still being able to limit the search space for the parameters to a small neighbourhood around the given training tuples.
Organisation
We provide the necessary definitions concerning in Section 2, and we formally introduce the learning problem that we consider in Section 3. The precise statement of our main result is given in Theorem 3.1. Our proof makes heavy use of the locality properties of the logic shown in [34], including a decomposition of -formulas into local formulas. These properties are used in Section 4 to provide our main technical tool for the proof of the main result. Section 5 concludes the paper with a summary and an outlook on future work. In the remainder of this introduction, we give an overview of related work.
Related work
The first-order logic with counting FOC was introduced in [42] and further studied in [34, 7]. This logic extends first-order logic (FO) by the ability to formulate counting terms that evaluate to integers, and by numerical predicates that allow to compare results of counting terms. It was shown in [42] that the model-checking problem for FOC is fixed-parameter tractable on classes of structures of bounded degree. From [34] it is known that the fixed-parameter tractability of FOC cannot be generalised to even very simple classes of structures of unbounded degree such as unranked trees (under a reasonable assumption in parameterised complexity). However, [34] identified a fragment called for which model-checking of formulas and evaluation of counting terms are fixed-parameter tractable on all nowhere dense classes of structures. The present paper uses counting terms of to represent integer-valued classifiers.
The learning framework we consider has been introduced for Boolean classification problems in [35], which provides information-theoretic learnability results for classes of classifiers that can be specified using FO- and MSO-formulas on restricted classes of structures, such as the class of planar graphs or classes of graphs of bounded degree. Algorithmic aspects of the framework, including the running time of a learning algorithm, were first studied in [33]. The paper showed that Boolean classifiers definable in FO can be learned in sublinear time on structures of polylogarithmic degree. Analogous results have been obtained for MSO on strings [32] and on trees [29], which included a precomputation step to allow for efficient repeated learning. The paper [9] studied the parameterised complexity of the Boolean classification problem and showed that on arbitrary relational structures, learning hypotheses definable in FO is hard for the parameterised complexity class (i. e., subject to a plausible complexity-theoretic assumption, it is not fixed-parameter tractable). The paper also showed that the problem is fixed-parameter tractable if the structures come from a nowhere dense class.
For Boolean classifiers definable in the extension FOCN of FO with counting quantifiers and numerical predicates, [7] obtained a sublinear-time learning algorithm for structures of bounded degree, i. e., classes of structures where the degree is bounded by a constant. Recently, [8] lifted this result to structures of tiny degree, i. e., classes of structures of degree up to for some constant , where is the size of the structure. The paper [10] considered a notion of weighted structures, which extend ordinary relational structures by assigning weights, i. e. elements from particular rings or abelian groups, to tuples present in the structure. It introduced the expressive logic FOWA, which extends FO by means of aggregating weights and formulating both formulas (that evaluate to “true” or “false”) and terms (that “aggregate” weights and evaluate to values in the associated ring or abelian group). For the fragment (that still extends FO), the paper showed that Boolean classifiers definable by -formulas over weighted background structures of polylogarithmic degree can be learned in sublinear time after quasi-linear-time preprocessing. This lifts the results obtained in [33] for FO to the substantially more expressive logic . As the logic can be embedded in , it follows from [10] that Boolean classifiers definable by -formulas over background structures of polylogarithmic degree can be learned in sublinear time after quasi-linear-time preprocessing. The main result of the present paper can be viewed as a generalisation of this to integer-valued classification problems.
The algorithmic results obtained so far within the framework introduced in [35] all focus on Boolean classification problems. However, many application scenarios require multiclass classification (cf. [21, 14, 36]). In the database systems literature, multiclass classifiers typically are described by aggregate queries [51, 52, 53, 54, 45]. In this paper, aggregate queries are represented by the counting terms of .
Closely related to the framework we consider is the framework of inductive logic programming (ILP) [46, 47, 39, 19, 20]. Both frameworks deal with a passive supervised learning setting, where the learning algorithms are given labelled examples. These examples are labelled according to some target concept, and the algorithms should return a hypothesis that approximately matches this target concept. One of the main differences between both frameworks is that we represent the background knowledge by a relational database, whereas in ILP, it is represented in a background theory, i. e., a set of formulas. Related logical learning frameworks have also been studied in formal verification [27, 43, 23, 56, 18].
In the database literature, various approaches to learning queries from examples have been studied, both in passive (such as ours) and active learning settings. In passive learning settings, results often focus on conjunctive queries [37, 38, 6, 40, 5, 55] or consider queries outside the relational database model [50, 11], while we focus on , an extension of full first-order logic. In the active learning setting introduced by Angluin [4], learning algorithms are allowed to actively query an oracle. Results in this setting [2, 49, 1, 11, 12, 15] again consider various types of queries. Another related subject in the database literature is the problem of learning schema mappings from examples [13, 28, 3, 16, 17].
2 Preliminaries
This section fixes the basic notation used throughout the paper, and it provides the precise syntax and semantics of first-order logic with counting (the latter is taken almost verbatim from [34]). We let , , and denote the sets of integers, non-negative integers, and positive integers, respectively. For , we let and . For a -tuple we write to denote its arity . By we denote the empty tuple, i. e., the tuple of arity .
Signatures and structures
A signature is a finite set of relation symbols. Every relation symbol has a fixed arity . Let be a signature. A -structure (or -database) consists of a finite set , called the universe of , and a relation for every . Note that signatures may contain relation symbols of arity 0. There are only two 0-ary relations over a set , namely and , which we interpret as “false” and “true”, respectively. We define the size of a -structure as .
Let be a signature with . A -expansion of a -structure is a -structure which has the same universe as and which satisfies for all .
A substructure of a -structure is a -structure with universe that satisfies for every . For a set , the induced substructure of on is the -structure with universe , where for every .
Gaifman graph, degree, distances, and neighbourhoods
In this paper, when speaking of graphs, we mean undirected simple graphs.
The Gaifman graph of a -structure with universe is the graph with vertex set whose edge set contains exactly those edges where , , and there exists an and a tuple such that . The degree of a -structure is defined as the degree of the Gaifman graph (i. e., the maximum number of neighbours of a vertex of ).
The distance between two elements is the minimal number of edges of a path from to in ; if no such path exists, we let . For a tuple and an element , we let .
Consider a -tuple for some . For every , the ball of radius (or -ball) of in is the set . The neighbourhood of radius (or -neighbourhood) of in is the induced substructure .
First-order logic with counting
Let vars be a fixed, countably infinite set of variables. A -interpretation consists of a -structure and an assignment , where denotes the universe of . For and distinct variables and elements , we write for the interpretation , where is the assignment with for every and for all .
Next, we provide the syntax and semantics of the logic [34]. This logic allows formulating numerical statements based on counting terms and numerical predicates.
For the remainder of this paper, fix a triple , called a numerical predicate collection, where is a finite set of predicate names, assigns an arity to each , and assigns a semantics to each . Examples of numerical predicates are the equality predicate with , the comparison predicate with , or the prime number predicate with . When analysing the running time of algorithms, we will assume that machines have access to oracles for evaluating the numerical predicates. That is, when given a and a tuple of integers, the oracle takes time to answer if .
Let be a signature. The set of formulas and counting terms (for short: terms) of is built according to the following rules.
-
(1)
and are formulas,111in particular, if then is a formula where , and are variables. We let and .
-
(2)
If and are formulas, then and are also formulas. We let and .
-
(3)
If is a formula and , then is a formula. We let .
-
(4)
If is a formula and is a tuple of pairwise distinct variables, for , then is a counting term. We let .
-
(5)
Every integer is a counting term. We let .
-
(6)
If and are counting terms, then and are also counting terms.
We let .
-
(7)
If , , and are counting terms with , then is a formula. We let .
First-order logic is the fragment of built by using only the rules (1)–(3). We write and as shorthands for and . For counting terms and , we write as a shorthand for .
By , we denote the union of all for arbitrary signatures . This applies analogously to FO. For the semantics for the rules (1)–(3) are defined in the same way as for FO; the semantics of the remaining rules are as follows. Let be a -interpretation, and let denote the universe of .
-
(4)
, where .
-
(5)
, for .
-
(6)
and .
-
(7)
if , and otherwise.
Examples of counting terms in and their intuitive meaning can be found in Examples 1.1 and 1.2.
An expression is a formula or a counting term. For an expression , we write to indicate that . A sentence is a formula without free variables. A ground term is a counting term without free variables.
For a formula and a -interpretation , we write to indicate that . Likewise, indicates that . For a formula , a -structure , and a tuple , we write to indicate that for all assignments with for all . Similarly, for a counting term , we write for the integer . In case that is a sentence and is a ground term, we shortly write instead of , and we write instead of .
The size of an -expression is defined as the length of the string representation of , where integers and variables are considered as having length . For , we write to denote the set of all -expressions of size at most and with at most free variables.
Hypotheses
Let be a -structure with universe , let be an -term, let , let , and let . The hypothesis represented in by the counting term and the parameter is defined as the mapping such that for every . That is, evaluating the counting term in while assigning the variables to the elements and assigning the variables to the elements yields the integer . Henceforth, we will write to denote this function . For a set , we say that is consistent with if and only if for all .
3 Learning -Definable Aggregate Queries
Let be signatures with , fix two numbers , , and let be two sets of terms with and . We study the following problem.
Precomputation: Given a -structure with universe , compute a -expansion of and a lookup table whose size is independent of .
Input: Training set .
Task: Return a term and a tuple such that the hypothesis is consistent with . The algorithm may reject if there is no term and tuple such that the hypothesis is consistent with .
As described in the introduction, the precomputation phase can be viewed as building an index structure for the given database . Afterwards, this index structure can be used each time that we receive as input a new training set . The “learning phase” is what happens after receiving such a set ; the desired output is a hypothesis that is consistent with . The main contribution of this paper is an efficient solution for the problem . Before presenting the exact statement of our result (Theorem 3.1), recall from Section 1 the discussion on the benefits of local-access algorithms. In the learning phase (i. e., when receiving a training set), the algorithm we provide does not require random access to the database. Instead, it only needs local access, i.e., it only accesses the neighbours of elements that it already holds in memory, initially starting with the elements given in the training set. Here, “neighbours” refers to neighbours in the Gaifman graph of the database . Formally, local access means that the algorithm can access an oracle that answers requests of the form “Is ?” in constant time and requests of the form “Return a list of all neighbours of in ” in time linear in the number of neighbours of . As our machine model, we use a random-access machine (RAM) model, and we consider running times under the uniform-cost measure. This allows us to store an element of the database in a single memory cell and access it in a single computation step.
The main result of this paper is the following theorem.
Theorem 3.1.
Let be a signature, let , let , let be a finite set of integers, and let be the set of all -terms that only use integers from .
There is an extension of with relation symbols of arity , and there is a number such that, for the set of all -terms, there is an algorithm that solves the problem as follows.
For a -structure of size and degree , the precomputation to compute the -expansion of and the associated lookup table takes time . Afterwards, upon input of a training set of size , the algorithm uses only local access to , access to the lookup table, and time to return either a hypothesis consistent with or the answer “reject”. Furthermore, the returned hypothesis can be evaluated in time , using only local access to and access to the lookup table.
In particular, this implies that when comes from a class of -structures of polylogarithmic degree, then the precomputation takes time quasi-linear in (i. e., ), a hypothesis is found within time polynomial in and polylogarithmic in , and it can be evaluated in time polylogarithmic in .
We remark that the algorithm given in the proof of Theorem 3.1 is not meant to be implemented and used in practice. Instead, Theorem 3.1 serves as a fundamental result that shows the theoretic (and asymptotic) feasibility of learning aggregate queries definable in . This is in line with previous work on the descriptive complexity of learning [35, 33, 32, 29, 7, 10, 9, 8] and closely related work on so-called algorithmic meta theorems (see, e. g., [30, 41]).
Before turning to the proof of Theorem 3.1, let us first briefly discuss the situation. In the assumption of the theorem, we require the set of integers occurring in terms of to be finite. However, note that we can still represent other integers in by using rule (6) for , that is, addition and multiplication. For example, we could let be the set of powers of 10 up to some bound, and we could obtain the numbers in between with rather few additions and multiplications. Moreover, requiring to be finite does not limit the use of counting terms of the form , which may evaluate to arbitrary integers based on the given database. Meanwhile, we could let contain (large) constants that are specific to the field the data is coming from, which allows using them even if the bound on the size of the terms in is small.
Since is finite and the size of terms in is bounded by , up to equivalence, the set is finite. Thus, when given a -structure with universe and a training set , in order to find a hypothesis that is consistent with , we could proceed as follows. Loop through all terms in . For each such term , loop through all tuples , and check whether for every . If so, stop and output and , indicating that is a hypothesis consistent with . If no such hypothesis is found, then stop and output “reject”. This obviously solves the learning problem. However, the time taken to loop through all the is polynomial in , and it may not suffice to start with the vertices given in the training set and repeatedly iterate over the neighbours of already found vertices. Note that Theorem 3.1 yields a much more efficient algorithm, which runs in time polynomial in the size of the training set and the degree of the structure. This can be substantially faster than being polynomial in the size of the structure. We achieve this by moving over to a larger signature and a suitable -expansion of . This is done in such a way that afterwards, we only have to loop through those tuples that are close to the tuples that actually occur in the training set . Meanwhile, the number of terms to check is not constant any more, but it depends on . However, as we discuss in the proof of Theorem 3.1, this number is polynomial in the degree of , which yields the desired running time. The exact details are provided by the following Lemma 3.2; this lemma is the main technical tool that allows us to prove Theorem 3.1.
For formulating the lemma, we need the following notation. In a structure with universe and of degree at most , for every and any radius , we have . Note that , , , and for all and . In particular, for a fixed radius , is polynomial in . For all , it is straightforward to construct an -formula such that for every -structure and all , we have if and only if . To improve readability, we write instead of , and instead of .
Let . An -formula with free variables is -local (around ) if for every -structure with universe and every tuple , we have . A formula is local if it is -local for some . This notion of local formulas is identical with the one in [25]. It is very similar to the notion of local formulas by Gaifman [26], although we use a semantic notion instead of Gaifman’s syntactic notion.
For every , every graph with vertex set , and every tuple of pairwise distinct variables, we consider the formula
Note that means that the connected components of the -neighbourhood correspond to the connected components of . Clearly, the formula is -local around its free variables .
For two sets with , for , and two tuples , we write if for all with . Note that this notion is not symmetric: for tuples with , the tuple may still have an entry for some while , so .
Our main technical ingredient for the proof of Theorem 3.1 is the following lemma. Note that in the final statement of the lemma, the graphs are connected and the formulas are local – both conditions are absolutely necessary for our proof of Theorem 3.1.
Lemma 3.2.
Let be a signature, let , let , let be an -term, and let be the set of integers that occur in .
There is an extension of with relation symbols of arity , and there are numbers such that, for every -structure of size and degree , we can compute a -expansion of in time such that the following is true, where denotes the universe of . For all , for all , and for all , there is an -term such that for all with for , we have for all .
Furthermore, is a combination via addition and multiplication of integers in , of integers with , and of counting terms of the form where is an -local formula in and is a connected graph.
We present the proof of Lemma 3.2 in Section 4. Intuitively, the lemma says that we can translate a term into a term over an extended signature such that the new term only needs those parameters that are close to the examples. By using this lemma, we can prove Theorem 3.1 as follows.
Proof of Theorem 3.1.
Let be chosen according to the assumption of the theorem.
Note that, up to logical equivalence, there are only finitely many terms in . Thus, w.l.o.g. we assume that is finite and that all terms in only use as variables. For each term in , we apply Lemma 3.2 to obtain an extension and numbers . W.l.o.g. we assume that the sets are pairwise disjoint.
Let , let , and let be the set of all -terms using only as variables. Furthermore, let .
Upon input of a -structure of size and degree , for each we use Lemma 3.2 to compute a -expansion of in time . We let be the -expansion of obtained by combining all the structures for .
In addition, we also compute a lookup table that stores the value for every ground term that occurs in a term in and is of the form , where is an -local formula in (for some ) and is a connected graph.
Claim 3.3.
The lookup table can be computed in time .
Proof.
First note that the number of entries in the lookup table is constant and does not depend on , because the terms in have size at most and only use variables from a fixed set. Thus, it suffices to show that every single entry of the lookup table can be computed in time .
Let be a ground term that occurs in a term in and is of the form , where is an -local formula in (for some ) and is a connected graph. Then . Let .
Since is a connected graph and , we have for all with . Hence, to compute , we can simply initialise the corresponding entry in the lookup table with , iterate over all and all , and increase the entry in the lookup table by if and only if holds. For every fixed , we iterate over at most tuples . Moreover, since and are -local, for each tuple , it holds that if and only if . The latter can be checked by building the neighbourhood structure around in time polynomial in , and then evaluating the formula (of constant size) on the neighbourhood structure by a brute-force algorithm in time polynomial in .
All in all, we use iterations, and every iteration takes time , so the overall running time is .
Now let us assume that we receive an arbitrary training set of size . Let be the set of all those terms in that satisfy the following: is a combination via addition and multiplication of integers in , of integers with , and of counting terms of the form , where is an -local formula in (for some ) and is a connected graph.
Claim 3.4.
is of size . Furthermore, for every , upon input of and , we can compute in time with only local access to and access to the precomputed lookup table.
Proof.
Since the terms in have length at most , the variables come from a fixed finite set, the signature has constant size, and the integers that may occur in a term in come from a set of size polynomial in , the total number of terms in is also polynomial in .
For the evaluation of a term in , we first consider counting terms of the form , where , , , is an -local formula in (for some ), and is a connected graph with .
In case that , the term is a ground term. Hence, we can simply use the precomputed lookup table to get the value in time .
In case that , let and . Since is connected and , for all with , we have .
Thus, for , we have
Furthermore, and are -local formulas, so if and only if . Since the size of the neighbourhood is polynomial in , the evaluation of can be performed by evaluating an -formula of constant size on a structure of size polynomial in for a number of assignments that is polynomial in . The evaluation of the formula can be done by building the neighbourhood structure around in time polynomial in , and then evaluating the formula on the neighbourhood structure by a brute-force algorithm in time polynomial in .
Now let be an arbitrary term in . Then, for all , , can be evaluated in time polynomial in with only local access to and access to the precomputed lookup table by first evaluating every counting term in as described above and then combining the results and the integers occurring in via a constant number of additions and multiplications.
To find a hypothesis that is consistent with , we loop through all terms in . For each such term , we loop through all tuples for . We check if for all . If so, we stop and output and , indicating that is a hypothesis consistent with . Otherwise, we stop and output “reject”.
Claim 3.5.
This algorithm uses only local access to and access to the precomputed lookup table, and it terminates within time . If it outputs a hypothesis, this hypothesis is consistent with . If it outputs “reject”, there is no term and tuple such that is consistent with .
Proof.
The set contains at most elements, which is polynomial in and . By 3.4, is of size . Thus, in total, we iterate over hypotheses. For every hypothesis consisting of a term and a tuple , by 3.4, we can check in time if for all . This check uses only local access to and access to the precomputed lookup table. This proves the first statement of the claim. The second statement of the claim is obvious.
To prove the third statement of the claim, let us assume that there exists a term and a tuple such that is consistent with . For this particular choice of and , and for the given tuples , Lemma 3.2 yields an -term such that for all with for , we have
Note that . Hence, our algorithm will consider at least one such that . Let us fix such a . All that remains to be done is to show that – this will imply that our algorithm will eventually consider and then stop and output and , indicating that is a hypothesis consistent with .
By our choice of and , we know that and . Thus, . From Lemma 3.2, we know that is a combination via addition and multiplication of integers in , of integers with , and of counting terms of the form , where is an -local formula in and is a connected graph. By our particular choice of and since , we obtain that . This completes the proof of Claim 3.5.
In summary, the proof of Theorem 3.1 is complete.
4 Proof of Lemma 3.2
The proof of Lemma 3.2 heavily relies on the following localisation theorem for . This theorem is implicit in [34]; here we present it in a way analogous to [10, Theorem 4.7].
Theorem 4.1 (Localisation Theorem for , [34]).
Let , and let be a signature. For every -formula , there is an extension of with relation symbols of arity , and an -formula that is a Boolean combination of local formulas and statements of the form where has arity , for which the following holds. There is an algorithm that, upon input of a -structure of size and degree , computes in time a -expansion of such that for all (where denotes the universe of ), we have .
For the proof of Lemma 3.2, let be chosen according to the assumption of the lemma. In particular, is an -term, and is the set of integers that occur in .
We first note that it suffices to prove the statement of the lemma for the particular case where is of the form for some -formula . Assume for now that the statement of the lemma holds for all terms of this form, and consider an arbitrary -term that is not of this form. Then, is a combination via addition and multiplication of integers from and of counting terms of the form for some -formula . Let be the set of all these counting terms occurring in . We assume that the statement of the lemma holds for each . Hence, for each , we obtain an extension of with relation symbols of arity and numbers . W.l.o.g. we assume that the sets are pairwise disjoint. Let , , and .
Upon input of a -structure of size and degree , for each , the statement of the lemma for enables us to compute a -expansion of in time . We let be the -expansion of obtained by combining all the structures for all counting terms . Let denote the universe of .
When given and , the statement of the lemma for yields an -term for each . We choose to be the term obtained from by replacing every occurrence of a counting term by the corresponding counting term . Clearly has length at most . It is not difficult to verify that has the desired properties stated in Lemma 3.2.
All that remains to be done is to prove the statement of the lemma for the particular case where is of the form for some -formula . Given such a term , let . Using Theorem 4.1, we obtain an extension of with relation symbols of arity and an -formula that is a Boolean combination of local formulas and statements of the form , where has arity , for which the following holds. Given a -structure of size and degree and with universe , we can compute a -expansion of in time such that for all , , and , we have . Thus, for , we have for all , .
From the particular shape of , we obtain that there exists a number such that is -local, i. e., for all , , , we have .
Let be the set of undirected graphs with vertex set . For every , consider the formula and the term . It is not difficult to verify that for all , , we have
To further decompose the terms , we use techniques similar to the ones used in [34] to decompose terms into so-called connected local terms. With these, we obtain the following technical lemma. The statement of this lemma as well as its proof are highly non-trivial. It depends on a careful analysis of the connected components of undirected graphs with nodes. Note that in the final statement of the lemma, the graphs are connected and the formulas are local – both conditions are also stated in Lemma 3.2 and are absolutely necessary for our proof of Theorem 3.1.
Lemma 4.2.
Let be a signature and let with . Let be an -local -formula with , , and . Let be the set of all undirected graphs with vertex set . Let and let
There is a number such that, for every -structure of degree and with universe , for all , for all , and for all , there exists an -term such that the following is true for .
For all with , we have
Furthermore, is a combination via addition and multiplication of integers with and of counting terms of the form , where , for a connected graph and an -local formula .
Before proving Lemma 4.2, we first finish the proof of Lemma 3.2. We apply Lemma 4.2 to the term for all . For each this yields a number . We choose and .
For any -structure , we let be the -expansion obtained as described above (by applying Theorem 4.1 to the formula ). When given tuples and , we obtain from Lemma 3.2 an -term , for every . This term is a combination via addition and multiplication of integers with and of counting terms of the form , where , for a connected graph and an -local formula . We choose
and let . Note that has length and that . It is not difficult to verify that the term has the desired properties stated in Lemma 3.2.
Proof of Lemma 4.2.
We prove the statement by induction on the number of connected components of .
Consider the induction base , i. e., the graph is connected. We choose to be the length of the term . For choosing the term , we distinguish between 4 cases.
- Case 1:
-
. In this case, we set .
- Case 2:
-
and . In this case, we set for the particular number . Since is connected, all elements in a tuple with have distance at most from . Therefore, .
- Case 3:
-
and . Since is connected, for all and all , it holds that Hence, for all we have . Therefore, we can choose .
- Case 4:
-
and . In this case, is the only tuple that satisfies . Hence, we can choose .
This completes the induction base.
We now turn to the induction step, where . We assume that the statement of the lemma holds for all graphs with fewer than connected components. Let be the set of all vertices of that are in the same connected component as the vertex . Let . For each , let be the induced subgraph of with vertex set . Clearly, is the disjoint union of the graphs and , and has only one connected component, and has connected components. For , let be the tuple of variables of such that their index is contained in , let be the tuple of variables of such that their index is contained in , and let be the tuple of variables of such that their index is contained in .
By using the Feferman-Vaught Theorem (cf., [24, 44, 10]), we obtain a decomposition of w.r.t. . I. e., is a finite non-empty set of pairs of -local -formulas of the form such that for all , , , we have there exists a pair such that and . Here, are defined analogously to for . Moreover, the pairs in are mutually exclusive, i. e., for all , , , there is at most one pair such that and .
Let be the set of graphs with vertex set and and . Note that such graphs have strictly fewer connected components than .
For every pair , we set
Since the pairs are mutually exclusive, for all , , we have Furthermore, it is not difficult to verify that
where are defined analogously to for .
Using the induction hypothesis, we apply the statement of the lemma to , , and to every summand of , and we call the resulting terms , , and .
We set and choose .
5 Conclusion
We have studied the complexity of learning aggregate queries from examples. For this, we have extended the framework for Boolean classification problems by Grohe and Turán [35] to multiclass classification problems with integer-valued classifiers. In our setting, such classifiers are represented by a pair where is a counting term in and is a tuple of elements in the universe of the given database .
Our main result shows that we can build a suitable index structure on the given database during a precomputation step whose running time is linear in the size and polynomial in the degree of . Afterwards, by utilising this index structure, whenever receiving as input a new training set , a classifier definable in can be learned in time polynomial in the degree of and polynomial in the number of examples given in the training set. The classifiers returned by our algorithm can be evaluated efficiently, in time polynomial in the degree of . Moreover, after having built the index structure, all our algorithms require only local access to the database.
It seems conceivable that our results can be generalised to the more expressive logic that operates on weighted structures [10], since the locality results obtained for this logic in [10] are similar to the ones obtained for in [34] and used here.
For the Boolean classification problem based on FO, [9] shows that the learning problem is fixed-parameter tractable on nowhere dense classes. This result heavily relies on the fixed-parameter tractability of the evaluation problem for FO on nowhere dense classes [31]. From [34] it is known that on nowhere dense classes also the evaluation problem for formulas and terms of is fixed-parameter tractable. We therefore think that is a good candidate for a logic with a fixed-parameter tractable integer-valued classification problem on classes of sparse structures.
In this paper, the task in the learning problem is to find an integer-valued hypothesis that is consistent with the training set. Other publications on the considered framework for Boolean classification problems also study settings in which one wants to find a hypothesis that generalises well [33, 32, 29, 7, 10, 9, 8]. More specifically, they study probably approximately correct (PAC) learning tasks in which the training examples are considered as being generated from a (fixed, but unknown) probability distribution. The task is to find a hypothesis that has a small expected error on new examples generated from the same distribution. While PAC-learning algorithms for Boolean classification problems are typically based on the empirical risk minimisation (ERM) principle [48], results in algorithmic learning theory for multiclass classification [14, 21, 36] are based on different principles and even show that the ERM principle in general does not suffice for PAC learning with an unbounded number of classes [22]. Therefore, we expect it to be quite challenging to obtain PAC-learning results for our framework of multiclass classification problems.
We plan to work on the raised questions in future work.
References
- [1] Azza Abouzied, Dana Angluin, Christos H. Papadimitriou, Joseph M. Hellerstein, and Avi Silberschatz. Learning and verifying quantified Boolean queries by example. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 49–60. ACM, 2013. doi:10.1145/2463664.2465220.
- [2] Howard Aizenstein, Tibor Hegedüs, Lisa Hellerstein, and Leonard Pitt. Complexity theoretic hardness results for query learning. Comput. Complex., 7(1):19–53, 1998. doi:10.1007/PL00001593.
- [3] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, 2011. doi:10.1145/2043652.2043656.
- [4] Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1987. doi:10.1007/BF00116828.
- [5] Pablo Barceló, Alexander Baumgartner, Victor Dalmau, and Benny Kimelfeld. Regularizing conjunctive features for classification. J. Comput. Syst. Sci., 119:97–124, 2021. doi:10.1016/j.jcss.2021.01.003.
- [6] Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In 20th International Conference on Database Theory, ICDT 2017, volume 68 of LIPIcs, pages 7:1–7:17, 2017. doi:10.4230/LIPIcs.ICDT.2017.7.
- [7] Steffen van Bergerem. Learning concepts definable in first-order logic with counting. In 34th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2019, pages 1–13. IEEE, 2019. doi:10.1109/LICS.2019.8785811.
- [8] Steffen van Bergerem. Descriptive Complexity of Learning. PhD thesis, RWTH Aachen University, Germany, 2023. doi:10.18154/RWTH-2023-02554.
- [9] Steffen van Bergerem, Martin Grohe, and Martin Ritzert. On the parameterized complexity of learning first-order logic. In PODS 2022: International Conference on Management of Data, pages 337–346. ACM, 2022. doi:10.1145/3517804.3524151.
- [10] Steffen van Bergerem and Nicole Schweikardt. Learning concepts described by weight aggregation logic. In 29th EACSL Annual Conference on Computer Science Logic, CSL 2021, Ljubljana, Slovenia (Virtual Conference), January 25-28, 2021, volume 183 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CSL.2021.10.
- [11] Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, pages 109–120. OpenProceedings.org, 2015. doi:10.5441/002/edbt.2015.11.
- [12] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016. doi:10.1145/2818637.
- [13] Angela Bonifati, Ugo Comignani, Emmanuel Coquery, and Romuald Thion. Interactive mapping specification with exemplar tuples. ACM Trans. Database Syst., 44(3):10:1–10:44, 2019. doi:10.1145/3321485.
- [14] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, pages 943–955. IEEE, 2022. doi:10.1109/FOCS54457.2022.00093.
- [15] Balder ten Cate and Victor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In 24th International Conference on Database Theory, ICDT 2021, volume 186 of LIPIcs, pages 9:1–9:24, 2021. doi:10.4230/LIPIcs.ICDT.2021.9.
- [16] Balder ten Cate, Victor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
- [17] Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
- [18] Adrien Champion, Tomoya Chiba, Naoki Kobayashi, and Ryosuke Sato. ICE-based refinement type discovery for higher-order functional programs. J. Autom. Reason., 64(7):1393–1418, 2020. doi:10.1007/s10817-020-09571-y.
- [19] William W. Cohen and C. David Page Jr. Polynomial learnability and inductive logic programming: Methods and results. New Gener. Comput., 13(3&4):369–409, 1995. doi:10.1007/BF03037231.
- [20] Andrew Cropper, Sebastijan Dumancic, Richard Evans, and Stephen H. Muggleton. Inductive logic programming at 30. Mach. Learn., 111(1):147–172, 2022. doi:10.1007/s10994-021-06089-1.
- [21] Amit Daniely, Sivan Sabato, and Shai Shalev-Shwartz. Multiclass learning approaches: A theoretical comparison with implications. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, pages 494–502, 2012. URL: https://proceedings.neurips.cc/paper/2012/hash/19f3cd308f1455b3fa09a282e0d496f4-Abstract.html.
- [22] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Maria-Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 287–316. JMLR.org, 2014. URL: http://proceedings.mlr.press/v35/daniely14b.html.
- [23] P. Ezudheen, Daniel Neider, Deepak D’Souza, Pranav Garg, and P. Madhusudan. Horn-ICE learning for synthesizing invariants and contracts. Proc. ACM Program. Lang., Volume 2( Issue OOPSLA):131:1–131:25, 2018. doi:10.1145/3276501.
- [24] Solomon Feferman and Robert L. Vaught. The first-order properties of products of algebraic systems. Fundamenta Mathematicae, 47:57–103, 1959.
- [25] Jörg Flum and Martin Grohe. Parameterized Complexity Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006. doi:10.1007/3-540-29953-X.
- [26] Haim Gaifman. On local and non-local properties. In Jacques Stern, editor, Proceedings of the Herbrand Symposium, volume 107 of Studies in Logic and the Foundations of Mathematics, pages 105–135. North-Holland, 1982. doi:10.1016/S0049-237X(08)71879-2.
- [27] Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. ICE: A robust framework for learning invariants. In Computer Aided Verification - 26th International Conference, CAV 2014, volume 8559 of Lecture Notes in Computer Science, pages 69–87. Springer, 2014. doi:10.1007/978-3-319-08867-9_5.
- [28] Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010. doi:10.1145/1667053.1667055.
- [29] Emilie Grienenberger and Martin Ritzert. Learning definable hypotheses on trees. In 22nd International Conference on Database Theory, ICDT 2019, pages 24:1–24:18, 2019. doi:10.4230/LIPIcs.ICDT.2019.24.
- [30] Martin Grohe. Logic, graphs, and algorithms. In Jörg Flum, Erich Grädel, and Thomas Wilke, editors, Logic and Automata: History and Perspectives [in Honor of Wolfgang Thomas], volume 2 of Texts in Logic and Games, pages 357–422. Amsterdam University Press, 2008.
- [31] Martin Grohe, Stephan Kreutzer, and Sebastian Siebertz. Deciding first-order properties of nowhere dense graphs. J. ACM, 64(3):17:1–17:32, 2017. doi:10.1145/3051095.
- [32] Martin Grohe, Christof Löding, and Martin Ritzert. Learning MSO-definable hypotheses on strings. In International Conference on Algorithmic Learning Theory, ALT 2017, pages 434–451, 2017. URL: http://proceedings.mlr.press/v76/grohe17a.html.
- [33] Martin Grohe and Martin Ritzert. Learning first-order definable concepts over structures of small degree. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2017, pages 1–12, 2017. doi:10.1109/LICS.2017.8005080.
- [34] Martin Grohe and Nicole Schweikardt. First-order query evaluation with cardinality conditions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2018, pages 253–266, 2018. doi:10.1145/3196959.3196970.
- [35] Martin Grohe and György Turán. Learnability and definability in trees and similar structures. Theory Comput. Syst., 37(1):193–220, 2004. doi:10.1007/s00224-003-1112-8.
- [36] Steve Hanneke, Shay Moran, and Qian Zhang. Universal rates for multiclass learning. In The 36th Annual Conference on Learning Theory, COLT 2023, volume 195 of Proceedings of Machine Learning Research, pages 5615–5681. PMLR, 2023. URL: https://proceedings.mlr.press/v195/hanneke23a.html.
- [37] David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989. doi:10.1007/BF00114802.
- [38] Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Algorithmic Learning Theory, 11th International Conference, ALT 2000, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000. doi:10.1007/3-540-40992-0_18.
- [39] Jörg-Uwe Kietz and Saso Dzeroski. Inductive logic programming and learnability. SIGART Bull., 5(1):22–32, 1994. doi:10.1145/181668.181674.
- [40] Benny Kimelfeld and Christopher Ré. A relational framework for classifier engineering. ACM Trans. Database Syst., 43(3):11:1–11:36, 2018. doi:10.1145/3268931.
- [41] Stephan Kreutzer. Algorithmic meta-theorems. In Javier Esparza, Christian Michaux, and Charles Steinhorn, editors, Finite and Algorithmic Model Theory, volume 379 of London Mathematical Society Lecture Note Series, pages 177–270. Cambridge University Press, 2011.
- [42] Dietrich Kuske and Nicole Schweikardt. First-order logic with counting. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2017, pages 1–12. IEEE Computer Society, 2017. doi:10.1109/LICS.2017.8005133.
- [43] Christof Löding, P. Madhusudan, and Daniel Neider. Abstract learning frameworks for synthesis. In Tools and Algorithms for the Construction and Analysis of Systems – 22nd International Conference, TACAS 2016, volume 9636 of Lecture Notes in Computer Science, pages 167–185. Springer, 2016. doi:10.1007/978-3-662-49674-9_10.
- [44] Johann A. Makowsky. Algorithmic uses of the Feferman-Vaught theorem. Ann. Pure Appl. Log., 126(1-3):159–213, 2004. doi:10.1016/j.apal.2003.11.002.
- [45] Denis Mayr Lima Martins. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Inf. Syst., 83:89–100, 2019. doi:10.1016/j.is.2019.03.002.
- [46] Stephen Muggleton. Inductive logic programming. New Gener. Comput., 8(4):295–318, 1991. doi:10.1007/BF03037089.
- [47] Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. J. Log. Program., 19/20:629–679, 1994. doi:10.1016/0743-1066(94)90035-3.
- [48] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. doi:10.1017/CBO9781107298019.
- [49] Robert H. Sloan, Balázs Szörényi, and Gyö rgy Turán. Learning Boolean functions with queries. In Yves Crama and Peter L. Hammer, editors, Boolean Models and Methods in Mathematics, Computer Science, and Engineering, pages 221–256. Cambridge University Press, 2010. doi:10.1017/cbo9780511780448.010.
- [50] Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In 15th International Conference on Database Theory, ICDT 2012, pages 140–154. ACM, 2012. doi:10.1145/2274576.2274592.
- [51] Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. Reverse engineering aggregation queries. Proc. VLDB Endow., 10(11):1394–1405, 2017. doi:10.14778/3137628.3137648.
- [52] Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. REGAL+: reverse engineering SPJA queries. Proc. VLDB Endow., 11(12):1982–1985, 2018. doi:10.14778/3229863.3236240.
- [53] Quoc Trung Tran, Chee Yong Chan, and Srinivasan Parthasarathy. Query reverse engineering. VLDB J., 23(5):721–746, 2014. doi:10.1007/s00778-013-0349-3.
- [54] Chenglong Wang, Alvin Cheung, and Rastislav Bodik. Synthesizing highly expressive SQL queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 452–466. ACM, 2017. doi:10.1145/3062341.3062365.
- [55] Yaacov Y. Weiss and Sara Cohen. Reverse engineering SPJ-queries from examples. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, pages 151–166. ACM, 2017. doi:10.1145/3034786.3056112.
- [56] He Zhu, Stephen Magill, and Suresh Jagannathan. A data-driven CHC solver. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, pages 707–721. ACM, 2018. doi:10.1145/3192366.3192416.