Abstract 1 Introduction 2 Preliminaries 3 Learning FOC1-Definable Aggregate Queries 4 Proof of Lemma 3.2 5 Conclusion References

Learning Aggregate Queries Defined by
First-Order Logic with Counting

Steffen van Bergerem ORCID Humboldt-Universität zu Berlin, Germany Nicole Schweikardt ORCID Humboldt-Universität zu Berlin, Germany
Abstract

In the logical framework introduced by Grohe and Turán (TOCS 2004) for Boolean classification problems, the instances to classify are tuples from a logical structure, and Boolean classifiers are described by parametric models based on logical formulas. This is a specific scenario for supervised passive learning, where classifiers should be learned based on labelled examples. Existing results in this scenario focus on Boolean classification. This paper presents learnability results beyond Boolean classification. We focus on multiclass classification problems where the task is to assign input tuples to arbitrary integers. To represent such integer-valued classifiers, we use aggregate queries specified by an extension of first-order logic with counting terms called FOC1.

Our main result shows the following: given a database of polylogarithmic degree, within quasi-linear time, we can build an index structure that makes it possible to learn FOC1-definable integer-valued classifiers in time polylogarithmic in the size of the database and polynomial in the number of training examples.

Keywords and phrases:
Supervised learning, multiclass classification problems, counting logic
Copyright and License:
[Uncaptioned image] © Steffen van Bergerem and Nicole Schweikardt; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Database theory
; Theory of computation Logic ; Theory of computation Finite Model Theory ; Computing methodologies Logical and relational learning ; Computing methodologies Supervised learning
Funding:
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 541000908 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 541000908).
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

We study the complexity of learning aggregate queries from examples. This is a classification problem of the following form. The elements that are to be classified come from a set X, the instance space. For a given set V, a V-valued classifier on X is a function c:XV. We are given a training set S of labelled examples (x,λ)X×V, i. e., λ is the label assigned to the instance x. The goal is to find a classifier, called a hypothesis, that can be used to predict the label of elements from X, including those not given in S. The term Boolean classification problem refers to the case where |V|=2 (often, V is {1,0}). We use the term multiclass classification problem to refer to cases where V may be arbitrarily large. In machine learning, these problems fall into the category of supervised learning tasks: we want to learn a function from given input-output pairs. In contrast to this, in unsupervised learning (e. g. clustering), the goal is to learn patterns from unlabelled data [48].

We focus on learning problems related to the framework introduced by Grohe and Turán [35]. There, the instance space X is a set of tuples from a logical structure (that is sometimes called the background structure), and the classifiers are Boolean and are described using parametric models based on logical formulas. In this paper, we extend the framework to multiclass classification problems where the classifiers are integer-valued, i. e., V=. In the framework that we consider, the background structure is a relational database 𝒜, and the instance space X is the set Ak of all k-tuples of elements from the active domain A of 𝒜 (also called the universe of 𝒜). Here, k is a fixed positive integer. One fixes a parameter length (a fixed non-negative integer). A classifier is specified by a pair p=(t,w¯), where w¯=(w1,,w) is an -tuple of elements in A, and t is a counting term in the first-order logic with counting FOC1 [34] with free variables x1,,xk,y1,,y. This pair p represents the classifier cp:X that assigns to each k-tuple a¯=(a1,,ak)X the integer i that is obtained by evaluating the counting term t in the database 𝒜 while interpreting the variables x1,,xk with the elements a1,,ak and the variables y1,,y with the “parameters” w1,,w. We will write ht,w¯𝒜 to denote this classifier cp.

Given a training set SAk×, we want to find a pair p=(t,w¯) such that the classifier ht,w¯𝒜 is consistent with S, i. e., it satisfies ht,w¯𝒜(a¯)=i for every (a¯,i)S.

Example 1.1.

Let 𝒜 be a relational database where the active domain A contains authors and publications, the binary relation Author contains all pairs (a,p) where a is an author of the publication p, and the binary relation Citation contains all pairs (p1,p2) where the publication p1 cites the publication p2.

Suppose we are given a training set S that consists of a few pairs (a,i) where a is an author and i is the total number of citations of the publications of a. A reasonable classifier for this setting would be a mapping c:A that assigns to every author a present in the database the total number i of citations of their publications. In our setting, this can be represented as follows. We let k=1 and =0. Since =0, the “parameter” w is fixed to be the empty tuple (). Since k=1, we use a counting term with a single free variable x (that will be assigned with authors present in the database). We choose the counting term

t(x)#(z1,z2).(Author(x,z1)Citation(z2,z1)).

Evaluating t(x) for an author x yields the number of tuples (z1,z2) such that x is an author of publication z1, and z2 is a publication that cites z1. This is precisely the total number of citations of publications authored by x. Hence, ht,()𝒜 is the desired classifier c.

Example 1.2.

Suppose we have a database that maintains a list of all cakes colleagues brought to work. We model this as a relational database 𝒜 whose active domain A contains persons, IDs of cakes, and types of cake. The binary relation Brought contains all pairs (p,c) where p is a person that brought the cake with ID c, and the binary relation Type contains all pairs (c,τ) where c is the ID of a cake of type τ (e. g., “chocolate cake”, “strawberry cake”, “carrot cake”, etc). Suppose we want to find a classifier that predicts the popularity of colleagues. For this, via a survey, we gather examples (p,i)A× where p is a person and i is the popularity of the person, and we call the resulting set of labelled examples S. We choose k==1, so we want to find a classifier that uses a single parameter. According to our own experience at work, it seems conceivable that the following classifier ht,w𝒜 is consistent with S: the parameter w is “chocolate cake” and t is the counting term

t(x,y)#(z).(Brought(x,z)¬Type(z,y))+ 2#(z).(Brought(x,z)Type(z,y)).

Note that t counts the number of cakes brought by person x, where cakes of type y are counted twice, and the variable y will always be assigned the value of the parameter w.

In many application scenarios, the same database is used multiple times with different training sets to learn different classifiers. Thus, we consider a setting in which we are first only given the database, without any training examples. In a precomputation step, we allow gathering information that will be helpful for solving future learning tasks. This precomputation step can be viewed as building an index structure that is designed in order to support solving multiple learning tasks.

In the actual learning phase, we are repeatedly given training sets of labelled examples, and our task is to output a hypothesis that is consistent with the corresponding training set. For this learning phase, it would be desirable to have algorithms that run efficiently even if the database is too large to fit into the main memory. To achieve this, we are interested in algorithms that require only local access to the database, i. e., instead of having random access to the database, a learning algorithm should initially start with the elements given in the training set; subsequently, it may only retrieve the neighbours of elements it already holds in memory. By utilising the memory hierarchy, such local access can be achieved efficiently even in cases where random access is too prohibitive. In the context of learning (concerning Boolean classification problems), this local-access model has been introduced by Grohe and Ritzert [33].

Our contribution

Our main result is an algorithm that builds the index structure in time linear in the size and polynomial in the degree of the database. Afterwards, upon input of concrete training sets, classifiers definable in FOC1 can be learned in time polynomial in the degree of the database and polynomial in the number of examples given in the training set. Moreover, the classifiers returned by our algorithm can be evaluated in time polynomial in the degree of the database. Furthermore, our algorithms for finding a classifier and for evaluating this classifier do not require random access to the database but only rely on the local-access model.

For databases of polylogarithmic degree (i. e., of degree up to (logn)c where c is a constant and n is the size of the database), our main result implies that the index structure can be built in quasi-linear time (i. e., time n(logn)c); afterwards, FOC1-definable integer-valued classifiers can be learned in time polylogarithmic (so, in particular, sublinear) in the size of the database and polynomial in the number of training examples.

Previous results in the framework of Grohe and Turán for Boolean classification problems relied on the fact that it suffices to check a constant number of queries while limiting the search space for the parameters to a neighbourhood of a certain radius [33, 7, 10]. For our setting of multiclass classification with aggregate queries, however, this does not hold any more. Hence, a priori, it is not clear that sublinear-time learning algorithms are possible for the multiclass case at all. The main technical challenge towards our learnability result was to find an approach that keeps the number of queries to check small (i. e., polynomial in the degree of the database), while still being able to limit the search space for the parameters to a small neighbourhood around the given training tuples.

Organisation

We provide the necessary definitions concerning FOC1 in Section 2, and we formally introduce the learning problem that we consider in Section 3. The precise statement of our main result is given in Theorem 3.1. Our proof makes heavy use of the locality properties of the logic FOC1 shown in [34], including a decomposition of FOC1-formulas into local formulas. These properties are used in Section 4 to provide our main technical tool for the proof of the main result. Section 5 concludes the paper with a summary and an outlook on future work. In the remainder of this introduction, we give an overview of related work.

Related work

The first-order logic with counting FOC was introduced in [42] and further studied in [34, 7]. This logic extends first-order logic (FO) by the ability to formulate counting terms that evaluate to integers, and by numerical predicates that allow to compare results of counting terms. It was shown in [42] that the model-checking problem for FOC is fixed-parameter tractable on classes of structures of bounded degree. From [34] it is known that the fixed-parameter tractability of FOC cannot be generalised to even very simple classes of structures of unbounded degree such as unranked trees (under a reasonable assumption in parameterised complexity). However, [34] identified a fragment called FOC1 for which model-checking of formulas and evaluation of counting terms are fixed-parameter tractable on all nowhere dense classes of structures. The present paper uses counting terms of FOC1 to represent integer-valued classifiers.

The learning framework we consider has been introduced for Boolean classification problems in [35], which provides information-theoretic learnability results for classes of classifiers that can be specified using FO- and MSO-formulas on restricted classes of structures, such as the class of planar graphs or classes of graphs of bounded degree. Algorithmic aspects of the framework, including the running time of a learning algorithm, were first studied in [33]. The paper showed that Boolean classifiers definable in FO can be learned in sublinear time on structures of polylogarithmic degree. Analogous results have been obtained for MSO on strings [32] and on trees [29], which included a precomputation step to allow for efficient repeated learning. The paper [9] studied the parameterised complexity of the Boolean classification problem and showed that on arbitrary relational structures, learning hypotheses definable in FO is hard for the parameterised complexity class AW[] (i. e., subject to a plausible complexity-theoretic assumption, it is not fixed-parameter tractable). The paper also showed that the problem is fixed-parameter tractable if the structures come from a nowhere dense class.

For Boolean classifiers definable in the extension FOCN of FO with counting quantifiers and numerical predicates, [7] obtained a sublinear-time learning algorithm for structures of bounded degree, i. e., classes of structures where the degree is bounded by a constant. Recently, [8] lifted this result to structures of tiny degree, i. e., classes of structures of degree up to (loglogn)c for some constant c, where n is the size of the structure. The paper [10] considered a notion of weighted structures, which extend ordinary relational structures by assigning weights, i. e. elements from particular rings or abelian groups, to tuples present in the structure. It introduced the expressive logic FOWA, which extends FO by means of aggregating weights and formulating both formulas (that evaluate to “true” or “false”) and terms (that “aggregate” weights and evaluate to values in the associated ring or abelian group). For the fragment FOWA1 (that still extends FO), the paper showed that Boolean classifiers definable by FOWA1-formulas over weighted background structures of polylogarithmic degree can be learned in sublinear time after quasi-linear-time preprocessing. This lifts the results obtained in [33] for FO to the substantially more expressive logic FOWA1. As the logic FOC1 can be embedded in FOWA1, it follows from [10] that Boolean classifiers definable by FOC1-formulas over background structures of polylogarithmic degree can be learned in sublinear time after quasi-linear-time preprocessing. The main result of the present paper can be viewed as a generalisation of this to integer-valued classification problems.

The algorithmic results obtained so far within the framework introduced in [35] all focus on Boolean classification problems. However, many application scenarios require multiclass classification (cf. [21, 14, 36]). In the database systems literature, multiclass classifiers typically are described by aggregate queries [51, 52, 53, 54, 45]. In this paper, aggregate queries are represented by the counting terms of FOC1.

Closely related to the framework we consider is the framework of inductive logic programming (ILP) [46, 47, 39, 19, 20]. Both frameworks deal with a passive supervised learning setting, where the learning algorithms are given labelled examples. These examples are labelled according to some target concept, and the algorithms should return a hypothesis that approximately matches this target concept. One of the main differences between both frameworks is that we represent the background knowledge by a relational database, whereas in ILP, it is represented in a background theory, i. e., a set of formulas. Related logical learning frameworks have also been studied in formal verification [27, 43, 23, 56, 18].

In the database literature, various approaches to learning queries from examples have been studied, both in passive (such as ours) and active learning settings. In passive learning settings, results often focus on conjunctive queries [37, 38, 6, 40, 5, 55] or consider queries outside the relational database model [50, 11], while we focus on FOC1, an extension of full first-order logic. In the active learning setting introduced by Angluin [4], learning algorithms are allowed to actively query an oracle. Results in this setting [2, 49, 1, 11, 12, 15] again consider various types of queries. Another related subject in the database literature is the problem of learning schema mappings from examples [13, 28, 3, 16, 17].

2 Preliminaries

This section fixes the basic notation used throughout the paper, and it provides the precise syntax and semantics of first-order logic with counting FOC1 (the latter is taken almost verbatim from [34]). We let , , and 1 denote the sets of integers, non-negative integers, and positive integers, respectively. For m,n, we let [m,n]{:mn} and [n][1,n]. For a k-tuple x¯=(x1,,xk) we write |x¯| to denote its arity k. By () we denote the empty tuple, i. e., the tuple of arity 0.

Signatures and structures

A signature is a finite set of relation symbols. Every relation symbol R has a fixed arity ar(R). Let σ be a signature. A σ-structure (or σ-database) 𝒜 consists of a finite set A, called the universe of 𝒜, and a relation R(𝒜)Aar(R) for every Rσ. Note that signatures may contain relation symbols of arity 0. There are only two 0-ary relations over a set A, namely and {()}, which we interpret as “false” and “true”, respectively. We define the size |𝒜| of a σ-structure 𝒜 as |𝒜||A|.

Let σ be a signature with σσ. A σ-expansion of a σ-structure 𝒜 is a σ-structure 𝒜 which has the same universe as 𝒜 and which satisfies R(𝒜)=R(𝒜) for all Rσ.

A substructure of a σ-structure 𝒜 is a σ-structure with universe BA that satisfies R()R(𝒜) for every Rσ. For a set XA, the induced substructure of 𝒜 on X is the σ-structure 𝒜[X] with universe X, where R(𝒜[X])=R(𝒜)Xar(R) for every Rσ.

Gaifman graph, degree, distances, and neighbourhoods

In this paper, when speaking of graphs, we mean undirected simple graphs.

The Gaifman graph G𝒜 of a σ-structure 𝒜 with universe A is the graph with vertex set V(G𝒜)=A whose edge set E(G𝒜) contains exactly those edges {v,w} where v,wA, vw, and there exists an Rσ and a tuple (a1,,aar(R))R(𝒜) such that v,w{a1,,aar(R)}. The degree of a σ-structure 𝒜 is defined as the degree of the Gaifman graph G𝒜 (i. e., the maximum number of neighbours of a vertex of G𝒜).

The distance dist𝒜(a,b) between two elements a,bA is the minimal number of edges of a path from a to b in G𝒜; if no such path exists, we let dist𝒜(a,b). For a tuple a¯=(a1,,ak)Ak and an element bA, we let dist𝒜(a¯,b)mini[k]dist𝒜(ai,b).

Consider a k-tuple a¯=(a1,,ak)Ak for some k1. For every r0, the ball of radius r (or r-ball) of a¯ in 𝒜 is the set Nr𝒜(a¯){bA:dist𝒜(a¯,b)r}. The neighbourhood of radius r (or r-neighbourhood) of a¯ in 𝒜 is the induced substructure 𝒩r𝒜(a¯)𝒜[Nr𝒜(a¯)].

First-order logic with counting FOC𝟏

Let vars be a fixed, countably infinite set of variables. A σ-interpretation =(𝒜,β) consists of a σ-structure 𝒜 and an assignment β:varsA, where A denotes the universe of 𝒜. For k and k distinct variables x1,,xkvars and elements a1,,akA, we write a1,,akx1,,xk for the interpretation (𝒜,βa1,,akx1,,xk), where βa1,,akx1,,xk is the assignment β with β(xi)=ai for every i[k] and β(z)=β(z) for all zvars{x1,,xk}.

Next, we provide the syntax and semantics of the logic FOC1 [34]. This logic allows formulating numerical statements based on counting terms and numerical predicates.

For the remainder of this paper, fix a triple (,ar,.), called a numerical predicate collection, where is a finite set of predicate names, ar assigns an arity ar(𝖯)1 to each 𝖯, and . assigns a semantics 𝖯ar(𝖯) to each 𝖯. Examples of numerical predicates are the equality predicate 𝖯= with 𝖯=={(i,i):i}, the comparison predicate 𝖯 with 𝖯={(i,j):i,j,ij}, or the prime number predicate 𝖯prime with 𝖯prime={i:i is a prime number}. When analysing the running time of algorithms, we will assume that machines have access to oracles for evaluating the numerical predicates. That is, when given a 𝖯 and a tuple (i1,,iar(𝖯)) of integers, the oracle takes time O(1) to answer if (i1,,iar(𝖯)𝖯.

Let σ be a signature. The set of formulas and counting terms (for short: terms) of FOC1[σ] is built according to the following rules.

  1. (1)

    x1=x2 and R(x1,,xar(R)) are formulas,111in particular, if ar(R)=0 then R() is a formula where Rσ, and x1,x2,,xar(R) are variables. We let free(x1=x2){x1,x2} and free(R(x1,,xar(R))){x1,,xar(R)}.

  2. (2)

    If φ and ψ are formulas, then ¬φ and (φψ) are also formulas. We let free(¬φ)free(φ) and free((φψ))free(φ)free(ψ).

  3. (3)

    If φ is a formula and xvars, then xφ is a formula. We let free(xφ)free(φ){x}.

  4. (4)

    If φ is a formula and x¯=(x1,,xk) is a tuple of k pairwise distinct variables, for k0, then #x¯.φ is a counting term. We let free(#(x1,,xk).φ)free(φ){x1,,xk}.

  5. (5)

    Every integer i is a counting term. We let free(i)=.

  6. (6)

    If t1 and t2 are counting terms, then (t1+t2) and (t1t2) are also counting terms.

    We let free((t1+t2))free((t1t2))free(t1)free(t2).

  7. (7)

    If 𝖯, m=ar(𝖯), and t1,,tm are counting terms with |i=1mfree(ti)|1, then 𝖯(t1,,tm) is a formula. We let free(𝖯(t1,,tm))i=1mfree(ti).

First-order logic FO[σ] is the fragment of FOC1[σ] built by using only the rules (1)–(3). We write (φψ) and xφ as shorthands for ¬(¬φ¬ψ) and ¬x¬φ. For counting terms t1 and t2, we write (t1t2) as a shorthand for (t1+((1)t2)).

By FOC1, we denote the union of all FOC1[σ] for arbitrary signatures σ. This applies analogously to FO. For FOC1 the semantics for the rules (1)–(3) are defined in the same way as for FO; the semantics of the remaining rules are as follows. Let =(𝒜,β) be a σ-interpretation, and let A denote the universe of 𝒜.

  1. (4)

    #x¯.φ=|{(a1,,ak)Ak:φa1,,akx1,,xk=1}|, where x¯=(x1,,xk).

  2. (5)

    i=i, for i.

  3. (6)

    (t1+t2)=t1+t2 and (t1t2)=t1t2.

  4. (7)

    𝖯(t1,,tm)=1 if (t1,,tmI)𝖯, and 𝖯(t1,,tm)=0 otherwise.

Examples of counting terms in FOC1 and their intuitive meaning can be found in Examples 1.1 and 1.2.

An expression is a formula or a counting term. For an expression ξ, we write ξ(z1,,zk) to indicate that free(ξ){z1,,zk}. A sentence is a formula without free variables. A ground term is a counting term without free variables.

For a formula φ and a σ-interpretation , we write φ to indicate that φ=1. Likewise, ⊧̸φ indicates that φ=0. For a formula φ(x1,,xk), a σ-structure 𝒜, and a tuple a¯=(a1,,ak)Ak, we write 𝒜φ[a¯] to indicate that (𝒜,β)φ for all assignments β with β(xi)=ai for all i[k]. Similarly, for a counting term t(x1,,xk), we write t𝒜[a¯] for the integer t. In case that φ is a sentence and t is a ground term, we shortly write 𝒜φ instead of 𝒜φ[()], and we write t𝒜 instead of t𝒜[()].

The size |ξ| of an FOC1-expression ξ is defined as the length of the string representation of ξ, where integers and variables are considered as having length 1. For m,q, we write FOC1[σ,m,q] to denote the set of all FOC1[σ]-expressions of size at most q and with at most m free variables.

Hypotheses

Let 𝒜 be a σ-structure with universe A, let t(x¯,y¯) be an FOC1[σ]-term, let k|x¯|1, let |y¯|0, and let w¯A. The hypothesis represented in 𝒜 by the counting term t(x¯,y¯) and the parameter w¯ is defined as the mapping f:Ak such that f(a¯)=t𝒜[a¯,w¯] for every a¯Ak. That is, evaluating the counting term t in 𝒜 while assigning the variables x¯ to the elements a¯ and assigning the variables y¯ to the elements w¯ yields the integer f(a¯). Henceforth, we will write ht,w¯𝒜 to denote this function f. For a set SAk×, we say that ht,w¯𝒜 is consistent with S if and only if ht,w¯𝒜(a¯)=i for all (a¯,i)S.

3 Learning FOC1-Definable Aggregate Queries

Let σ,σ be signatures with σσ, fix two numbers k1, , and let TFOC1[σ],TFOC1[σ] be two sets of terms t(x¯,y¯) with |x¯|=k and |y¯|=. We study the following problem.

Learn-with-Precomputation(k,,T,T)

Precomputation: Given a σ-structure 𝒜 with universe A, compute a σ-expansion 𝒜 of 𝒜 and a lookup table whose size is independent of 𝒜.

Input: Training set SAk×.

Task: Return a term tT and a tuple w¯A such that the hypothesis ht,w¯𝒜 is consistent with S. The algorithm may reject if there is no term tT and tuple w¯A such that the hypothesis ht,w¯𝒜 is consistent with S.

As described in the introduction, the precomputation phase can be viewed as building an index structure for the given database 𝒜. Afterwards, this index structure can be used each time that we receive as input a new training set S. The “learning phase” is what happens after receiving such a set S; the desired output is a hypothesis that is consistent with S. The main contribution of this paper is an efficient solution for the problem Learn-with-Precomputation(k,,T,T). Before presenting the exact statement of our result (Theorem 3.1), recall from Section 1 the discussion on the benefits of local-access algorithms. In the learning phase (i. e., when receiving a training set), the algorithm we provide does not require random access to the database. Instead, it only needs local access, i.e., it only accesses the neighbours of elements that it already holds in memory, initially starting with the elements given in the training set. Here, “neighbours” refers to neighbours in the Gaifman graph G𝒜 of the database 𝒜. Formally, local access means that the algorithm can access an oracle that answers requests of the form “Is v¯R(𝒜)?” in constant time and requests of the form “Return a list of all neighbours of v in 𝒜” in time linear in the number of neighbours of v. As our machine model, we use a random-access machine (RAM) model, and we consider running times under the uniform-cost measure. This allows us to store an element of the database in a single memory cell and access it in a single computation step.

The main result of this paper is the following theorem.

Theorem 3.1.

Let σ be a signature, let k1, let ,q, let I be a finite set of integers, and let T be the set of all FOC1[σ,k+,q]-terms that only use integers from I.

There is an extension σ of σ with relation symbols of arity 1, and there is a number q such that, for the set T of all FOC1[σ,k+,q]-terms, there is an algorithm that solves the problem Learn-with-Precomputation(k,,T,T) as follows.

For a σ-structure 𝒜 of size n and degree d, the precomputation to compute the σ-expansion 𝒜 of 𝒜 and the associated lookup table takes time d𝒪(1)n. Afterwards, upon input of a training set S of size s, the algorithm uses only local access to 𝒜, access to the lookup table, and time (s+d)𝒪(1) to return either a hypothesis consistent with S or the answer “reject”. Furthermore, the returned hypothesis can be evaluated in time d𝒪(1), using only local access to 𝒜 and access to the lookup table.

In particular, this implies that when 𝒜 comes from a class of σ-structures of polylogarithmic degree, then the precomputation takes time quasi-linear in n (i. e., n(logn)O(1)), a hypothesis is found within time polynomial in s and polylogarithmic in n, and it can be evaluated in time polylogarithmic in n.

We remark that the algorithm given in the proof of Theorem 3.1 is not meant to be implemented and used in practice. Instead, Theorem 3.1 serves as a fundamental result that shows the theoretic (and asymptotic) feasibility of learning aggregate queries definable in FOC1. This is in line with previous work on the descriptive complexity of learning [35, 33, 32, 29, 7, 10, 9, 8] and closely related work on so-called algorithmic meta theorems (see, e. g., [30, 41]).

Before turning to the proof of Theorem 3.1, let us first briefly discuss the situation. In the assumption of the theorem, we require the set I of integers occurring in terms of T to be finite. However, note that we can still represent other integers in T by using rule (6) for FOC1, that is, addition and multiplication. For example, we could let I be the set of powers of 10 up to some bound, and we could obtain the numbers in between with rather few additions and multiplications. Moreover, requiring I to be finite does not limit the use of counting terms of the form #x¯.φ, which may evaluate to arbitrary integers based on the given database. Meanwhile, we could let I contain (large) constants that are specific to the field the data is coming from, which allows using them even if the bound q on the size of the terms in T is small.

Since I is finite and the size of terms in T is bounded by q, up to equivalence, the set T is finite. Thus, when given a σ-structure 𝒜 with universe A and a training set SAk×, in order to find a hypothesis that is consistent with S, we could proceed as follows. Loop through all terms t(x¯,y¯) in T. For each such term t, loop through all tuples w¯A, and check whether t𝒜[v¯,w¯]=λ for every (v¯,λ)S. If so, stop and output t and w¯, indicating that ht,w¯𝒜 is a hypothesis consistent with S. If no such hypothesis is found, then stop and output “reject”. This obviously solves the learning problem. However, the time taken to loop through all the w¯A is polynomial in |A|, and it may not suffice to start with the vertices given in the training set and repeatedly iterate over the neighbours of already found vertices. Note that Theorem 3.1 yields a much more efficient algorithm, which runs in time polynomial in the size of the training set and the degree of the structure. This can be substantially faster than being polynomial in the size of the structure. We achieve this by moving over to a larger signature σ and a suitable σ-expansion 𝒜 of 𝒜. This is done in such a way that afterwards, we only have to loop through those tuples w¯ that are close to the tuples v¯Ak that actually occur in the training set S. Meanwhile, the number of terms to check is not constant any more, but it depends on 𝒜. However, as we discuss in the proof of Theorem 3.1, this number is polynomial in the degree of 𝒜, which yields the desired running time. The exact details are provided by the following Lemma 3.2; this lemma is the main technical tool that allows us to prove Theorem 3.1.

For formulating the lemma, we need the following notation. In a structure 𝒜 with universe A and of degree at most d, for every vA and any radius r, we have |Nr𝒜(v)|νd(r)1+d0i<r(d1)i. Note that ν0(r)=1, ν1(r)2, ν2(r)2r+1, and νd(r)dr+1 for all r and d3. In particular, for a fixed radius r, νd(r) is polynomial in d. For all r, it is straightforward to construct an FO[σ]-formula distrσ(x,y) such that for every σ-structure 𝒜 and all v,wA, we have 𝒜distrσ[v,w] if and only if dist𝒜(v,w)r. To improve readability, we write distσ(x,y)r instead of distrσ(x,y), and distσ(x,y)>r instead of ¬distrσ(x,y).

Let r. An FOC1[σ]-formula φ(x¯) with free variables x¯=(x1,,xk) is r-local (around x¯) if for every σ-structure 𝒜 with universe A and every tuple v¯=(v1,,vk)Ak, we have 𝒜φ[v¯]𝒩r𝒜(v¯)φ[v¯]. A formula is local if it is r-local for some r. This notion of local formulas is identical with the one in [25]. It is very similar to the notion of local formulas by Gaifman [26], although we use a semantic notion instead of Gaifman’s syntactic notion.

For every k1, every graph G with vertex set [k], and every tuple x¯=(x1,,xk) of k pairwise distinct variables, we consider the formula

δG,rσ(x¯){i,j}E(G)distσ(xi,xj)r{i,j}E(G)distσ(xi,xj)>r.

Note that 𝒜δG,2r+1σ[v¯] means that the connected components of the r-neighbourhood 𝒩r𝒜(v¯) correspond to the connected components of G. Clearly, the formula δG,2r+1σ(x¯) is r-local around its free variables x¯.

For two sets A,N with NA, for k, and two tuples w¯,w¯Ak, we write w¯Nw¯ if wi=wi for all i[k] with wiN. Note that this notion is not symmetric: for tuples w¯,w¯ with w¯Nw¯, the tuple w¯ may still have an entry wiN for some i[k] while wiN, so wiwi.

Our main technical ingredient for the proof of Theorem 3.1 is the following lemma. Note that in the final statement of the lemma, the graphs H are connected and the formulas ψ are local – both conditions are absolutely necessary for our proof of Theorem 3.1.

Lemma 3.2.

Let σ be a signature, let k1, let ,q, let t(x¯,y¯) be an FOC1[σ,k+,q]-term, and let It be the set of integers that occur in t.

There is an extension σt of σ with relation symbols of arity 1, and there are numbers qt,rt such that, for every σ-structure 𝒜 of size n and degree d, we can compute a σt-expansion 𝒜t of 𝒜 in time d𝒪(1)n such that the following is true, where A denotes the universe of 𝒜. For all s1, for all v¯1,,v¯sAk, and for all w¯A, there is an FOC1[σt,k+,qt]-term t(x¯,y¯) such that for all w¯A with w¯Ntw¯ for NtN(2rt+1)(+q)𝒜(v¯1,,v¯s), we have t𝒜[v¯i,w¯]=(t)𝒜[v¯i,w¯] for all i[s].

Furthermore, t is a combination via addition and multiplication of integers in It, of integers i with 1i(νd((2rt+1)q))q, and of counting terms of the form #z¯.(ψδH,2rt+1σt) where ψ is an rt-local formula in FO[σt] and H is a connected graph.

We present the proof of Lemma 3.2 in Section 4. Intuitively, the lemma says that we can translate a term t into a term t over an extended signature such that the new term only needs those parameters that are close to the examples. By using this lemma, we can prove Theorem 3.1 as follows.

Proof of Theorem 3.1.

Let σ,k,,q,I,T be chosen according to the assumption of the theorem.

Note that, up to logical equivalence, there are only finitely many terms in T. Thus, w.l.o.g. we assume that T is finite and that all terms in T only use x1,,xk,y1,,y,z1,,zq as variables. For each term t(x¯,y¯) in T, we apply Lemma 3.2 to obtain an extension σtσ and numbers qt,rt. W.l.o.g. we assume that the sets (σtσ)tT are pairwise disjoint.

Let σtTσt, let qmaxtTqt, and let T be the set of all FOC1[σ,k+,q]-terms using only x1,,xk,y1,,y,z1,,zq as variables. Furthermore, let rmaxtTrt.

Upon input of a σ-structure 𝒜 of size n and degree d, for each tT we use Lemma 3.2 to compute a σt-expansion 𝒜t of 𝒜 in time d𝒪(1)n. We let 𝒜 be the σ-expansion of 𝒜 obtained by combining all the structures 𝒜t for tT.

In addition, we also compute a lookup table that stores the value g𝒜 for every ground term g that occurs in a term in T and is of the form #z¯.(ψδH,2rt+1σ), where ψ is an rt-local formula in FO[σ] (for some tT) and H is a connected graph.

Claim 3.3.

The lookup table can be computed in time d𝒪(1)n.

Proof.

First note that the number of entries in the lookup table is constant and does not depend on 𝒜, because the terms in T have size at most q and only use variables from a fixed set. Thus, it suffices to show that every single entry of the lookup table can be computed in time d𝒪(1)n.

Let g be a ground term that occurs in a term in T and is of the form #z¯.(ψδH,2rt+1σ), where ψ is an rt-local formula in FO[σ] (for some tT) and H is a connected graph. Then m|z¯|q. Let r~(2r+1)m.

Since H is a connected graph and (2rt+1)m(2r+1)m=r~, we have v2,,vmNr~𝒜(v1) for all v¯(v1,,vm)Am with 𝒜δH,2rt+1σ[v¯]. Hence, to compute g𝒜, we can simply initialise the corresponding entry in the lookup table with 0, iterate over all v1A and all v2,,vmNr~𝒜(v1), and increase the entry in the lookup table by 1 if and only if 𝒜(ψδH,2rt+1σ)[v¯] holds. For every fixed v1A, we iterate over at most νd(r~)m1d𝒪(1) tuples (v2,,vm). Moreover, since ψ and δH,2rt+1σ are rt-local, for each tuple v¯=(v1,v2,,vm), it holds that 𝒜(ψδH,2rt+1σ)[v¯] if and only if 𝒩rt𝒜(v¯)(ψδH,2rt+1σ)[v¯]. The latter can be checked by building the neighbourhood structure around v¯ in time polynomial in d, and then evaluating the formula (of constant size) on the neighbourhood structure by a brute-force algorithm in time polynomial in d.

All in all, we use d𝒪(1)n iterations, and every iteration takes time d𝒪(1), so the overall running time is d𝒪(1)n.

Now let us assume that we receive an arbitrary training set S={(v¯1,λ1),,(v¯s,λs)}Ak× of size s. Let T be the set of all those terms t(x¯,y¯) in T that satisfy the following: t is a combination via addition and multiplication of integers in I, of integers i with 1i(νd((2r+1)q))q, and of counting terms of the form #z¯.(ψδH,2rt+1σ), where ψ is an rt-local formula in FO[σ] (for some tT) and H is a connected graph.

Claim 3.4.

T is of size d𝒪(1). Furthermore, for every t(x¯,y¯)T, upon input of a¯Ak and b¯A, we can compute (t)𝒜[a¯,b¯] in time d𝒪(1) with only local access to 𝒜 and access to the precomputed lookup table.

Proof.

Since the terms in T have length at most q, the variables come from a fixed finite set, the signature σ has constant size, and the integers that may occur in a term in T come from a set of size polynomial in d, the total number of terms in T is also polynomial in d.

For the evaluation of a term t in T, we first consider counting terms t(x¯,y¯) of the form #z¯.(ψδH,2rt+1σ), where k|x¯|k, |y¯|, m|z¯|q, ψ is an rt-local formula in FO[σ] (for some tT), and H is a connected graph with V(H)=[k++m].

In case that k+=0, the term t is a ground term. Hence, we can simply use the precomputed lookup table to get the value (t)𝒜 in time O(1).

In case that k+>0, let a¯Ak and b¯A. Since H is connected and rtr, for all c¯=(c1,,cm) with 𝒜δH,2rt+1σ[a¯,b¯,c¯], we have c1,,cmN(2r+1)m𝒜(a¯,b¯).

Thus, for r~(2r+1)m(2r+1)q, we have

(t)𝒜[a¯,b¯] =|{c¯Am:𝒜(ψδH,2rt+1σ)[a¯,b¯,c¯]}|
=|{c¯(Nr~𝒜(a¯,b¯))m:𝒜(ψδH,2rt+1σ)[a¯,b¯,c¯]}|.

Furthermore, ψ and δH,2rt+1σ are rt-local formulas, so 𝒜(ψδ2rt+1,Hσ)[a¯,b¯,c¯] if and only if 𝒩rt𝒜(a¯,b¯,c¯)(ψδ2rt+1,Hσ)[a¯,b¯,c¯]. Since the size of the neighbourhood is polynomial in d, the evaluation of (t)𝒜[a¯,b¯] can be performed by evaluating an FO[σ]-formula of constant size on a structure of size polynomial in d for a number of assignments that is polynomial in d. The evaluation of the formula can be done by building the neighbourhood structure around a¯,b¯ in time polynomial in d, and then evaluating the formula on the neighbourhood structure by a brute-force algorithm in time polynomial in d.

Now let t be an arbitrary term in T. Then, for all a¯Ak, b¯A, (t)𝒜[a¯,b¯] can be evaluated in time polynomial in d with only local access to 𝒜 and access to the precomputed lookup table by first evaluating every counting term in t as described above and then combining the results and the integers occurring in t via a constant number of additions and multiplications.

To find a hypothesis that is consistent with S, we loop through all terms t(x¯,y¯) in T. For each such term t, we loop through all tuples w¯N for NN(2r+1)(+q)𝒜(v¯1,,v¯s). We check if (t)𝒜[v¯i,w¯]=λi for all i[s]. If so, we stop and output t and w¯, indicating that ht,w¯𝒜 is a hypothesis consistent with S. Otherwise, we stop and output “reject”.

Claim 3.5.

This algorithm uses only local access to 𝒜 and access to the precomputed lookup table, and it terminates within time (s+d)𝒪(1). If it outputs a hypothesis, this hypothesis is consistent with S. If it outputs “reject”, there is no term tT and tuple w¯A such that ht,w¯𝒜 is consistent with S.

Proof.

The set N contains at most skνd((2r+1)(+q)) elements, which is polynomial in s and d. By 3.4, T is of size d𝒪(1). Thus, in total, we iterate over (s+d)𝒪(1) hypotheses. For every hypothesis consisting of a term tT and a tuple w¯A, by 3.4, we can check in time d𝒪(1)s if (t)𝒜[v¯i,w¯]=λi for all i[s]. This check uses only local access to 𝒜 and access to the precomputed lookup table. This proves the first statement of the claim. The second statement of the claim is obvious.

To prove the third statement of the claim, let us assume that there exists a term tT and a tuple w¯A such that ht,w¯𝒜 is consistent with S. For this particular choice of t and w¯, and for the given tuples v¯1,,v¯s, Lemma 3.2 yields an FOC1[σt,k+,qt]-term t(x¯,y¯) such that for all w¯A with w¯Ntw¯ for NtN(2rt+1)(+q)𝒜(v¯1,,v¯s), we have

t𝒜[v¯i,w¯]=(t)𝒜[v¯i,w¯]for all i[s].

Note that NtN. Hence, our algorithm will consider at least one w¯N such that w¯Ntw¯. Let us fix such a w¯. All that remains to be done is to show that tT – this will imply that our algorithm will eventually consider t,w¯ and then stop and output t and w¯, indicating that ht,w¯𝒜 is a hypothesis consistent with S.

By our choice of σ and q, we know that σσt and qqt. Thus, tT. From Lemma 3.2, we know that t is a combination via addition and multiplication of integers in I, of integers i with 1i(νd((2rt+1)q))q, and of counting terms of the form #z¯.(ψδH,2rt+1σ), where ψ is an rt-local formula in FO[σt] and H is a connected graph. By our particular choice of T and since rrt, we obtain that tT. This completes the proof of Claim 3.5.

In summary, the proof of Theorem 3.1 is complete.

4 Proof of Lemma 3.2

The proof of Lemma 3.2 heavily relies on the following localisation theorem for FOC1. This theorem is implicit in [34]; here we present it in a way analogous to [10, Theorem 4.7].

Theorem 4.1 (Localisation Theorem for FOC1, [34]).

Let k, and let σ be a signature. For every FOC1[σ]-formula φ(x1,,xk), there is an extension σφ of σ with relation symbols of arity 1, and an FO[σφ]-formula φ(x1,,xk) that is a Boolean combination of local formulas and statements of the form R() where Rσφ has arity 0, for which the following holds. There is an algorithm that, upon input of a σ-structure 𝒜 of size n and degree d, computes in time d𝒪(1)n a σφ-expansion 𝒜φ of 𝒜 such that for all v¯Ak (where A denotes the universe of 𝒜), we have 𝒜φφ[v¯] 𝒜φ[v¯].

For the proof of Lemma 3.2, let σ,k,,q,t(x¯,y¯),It be chosen according to the assumption of the lemma. In particular, t(x¯,y¯) is an FOC1[σ,k+,q]-term, and It is the set of integers that occur in t.

We first note that it suffices to prove the statement of the lemma for the particular case where t(x¯,y¯) is of the form #z¯.φ(x¯,y¯,z¯) for some FOC1[σ]-formula φ. Assume for now that the statement of the lemma holds for all terms of this form, and consider an arbitrary FOC1[σ,k+,q]-term t(x¯,y¯) that is not of this form. Then, t is a combination via addition and multiplication of integers from It and of counting terms u of the form #z¯.φ(x¯,y¯,z¯) for some FOC1[σ]-formula φ. Let U be the set of all these counting terms u occurring in t. We assume that the statement of the lemma holds for each uU. Hence, for each uU, we obtain an extension σu of σ with relation symbols of arity 1 and numbers qu,ru. W.l.o.g. we assume that the sets (σuσ)uU are pairwise disjoint. Let σtuUσu, rtmaxuUru, and qtqmaxuUqu.

Upon input of a σ-structure 𝒜 of size n and degree d, for each uU, the statement of the lemma for u enables us to compute a σu-expansion 𝒜u of 𝒜 in time d𝒪(1)n. We let 𝒜t be the σt-expansion of 𝒜 obtained by combining all the structures 𝒜u for all counting terms uU. Let A denote the universe of 𝒜.

When given v¯1,,v¯sAk and w¯A, the statement of the lemma for u yields an FOC1[σu,k+,qu]-term u for each uU. We choose t to be the term obtained from t by replacing every occurrence of a counting term uU by the corresponding counting term u. Clearly t has length at most qt. It is not difficult to verify that t has the desired properties stated in Lemma 3.2.

All that remains to be done is to prove the statement of the lemma for the particular case where t(x¯,y¯) is of the form #z¯.φ(x¯,y¯,z¯) for some FOC1[σ]-formula φ. Given such a term t, let m|z¯|. Using Theorem 4.1, we obtain an extension σσφ of σ with relation symbols of arity 1 and an FO[σ]-formula φ(x¯,y¯,z¯) that is a Boolean combination of local formulas and statements of the form R(), where Rσ has arity 0, for which the following holds. Given a σ-structure 𝒜 of size n and degree d and with universe A, we can compute a σ-expansion 𝒜𝒜φ of 𝒜 in time d𝒪(1)n such that for all a¯Ak, b¯A, and c¯Am, we have 𝒜φ[a¯,b¯,c¯] 𝒜φ[a¯,b¯,c¯]. Thus, for t~(x¯,y¯)#z¯.φ(x¯,y¯,z¯), we have t~𝒜[a¯,b¯]=t𝒜[a¯,b¯] for all a¯Ak, b¯A.

From the particular shape of φ, we obtain that there exists a number r such that φ is r-local, i. e., for all a¯Ak, b¯A, c¯Am, we have 𝒜φ[a¯,b¯,c¯] 𝒩r𝒜(a¯,b¯,c¯)φ[a¯,b¯,c¯].

Let 𝒢 be the set of undirected graphs with vertex set [k++m]. For every G𝒢, consider the formula φG(x¯,y¯,z¯)φ(x¯,y¯,z¯)δG,2r+1σ(x¯,y¯,z¯) and the term uG(x¯,y¯)#z¯.φG(x¯,y¯,z¯). It is not difficult to verify that for all a¯Ak, b¯A, we have

t𝒜[a¯,b¯]=t~𝒜[a¯,b¯]=G𝒢(uG)𝒜[a¯,b¯].

To further decompose the terms uG, we use techniques similar to the ones used in [34] to decompose terms into so-called connected local terms. With these, we obtain the following technical lemma. The statement of this lemma as well as its proof are highly non-trivial. It depends on a careful analysis of the connected components of undirected graphs with k++m nodes. Note that in the final statement of the lemma, the graphs H are connected and the formulas ψ are local – both conditions are also stated in Lemma 3.2 and are absolutely necessary for our proof of Theorem 3.1.

Lemma 4.2.

Let σ be a signature and let r,k,,m with k++m1. Let φ(x¯,y¯,z¯) be an r-local FO[σ]-formula with |x¯|=k, |y¯|=, and |z¯|=m. Let 𝒢 be the set of all undirected graphs with vertex set [k++m]. Let G𝒢 and let

uG(x¯,y¯)#z¯.(φ(x¯,y¯,z¯)δG,2r+1σ(x¯,y¯,z¯)).

There is a number qG such that, for every σ-structure 𝒜 of degree d and with universe A, for all s1, for all v¯1,,v¯sAk, and for all w¯A, there exists an FOC1[σ,k+,qG]-term uG(x¯,y¯) such that the following is true for NN(2r+1)(+m)𝒜(v¯1,,v¯s).

For all w¯A with w¯Nw¯, we have

(uG)𝒜[v¯i,w¯]=(uG)𝒜[v¯i,w¯]for all i[s].

Furthermore, uG is a combination via addition and multiplication of integers i with 1i(νd((2r+1)m))m and of counting terms of the form #z¯.(ψδH,2r+1σ), where |z¯||z¯|, for a connected graph H and an r-local formula ψ.

Before proving Lemma 4.2, we first finish the proof of Lemma 3.2. We apply Lemma 4.2 to the term uG for all G𝒢. For each G𝒢 this yields a number qG. We choose rtr and σtσ.

For any σ-structure 𝒜, we let 𝒜 be the σ-expansion obtained as described above (by applying Theorem 4.1 to the formula φ). When given tuples v¯1,,v¯sAk and w¯A, we obtain from Lemma 3.2 an FOC1[σ,k+,qG]-term uG(x¯,y¯), for every G𝒢. This term uG is a combination via addition and multiplication of integers i with 1i(νd((2r+1)m))m and of counting terms of the form #z¯.(ψδH,2r+1σ), where |z¯||z¯|, for a connected graph H and an r-local formula ψ. We choose

t(x¯,y¯)G𝒢uG(x¯,y¯)

and let qtG𝒢(qG+3). Note that t has length qt and that mq. It is not difficult to verify that the term t has the desired properties stated in Lemma 3.2.

All that remains to be done to complete the proof of Lemma 3.2 is to prove Lemma 4.2.

Proof of Lemma 4.2.

We prove the statement by induction on the number c of connected components of G.

Consider the induction base c=1, i. e., the graph G is connected. We choose qG to be the length of the term uG. For choosing the term uG, we distinguish between 4 cases.

Case 1:

k==0. In this case, we set uGuG.

Case 2:

k=0 and >0. In this case, we set uGi for the particular number i(uG)𝒜[w¯]. Since G is connected, all elements in a tuple c¯Am with 𝒜δG,2r+1σ[w¯,c¯] have distance at most (2r+1)m from w¯. Therefore, 0i(νd((2r+1)m))m.

Case 3:

k>0 and w¯N. Since G is connected, for all i[s] and all c¯Am, it holds that 𝒜⊧̸δG,2r+1σ[v¯i,w¯,c¯]. Hence, for all i[s] we have (uG)𝒜[v¯i,w¯]=0. Therefore, we can choose uG0.

Case 4:

k>0 and w¯N. In this case, w¯ is the only tuple w¯ that satisfies w¯Nw¯. Hence, we can choose uGuG.

This completes the induction base.

We now turn to the induction step, where c>1. We assume that the statement of the lemma holds for all graphs G with fewer than c connected components. Let C1 be the set of all vertices of G that are in the same connected component as the vertex 1. Let C2V(G)C1. For each j{1,2}, let G[Cj] be the induced subgraph of G with vertex set Cj. Clearly, G is the disjoint union of the graphs G[C1] and G[C2], and G[C1] has only one connected component, and G[C2] has c1 connected components. For j{1,2}, let x¯j be the tuple of variables of x¯ such that their index is contained in Cj, let y¯j be the tuple of variables of y¯ such that their index +k is contained in Cj, and let z¯j be the tuple of variables of z¯ such that their index +k+ is contained in Cj.

By using the Feferman-Vaught Theorem (cf., [24, 44, 10]), we obtain a decomposition Δ of φ(x¯,y¯,z¯) w.r.t. (x¯1y¯1z¯1;x¯2y¯2z¯2). I. e., Δ is a finite non-empty set of pairs of r-local FO[σ]-formulas of the form (α(x¯1,y¯1,z¯1),β(x¯2,y¯2,z¯2)) such that for all a¯Ak, b¯A, c¯Am, we have 𝒜φ[a¯,b¯,c¯] there exists a pair (α,β)Δ such that 𝒜α[a¯1,b¯1,c¯1] and 𝒜β[a¯2,b¯2,c¯2]. Here, a¯j,b¯j,c¯j are defined analogously to x¯j,y¯j,z¯j for j{1,2}. Moreover, the pairs in Δ are mutually exclusive, i. e., for all a¯Ak, b¯A, c¯Am, there is at most one pair (α,β)Δ such that 𝒜α[a¯1,b¯1,c¯1] and 𝒜β[a¯2,b¯2,c¯2].

Let 𝒢¬G be the set of graphs HG with vertex set V(H)=[k++m] and G[C1]=H[C1] and G[C2]=H[C2]. Note that such graphs H have strictly fewer connected components than G.

For every pair (α,β)Δ, we set

tα,β(x¯,y¯) #z¯.(α(x¯1,y¯1,z¯1)β(x¯2,y¯2,z¯2)δG,2r+1σ(x¯,y¯,z¯)),
tα,1(x¯1,y¯1) #z¯1.(α(x¯1,y¯1,z¯1)δG[C1],2r+1σ(x¯1,y¯1,z¯1))),
tβ,2(x¯2,y¯2) #z¯2.(β(x¯2,y¯2,z¯2)δG[C2],2r+1σ(x¯2,y¯2,z¯2))),
tα,β,¬G(x¯,y¯) H𝒢¬G#z¯.(α(x¯1,y¯1,z¯1)β(x¯2,y¯2,z¯2)δH,2r+1σ(x¯,y¯,z¯)).

Since the pairs (α,β)Δ are mutually exclusive, for all a¯Ak, b¯A, we have (uG)𝒜[a¯,b¯]=(α,β)Δtα,β𝒜[a¯,b¯]. Furthermore, it is not difficult to verify that

tα,β𝒜[a¯,b¯]=tα,1𝒜[a¯1,b¯1]tβ,2𝒜[a¯2,b¯2]tα,β,¬G𝒜[a¯,b¯],

where a¯j,b¯j are defined analogously to x¯j,y¯j for j{1,2}.

Using the induction hypothesis, we apply the statement of the lemma to tα,1, tβ,2, and to every summand of tα,β,¬G, and we call the resulting terms tα,1, tβ,2, and tα,β,¬G.

We set tα,β(x¯,y¯)tα,1(x¯1,y¯1)tβ,2(x¯2,y¯2)tα,β,¬G(x¯,y¯) and choose uG(x¯,y¯)(α,β)Δtα,β(x¯,y¯).

It is not difficult to verify that the term uG has the desired properties. Moreover, the size of uG can be bounded by a number qG that only depends on uG (but not on 𝒜). This completes the proof of Lemma 4.2. Hence, also the proof of Lemma 3.2 now is complete.

5 Conclusion

We have studied the complexity of learning aggregate queries from examples. For this, we have extended the framework for Boolean classification problems by Grohe and Turán [35] to multiclass classification problems with integer-valued classifiers. In our setting, such classifiers are represented by a pair (t,w¯) where t(x¯,y¯) is a counting term in FOC1 and w¯ is a tuple of elements in the universe A of the given database 𝒜.

Our main result shows that we can build a suitable index structure on the given database 𝒜 during a precomputation step whose running time is linear in the size and polynomial in the degree of 𝒜. Afterwards, by utilising this index structure, whenever receiving as input a new training set S, a classifier definable in FOC1 can be learned in time polynomial in the degree of 𝒜 and polynomial in the number of examples given in the training set. The classifiers returned by our algorithm can be evaluated efficiently, in time polynomial in the degree of 𝒜. Moreover, after having built the index structure, all our algorithms require only local access to the database.

It seems conceivable that our results can be generalised to the more expressive logic FOWA1 that operates on weighted structures [10], since the locality results obtained for this logic in [10] are similar to the ones obtained for FOC1 in [34] and used here.

For the Boolean classification problem based on FO, [9] shows that the learning problem is fixed-parameter tractable on nowhere dense classes. This result heavily relies on the fixed-parameter tractability of the evaluation problem for FO on nowhere dense classes [31]. From [34] it is known that on nowhere dense classes also the evaluation problem for formulas and terms of FOC1 is fixed-parameter tractable. We therefore think that FOC1 is a good candidate for a logic with a fixed-parameter tractable integer-valued classification problem on classes of sparse structures.

In this paper, the task in the learning problem is to find an integer-valued hypothesis that is consistent with the training set. Other publications on the considered framework for Boolean classification problems also study settings in which one wants to find a hypothesis that generalises well [33, 32, 29, 7, 10, 9, 8]. More specifically, they study probably approximately correct (PAC) learning tasks in which the training examples are considered as being generated from a (fixed, but unknown) probability distribution. The task is to find a hypothesis that has a small expected error on new examples generated from the same distribution. While PAC-learning algorithms for Boolean classification problems are typically based on the empirical risk minimisation (ERM) principle [48], results in algorithmic learning theory for multiclass classification [14, 21, 36] are based on different principles and even show that the ERM principle in general does not suffice for PAC learning with an unbounded number of classes [22]. Therefore, we expect it to be quite challenging to obtain PAC-learning results for our framework of multiclass classification problems.

We plan to work on the raised questions in future work.

References

  • [1] Azza Abouzied, Dana Angluin, Christos H. Papadimitriou, Joseph M. Hellerstein, and Avi Silberschatz. Learning and verifying quantified Boolean queries by example. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 49–60. ACM, 2013. doi:10.1145/2463664.2465220.
  • [2] Howard Aizenstein, Tibor Hegedüs, Lisa Hellerstein, and Leonard Pitt. Complexity theoretic hardness results for query learning. Comput. Complex., 7(1):19–53, 1998. doi:10.1007/PL00001593.
  • [3] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, 2011. doi:10.1145/2043652.2043656.
  • [4] Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1987. doi:10.1007/BF00116828.
  • [5] Pablo Barceló, Alexander Baumgartner, Victor Dalmau, and Benny Kimelfeld. Regularizing conjunctive features for classification. J. Comput. Syst. Sci., 119:97–124, 2021. doi:10.1016/j.jcss.2021.01.003.
  • [6] Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In 20th International Conference on Database Theory, ICDT 2017, volume 68 of LIPIcs, pages 7:1–7:17, 2017. doi:10.4230/LIPIcs.ICDT.2017.7.
  • [7] Steffen van Bergerem. Learning concepts definable in first-order logic with counting. In 34th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2019, pages 1–13. IEEE, 2019. doi:10.1109/LICS.2019.8785811.
  • [8] Steffen van Bergerem. Descriptive Complexity of Learning. PhD thesis, RWTH Aachen University, Germany, 2023. doi:10.18154/RWTH-2023-02554.
  • [9] Steffen van Bergerem, Martin Grohe, and Martin Ritzert. On the parameterized complexity of learning first-order logic. In PODS 2022: International Conference on Management of Data, pages 337–346. ACM, 2022. doi:10.1145/3517804.3524151.
  • [10] Steffen van Bergerem and Nicole Schweikardt. Learning concepts described by weight aggregation logic. In 29th EACSL Annual Conference on Computer Science Logic, CSL 2021, Ljubljana, Slovenia (Virtual Conference), January 25-28, 2021, volume 183 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CSL.2021.10.
  • [11] Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, pages 109–120. OpenProceedings.org, 2015. doi:10.5441/002/edbt.2015.11.
  • [12] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016. doi:10.1145/2818637.
  • [13] Angela Bonifati, Ugo Comignani, Emmanuel Coquery, and Romuald Thion. Interactive mapping specification with exemplar tuples. ACM Trans. Database Syst., 44(3):10:1–10:44, 2019. doi:10.1145/3321485.
  • [14] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, pages 943–955. IEEE, 2022. doi:10.1109/FOCS54457.2022.00093.
  • [15] Balder ten Cate and Victor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In 24th International Conference on Database Theory, ICDT 2021, volume 186 of LIPIcs, pages 9:1–9:24, 2021. doi:10.4230/LIPIcs.ICDT.2021.9.
  • [16] Balder ten Cate, Victor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
  • [17] Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
  • [18] Adrien Champion, Tomoya Chiba, Naoki Kobayashi, and Ryosuke Sato. ICE-based refinement type discovery for higher-order functional programs. J. Autom. Reason., 64(7):1393–1418, 2020. doi:10.1007/s10817-020-09571-y.
  • [19] William W. Cohen and C. David Page Jr. Polynomial learnability and inductive logic programming: Methods and results. New Gener. Comput., 13(3&4):369–409, 1995. doi:10.1007/BF03037231.
  • [20] Andrew Cropper, Sebastijan Dumancic, Richard Evans, and Stephen H. Muggleton. Inductive logic programming at 30. Mach. Learn., 111(1):147–172, 2022. doi:10.1007/s10994-021-06089-1.
  • [21] Amit Daniely, Sivan Sabato, and Shai Shalev-Shwartz. Multiclass learning approaches: A theoretical comparison with implications. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, pages 494–502, 2012. URL: https://proceedings.neurips.cc/paper/2012/hash/19f3cd308f1455b3fa09a282e0d496f4-Abstract.html.
  • [22] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Maria-Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 287–316. JMLR.org, 2014. URL: http://proceedings.mlr.press/v35/daniely14b.html.
  • [23] P. Ezudheen, Daniel Neider, Deepak D’Souza, Pranav Garg, and P. Madhusudan. Horn-ICE learning for synthesizing invariants and contracts. Proc. ACM Program. Lang., Volume 2( Issue OOPSLA):131:1–131:25, 2018. doi:10.1145/3276501.
  • [24] Solomon Feferman and Robert L. Vaught. The first-order properties of products of algebraic systems. Fundamenta Mathematicae, 47:57–103, 1959.
  • [25] Jörg Flum and Martin Grohe. Parameterized Complexity Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006. doi:10.1007/3-540-29953-X.
  • [26] Haim Gaifman. On local and non-local properties. In Jacques Stern, editor, Proceedings of the Herbrand Symposium, volume 107 of Studies in Logic and the Foundations of Mathematics, pages 105–135. North-Holland, 1982. doi:10.1016/S0049-237X(08)71879-2.
  • [27] Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. ICE: A robust framework for learning invariants. In Computer Aided Verification - 26th International Conference, CAV 2014, volume 8559 of Lecture Notes in Computer Science, pages 69–87. Springer, 2014. doi:10.1007/978-3-319-08867-9_5.
  • [28] Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010. doi:10.1145/1667053.1667055.
  • [29] Emilie Grienenberger and Martin Ritzert. Learning definable hypotheses on trees. In 22nd International Conference on Database Theory, ICDT 2019, pages 24:1–24:18, 2019. doi:10.4230/LIPIcs.ICDT.2019.24.
  • [30] Martin Grohe. Logic, graphs, and algorithms. In Jörg Flum, Erich Grädel, and Thomas Wilke, editors, Logic and Automata: History and Perspectives [in Honor of Wolfgang Thomas], volume 2 of Texts in Logic and Games, pages 357–422. Amsterdam University Press, 2008.
  • [31] Martin Grohe, Stephan Kreutzer, and Sebastian Siebertz. Deciding first-order properties of nowhere dense graphs. J. ACM, 64(3):17:1–17:32, 2017. doi:10.1145/3051095.
  • [32] Martin Grohe, Christof Löding, and Martin Ritzert. Learning MSO-definable hypotheses on strings. In International Conference on Algorithmic Learning Theory, ALT 2017, pages 434–451, 2017. URL: http://proceedings.mlr.press/v76/grohe17a.html.
  • [33] Martin Grohe and Martin Ritzert. Learning first-order definable concepts over structures of small degree. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2017, pages 1–12, 2017. doi:10.1109/LICS.2017.8005080.
  • [34] Martin Grohe and Nicole Schweikardt. First-order query evaluation with cardinality conditions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2018, pages 253–266, 2018. doi:10.1145/3196959.3196970.
  • [35] Martin Grohe and György Turán. Learnability and definability in trees and similar structures. Theory Comput. Syst., 37(1):193–220, 2004. doi:10.1007/s00224-003-1112-8.
  • [36] Steve Hanneke, Shay Moran, and Qian Zhang. Universal rates for multiclass learning. In The 36th Annual Conference on Learning Theory, COLT 2023, volume 195 of Proceedings of Machine Learning Research, pages 5615–5681. PMLR, 2023. URL: https://proceedings.mlr.press/v195/hanneke23a.html.
  • [37] David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989. doi:10.1007/BF00114802.
  • [38] Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Algorithmic Learning Theory, 11th International Conference, ALT 2000, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000. doi:10.1007/3-540-40992-0_18.
  • [39] Jörg-Uwe Kietz and Saso Dzeroski. Inductive logic programming and learnability. SIGART Bull., 5(1):22–32, 1994. doi:10.1145/181668.181674.
  • [40] Benny Kimelfeld and Christopher Ré. A relational framework for classifier engineering. ACM Trans. Database Syst., 43(3):11:1–11:36, 2018. doi:10.1145/3268931.
  • [41] Stephan Kreutzer. Algorithmic meta-theorems. In Javier Esparza, Christian Michaux, and Charles Steinhorn, editors, Finite and Algorithmic Model Theory, volume 379 of London Mathematical Society Lecture Note Series, pages 177–270. Cambridge University Press, 2011.
  • [42] Dietrich Kuske and Nicole Schweikardt. First-order logic with counting. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2017, pages 1–12. IEEE Computer Society, 2017. doi:10.1109/LICS.2017.8005133.
  • [43] Christof Löding, P. Madhusudan, and Daniel Neider. Abstract learning frameworks for synthesis. In Tools and Algorithms for the Construction and Analysis of Systems – 22nd International Conference, TACAS 2016, volume 9636 of Lecture Notes in Computer Science, pages 167–185. Springer, 2016. doi:10.1007/978-3-662-49674-9_10.
  • [44] Johann A. Makowsky. Algorithmic uses of the Feferman-Vaught theorem. Ann. Pure Appl. Log., 126(1-3):159–213, 2004. doi:10.1016/j.apal.2003.11.002.
  • [45] Denis Mayr Lima Martins. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Inf. Syst., 83:89–100, 2019. doi:10.1016/j.is.2019.03.002.
  • [46] Stephen Muggleton. Inductive logic programming. New Gener. Comput., 8(4):295–318, 1991. doi:10.1007/BF03037089.
  • [47] Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. J. Log. Program., 19/20:629–679, 1994. doi:10.1016/0743-1066(94)90035-3.
  • [48] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. doi:10.1017/CBO9781107298019.
  • [49] Robert H. Sloan, Balázs Szörényi, and Gyö rgy Turán. Learning Boolean functions with queries. In Yves Crama and Peter L. Hammer, editors, Boolean Models and Methods in Mathematics, Computer Science, and Engineering, pages 221–256. Cambridge University Press, 2010. doi:10.1017/cbo9780511780448.010.
  • [50] Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In 15th International Conference on Database Theory, ICDT 2012, pages 140–154. ACM, 2012. doi:10.1145/2274576.2274592.
  • [51] Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. Reverse engineering aggregation queries. Proc. VLDB Endow., 10(11):1394–1405, 2017. doi:10.14778/3137628.3137648.
  • [52] Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. REGAL+: reverse engineering SPJA queries. Proc. VLDB Endow., 11(12):1982–1985, 2018. doi:10.14778/3229863.3236240.
  • [53] Quoc Trung Tran, Chee Yong Chan, and Srinivasan Parthasarathy. Query reverse engineering. VLDB J., 23(5):721–746, 2014. doi:10.1007/s00778-013-0349-3.
  • [54] Chenglong Wang, Alvin Cheung, and Rastislav Bodik. Synthesizing highly expressive SQL queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 452–466. ACM, 2017. doi:10.1145/3062341.3062365.
  • [55] Yaacov Y. Weiss and Sara Cohen. Reverse engineering SPJ-queries from examples. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, pages 151–166. ACM, 2017. doi:10.1145/3034786.3056112.
  • [56] He Zhu, Stephen Magill, and Suresh Jagannathan. A data-driven CHC solver. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, pages 707–721. ACM, 2018. doi:10.1145/3192366.3192416.