Abstract 1 Introduction 2 Graph databases and Cypher patterns: an abstraction 3 Cypher Vs. RPQs 4 Conclusions References

Database Theory in Action:
Cypher, GQL, and Regular Path Queries

Amélie Gheerbrant ORCID Université Paris Cité, CNRS, IRIF, France Leonid Libkin ORCID RelationalAI and IRIF, CNRS, Paris, France
University of Edinburgh, UK
Liat Peterfreund ORCID The Hebrew University of Jerusalem, Israel Alexandra Rogova ORCID Université Paris Cité, CNRS, IRIF, France
Abstract

Cypher has so far been the most commonly used query language for property graphs, and served as the foundation of the recently standardized graph query language GQL. In designing the features of GQL, the standards committee addressed the perceived limitations of Cypher. One such limitation is the inability of Cypher, as originally designed, to express all regular path queries (RPQs). Despite this claim having been stated many times as a folklore result, we could not find any proof of it. In this note we formalize the core of Cypher’s pattern matching and formally prove that indeed it falls short of all RPQs, justifying the inclusion of new pattern matching features in GQL.

Keywords and phrases:
Regular path queries, Cypher, GQL, inexpressibility
Copyright and License:
[Uncaptioned image] © Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, and Alexandra Rogova; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Information systems Graph-based database models
Funding:
We acknowledge support from: VeriGraph project, ANR-21-CE48-0015 (L. Libkin); Israel Science Foundation 2355/24 (L. Peterfreund); NCN grant 2018/30/E/ST6/00042 (A. Rogova).
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

One of the key goals of the design of the new standard Graph Query Language (ISO/IEC 39075:2024 Information technology – Database languages – GQL, www.iso.org/standard/76120.html) was to overcome Cypher’s perceived limitation with respect to regular path queries (RPQs). One finds statements that Cypher falls short of the full power of RPQs in many sources and surveys, e.g. [5, 2, 1]. There is a strong intuition behind this statement: in Cypher, the use of Kleene star is limited to edge labels. Essentially, one can say that there is a path of edges labeled between two nodes n and n, but it seems that one cannot say that there is a path n=n0n1n2nk1nk=n so that each part of this path between ni and ni+1 for 0i<k, satisfies a pattern π more complex than a labeled edge.

This observation led to the substantial extension of GQL’s pattern matching, which also coincide with available pattern matching in SQL/PGQ, a newly standardized extension of SQL with mechanisms for property graph querying (see their informal description in [3]). Namely, a bounded (between n and m times for n<m) or unbounded (at least n times) repetition can be applied to an arbitrary pattern π (currently such patterns cannot themselves contain repetitions, but a language opportunity111 In SQL and GQL standards, this means the committee plans to revisit a specific feature. exists to remove this restriction).

This substantial extension of the language was based on a belief that arbitrary regular properties of paths cannot be expressed in Cypher. Our goal is to justify this decision by providing a proof of the widely believed – but hitherto unproven – statement.

2 Graph databases and Cypher patterns: an abstraction

Cypher operates on property graphs: these are labeled graphs, potentially with multiple edges between two nodes, where both edges and nodes carry properties given as key/value pairs. The latter play no role in comparison with RPQs which only refer to labels; thus we use a simple model of graph databases where properties are disregarded.

Graph databases.

Assume pairwise disjoint countable sets 𝒩 of node ids, of edge ids and of labels. A graph database is a tuple G=N,E,λ,σ,τ where

  • N𝒩 is a finite set of node ids used in G;

  • E is a finite set of directed edge ids used in G;

  • λ:NE is a labeling function that associates with every id a label from ;

  • σ,τ:EN define source and target of an edge.

In real Cypher, λ can assign sets of labels to a node, and in GQL it can assign sets of labels to edges as well, but since the separating example does not depend on these features (which can also be modeled by considering 2 as the new set of labels), we do not use them here.

A path in G is an alternating sequence n0e1n1e2eknk, for k0, of nodes and edges that starts and ends with a node and so that each edge ei connects ni1 to ni for ik. That is, either ei is a forward edge with σ(ei)=ni1 and τ(ei)=ni, or a backward edgewith σ(ei)=ni and τ(ei)=ni1. If k=0, the path consists of a single node n0. We explicitly write out paths as n0,e1,n1,,ek,nk. Two paths p=n0,e0,,nk and p=n0,e0,,nj concatenate, if nk=n0, in which case their concatenation pp is defined as n0,e0,,nk,e0,,nj.

Cypher Patterns.

We refine an abstraction of patterns of GQL from [4], omitting the GQL specific features, and adding Cypher patterns matching repeated edges of a given label. Let 𝒱 be a countably infinite set of variables. Patterns are defined by

π:=(x:)(x)y:(z)(x)y:(z)(x)(z)(x)(z)π1π2π1+π2πθ,x,y,z𝒱

Let 𝒱(π) be the set of all variables mentioned in π. The syntactic conditions are:

  • conditions in πθ are given by θ,θ:=(x=x)θθθθ¬θ; all variables mentioned in θ must occur in 𝒱(π).

  • π1+π2 is only defined when 𝒱(π1)=𝒱(π2).

In Cypher variables and labels in edge/node patterns are optional, but we include them for simplicity. This does not affect the separating example in which we assume one fixed label for all nodes/edges; the use of variables does not affect expressibility of RPQs. Also, Cypher’s repetitions of labeled edges of the form n..m (between n and m) or n.. (at least n), or ..m (at most m) are all expressible with concatenation, union, and Kleene star.

Semantics of Patterns.

The semantics π of a path pattern π, with respect to a graph database G, is a set of pairs (p,μ) where p is a path and μ is a mapping 𝒱(π)NE as defined in Fig. 1. Two partial mappings μ,μ:𝒱NE are joinable if μ(x)=μ(x) for each shared x. Their join μμ is then unambiguously defined as the mapping that coincides with μ(x) for x in the domain of μ and μ(x) for x in the domain of μ. In the figure, we omit the standard conditions for μθ for Boolean connectives.

(x):={(n,{xn})nN}𝑥:={(n1,e,n2,{xe})|eE,σ(e)=n1,τ(e)=n2}𝑥:={(n2,e,n1,{xe})|eE,σ(e)=n1,τ(e)=n2}(x):(z):={(n1,e1,n2,,ek1,nk,{xn1,znk})e1,,ek1E,σ(ei)=ni,τ(ei)=ni+1,λ(ei)= for all i<k}(x):(z):={(n1,e1,n2,,ek1,nk,{xnk,zn1})e1,,ek1E,σ(ei)=ni+1,τ(ei)=ni,λ(ei)= for all i<k}π1+π2:=π1π2π1π2:={(p1p2,μ1μ2)|(p1,μ1)π1,(p2,μ2)π2,μ1,μ2 are joinable and p1,p2 concatenate}πθ:={(p,μ)πμθ}where μx=y iff μ(x)=μ(y)

Figure 1: Semantics of Cypher patterns with respect to G=N,E,λ,σ,τ.

3 Cypher Vs. RPQs

Recall that an RPQ is a regular expression q over the labels . The result of q in G, written q(G), is the set of pairs of nodes (n0,nk) such that there is a path n0,e0,n1,e1,,ek1,nk with the word λ(e0)λ(e1)λ(ek1) being in the regular language of q.

A Cypher pattern π with two designated variables xs,xt𝒱(π) is said to express an RPQ q in G if q(G)={(μ(xs),μ(xt))(p,μ)π}.

Theorem 1.

Cypher patterns, as defined above, cannot express all RPQs. In particular they cannot express the pattern testing for an even-length path of edges labeled .

In other words, the regular path query () cannot be expressed in Cypher.

Proof.

Consider graphs Gn with N={v1,,vn} and E={e1,,en1} so that σ(ei)=vi and τ(ei)=vi+1 for i<n (i.e., directed paths), with each edge labelled . We can therefore assume that all edge labels used in patterns are (if not, such a pattern is not matched, and thus the entire subpattern in which it occurred cannot be matched, up to + in the parse tree, and thus can be removed). We can also further assume that no variable is used as both a node variable and an edge variable (as this would falsify the pattern), nor any explicit equality between such variables is used in conditional patterns.

We now represent such graphs Gn as first-order structures Sn in the vocabulary R,R with the universe N, and relations interpreted as follows:

  • R={(vi,vi+1) 1i<n} is the edge relation;

  • R={(vi,vj) 1ijn} is the reflexive transitive closure of R.

We next show how patterns are translated into first-order formulae over this vocabulary. We use R for convenience, as it is definable from R which is isomorphic to a linear order on {1,,n}. We will then easily obtain the inexpressibility results since FO cannot define even cardinality of linear orders.

For the translation, with each pattern π we associate two new variables xπs,xπt (intuitively, to be witnessed by the endpoints of patterns), and with each edge variable z used in a pattern we associate two first-order variables zs,zt to be used in FO formulas (for source and target of edges). Then a pattern π with node variables y1,,ym and edge variables z1,,zk (all distinct) is translated into an FO formula απ(xπs,xπt,y1,,ym,z1s,z1t,,zks,zkt). The condition on the translation is that for a path p=u0,f0,u1,,fr1,ur we have

(p,μ)πGnSnαπ(u0,ur,μ(y1),,μ(ym),σ(μ(z1)),τ(μ(z1)),,σ(μ(zk)),τ(μ(zk))) (1)

Now suppose the pattern () is definable in Cypher over graphs Gn by a pattern π as above. Then β(xπs,xπt):=y1,,ym,z1s,,zktαπ is true for vi,vj iff the path between them is of even length and therefore the sentence γ:=s,t(¬sR(s,s)¬tR(t,t)β(s,t)) states the path from v1 to vn is of even length, which is impossible.

Next, to conclude the proof, we present the translation, recursively.

  • If π=(y) then απ(xπs,xπt,y):=xπs=xπtxπt=y.

  • If π=(y1)z:(y2) then απ(xπs,xπt,y1,y2,zs,zt):=xπs=y1xπt=y2zs=y1zt=y2R(y1,y2).

  • If π=(y1)z:(y2) then απ(xπs,xπt,y1,y2,zs,zt):=xπs=y2xπt=y1zs=y2zt=y1R(y2,y1).

  • If π=(y1):(y2) then απ(xπs,xπt,y1,y2):=xπs=y1xπt=y2R(y1,y2)

  • if π=(y1):(y2) then απ(xπs,xπt,y1,y2):=xπs=y2xπt=y1R(y2,y1).

  • If π=π1π2 with π1,π2 translated as απ1(xπ1s,xπ1t,v¯1) and απ2(xπ2s,xπ2t,v¯2) respectively (where v¯i list variables in those formulae corresponding to node and edge variables in patterns), then απ(xπs,xπt,v¯1,v¯2) is defined as

    xπ1s,xπ1t,xπ2s,xπ2t(απ1(xπ1s,xπ1t,v¯1)απ2(xπ2s,xπ2t,v¯2)xπs=xπ1sxπt=xπ2txπ1t=xπ2s)

    where in v¯1,v¯2 repeated variables are mentioned only once.

  • If π=π1+π2 with π1,π2 translated as απ1(xπ1s,xπ1t,v¯) and απ2(xπ2s,xπ2t,v¯) (note that variables must be the same as the schemas of π1 and π2 coincide), then απ(xπs,xπt,v¯):=απ1(xπs,xπt,v¯)απ2(xπs,xπt,v¯).

  • If π=π1θ then απ(xπs,xπt,v¯):=απ1(xπs,xπt,v¯)θ where θ is obtained from θ by the following transformations:

    • each condition yi=yj stays;

    • each condition zi=zj is replaced by zis=zjszit=zjt;

    • these are propagated through the Boolean connectives.

It is straightforward to verify that these translations satisfy (1), completing the proof.

The proof shows that on graphs Gn, Cypher patterns fall far short of RPQs. The latter can express every regular property of languages in , or in other words test if n belongs to a set which is a finite union of arithmetic progressions. For Cypher patterns, on the other hand, the first-order definability of a pattern π in the theory of order implies the existence of the threshold t such that either (v1,vn) is selected by π for all n>t, or (v1,vn) is not selected by π for all n>t.

4 Conclusions

With formal models of GQL, SQL/PGQ, and Cypher finally available, this note is an example of how research in database theory can affect the design of new languages. When it comes to graph languages, industrial developments are far ahead of academic research, creating opportunities for the academic community to develop tools to evaluate decisions already made and lay a solid foundation for new language features in upcoming editions of the standards.

References

  • [1] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoč. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5):68:1–68:40, 2017. doi:10.1145/3104031.
  • [2] Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. Querying Graphs. Morgan & Claypool Publishers, 2018. doi:10.2200/S00873ED1V01Y201808DTM051.
  • [3] Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Hannes Voigt, Oskar van Rest, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. Graph pattern matching in GQL and SQL/PGQ. In SIGMOD, pages 1–12. ACM, 2022. arXiv:2112.06217.
  • [4] Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoc. GPC: A pattern calculus for property graphs. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 241–250. ACM, 2023. doi:10.1145/3584372.3588662.
  • [5] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3183713.3190657.