Database Theory in Action:
Cypher, GQL, and Regular Path Queries
Abstract
Cypher has so far been the most commonly used query language for property graphs, and served as the foundation of the recently standardized graph query language GQL. In designing the features of GQL, the standards committee addressed the perceived limitations of Cypher. One such limitation is the inability of Cypher, as originally designed, to express all regular path queries (RPQs). Despite this claim having been stated many times as a folklore result, we could not find any proof of it. In this note we formalize the core of Cypher’s pattern matching and formally prove that indeed it falls short of all RPQs, justifying the inclusion of new pattern matching features in GQL.
Keywords and phrases:
Regular path queries, Cypher, GQL, inexpressibilityCopyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Information systems Graph-based database modelsFunding:
We acknowledge support from: VeriGraph project, ANR-21-CE48-0015 (L. Libkin); Israel Science Foundation 2355/24 (L. Peterfreund); NCN grant 2018/30/E/ST6/00042 (A. Rogova).Editors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
One of the key goals of the design of the new standard Graph Query Language (ISO/IEC 39075:2024 Information technology – Database languages – GQL, www.iso.org/standard/76120.html) was to overcome Cypher’s perceived limitation with respect to regular path queries (RPQs). One finds statements that Cypher falls short of the full power of RPQs in many sources and surveys, e.g. [5, 2, 1]. There is a strong intuition behind this statement: in Cypher, the use of Kleene star is limited to edge labels. Essentially, one can say that there is a path of edges labeled between two nodes and , but it seems that one cannot say that there is a path so that each part of this path between and for , satisfies a pattern more complex than a labeled edge.
This observation led to the substantial extension of GQL’s pattern matching, which also coincide with available pattern matching in SQL/PGQ, a newly standardized extension of SQL with mechanisms for property graph querying (see their informal description in [3]). Namely, a bounded (between and times for ) or unbounded (at least times) repetition can be applied to an arbitrary pattern (currently such patterns cannot themselves contain repetitions, but a language opportunity111 In SQL and GQL standards, this means the committee plans to revisit a specific feature. exists to remove this restriction).
This substantial extension of the language was based on a belief that arbitrary regular properties of paths cannot be expressed in Cypher. Our goal is to justify this decision by providing a proof of the widely believed – but hitherto unproven – statement.
2 Graph databases and Cypher patterns: an abstraction
Cypher operates on property graphs: these are labeled graphs, potentially with multiple edges between two nodes, where both edges and nodes carry properties given as key/value pairs. The latter play no role in comparison with RPQs which only refer to labels; thus we use a simple model of graph databases where properties are disregarded.
Graph databases.
Assume pairwise disjoint countable sets of node ids, of edge ids and of labels. A graph database is a tuple where
-
is a finite set of node ids used in ;
-
is a finite set of directed edge ids used in ;
-
is a labeling function that associates with every id a label from ;
-
define source and target of an edge.
In real Cypher, can assign sets of labels to a node, and in GQL it can assign sets of labels to edges as well, but since the separating example does not depend on these features (which can also be modeled by considering as the new set of labels), we do not use them here.
A path in is an alternating sequence , for , of nodes and edges that starts and ends with a node and so that each edge connects to for . That is, either is a forward edge with and , or a backward edgewith and . If , the path consists of a single node . We explicitly write out paths as . Two paths and concatenate, if , in which case their concatenation is defined as .
Cypher Patterns.
We refine an abstraction of patterns of GQL from [4], omitting the GQL specific features, and adding Cypher patterns matching repeated edges of a given label. Let be a countably infinite set of variables. Patterns are defined by
Let be the set of all variables mentioned in . The syntactic conditions are:
-
conditions in are given by ; all variables mentioned in must occur in .
-
is only defined when .
In Cypher variables and labels in edge/node patterns are optional, but we include them for simplicity. This does not affect the separating example in which we assume one fixed label for all nodes/edges; the use of variables does not affect expressibility of RPQs. Also, Cypher’s repetitions of labeled edges of the form (between and ) or (at least ), or (at most ) are all expressible with concatenation, union, and Kleene star.
Semantics of Patterns.
The semantics of a path pattern , with respect to a graph database , is a set of pairs where is a path and is a mapping as defined in Fig. 1. Two partial mappings are joinable if for each shared . Their join is then unambiguously defined as the mapping that coincides with for in the domain of and for in the domain of . In the figure, we omit the standard conditions for for Boolean connectives.
3 Cypher Vs. RPQs
Recall that an RPQ is a regular expression over the labels . The result of in , written , is the set of pairs of nodes such that there is a path with the word being in the regular language of .
A Cypher pattern with two designated variables is said to express an RPQ in if .
Theorem 1.
Cypher patterns, as defined above, cannot express all RPQs. In particular they cannot express the pattern testing for an even-length path of edges labeled .
In other words, the regular path query cannot be expressed in Cypher.
Proof.
Consider graphs with and so that and for (i.e., directed paths), with each edge labelled . We can therefore assume that all edge labels used in patterns are (if not, such a pattern is not matched, and thus the entire subpattern in which it occurred cannot be matched, up to in the parse tree, and thus can be removed). We can also further assume that no variable is used as both a node variable and an edge variable (as this would falsify the pattern), nor any explicit equality between such variables is used in conditional patterns.
We now represent such graphs as first-order structures in the vocabulary with the universe , and relations interpreted as follows:
-
is the edge relation;
-
is the reflexive transitive closure of .
We next show how patterns are translated into first-order formulae over this vocabulary. We use for convenience, as it is definable from which is isomorphic to a linear order on . We will then easily obtain the inexpressibility results since FO cannot define even cardinality of linear orders.
For the translation, with each pattern we associate two new variables (intuitively, to be witnessed by the endpoints of patterns), and with each edge variable used in a pattern we associate two first-order variables to be used in FO formulas (for source and target of edges). Then a pattern with node variables and edge variables (all distinct) is translated into an FO formula The condition on the translation is that for a path we have
(1) |
Now suppose the pattern is definable in Cypher over graphs by a pattern as above. Then is true for iff the path between them is of even length and therefore the sentence states the path from to is of even length, which is impossible.
Next, to conclude the proof, we present the translation, recursively.
-
If then .
-
If then .
-
If then .
-
If then
-
if then .
-
If with translated as and respectively (where list variables in those formulae corresponding to node and edge variables in patterns), then is defined as
where in repeated variables are mentioned only once.
-
If with translated as and (note that variables must be the same as the schemas of and coincide), then .
-
If then where is obtained from by the following transformations:
-
–
each condition stays;
-
–
each condition is replaced by ;
-
–
these are propagated through the Boolean connectives.
-
–
It is straightforward to verify that these translations satisfy (1), completing the proof.
The proof shows that on graphs , Cypher patterns fall far short of RPQs. The latter can express every regular property of languages in , or in other words test if belongs to a set which is a finite union of arithmetic progressions. For Cypher patterns, on the other hand, the first-order definability of a pattern in the theory of order implies the existence of the threshold such that either is selected by for all , or is not selected by for all .
4 Conclusions
With formal models of GQL, SQL/PGQ, and Cypher finally available, this note is an example of how research in database theory can affect the design of new languages. When it comes to graph languages, industrial developments are far ahead of academic research, creating opportunities for the academic community to develop tools to evaluate decisions already made and lay a solid foundation for new language features in upcoming editions of the standards.
References
- [1] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoč. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5):68:1–68:40, 2017. doi:10.1145/3104031.
- [2] Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. Querying Graphs. Morgan & Claypool Publishers, 2018. doi:10.2200/S00873ED1V01Y201808DTM051.
- [3] Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Hannes Voigt, Oskar van Rest, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. Graph pattern matching in GQL and SQL/PGQ. In SIGMOD, pages 1–12. ACM, 2022. arXiv:2112.06217.
- [4] Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoc. GPC: A pattern calculus for property graphs. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 241–250. ACM, 2023. doi:10.1145/3584372.3588662.
- [5] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3183713.3190657.