Abstract 1 Introduction 2 Graph edit distance 3 Tightness analysis 4 Tree-based search algorithm 5 Experiments 6 Conclusion and future works References Appendix A Proof of Lemma 8 Appendix B Examples of computing LED, HED and BED Appendix C Successor generation

An Efficient Heuristic for Graph Edit Distance

Xiaoyang Chen ORCID Department of Computer Science, Xidian University, Xi’an, China Yujia Wang ORCID Department of Computer Science, Xidian University, Xi’an, China Hongwei Huo111corresponding author ORCID Department of Computer Science, Xidian University, Xi’an, China Jeffrey Scott Vitter111corresponding author ORCID Department of Computer Science, Tulane University, New Orleans, LA, USA
The University of Mississippi, MS, USA
Abstract

The graph edit distance (GED) is a flexible distance measure widely used in many applications. Existing GED computation methods are usually based upon the tree-based search algorithm that explores all possible vertex (or edge) mappings between two compared graphs. During this process, various GED lower bounds are adopted as heuristic estimations to accelerate the tree-based search algorithm. For the first time, we analyze the relationship among three state-of-the-art GED lower bounds, label edit distance (LED), Hausdorff edit distance (HED), and branch edit distance (BED). Specifically, we demonstrate that 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) for any two undirected graphs G and Q. Furthermore, for BED we propose an efficient heuristic BED+ for improving the tree-based search algorithm. Extensive experiments on real and synthetic datasets confirm that BED+ achieves smaller deviation and larger solvable ratios than LED, HED and BED when they are employed as heuristic estimations. The source code is available online.

Keywords and phrases:
Graph edit distance, Label edit distance, Hausdorff edit distance, Branch edit distance, Tree-based search, Heuristics
Category:
Research
Copyright and License:
[Uncaptioned image] © Xiaoyang Chen, Yujia Wang, Hongwei Huo, and Jeffrey Scott Vitter; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Information systems Query optimization
Supplementary Material:
Software  (Source Code): https://github.com/Hongweihuo-Lab/Heur-GED [11]
Funding:
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.
Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

1 Introduction

Graphs are frequently used to represent a wide variety of various objects, such as networks, maps, handwriting, molecular compounds, and protein structures. The process of evaluating the similarity of two graphs is referred to as error-tolerant graph matching, aiming to find a correspondence between their vertices. In this paper, we focus upon the similarity measure graph edit distance (GED) because it can be applied to all types of graphs and can precisely capture structural differences between the compared graphs. The GED of two graphs is defined as the minimum cost of transforming one graph into another through a sequence of edit operations (inserting, deleting and substituting vertices or edges). An edit cost is assigned to each edit operation to measure its strength, which can be obtained by combining specific knowledge of the domain or learning from a set of sample graphs [14].

However, computing the GED is an NP-hard problem [30] and usually based upon the tree-based search algorithm. This search tree enumerates all possible mappings between vertices (or edges) of two compared graphs, where the inner nodes denote partial mappings and the leaf nodes denote complete mappings. Most existing GED computation methods employ different search paradigms to traverse this search tree to seek for the optimal mapping that induces the GED. Riesen et al. [25, 26] proposed the standard method, A-GED, based upon the best-first search paradigm. It needs to store numerous inner nodes, resulting in high memory consumption. To overcome this bottleneck, Abu-Aisheh et al. [3] proposed a depth-first search based algorithm, DF-GED, whose memory requirement increases linearly with the number of vertices of graphs. On the other hand, Chen et al. [9] introduced a method for the GED computation based upon beam-stack search [28], achieving a flexible tradeoff between memory consumption and the time overhead of backtracking in the depth-first search. Chang et al. [8] developed a unified framework that can be instantiated into either a best-first search approach or a depth-first search approach. Gouda et al. [17] proposed a novel edge-mapping based approach, CSI_GED, and also employed the depth-first search paradigm. CSI_GED works only for the uniform cost model, and Blumenthal et al. [4, 6] generalized it to cover the non-uniform cost model. Kim [19] developed an efficient GED computation algorithm using isomorphic vertices [9]. Liu et al. [22] explored a learning-based method for the approximate GED computation. Piao et al. [23] propose a deep learning method for the GED computation. It is worth mentioning that many researchers have proposed various indexing techniques [31, 8, 10] to accelerate graph similarity searches under the GED metric. They use the above GED computation methods as the final phase to verify the candidate graphs that satisfy the GED constraint.

In the tree-based search algorithm, a heuristic estimation is usually adopted to prune the useless search space to accelerate the search process. In order to ensure that the optimal mapping is not erroneously pruned, this heuristic function must be admissible; namely, it estimates the cost of a tree node that is less than or equal to the real cost. In the previous works A-GED and DF-GED, they adopted the label edit distance (LED) as the heuristic, which calculates the minimum substitution cost of vertices and edges of two compared graphs. After that, Fischer et al. [15, 16] proposed the Hausdorff edit distance (HED) as a heuristic estimation. HED, based upon Hausdorff matching [29], performs a bidirectional matching between two graphs and allows multiple assignments between their vertices. Recently, Blumenthal et al. [5] proposed another effective GED lower bound, branch edit distance (BED), which also can be adopted as a heuristic estimation.

As observed in other studies [7, 15, 27], the higher the heuristic estimates the cost, the better the tree-based search algorithm performs. The following question naturally arises: Which of these three state-of-the-art GED lower bounds (namely, LED, HED, or BED) is more effective? In this paper, we first analyze the relationship among these three lower bounds and then propose an effective heuristic estimation. Our contributions are summarized as follows:

  1. (1)

    We analyze the relationship among LED, HED and BED for the first time, and we derive that 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) for any two undirected graphs G and Q.

  2. (2)

    We propose an efficient heuristic estimation BED+ based upon BED, and demonstrate that BED+ is still admissible.

  3. (3)

    We conduct extensive experiments to confirm BED+’s effectiveness on the real and synthetic datasets. The source code is available online [11].

The rest of this paper is organized as follows. In Section 2, we give the definition of the graph edit distance and revisit three state-of-the-art GED lower bounds. In Section 3, we theoretically analyze the relationship between LED, HED and BED. In Section 4, we propose the heuristic function BED+ for improving the GED computation. In Section 5, we report the experimental results. Finally, we summarize this paper in Section 6.

2 Graph edit distance

In this paper, we consider undirected, labeled graphs without multi-edges or self-loops. A labeled graph is a triplet G=(VG,EG,L), where VG is the set of vertices, EG is the set of edges, L:VGEGΣ is a labeling function that assigns a label to a vertex or an edge, and Σ is a set of labels. Also, we use a special symbol ε to denote a dummy vertex or a dummy edge.

Given two graphs G and Q, six edit operations [18, 25, 21, 5] can be used to transform G to Q (or vice versa): inserting or deleting a vertex or an edge, and substituting the label of a vertex or an edge. We denote the label substitution (or simply substitution) of vertices uVG and vVQ by (uv), the deletion of u by (uε), and the insertion of v by (εv). For the three edit operations on edges, we use similar notation.

An edit path 𝒫=p1,p2,,pk between G and Q is a sequence of edit operations that transforms G to Q, such as G=G0p1Gipi+1Gi+1pkGk=Q, where graph Gi+1 is obtained by performing the edit operation pi+1 on graph Gi, for 0ik1. During this transformation, each edit operation pi is assigned a penalty cost c(pi) to reflect whether it can strongly change a graph. Note that the cost of editing two dummy vertices (or edges) is 0; that is, c(εε)=0. Thus, 𝒫’s edit cost is defined as i=1kc(pi). We define the graph edit distance as follows:

Definition 1.

Given two graphs G and Q, the graph edit distance between them, denoted by 𝑔𝑒𝑑(G,Q), is defined as the minimum cost of transforming G to Q, namely,

𝑔𝑒𝑑(G,Q)=min𝒫Υ(G,Q)pi𝒫c(pi) (1)

where Υ(G,Q) is the set of all edit paths between G and Q, and c(pi) is edit operation pi’s cost.

Hereafter, for ease of presentation, we denote VGε=VG{ε,,ε}|VQ| and VQε=VQ{ε,,ε}|VG| as the expanded sets of VG and EG, respectively, so that VGε and VQε have the same number of vertices. Similarly, EGε=EG{ε,,ε}|EQ| and EQε=EQ{ε,,ε}|EG| denote the expanded sets of EG and EQ, respectively.

2.1 State-of-the-art GED lower bounds

Below we introduce three state-of-the-art GED lower bounds, which can be used as heuristic estimations in the tree-based search algorithm to compute GED. Each of the methods gives a lower bound on GED because the operations are done in sets that do not have to be consistent with one another. For example, in the first method LED described below, the edit operations on the vertex labels can be done independently of the edit operations on the edge labels, and thus they may not be globally consistent.

Label Edit Distance.

Riesen et al. [27, 26] proposed the label edit distance (LED), which is the minimum cost of substituting vertices and edges of two graphs.

Definition 2 (Label edit distance).

Given two graphs G and Q, the label edit distance between them is defined as 𝐿𝐸𝐷(G,Q)=λV(G,Q)+λE(G,Q), where λV(G,Q)=minϕ:VGεVQεuVGε c(uϕ(u)) is the minimum cost of substituting vertices of G and Q, and ϕ is a bijection from VGε to VQε; and λE(G,Q)=minφ:EGεEQεe(u,u)EGεc(e(u,u)φ(e(u,u))) is the minimum cost of substituting edges of G and Q, and φ is a bijection from EGε to EQε.

Hausdorff Edit Distance.

Inspired by the Hausdorff distance [29] between two finite sets, Fischer et al. [16] proposed the Hausdorff edit distance (HED) between two graphs G and Q. The key ideas of HED are to perform a bidirectional matching between G and Q and to allow multiple assignments between their vertices.

Definition 3 (Hausdorff edit distance).

Given two graphs G and Q, their Hausdorff edit distance is defined as 𝐻𝐸𝐷(G,Q)=uVGminvVQ{ε}fH(u,v)+vVQminuVG{ε}fH(u,v), where fH(u,v) is the Hausdorff cost of matching vertex u to vertex v.

The Hausdorff vertex matching cost fH(u,v) considers not only the two vertices uVG and vVQ but also their neighboring edges.

Definition 4 (Neighboring edges).

Given graph G and a vertex uVG, the neighboring edges Nu of u are defined as Nu={e(u,u):uVGe(u,u)EG}.

We define fH(u,v) as

fH(u,v)={c(uε)+e1Nuc(e1ε)2ifv=ε;c(εv)+e2Nvc(εe2)2ifu=ε;c(uv)+𝐻𝐸𝐷(Nu,Nv)22otherwise. (2)

Similarly to Definition 3, the Hausdorff edit distance 𝐻𝐸𝐷(Nu,Nv) between Nu and Nv is defined as

𝐻𝐸𝐷(Nu,Nv)=e1Numine2Nv{ε}fH(e1,e2)+e2Nvmine1Nu{ε}fH(e1,e2) (3)

where fH(e1,e2) is the cost of matching two edges such that

fH(e1,e2)={c(e1ε)ife2=ε;c(εe2)ife1=ε;c(e1e2)2otherwise. (4)
Branch Edit Distance.

Blumenthal et al. [5] recently proposed the branch edit distance (BED), which computes the minimum cost of editing branch structures of two graphs.

Definition 5 (Branch structure).

The branch structure of vertex u in graph G is defined as Bu=(u,Nu), where Nu is the set of neighboring edges of u.

Given two branch structures Bu and Bv, the minimum cost of editing Bu into Bv is defined as

fB(u,v)=c(uv)+12minϱ:NuεNvεe(u,u)Nuεc(e(u,u)ϱ(e(u,u))), (5)

where Nuε=Nu{ε,,ε}|Nv| and Nvε=Nv{ε,,ε}|Nu| are expanded sets of Nu and Nv, respectively, and ϱ is a bijection from Nuε to Nvε.

Definition 6 (Branch edit distance).

Given two graphs G and Q, the branch edit distance between them is defined as 𝐵𝐸𝐷(G,Q)=minρ:VGεVQεuVGεfB(u,ρ(u)), where ρ is a bijection from VGε to VQε, and fB(,) is defined in (5).

3 Tightness analysis

In this section, we analyze the tightness of the three GED lower bounds: LED, HED and BED. Specifically, we will prove that BED is the strongest of all; that is, for any two undirected graphs G and Q, we have 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q).

3.1 Relation of LED and BED

Theorem 7.

Given two graphs G and Q, we have 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q).

Proof.

For ease of proof, we insert dummy vertices and edges into G to make it become a complete graph with (|VG|+|VQ|) vertices. Similarly, we transform Q into a complete graph that also has (|VG|+|VQ|) vertices. Then, we can simplify (5) as

fB(u,v) =c(uv)+12minϱ:NuεNvεe(u,u)Nuεc(e(u,u)ϱ(e(u,u)))
=c(uv)+12minζ:VG\{u}VQ\{v}uVG\{u}c(e(u,u)e(v,ζ(u)))
=c(uv)+12uVG\{u}c(e(u,u)e(v,ζminu,v(u)))

where ζminu,v is the bijection from VG\{u} to VQ\{v} for which fB(u,v) achieves the minimum value. Thus, we have

𝐵𝐸𝐷(G,Q)=minρ:VGVQuVGfB(u,ρ(u))=uVG{c(uρmin(u))+12uVG\{u}c(e(u,u)e(ρmin(u),ζminu,u(u))}=uVGc(uρmin(u))+12uVGuVG\{u}c(e(u,u)e(ρmin(u),ζminu,u(u)))=uVGc(uρmin(u))+e(u,u)EGc(e(u,u)ξ(e(u,u)))minϕ:VGVQuVGc(uϕ(u))+minφ:EGEQe(u,u)EGc(e(u,u)φ(e(u,u))=λV(G,Q)+λE(EG,EQ)=𝐿𝐸𝐷(G,Q)

where ρmin is the bijection from VG to VQ for which 𝐵𝐸𝐷(G,Q) achieves the minimum value, and ξ is the bijection from EG to EQ satisfying e(ρmin(u),ζminu,u(u)))=ξ(e(u,u)) for uVG,uVG\{u}.

3.2 Relation of HED and BED

Lemma 8.

Given two vertices uVGε and vVQε, then we have

fH(u,v){fB(u,v)ifu=εorv=ε;12fB(u,v)otherwise.

where fH(u,v) and fB(u,v) are defined in (2) and (5), respectively.

The proof of Lemma 8 is in Appendix A.

Theorem 9.

Given two graphs G and Q, we have 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q).

Proof.

By fH(u,v)’s definition in (2), we know that when u=ε, minvVQ{ε}fH(ε,v)=fH(ε,ε)=0; and similarly when v=ε, minuVG{ε}fH(u,ε)=fH(ε,ε)=0. We can rewrite 𝐻𝐸𝐷(G,Q) as

𝐻𝐸𝐷(G,Q) = uVGminvVQ{ε}fH(u,v)+vVQminuVG{ε}fH(u,v) (6)
= uVGεminvVQεfH(u,v)+vVQεminuVGεfH(u,v)
= uVGεfH(u,π1(u))+vVQεfH(π2(v),v))
= uVGε{fH(u,π1(u))+fH(π2(ρmin(u)),ρmin(u))},

where π1 is a mapping from VGε to VQε satisfying π1(u)=argminvVQεfH(u,v),; π2 is a mapping from VQε to VGε satisfying π2(v)=argminuVGεfH(u,v); and ρmin is the bijection from VGε to VQε for which 𝐵𝐸𝐷(G,Q) achieves the minimum value. We know that

𝐵𝐸𝐷(G,Q)=uVGεfB(u,ρmin(u)). (7)

By (6) and (7), we can complete the proof by showing that

fH(u,π1(u))+fH(π2(ρmin(u)),ρmin(u))fB(u,ρmin(u)).

We do so by considering the following four exhaustive cases:

Case I.

When u=ε and ρmin(u)=ε, then fH(u,π1(u))+fH(π2(ρmin(u)),ρmin(u))fH(ε,ε)+fH(ε,ε)=fB(ε,ε)=0, by the definitions of π1, π2, and ρmin.

Case II.

When uε and ρmin(u)=ε, then fH(u,π1(u))+fH(π2(ρmin(u)),ρmin(u))fH(u,ε)+fH(ε,ε)=fH(u,ε), by the definitions of π1, π2, and ρmin. By Lemma 8, we know that fH(u,ε)fB(u,ε)=fB(u,ρmin(u)).

Case III.

When u=ε and ρmin(u)ε, the analysis is similar to that of Case II.

Case IV.

When uε and ρmin(u)ε, then we have fH(u,π1(u))fH(u,ρmin(u)) and fH(π2(ρmin(u)),ρmin(u))fH(u,ρmin(u)), by the definitions of π1, π2, and ρmin. By Lemma 8, we know that fH(u,ρmin(u))12fB(u,ρmin(u)). Thus, we have fH(u,π1(u))+fH(π2(ρmin(u)),ρmin(u))2×12fB(u,ρmin(u))=fB(u,ρmin(u)).

4 Tree-based search algorithm

The previous section showed that BED achieves the tightest GED lower bound. In this section, based upon BED we propose an efficient heuristic estimation to improve the tree-based search algorithm [2, 3] for the GED computation.

4.1 Search tree

Computing the GED of graphs G and Q is typically based upon a tree-based search procedure that explores all possible graph mappings from G to Q. Starting from a dummy node, root = , we logically create the search tree layer by layer by iteratively generating successors using BasicGenSuccr [9]. This search space can be organized as an ordered search tree, where the inner nodes denote partial graph mappings and the leaf nodes denote complete graph mappings. Such a search tree is created dynamically at runtime by iteratively generating successors linked by edges to the currently considered node. For more details, please refer to Section 2 in the reference [9].

4.2 Heuristic cost estimation

For a node r in the search tree, let h(r) be the estimated cost from r to its descendant leaf node that is less than or equal to the real cost. Based upon BED, we introduce how to estimate h(r) in the tree-based search algorithm.

4.2.1 Heuristic function

Consider an inner node r={(u1vj1),,(ulvj)}, where vjk is uk’s mapped vertex, for 1k. We divide G into two subgraphs Gr1 and Gr2, where Gr1 is the mapped part of G such that VGr1={u1,,ul} and EGr1={e(u,v):u,vVGr1e(u,v)EG}, and Gr2 is the unmapped part such that VGr2=VG\VGr1 and EGr2={e(u,v):u,vVGr2e(u,v)EG}. We obtain Qr1 and Qr2 similarly.

Clearly, the lower bound 𝐵𝐸𝐷(Gr2,Qr2) can be used to estimate r’s cost. However, 𝐵𝐸𝐷(Gr2,Qr2) has not covered the potential edit cost on the edges between Gr1 (resp., Qr1) and Gr2 (resp., Qr2). Recently, [8, 9] proposed two different methods to cover this potential cost; nevertheless, these two methods only worked for the uniform cost function (i.e., for which the cost of each edit operation is 1). We expand the method in [8] to support for any cost function.

Definition 10.

Given vertices uGr2 and vQr2, we define the cost of matching u to v as fB+(u,v)=fB(u,v)+uVGr1c(e(u,u)e(v,v)), where v is the mapped vertex of the already processed vertex u, and fB(u,v) is the minimum cost of transforming Bu to Bv, which we defined in (5).

When there is no edge between u and u, we set e(u,u)=ε, and similarly for e(v,v). Based upon fB+(u,v), we define the improved lower bound BED+ as

𝐵𝐸𝐷+(Gr2,Qr2)=minρ:VGr2εVQr2εuVGr2εfB+(u,ρ(u)) (8)
Theorem 11.

Given a node r in the GED tree of graphs G and Q, then 𝐵𝐸𝐷+(Gr2,Qr2)𝐵𝐸𝐷(Gr2,Qr2), where Gr2 and Qr2 are the unmapped subgraphs of G and Q, respectively, 𝐵𝐸𝐷+(,) and 𝐵𝐸𝐷(,) are defined in (8) and Definition 6, respectively.

Proof.

We trivially obtain this theorem since fB+(u,v)fB(u,v) for uVGr2,vVQr2.

Theorem 12.

Given a descendant leaf node s of r, the heuristic estimation h(r)=𝐵𝐸𝐷+(Gr2,Qr2) is admissible; that is, h(r)g(s)g(r), where g() gives the incurred cost from the root node to the currently considered node.

Proof.

For ease of proof, we insert dummy vertices and edges into G to transform it to a complete graph with (|VG|+|VQ|) vertices. Similarly, we transform Q to a complete graph that also has (|VG|+|VQ|) vertices.

Consider an internal node r={(u1vj1),,(uvj)} in the search tree, where vjk is the mapped vertex of uk, for 1k. For easy presentation, hereafter we use r(uk) to denote uk’s mapped vertex, i.e., vjk=r(uk). Given a descendent leaf node s (i.e., s is a complete vertex mapping from G to Q) of r, then the incurred cost of s is

g(s) =uVGc(us(u))+12uVGuVG\{u}c(e(u,u)e(s(u),s(u))) (9)
=uVGc(us(u))+e(u,u)(VG2)c(e(u,u)e(s(u),s(u)))

As we know, r induces an edit path transforming Gr1 to Qr1, where Gr1 and Qr1 are the already mapped subgraphs of G and Q, respectively, and VGr1={u1,,u} and VQr1={r(u1),,r(u)}. According to (9), we know that

g(r)=uVGr1c(ur(u))+e(u,u)(VGr12)c(e(u,u)e(r(u),r(u)))

Let ω=s\r be the partial mapping that contains the vertex mapping pairs belong to s but not r; namely, ω={(us(u)):uVG\VGr1}. We can obtain that

g(s)g(r) =uVGc(us(u))+e(u,u)(VG2)c(e(u,u)e(s(u),s(u)))
{uVGr1c(ur(u))+e(u,u)(VGr12)c(e(u,u)e(r(u),r(u)))}
=uVGr2c(uω(u))+e(u,u)(VGr22)c(e(u,u)e(ω(u),ω(u)))
+uVGr2uvGr1c(e(u,u)e(ω(u),r(u)))
=uVGr2{c(uω(u))+12uVGr2\{u}c(e(u,u)e(ω(u),ω(u))).
+.uVGr1c(e(u,u)e(ω(u),r(u)))}
=uVGr2fB+(u,ω(u))minρ:VGr2VQr2fB+(u,ρ(u))=𝐵𝐸𝐷+(Gr2,Qr2)=h(r).

where VGr2=VG\VGr1 and VQr2=VQ\VQr1. The second equality is due to (VG2)=(VGr12)(VGr22)(VGr1×VGr2) when VG is partitioned into two disjoint sets VGr1 and VGr2.

We give an example in Appendix B of computing three GED lower bounds: LED, HED and BED. The same optimization that produces BED+ from BED can be applied to LED and HED to achieve enhanced heuristics LED+ and HED+, but we do not include them in this paper.

4.3 Algorithm

In this section, we show how to incorporate the heuristic estimation 𝐵𝐸𝐷+ into the anytime-based GED computation algorithm [2]. The reason we consider the anytime-based algorithm is that it is flexible and can control the algorithm to output tighter and tighter GED upper bounds until the exact GED value by setting more and more running time.

Algorithm 1 gives the anytime-based algorithm for computing the GED, where tmax is the user-defined maximum running time. We perform a depth-first search over the GED search tree of G and Q to find better and better GED upper bounds until the running time #t reaches tmax. To accomplish this, we first employ the BP [25] algorithm to fast compute an initial GED upper bound ub; then, we adopt a stack 𝒮 to finish the depth-first search. Each time we pop a node q from 𝒮. If q is a leaf node, then we find a better solution. Otherwise, we call procedure BasicGenSuccr [9] (see Appendix C) to generate q’s successors and then insert them into 𝒮. During this process, we adopt the branch-and-bound strategy to prune the useless space: for each successor r, if g(r)+h(r)ub, we can safely prune it, where h(r)=𝐵𝐸𝐷+(Gr2,Qr2) is defined in (8).

Algorithm 1 Anytime-based GED computation.

5 Experiments

5.1 Datasets and settings

Datasets.

We chose four real (GREC, MUTA, PRO, and CMU) and one synthetic (SYN) datasets in the experiments. The datasets GREC, MUTA, and PRO were taken from the IAM Graph Database Repository [24]; the CMU dataset could be found at the CMU website [13]; and the SYN dataset was generated by the synthetic graph generator GraphGen [12]. Following the same procedure in [2, 21], we selected some subsets of GREC, MUTA, and PRO as the tested datasets, respectively, where each subset consists of graphs that have the same number of vertices. Specifically, the subsets of CREC contain 5, 10, 15, and 20 vertices, respectively; the subsets of MUTA contain 10, 20, , 70 vertices, respectively; the subsets of PRO contain 20, 30, and 40 vertices, respectively; and each subset consists of 10 graphs.

Table 1 summarizes the characteristic and applied cost function of each dataset. ED and SED are short for Euclidean distance and string edit distance functions, respectively. cv is the cost of inserting/deleting a vertex; ce is the cost of inserting/deleting an edge; cvs and ces are the costs of substituting a vertex and an edge, respectively. In addition, we introduce a parameter α to control whether edit operations on vertices or edges are more important.

Settings.

We conducted all the experiments on a HP Z800 PC running the Ubuntu 12.04 LTS operating system and equipped with a 2.67GHz CPU and 24 GB of memory. We implemented the algorithm in C++, using -O3 to compile and run it.

Table 1: Summary of characteristics of datasets and cost functions used.
Dataset #Graphs |V| |E| vertex labels edge labels cv ce α cvs ces
GREC 40 12.5 17.5 (x, y) coord. Line type 90 15 0.5 Ext. ED Dirac
MUTA 70 40 41.5 Chem. symbol Valence 11 1.1 0.25 Dirac Dirac
PRO 30 30 58.6 Type/AA-seq. Type/length 11 1 0.75 Ext. SED Dirac
CMU 111 30 79.1 None Distance 0.5 0 L1 norm
SYN 100 14.5 20 Symbol Symbol 0.3 0.5 0.75 Dirac Dirac

5.2 Evaluation metrics

We discuss two metrics to evaluate algorithm performance: deviation (𝑑𝑒𝑣[1] and solvable ratio (sr[9]. The metric 𝑑𝑒𝑣 measures the deviation generated by an algorithm. Formally, given two graphs G and Q, the deviation of the two graphs can be computed as 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(G,Q)=|𝑑𝑖𝑠(G,Q)R(G,Q)|/R(G,Q), where 𝑑𝑖𝑠(G,Q) is the (approximate) GED value produced by the algorithm, and R(G,Q) is the best GED value produced in all the experiments done on the graph database repository in [1]. Based upon the pairwise comparison model, the deviation on the dataset 𝒢 can be computed as

𝑑𝑒𝑣=1|𝒢|×|𝒢|G𝒢Q𝒢𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(G,Q) (10)

The metric 𝑠𝑟 measures how often the exact GED value is obtained when reaching the maximum running time threshold tmax. Formally, let 𝑠𝑙𝑜𝑣𝑒(G,Q) indicate whether an algorithm outputs the exact GED of G and Q within tmax time; in other words, if the algorithm requires less than tmax time to output the GED, 𝑠𝑙𝑜𝑣𝑒(G,Q)=1; otherwise, 𝑠𝑙𝑜𝑣𝑒(G,Q)=0. The solvable ratio (sr) on the dataset 𝒢 can be computed as

sr=1|𝒢|×|𝒢|G𝒢Q𝒢𝑠𝑙𝑜𝑣𝑒(G,Q) (11)

Obviously, a smaller 𝑑𝑒𝑣 and a larger 𝑠𝑟 reflects a better performance of an algorithm.

5.3 Experimental results

As described earlier in this paper, we first analyzed the relation of three GED lower bounds (i.e., LED, HED and BED). Then based upon BED we proposed an efficient heuristic estimation BED+. Thus, it is necessary to evaluate the contribution of these lower bounds to the GED computation.

5.3.1 Tightness of LED, HED and BED

In this section, we evaluate the tightness of three GED lower bounds LED, HED and BED as well as their running time. Table 2 shows the obtained results, where the abbreviation “ms” represents milliseconds.

As shown in Table 2, BED achieves the smallest 𝑑𝑒𝑣, which means that BED is closest to the exact GED value. This result is consistent with the analysis in Section 3, i.e., 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) for any two graphs G and Q. We also find in most cases tthat LED performs better than HED; the reason is that HED allows multiple assignments between vertices of G and Q and greedily selects matched vertices with the lowest cost.

We also list the running time of each method in Table 2. It can be seen from this table that HED runs faster than LED and LED runs faster than BED. The reason is that HED runs in quadratic time, while both LED and BED run in cubic time. LED independently considers the cost of substituting vertices and edges and ignores the structures, thus it has a better running time than BED.

Table 2: Deviation (%) and running time (ms) of LED, HED, and BED.
Datasets LED HED BED
𝑑𝑒𝑣 𝑡𝑖𝑚𝑒 𝑑𝑒𝑣 𝑡𝑖𝑚𝑒 𝑑𝑒𝑣 𝑡𝑖𝑚𝑒
GREC 4.41 0.38 17.45 0.29 3.54 0.52
MUTA 12.54 1.07 30.13 0.47 11.49 2.72
PRO 4.61 3.75 21.31 2.84 3.25 6.07
CMU 61.56 9.53 57.6 3.77 25.1 13.2
SYN 67.41 0.28 91.92 0.14 46.61 0.45

5.3.2 Effect of heuristic

Observing that BED produces the tighter lower bound than LED and HED, we propose BED+ as a heuristic estimation to improve the GED computation. To achieve the comparison, we adopted LED, HED, BED, and BED+ as the heuristic estimations, respectively, and fixed the running time tmax=104 ms. Table 3 lists the obtained deviation 𝑑𝑒𝑣 and solvable ration 𝑠𝑟.

Table 3: Deviation (%) and solvable ratio (%) of of LED, HED, BED, and BED+.
Datasets LED HED BED BED+
𝑑𝑒𝑣 𝑠𝑟 𝑑𝑒𝑣 𝑠𝑟 𝑑𝑒𝑣 𝑠𝑟 𝑑𝑒𝑣 𝑠𝑟
GREC 0.36 69.87 0.56 54.38 0.22 67.88 0.01 90
MUTA 5.56 3.47 4.85 3.27 4.49 3.51 2.6 22.33
PRO 1.27 4 0.71 3.56 0.68 4.22 0.06 4.22
CMU 109.89 19.06 49.18 19.06 52.83 31.16 2.97 75.79
SYN 8.79 8.4 9.48 4.18 7.96 7.48 0.17 94.34

From Table 3, we know that using BED+ as a heuristic can produce the smallest 𝑑𝑒𝑣. This is due to the fact that BED+ produces a higher estimated bound. For the solvable ratio 𝑠𝑟, we also find that BED+ achieves the best performance.

We also varied the running time tmax from 101 ms to 105 ms in order to evaluate the above heuristic estimations under different running times. From Figure 1, as the running time tmax increases, using the above four heuristic estimations we obtain lower and lower deviation 𝑑𝑒𝑣 as well as higher and higher solvable ratio 𝑠𝑟. Also, we find in most cases that BED+ achieves the best 𝑑𝑒𝑣 and 𝑠𝑟 under both small (e.g., 102 ms) and large (e.g., 105 ms) running times. Compared with the widely used heuristic estimation LED, using BED+ can decrease the 𝑑𝑒𝑣 by 72.2%, 34.1%, 48.1%, 39.4%, and 52.1% on average on the GREC, MUTA, PRO, CMU, and SYN datasets, respectively. Using BED+ can increase the 𝑠𝑟 by 54.3%, 293.2%, 3.4%, 113.1%, and 702.8% on average on the five above datasets, respectively. Thus, we conclude that using BED+ as a heuristic estimation can greatly improve the GED computation.

Figure 1: The 𝑑𝑒𝑣 (top two rows) and 𝑠𝑟 (bottom two rows) under different running time.

6 Conclusion and future works

In this paper, we analyze the relationship among three state-of-the-art GED lower bounds that are widely used as heuristic estimations in the tree-based search algorithm for the GED computation. Specifically, we demonstrate that 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q) for any two undirected graphs G and Q. Furthermore, based upon BED we propose an efficient heuristic estimation BED+ and demonstrate that BED+ still estimates a cost that is not greater than the real cost. Experimental results on four real and one synthetic datasets confirm that BED+ can achieve the best performance under both small and large running time.

When calculating the heuristic estimation 𝐵𝐸𝐷+(Gr2,Qr2), we first compute the transformation cost (i.e., fB(,)) of two compared branch structures. In fact, the transformation cost of these two branch structures may have been calculated many times in the previous traversal of the search tree. Future work will consider how to build a suitable index structure to maintain the transformation cost of these traversed branch structures in order to accelerate the computation of 𝐵𝐸𝐷+(Gr2,Qr2).

References

  • [1] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. A graph database repository and performance evaluation metrics for graph edit distance. In GbRPR, pages 138–147, 2015.
  • [2] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. Anytime graph matching. Pattern Recogn Lett., 84:215–224, 2016. doi:10.1016/J.PATREC.2016.10.004.
  • [3] Z. Abu-Aisheh, R. Raveaux, J. Y. Ramel, and P. Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In ICPRAM, pages 271–278, 2015.
  • [4] D. B. Blumenthal and J. Gamper. Exact computation of graph edit distance for uniform and non-uniform metric edit costs. In GbRPR, pages 211–221, 2017.
  • [5] D. B. Blumenthal and J. Gamper. Improved lower bounds for graph edit distance. IEEE Trans. Knowl Data Eng., 30(3):503–516, 2018. doi:10.1109/TKDE.2017.2772243.
  • [6] D. B. Blumenthal and J. Gamper. On the exact computation of the graph edit distance. Pattern Recogn Lett., 134:46–57, 2020. doi:10.1016/J.PATREC.2018.05.002.
  • [7] B. Bonet and H. Geffner. Planning as heuristic search. Artif. Intell., 129(1-2):5–33, 2001. doi:10.1016/S0004-3702(01)00108-4.
  • [8] L. Chang, X. Feng, X. Lin, L. Qin, and W. Zhang. Efficient graph edit distance computation and verification via anchor-aware lower bound estimation. CoRR, 2017. arXiv:1709.06810.
  • [9] X. Chen, H. Huo, J. Huan, and J. S. Vitter. An efficient algorithm for graph edit distance computation. Knowl.-Based Syst., 163:762–775, 2019. doi:10.1016/J.KNOSYS.2018.10.002.
  • [10] X. Chen, H. Huo, J. Huan, J. S. Vitter, W. Zheng, and L. Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Trans. Knowl Data Eng., 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
  • [11] X. Chen, Y. Wang, H. Huo, and J. S. Vitter. An efficient heuristic for graph edit distance [source code], June 2019. URL: https://github.com/Hongweihuo-Lab/Heur-GED.
  • [12] James Cheng, Yiping Ke, and Wilfred Ng. GraphGen — a synthetic graph data generator. URL: https://cse.hkust.edu.hk/graphgen/.
  • [13] CMU house and hotel datasets. URL: https://github.com/dbblumenthal/gedlib/blob/master/data/datasets/CMU-GED.
  • [14] X. Cortés and F. Serratosa. Learning graph-matching edit-costs based on the optimality of the oracle’s node correspondences. Pattern Recogn Lett., 56:22–29, 2015. doi:10.1016/J.PATREC.2015.01.009.
  • [15] A. Fischer, R. Plamondon, Y. Savaria, K. Riesen, and H. Bunke. A Hausdorff heuristic for efficient computation of graph edit distance. Structural, Syntactic, and Statistical Pattern Recognition, LNCS 8621:83–92, 2014.
  • [16] A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke. Approximation of graph edit distance based on Hausdorff matching. Pattern Recogn., 48(2):331–343, 2015. doi:10.1016/J.PATCOG.2014.07.015.
  • [17] K. Gouda and M. Hassaan. CSI_GED: An efficient approach for graph edit similarity computation. In ICDE, pages 256–275, 2016.
  • [18] D. Justice and A. Hero. A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal Mach Intell., 28(8):1200–1214, 2006. doi:10.1109/TPAMI.2006.152.
  • [19] J. Kim. Efficient graph edit distance computation using isomorphic vertices. Pattern Recogn Lett., 168(2023):71–78, 2023. doi:10.1016/J.PATREC.2023.03.002.
  • [20] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. doi:10.1002/nav.3800020109.
  • [21] J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, and S. Adam. New binary linear programming formulation to compute the graph edit distance. Pattern Recogn., 72:254–265, 2017. doi:10.1016/J.PATCOG.2017.07.029.
  • [22] J. Liu, M. Zhou, S. Ma, and L. Pan. MATA*: Combining learnable node matching with A* algorithm for approximate graph edit distance computation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), pages 1503–1512, 2023.
  • [23] Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, and Hong Cheng. Computing graph edit distance via neural graph matching. Proceedings of the VLDB Endowment, 16(8):1817–1829, 2023. doi:10.14778/3594512.3594514.
  • [24] K. Riesen and H. Bunke. IAM graph database repository for graph based pattern recognition and machine learning. Structural, Syntactic, and Statistical Pattern Recognition, pages 287–297, 2008.
  • [25] K. Riesen and H. Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput., 27(7):950–959, 2009. doi:10.1016/J.IMAVIS.2008.04.004.
  • [26] K. Riesen, S. Emmenegger, and H. Bunke. A novel software toolkit for graph edit distance computation. In GbRPR, pages 142–151, 2013.
  • [27] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph edit distance computation with a bipartite heuristic. In MLG, pages 21–24, 2007.
  • [28] S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach (2nd ed.). Prentice-Hall, New Jersey, USA, 2002.
  • [29] O Schütze, X. Esquivel, A. Lara, and C. A. C. Carlos. Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput., 16(4):504–522, 2012. doi:10.1109/TEVC.2011.2161872.
  • [30] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou. Comparing stars: On approximating graph edit distance. PVLDB, 2(1):25–36, 2009. doi:10.14778/1687627.1687631.
  • [31] W. Zheng, L. Zou, X. Lian, D. Wang, and D. Zhao. Efficient graph similarity search over large graph databases. IEEE Trans. Knowl Data Eng., 27(4):964–978, 2015. doi:10.1109/TKDE.2014.2349924.

Appendix A Proof of Lemma 8

Proof.

We prove this lemma by considering the following two cases:

Case I.

when u=ε or v=ε. We first discuss the case u=ε. It is trivial to know that fH(u,ε)=fB(u,ε)=c(uε)+12eNuc(eε). Similarly, when v=ε, we also have fH(u,v)=fB(u,v). Thus, when u=ε or v=ε, the lemma follows.

Case II.

when uε and vε. Then, we know that

fB(u,v) = c(uv)+12𝑀𝐿𝑆(Nu,Nv), (12)
fH(u,v) = 12{c(uv)+12𝐻𝐸𝐷(Nu,Nv)} (13)

where 𝑀𝐿𝑆(Nu,Nv)=minϱ:NuεNvεeNuεc(eϱ(e)), and ϱ is a bijection from Nuε to Nvε.

In order to prove fH(u,v)12fB(u,v), it suffices from (12) and (13) to prove

𝐻𝐸𝐷(Nu,Nv)𝑀𝐿𝑆(Nu,Nv),

which we do as follows:

  1. (i)

    Rewriting 𝑀𝐿𝑆(Nu,Nv) and 𝐻𝐸𝐷(Nu,Nv):

    𝑀𝐿𝑆(Nu,Nv)=eNuεc(eϱmin(e))=eNuεc(ey),

    where ϱmin is the bijection from Nuε to Nvε that 𝑀𝐿𝑆(Nu,Nv) achieves the minimum value; y=ϱmin(e)Nvε is e’s mapped edge under the bijection ϱmin, for eNuε;

    𝐻𝐸𝐷(Nu,Nv)=eNuε{fH(e,χ1(e))+fH(χ2(y),y)},

    where χ1 is the mapping from Nuε to Nvε satisfying χ1(e)=argmineNvεfH(e,e), for eNuε; and χ2 is the mapping from Nvε to Nuε satisfying χ2(y)=argmineNuεfH(e,y).

  2. (ii)

    Proving fH(e,χ1(e))+fH(χ2(y),y)c(ey): According to the definition of χ1 and χ2, fH(e,χ1(e))min{fH(e,y),fH(e,ε)} and fH(χ2(y),y)min{fH(e,y),fH(ε,y)}. We discuss the following cases (a)–(d):

    1. (a)

      When e=ε and y=ε, then fH(e,χ1(e))+fH(χ2(y),y)fH(ε,ε)+fH(ε,ε)=c(εε)=0;

    2. (b)

      When eε and y=ε, then fH(e,χ1(e))+fH(χ2(y),y)fH(e,ε)+fH(ε,ε)=c(eε);

    3. (c)

      When e=ε and yε, the analysis is similar to that of (b);

    4. (d)

      When eε and yε, then fH(e,χ1(e))+fH(χ2(y),y)fH(e,y)+fH(e,y)=2fH(e,y)=2×12c(ey)=c(ey).

  3. (iii)

    Combining both (i) and (ii), we have

    𝐻𝐸𝐷(Nu,Nv) =eNuε{fH(e,χ1(e))+fH(χ2(y),y)} eNuεc(ey)=𝑀𝐿𝑆(Nu,Nv)

Therefore, fH(u,v)12fB(u,v) when uε and vε. This completes the proof.

Appendix B Examples of computing LED, HED and BED

In this section, we give an example of calculating three GED lower bounds, LED, HED and BED.

Figure 2: Graphs G (left) and Q (right).

Figure 2 shows two graphs G and Q, where “A”, “B” and “C” denote vertex labels, and “a” and “b” denote edge labels. Consider the cost function c satisfying: (i) the cost of each vertex edit operation is 2, that is, c(uv)=2 when two vertices uVGε and vVQε have different labels, and c(uv)=0 otherwise; (ii) the cost of each edge edit operation is 1, that is, c(e1e2)=1 when two edges e1EGε and e2EQε have different labels, and c(e1e2)=0 otherwise. Based upon this cost function c, we discuss how to compute 𝐿𝐸𝐷(G,Q), 𝐻𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q) below using the examples shown in Figure 2.

(1) Computing 𝑳𝑬𝑫(𝑮,𝑸)

In 𝐿𝐸𝐷(G,Q) (see Definition 2 in main text), we need to compute the minimum substitution cost of vertices and edges of G and Q, i.e., λV(G,Q) and λE(G,Q). For λV(G,Q)=minϕ:VGεVQεuVGεc(uϕ(u)), we seek for a bijection ϕ from VGε to VQε to minimize the linear sum λV(G,Q); this is a well-investigated linear sum assignment problem (LSAP) and can be solved by the Hungarian algorithm [20] through the following two steps:

  1. (1)

    Construct the vertex substitution cost matrix WV, such that Wu,vV=c(uv) is the cost of substituting vertices uVG and vVQ; Wu,εV=c(uε) is the cost of deleting u; and Wε,vV=c(εv) is the cost of inserting v. In this example, we compute WV as

  2. (2)

    Find the optimal assignment ϕmin that minimizes the linear sum on WV. In this example, we find that ϕmin={(u1v1),(u2v2),(u3v3),(u4v4),(εε)} is the optimal assignment, and then obtain λV(G,Q)=Wu1,v1V+Wu2,v2V+Wu3,v3V+Wu4,v4V+Wε,εV=2.

Similar to the above process, we can compute the edge substitution cost matrix WE as follows:

With the Hungarian algorithm, we know that the optimal assignment on WE is φmin={(e(u1,u2)ε),(e(u1,u3)e(v1,v4)),(e(u2,u4)e(v2,v4)),(e(u3,u4)e(v3,v4)),(εε)}. Then, λE(G,Q)=We(u1,u2),εE+We(u1,u3),e(v1,v4)E+We(u2,u4),e(v2,v4)E+We(u3,u4),e(v3,v4)E+Wε,εE=2. Combing λV(G,Q) and λE(G,Q), we have 𝐿𝐸𝐷(G,Q)=λV(G,Q)+λE(G,Q)=2+2=4.

(2) Computing 𝑯𝑬𝑫(𝑮,𝑸)

According to the definition of 𝐻𝐸𝐷(G,Q) (see Definition 3 in main text), we need to calculate the hausdorff matching cost fH(u,v) between two vertices uVGε and vVQε, and then perform a bidirectional matching between G and Q. When performing a matching from G to Q, we greedily seek for the minimum matching cost minvVQεfH(u,v) of each vertex u; then, the sum of these minimum costs is the matching cost from G to Q, i.e., uVGminvVQεfH(u,v). Similarly, the matching cost from Q to G is vVQminuVGεfH(u,v). Finally, the sum of the above two matching costs is 𝐻𝐸𝐷(G,Q). We can summarize the computation process of 𝐻𝐸𝐷(G,Q) as two steps:

  1. (1)

    Construct the hausdorff matching cost matrix WH, such that Wu,vH=fH(u,v) is the hausdorff cost of matching vertex uVG to vertex vVQ; Wu,εH=fH(u,ε) is the hausdorff cost of deleting u; and Wε,vH=fH(ε,v) is the hausdorff cost of inserting v, where fH(,) is defined in (2) in main text. In this example, we can compute WH as

  2. (2)

    Based upon WH, compute uVGminvVQεWu,vH and vVQminuVGεWu,vH. In this example, we trivially obtain 𝐻𝐸𝐷(G,Q)=uVGminvVQεWu,vH+vVQminuVGεWu,vH= (Wu1,v1H+Wu2,v1H+Wu3,v1H+Wu4,v4H)+(Wu2,v1H+Wu2,v2H+Wu3,v2H+Wu4,v4H)= (1.375+0.125+0.125+0)+(0.125+0.125+0.125+0)=2.

Note that when calculating Wu,vH (i.e., fH(u,v)), we need to calculate 𝐻𝐸𝐷(Nu,Nv) (see Equation (3) in main context), where Nu and Nv are the sets of edges adjacent to u and v, respectively. The computation of 𝐻𝐸𝐷(Nu,Nv) is similar to the above process of computing 𝐻𝐸𝐷(G,Q); and thus, we omit the detailed calculation here.

(3) Computing 𝑩𝑬𝑫(𝑮,𝑸)

The process of calculating 𝐵𝐸𝐷(G,Q) is similar to that of calculating λV(G,Q), which is also looking for a bijection ρ from VGε to VQε to minimize the linear sum uVGεfB(u,ρ(u)). The computation contains two steps:

  1. (1)

    Construct the branch matching cost matrix WB, such that Wu,vB=fB(u,v) is the branch cost of matching vertex uVG to vertex vVQ; Wu,εB=fB(u,ε) is the branch cost of deleting u; and Wε,vB=fB(ε,v) is the branch cost of inserting v, where fB(,) is defined in (5) in main text. In this example, we can compute WB as

  2. (2)

    Find the optimal assignment ρmin that minimizes the linear sum on WB. In this example, we find that ρmin={(u1v1),(u2v2),(u3v3),(u4v4),(εε)} is the optimal assignment, and then obtain 𝐵𝐸𝐷(G,Q)=Wu1,v1B+Wu2,v2B+Wu3,v3B+Wu4,v4B+Wε,εB=4.5.

Note that when calculating Wu,vB (i.e., fB(u,v)), we need to calculate the minimum edge substitution cost between Nu and Nv, which is similar to the process of calculating λE(,); and thus, we omit the detailed computation here.

For graphs G and Q in Figure 2, we finally obtain that 𝐿𝐸𝐷(G,Q)=4, 𝐻𝐸𝐷(G,Q)=2 and 𝐵𝐸𝐷(G,Q)=4.5. Clearly, 𝐵𝐸𝐷(G,Q)𝐿𝐸𝐷(G,Q) and 𝐵𝐸𝐷(G,Q)𝐻𝐸𝐷(G,Q).

Appendix C Successor generation

We discuss how to generate successors of each node in the GED search tree with Algorithm 2.

Consider an inner node r={(u1vj1),,(uvj)}, where vjk is the mapped vertex of uk in the GED search tree, for 1k. BasicGenSuccr generates all the possible successors of r. First, we compute the sets of unmapped vertices in G and Q, respectively, i.e., VGr=VG\{u1,,ul} and VQr=VQ\{vj1,,vjl}. If |VGr|>0, then we select a vertex z from VQr{ε} as the mapped vertex of u+1, and consequently, obtain a successor 𝑐ℎ𝑖𝑙𝑑 of r such that 𝑐ℎ𝑖𝑙𝑑=r{(u+1z)}. Otherwise, all the vertices of G are processed; trivially, we obtain a leaf node s=rzVQr{(εz)}.

Algorithm 2 BasicGenSuccr(r).