An Efficient Heuristic for Graph Edit Distance

Chen, Xiaoyang; Wang, Yujia; Huo, Hongwei; Vitter, Jeffrey Scott

doi:10.4230/OASIcs.Grossi.1

An Efficient Heuristic for Graph Edit Distance

Xiaoyang Chen

Department of Computer Science, Xidian University, Xi’an, China Yujia Wang

Department of Computer Science, Xidian University, Xi’an, China Hongwei Huo¹¹1corresponding author

Department of Computer Science, Xidian University, Xi’an, China Jeffrey Scott Vitter¹¹1corresponding author

Department of Computer Science, Tulane University, New Orleans, LA, USA
The University of Mississippi, MS, USA

Abstract

The graph edit distance (GED) is a flexible distance measure widely used in many applications. Existing GED computation methods are usually based upon the tree-based search algorithm that explores all possible vertex (or edge) mappings between two compared graphs. During this process, various GED lower bounds are adopted as heuristic estimations to accelerate the tree-based search algorithm. For the first time, we analyze the relationship among three state-of-the-art GED lower bounds, label edit distance (LED), Hausdorff edit distance (HED), and branch edit distance (BED). Specifically, we demonstrate that $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ and $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ for any two undirected graphs $G$ and $Q$ . Furthermore, for BED we propose an efficient heuristic BED⁺ for improving the tree-based search algorithm. Extensive experiments on real and synthetic datasets confirm that BED⁺ achieves smaller deviation and larger solvable ratios than LED, HED and BED when they are employed as heuristic estimations. The source code is available online.

Keywords and phrases:

Graph edit distance, Label edit distance, Hausdorff edit distance, Branch edit distance, Tree-based search, Heuristics

Category:

Research

Copyright and License:

2012 ACM Subject Classification:

Information systems

\rightarrow

Query optimization

Supplementary Material:

Software (Source Code): https://github.com/Hongweihuo-Lab/Heur-GED [11]

Funding:

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.

DOI:

10.4230/OASIcs.Grossi.2025.1

Event:

From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday

Editors:

Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

Series and Publisher:

Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Graphs are frequently used to represent a wide variety of various objects, such as networks, maps, handwriting, molecular compounds, and protein structures. The process of evaluating the similarity of two graphs is referred to as error-tolerant graph matching, aiming to find a correspondence between their vertices. In this paper, we focus upon the similarity measure graph edit distance (GED) because it can be applied to all types of graphs and can precisely capture structural differences between the compared graphs. The GED of two graphs is defined as the minimum cost of transforming one graph into another through a sequence of edit operations (inserting, deleting and substituting vertices or edges). An edit cost is assigned to each edit operation to measure its strength, which can be obtained by combining specific knowledge of the domain or learning from a set of sample graphs [14].

However, computing the GED is an NP-hard problem [30] and usually based upon the tree-based search algorithm. This search tree enumerates all possible mappings between vertices (or edges) of two compared graphs, where the inner nodes denote partial mappings and the leaf nodes denote complete mappings. Most existing GED computation methods employ different search paradigms to traverse this search tree to seek for the optimal mapping that induces the GED. Riesen et al. [25, 26] proposed the standard method, A^⋆-GED, based upon the best-first search paradigm. It needs to store numerous inner nodes, resulting in high memory consumption. To overcome this bottleneck, Abu-Aisheh et al. [3] proposed a depth-first search based algorithm, DF-GED, whose memory requirement increases linearly with the number of vertices of graphs. On the other hand, Chen et al. [9] introduced a method for the GED computation based upon beam-stack search [28], achieving a flexible tradeoff between memory consumption and the time overhead of backtracking in the depth-first search. Chang et al. [8] developed a unified framework that can be instantiated into either a best-first search approach or a depth-first search approach. Gouda et al. [17] proposed a novel edge-mapping based approach, CSI_GED, and also employed the depth-first search paradigm. CSI_GED works only for the uniform cost model, and Blumenthal et al. [4, 6] generalized it to cover the non-uniform cost model. Kim [19] developed an efficient GED computation algorithm using isomorphic vertices [9]. Liu et al. [22] explored a learning-based method for the approximate GED computation. Piao et al. [23] propose a deep learning method for the GED computation. It is worth mentioning that many researchers have proposed various indexing techniques [31, 8, 10] to accelerate graph similarity searches under the GED metric. They use the above GED computation methods as the final phase to verify the candidate graphs that satisfy the GED constraint.

In the tree-based search algorithm, a heuristic estimation is usually adopted to prune the useless search space to accelerate the search process. In order to ensure that the optimal mapping is not erroneously pruned, this heuristic function must be admissible; namely, it estimates the cost of a tree node that is less than or equal to the real cost. In the previous works A^⋆-GED and DF-GED, they adopted the label edit distance (LED) as the heuristic, which calculates the minimum substitution cost of vertices and edges of two compared graphs. After that, Fischer et al. [15, 16] proposed the Hausdorff edit distance (HED) as a heuristic estimation. HED, based upon Hausdorff matching [29], performs a bidirectional matching between two graphs and allows multiple assignments between their vertices. Recently, Blumenthal et al. [5] proposed another effective GED lower bound, branch edit distance (BED), which also can be adopted as a heuristic estimation.

As observed in other studies [7, 15, 27], the higher the heuristic estimates the cost, the better the tree-based search algorithm performs. The following question naturally arises: Which of these three state-of-the-art GED lower bounds (namely, LED, HED, or BED) is more effective? In this paper, we first analyze the relationship among these three lower bounds and then propose an effective heuristic estimation. Our contributions are summarized as follows:

(1)

We analyze the relationship among LED, HED and BED for the first time, and we derive that $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ and $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ for any two undirected graphs $G$ and $Q$ .
(2)

We propose an efficient heuristic estimation BED⁺ based upon BED, and demonstrate that BED⁺ is still admissible.
(3)

We conduct extensive experiments to confirm BED⁺’s effectiveness on the real and synthetic datasets. The source code is available online [11].

The rest of this paper is organized as follows. In Section 2, we give the definition of the graph edit distance and revisit three state-of-the-art GED lower bounds. In Section 3, we theoretically analyze the relationship between LED, HED and BED. In Section 4, we propose the heuristic function BED⁺ for improving the GED computation. In Section 5, we report the experimental results. Finally, we summarize this paper in Section 6.

2 Graph edit distance

In this paper, we consider undirected, labeled graphs without multi-edges or self-loops. A labeled graph is a triplet $G=(V_{G},E_{G},L)$ , where $V_{G}$ is the set of vertices, $E_{G}$ is the set of edges, $L:V_{G}\cup E_{G}\to\Sigma$ is a labeling function that assigns a label to a vertex or an edge, and $\Sigma$ is a set of labels. Also, we use a special symbol $\varepsilon$ to denote a dummy vertex or a dummy edge.

Given two graphs $G$ and $Q$ , six edit operations [18, 25, 21, 5] can be used to transform $G$ to $Q$ (or vice versa): inserting or deleting a vertex or an edge, and substituting the label of a vertex or an edge. We denote the label substitution (or simply substitution) of vertices $u\in V_{G}$ and $v\in V_{Q}$ by $(u\to v)$ , the deletion of $u$ by $(u\to\varepsilon)$ , and the insertion of $v$ by $(\varepsilon\to v)$ . For the three edit operations on edges, we use similar notation.

An edit path $\mathcal{P}=\langle p_{1},p_{2},\dots,p_{k}\rangle$ between $G$ and $Q$ is a sequence of edit operations that transforms $G$ to $Q$ , such as $G=G^{0}\xrightarrow{p_{1}}\dots G^{i}\xrightarrow{p_{i+1}}G^{i+1}\dots% \xrightarrow{p_{k}}G^{k}=Q$ , where graph $G^{i+1}$ is obtained by performing the edit operation $p_{i+1}$ on graph $G^{i}$ , for $0\leq i\leq k-1$ . During this transformation, each edit operation $p_{i}$ is assigned a penalty cost $c(p_{i})$ to reflect whether it can strongly change a graph. Note that the cost of editing two dummy vertices (or edges) is 0; that is, $c(\varepsilon\to\varepsilon)=0$ . Thus, $\mathcal{P}$ ’s edit cost is defined as $\sum_{i=1}^{k}c(p_{i})$ . We define the graph edit distance as follows:

Definition 1.

Given two graphs $G$ and $Q$ , the graph edit distance between them, denoted by $\mathit{ged}(G,Q)$ , is defined as the minimum cost of transforming $G$ to $Q$ , namely,

\mathit{ged}(G,Q)=\min\nolimits_{\mathcal{P}\in\Upsilon(G,Q)}\sum\nolimits_{p_% {i}\in\mathcal{P}}c(p_{i})

(1)

where $\Upsilon(G,Q)$ is the set of all edit paths between $G$ and $Q$ , and $c(p_{i})$ is edit operation $p_{i}$ ’s cost.

Hereafter, for ease of presentation, we denote $V_{G}^{\varepsilon}=V_{G}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|V% _{Q}|}$ and $V_{Q}^{\varepsilon}=V_{Q}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|V% _{G}|}$ as the expanded sets of $V_{G}$ and $E_{G}$ , respectively, so that $V_{G}^{\varepsilon}$ and $V_{Q}^{\varepsilon}$ have the same number of vertices. Similarly, $E_{G}^{\varepsilon}=E_{G}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|E% _{Q}|}$ and $E_{Q}^{\varepsilon}=E_{Q}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|E% _{G}|}$ denote the expanded sets of $E_{G}$ and $E_{Q}$ , respectively.

2.1 State-of-the-art GED lower bounds

Below we introduce three state-of-the-art GED lower bounds, which can be used as heuristic estimations in the tree-based search algorithm to compute GED. Each of the methods gives a lower bound on GED because the operations are done in sets that do not have to be consistent with one another. For example, in the first method LED described below, the edit operations on the vertex labels can be done independently of the edit operations on the edge labels, and thus they may not be globally consistent.

Label Edit Distance.

Riesen et al. [27, 26] proposed the label edit distance (LED), which is the minimum cost of substituting vertices and edges of two graphs.

Definition 2 (Label edit distance).

Given two graphs $G$ and $Q$ , the label edit distance between them is defined as $\mathit{LED}(G,Q)=\lambda_{V}(G,Q)+\lambda_{E}(G,Q)$ , where $\lambda_{V}(G,Q)=\min\nolimits_{\phi:V_{G}^{\varepsilon}\to V_{Q}^{\varepsilon% }}\sum\nolimits_{u\in V_{G}^{\varepsilon}}$ $c(u\to\phi(u))$ is the minimum cost of substituting vertices of $G$ and $Q$ , and $\phi$ is a bijection from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ ; and $\mathit{\lambda_{E}(G,Q)=\min\nolimits_{\varphi:E_{G}^{\varepsilon}\to E_{Q}^{% \varepsilon}}\sum\nolimits_{e(u,u^{\prime})\in E_{G}^{\varepsilon}}c(e(u,u^{% \prime})\to\varphi(e(u,u^{\prime})))}$ is the minimum cost of substituting edges of $G$ and $Q$ , and $\varphi$ is a bijection from $E_{G}^{\varepsilon}$ to $E_{Q}^{\varepsilon}$ .

Hausdorff Edit Distance.

Inspired by the Hausdorff distance [29] between two finite sets, Fischer et al. [16] proposed the Hausdorff edit distance (HED) between two graphs $G$ and $Q$ . The key ideas of HED are to perform a bidirectional matching between $G$ and $Q$ and to allow multiple assignments between their vertices.

Definition 3 (Hausdorff edit distance).

Given two graphs $G$ and $Q$ , their Hausdorff edit distance is defined as $\mathit{HED}(G,Q)=\sum_{u\in V_{G}}\min_{v\in V_{Q}\cup\{\varepsilon\}}f_{H}(u% ,v)+\sum_{v\in V_{Q}}\min_{u\in V_{G}\cup\{\varepsilon\}}f_{H}(u,v)$ , where $f_{H}(u,v)$ is the Hausdorff cost of matching vertex $u$ to vertex $v$ .

The Hausdorff vertex matching cost $f_{H}(u,v)$ considers not only the two vertices $u\in V_{G}$ and $v\in V_{Q}$ but also their neighboring edges.

Definition 4 (Neighboring edges).

Given graph $G$ and a vertex $u\in V_{G}$ , the neighboring edges $N_{u}$ of $u$ are defined as $N_{u}=\{e(u,u^{\prime}):u^{\prime}\in V_{G}\land e(u,u^{\prime})\in E_{G}\}$ .

We define $f_{H}(u,v)$ as

f_{H}(u,v)=\left\{\begin{array}[]{ll}c(u\to\varepsilon)+\sum_{e_{1}\in N_{u}}% \frac{c(e_{1}\to\varepsilon)}{2}&\mathrm{if}\ v=\varepsilon;\\[2.0pt] c(\varepsilon\to v)+\sum_{e_{2}\in N_{v}}\frac{c(\varepsilon\to e_{2})}{2}&% \mathrm{if}\ u=\varepsilon;\\[2.0pt] \frac{c(u\to v)+\frac{\mathit{HED}(N_{u},N_{v})}{2}}{2}&\mathrm{otherwise.}% \end{array}\right.

(2)

Similarly to Definition 3, the Hausdorff edit distance $\mathit{HED}(N_{u},N_{v})$ between $N_{u}$ and $N_{v}$ is defined as

\mathit{HED}(N_{u},N_{v})=\sum_{e_{1}\in N_{u}}\min_{e_{2}\in N_{v}\cup\{% \varepsilon\}}f_{H}(e_{1},e_{2})+\sum_{e_{2}\in N_{v}}\min_{e_{1}\in N_{u}\cup% \{\varepsilon\}}f_{H}(e_{1},e_{2})

(3)

where $f_{H}(e_{1},e_{2})$ is the cost of matching two edges such that

f_{H}(e_{1},e_{2})=\left\{\begin{array}[]{ll}c(e_{1}\to\varepsilon)&\mathrm{if% }\ e_{2}=\varepsilon;\\[2.0pt] c(\varepsilon\to e_{2})&\mathrm{if}\ e_{1}=\varepsilon;\\[2.0pt] \frac{c(e_{1}\to e_{2})}{2}&\mathrm{otherwise.}\\ \end{array}\right.

(4)

Branch Edit Distance.

Blumenthal et al. [5] recently proposed the branch edit distance (BED), which computes the minimum cost of editing branch structures of two graphs.

Definition 5 (Branch structure).

The branch structure of vertex $u$ in graph $G$ is defined as $B_{u}=(u,N_{u})$ , where $N_{u}$ is the set of neighboring edges of $u$ .

Given two branch structures $B_{u}$ and $B_{v}$ , the minimum cost of editing $B_{u}$ into $B_{v}$ is defined as

f_{B}(u,v)=c(u\to v)+\frac{1}{2}\min_{\varrho:N_{u}^{\varepsilon}\to N_{v}^{% \varepsilon}}\sum_{e(u,u^{\prime})\in N_{u}^{\varepsilon}}c(e(u,u^{\prime})\to% \varrho(e(u,u^{\prime}))),

(5)

where $N_{u}^{\varepsilon}=N_{u}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|N% _{v}|}$ and $N_{v}^{\varepsilon}=N_{v}\cup\overbrace{\{\varepsilon,\dots,\varepsilon\}}^{|N% _{u}|}$ are expanded sets of $N_{u}$ and $N_{v}$ , respectively, and $\varrho$ is a bijection from $N_{u}^{\varepsilon}$ to $N_{v}^{\varepsilon}$ .

Definition 6 (Branch edit distance).

Given two graphs $G$ and $Q$ , the branch edit distance between them is defined as $\mathit{BED}(G,Q)=\min_{\rho:V_{G}^{\varepsilon}\to V_{Q}^{\varepsilon}}\sum_{% u\in V_{G}^{\varepsilon}}f_{B}(u,\rho(u))$ , where $\rho$ is a bijection from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ , and $f_{B}(\cdot,\cdot)$ is defined in (5).

3 Tightness analysis

In this section, we analyze the tightness of the three GED lower bounds: LED, HED and BED. Specifically, we will prove that BED is the strongest of all; that is, for any two undirected graphs $G$ and $Q$ , we have $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ and $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ .

3.1 Relation of LED and BED

Theorem 7.

Given two graphs $G$ and $Q$ , we have $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ .

Proof.

For ease of proof, we insert dummy vertices and edges into $G$ to make it become a complete graph with $(|V_{G}|+|V_{Q}|)$ vertices. Similarly, we transform $Q$ into a complete graph that also has $(|V_{G}|+|V_{Q}|)$ vertices. Then, we can simplify (5) as

	$\displaystyle f_{B}(u,v)$	$\displaystyle=c(u\to v)+\frac{1}{2}\min\limits_{\varrho:N_{u}^{\varepsilon}\to N% _{v}^{\varepsilon}}\sum\limits_{e(u,u^{\prime})\in N_{u}^{\varepsilon}}c(e(u,u% ^{\prime})\to\varrho(e(u,u^{\prime})))$
		$\displaystyle=c(u\to v)+\frac{1}{2}\min\limits_{\zeta:V_{G}\backslash\{u\}\to V% _{Q}\backslash\{v\}}\sum_{u^{\prime}\in V_{G}\backslash\{u\}}c(e(u,u^{\prime})% \to e(v,\zeta(u^{\prime})))$
		$\displaystyle=c(u\to v)+\frac{1}{2}\sum_{u^{\prime}\in V_{G}\backslash\{u\}}c(% e(u,u^{\prime})\to e(v,\zeta_{\min}^{u,v}(u^{\prime})))$

where $\zeta_{\min}^{u,v}$ is the bijection from $V_{G}\backslash\{u\}$ to $V_{Q}\backslash\{v\}$ for which $f_{B}(u,v)$ achieves the minimum value. Thus, we have

\thinspace\begin{aligned} \mathit{BED}(G,Q)&=\min_{\rho:V_{G}\to V_{Q}}\sum_{u% \in V_{G}}f_{B}(u,\rho(u))\\ &=\sum_{u\in V_{G}}\bigg{\{}c(u\to\rho_{\min}(u))+\frac{1}{2}\sum_{u^{\prime}% \in V_{G}\backslash\{u\}}c(e(u,u^{\prime})\to e(\rho_{\min}(u),\zeta_{\min}^{u% ,u^{\prime}}(u^{\prime}))\bigg{\}}\\ &=\sum_{u\in V_{G}}c(u\to\rho_{\min}(u))+\frac{1}{2}\sum_{u\in V_{G}}\sum_{u^{% \prime}\in V_{G}\backslash\{u\}}c(e(u,u^{\prime})\to e(\rho_{\min}(u),\zeta_{% \min}^{u,u^{\prime}}(u^{\prime})))\\ &=\sum_{u\in V_{G}}c(u\to\rho_{\min}(u))+\sum_{e(u,u^{\prime})\in E_{G}}c(e(u,% u^{\prime})\to\xi(e(u,u^{\prime})))\\ &\geq\min_{\phi:V_{G}\to V_{Q}}\sum_{u\in V_{G}}c(u\to\phi(u))+\min_{\varphi:E% _{G}\to E_{Q}}\sum_{e(u,u^{\prime})\in E_{G}}c(e(u,u^{\prime})\to\varphi(e(u,u% ^{\prime}))\\ &=\lambda_{V}(G,Q)+\lambda_{E}(E_{G},E_{Q})=\mathit{LED}(G,Q)\\ \end{aligned}

where $\rho_{\min}$ is the bijection from $V_{G}$ to $V_{Q}$ for which $\mathit{BED}(G,Q)$ achieves the minimum value, and $\xi$ is the bijection from $E_{G}$ to $E_{Q}$ satisfying $e(\rho_{\min}(u),\zeta_{\min}^{u,u^{\prime}}(u^{\prime})))=\xi(e(u,u^{\prime}))$ for $\forall u\in V_{G},u^{\prime}\in V_{G}\backslash\{u\}$ . $\hfill\blacktriangleleft$

3.2 Relation of HED and BED

Lemma 8.

Given two vertices $u\in V_{G}^{\varepsilon}$ and $v\in V_{Q}^{\varepsilon}$ , then we have

f_{H}(u,v)\leq\left\{\begin{array}[]{ll}f_{B}(u,v)&\mathrm{if}\ u=\varepsilon% \ \mathrm{or}\ v=\varepsilon;\\[3.0pt] \frac{1}{2}f_{B}(u,v)&\mathrm{otherwise.}\end{array}\right.

where $f_{H}(u,v)$ and $f_{B}(u,v)$ are defined in (2) and (5), respectively.

The proof of Lemma 8 is in Appendix A.

Theorem 9.

Given two graphs $G$ and $Q$ , we have $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ .

Proof.

By $f_{H}(u,v)$ ’s definition in (2), we know that when $u=\varepsilon$ , $\min_{v\in V_{Q}\cup\{\varepsilon\}}f_{H}(\varepsilon,v)=f_{H}(\varepsilon,% \varepsilon)=0$ ; and similarly when $v=\varepsilon$ , $\min_{u\in V_{G}\cup\{\varepsilon\}}f_{H}(u,\varepsilon)=f_{H}(\varepsilon,% \varepsilon)=0$ . We can rewrite $\mathit{HED}(G,Q)$ as

$\displaystyle\mathit{HED}(G,Q)$	$\displaystyle=$	$\displaystyle\sum_{u\in V_{G}}\min_{v\in V_{Q}\cup\{\varepsilon\}}f_{H}(u,v)+% \sum_{v\in V_{Q}}\min_{u\in V_{G}\cup\{\varepsilon\}}f_{H}(u,v)$	(6)
	$\displaystyle=$	$\displaystyle\sum_{u\in V_{G}^{\varepsilon}}\min_{v\in V_{Q}^{\varepsilon}}f_{% H}(u,v)+\sum_{v\in V_{Q}^{\varepsilon}}\min_{u\in V_{G}^{\varepsilon}}f_{H}(u,v)$
	$\displaystyle=$	$\displaystyle\sum_{u\in V_{G}^{\varepsilon}}f_{H}(u,\pi_{1}(u))+\sum_{v\in V_{% Q}^{\varepsilon}}f_{H}(\pi_{2}(v),v))$
	$\displaystyle=$	$\displaystyle\sum_{u\in V_{G}^{\varepsilon}}\bigg{\{}f_{H}(u,\pi_{1}(u))+f_{H}% \bigl{(}\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u)\bigr{)}\bigg{\}},$

where $\pi_{1}$ is a mapping from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ satisfying $\pi_{1}(u)=\arg\min_{v\in V_{Q}^{\varepsilon}}f_{H}(u,v)$ ,; $\pi_{2}$ is a mapping from $V_{Q}^{\varepsilon}$ to $V_{G}^{\varepsilon}$ satisfying $\pi_{2}(v)=\arg\min_{u\in V_{G}^{\varepsilon}}f_{H}(u,v)$ ; and $\rho_{\min}$ is the bijection from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ for which $\mathit{BED}(G,Q)$ achieves the minimum value. We know that

\mathit{BED}(G,Q)=\sum_{u\in V_{G}^{\varepsilon}}f_{B}(u,\rho_{\min}(u)).

(7)

By (6) and (7), we can complete the proof by showing that

{f_{H}\bigl{(}u,\pi_{1}(u))+f_{H}(\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u)\bigr{% )}\leq f_{B}(u,\rho_{\min}(u))}.

We do so by considering the following four exhaustive cases:

Case I.: When $u=\varepsilon$ and $\rho_{\min}(u)=\varepsilon$ , then $f_{H}(u,\pi_{1}(u))+f_{H}(\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u))\leq f_{H}(% \varepsilon,\varepsilon)+f_{H}(\varepsilon,\varepsilon)=f_{B}(\varepsilon,% \varepsilon)=0$ , by the definitions of $\pi_{1}$ , $\pi_{2}$ , and $\rho_{\min}$ .
Case II.: When $u\neq\varepsilon$ and $\rho_{\min}(u)=\varepsilon$ , then $f_{H}(u,\pi_{1}(u))+f_{H}(\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u))\leq f_{H}(u,% \varepsilon)+f_{H}(\varepsilon,\varepsilon)=f_{H}(u,\varepsilon)$ , by the definitions of $\pi_{1}$ , $\pi_{2}$ , and $\rho_{\min}$ . By Lemma 8, we know that $f_{H}(u,\varepsilon)\leq f_{B}(u,\varepsilon)=f_{B}(u,\rho_{\min}(u))$ .
Case III.: When $u=\varepsilon$ and $\rho_{\min}(u)\neq\varepsilon$ , the analysis is similar to that of Case II.
Case IV.: When $u\neq\varepsilon$ and $\rho_{\min}(u)\neq\varepsilon$ , then we have $f_{H}(u,\pi_{1}(u))\leq f_{H}(u,\rho_{\min}(u))$ and $f_{H}(\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u))\leq f_{H}(u,\rho_{\min}(u))$ , by the definitions of $\pi_{1}$ , $\pi_{2}$ , and $\rho_{\min}$ . By Lemma 8, we know that $f_{H}(u,\rho_{\min}(u))\leq\frac{1}{2}f_{B}(u,\rho_{\min}(u))$ . Thus, we have $f_{H}(u,\pi_{1}(u))+f_{H}\bigl{(}\pi_{2}(\rho_{\min}(u)),\rho_{\min}(u)\bigr{)% }\leq 2\times\frac{1}{2}f_{B}(u,\rho_{\min}(u))=f_{B}(u,\rho_{\min}(u))$ .

$\hfill\blacktriangleleft$

4 Tree-based search algorithm

The previous section showed that BED achieves the tightest GED lower bound. In this section, based upon BED we propose an efficient heuristic estimation to improve the tree-based search algorithm [2, 3] for the GED computation.

4.1 Search tree

Computing the GED of graphs $G$ and $Q$ is typically based upon a tree-based search procedure that explores all possible graph mappings from $G$ to $Q$ . Starting from a dummy node, root = $\emptyset$ , we logically create the search tree layer by layer by iteratively generating successors using BasicGenSuccr [9]. This search space can be organized as an ordered search tree, where the inner nodes denote partial graph mappings and the leaf nodes denote complete graph mappings. Such a search tree is created dynamically at runtime by iteratively generating successors linked by edges to the currently considered node. For more details, please refer to Section 2 in the reference [9].

4.2 Heuristic cost estimation

For a node $r$ in the search tree, let $h(r)$ be the estimated cost from $r$ to its descendant leaf node that is less than or equal to the real cost. Based upon BED, we introduce how to estimate $h(r)$ in the tree-based search algorithm.

4.2.1 Heuristic function

Consider an inner node $r=\{(u_{1}\to v_{j_{1}}),\dots,(u_{l}\to v_{j_{\ell}})\}$ , where $v_{j_{k}}$ is $u_{k}$ ’s mapped vertex, for $1\leq k\leq\ell$ . We divide $G$ into two subgraphs $G_{r}^{1}$ and $G_{r}^{2}$ , where $G_{r}^{1}$ is the mapped part of $G$ such that $V_{G_{r}^{1}}=\{u_{1},\dots,u_{l}\}$ and $E_{G_{r}^{1}}=\{e(u,v):u,v\in V_{G_{r}^{1}}\land e(u,v)\in E_{G}\}$ , and $G_{r}^{2}$ is the unmapped part such that $V_{G_{r}^{2}}=V_{G}\backslash V_{G_{r}^{1}}$ and $E_{G_{r}^{2}}=\{e(u,v):u,v\in V_{G_{r}^{2}}\land e(u,v)\in E_{G}\}$ . We obtain $Q_{r}^{1}$ and $Q_{r}^{2}$ similarly.

Clearly, the lower bound $\mathit{BED}(G_{r}^{2},Q_{r}^{2})$ can be used to estimate $r$ ’s cost. However, $\mathit{BED}(G_{r}^{2},Q_{r}^{2})$ has not covered the potential edit cost on the edges between $G_{r}^{1}$ (resp., $Q_{r}^{1}$ ) and $G_{r}^{2}$ (resp., $Q_{r}^{2}$ ). Recently, [8, 9] proposed two different methods to cover this potential cost; nevertheless, these two methods only worked for the uniform cost function (i.e., for which the cost of each edit operation is 1). We expand the method in [8] to support for any cost function.

Definition 10.

Given vertices $u\in G_{r}^{2}$ and $v\in Q_{r}^{2}$ , we define the cost of matching $u$ to $v$ as $f_{B}^{+}(u,v)=f_{B}(u,v)+\smash{\sum_{u^{\prime}\in V_{G_{r}^{1}}}}c(e(u,u^{% \prime})\to e(v,v^{\prime}))$ , where $v^{\prime}$ is the mapped vertex of the already processed vertex $u^{\prime}$ , and $f_{B}(u,v)$ is the minimum cost of transforming $B_{u}$ to $B_{v}$ , which we defined in (5).

When there is no edge between $u$ and $u^{\prime}$ , we set $\mathit{e(u,u^{\prime})=\varepsilon}$ , and similarly for $e(v,v^{\prime})$ . Based upon $f_{B}^{+}(u,v)$ , we define the improved lower bound BED⁺ as

\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})\ =\min_{\rho:V_{G_{r}^{2}}^{\varepsilon}% \to V_{Q_{r}^{2}}^{\varepsilon}}\sum_{u\in V_{G_{r}^{2}}^{\varepsilon}}f_{B}^{% +}(u,\rho(u))

(8)

Theorem 11.

Given a node $r$ in the GED tree of graphs $G$ and $Q$ , then $\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})\geq\mathit{BED}(G_{r}^{2},Q_{r}^{2})$ , where $G_{r}^{2}$ and $Q_{r}^{2}$ are the unmapped subgraphs of $G$ and $Q$ , respectively, $\mathit{BED}^{+}(\cdot,\cdot)$ and $\mathit{BED}(\cdot,\cdot)$ are defined in (8) and Definition 6, respectively.

Proof.

We trivially obtain this theorem since $f_{B}^{+}(u,v)\geq f_{B}(u,v)$ for $\forall u\in V_{G_{r}^{2}},v\in V_{Q_{r}^{2}}$ . $\hfill\blacktriangleleft$

Theorem 12.

Given a descendant leaf node $s$ of $r$ , the heuristic estimation $h(r)=\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})$ is admissible; that is, $h(r)\leq g(s)-g(r)$ , where $g(\cdot)$ gives the incurred cost from the root node to the currently considered node.

Proof.

For ease of proof, we insert dummy vertices and edges into $G$ to transform it to a complete graph with $(|V_{G}|+|V_{Q}|)$ vertices. Similarly, we transform $Q$ to a complete graph that also has $(|V_{G}|+|V_{Q}|)$ vertices.

Consider an internal node $r=\{(u_{1}\to v_{j_{1}}),\dots,(u_{\ell}\to v_{j_{\ell}})\}$ in the search tree, where $v_{j_{k}}$ is the mapped vertex of $u_{k}$ , for $1\leq k\leq\ell$ . For easy presentation, hereafter we use $r(u_{k})$ to denote $u_{k}$ ’s mapped vertex, i.e., $v_{j_{k}}=r(u_{k})$ . Given a descendent leaf node $s$ (i.e., $s$ is a complete vertex mapping from $G$ to $Q$ ) of $r$ , then the incurred cost of $s$ is

	$\displaystyle g(s)$	$\displaystyle=\sum_{u\in V_{G}}c(u\to s(u))\ +\ \frac{1}{2}\sum_{u\in V_{G}}% \sum_{u^{\prime}\in V_{G}\backslash\{u\}}c(e(u,u^{\prime})\to e(s(u),s(u^{% \prime})))$		(9)
		$\displaystyle=\sum_{u\in V_{G}}c(u\to s(u))\ +\sum_{e(u,u^{\prime})\in\binom{V% _{G}}{2}}c(e(u,u^{\prime})\to e(s(u),s(u^{\prime})))$		(9)

As we know, $r$ induces an edit path transforming $G_{r}^{1}$ to $Q_{r}^{1}$ , where $G_{r}^{1}$ and $Q_{r}^{1}$ are the already mapped subgraphs of $G$ and $Q$ , respectively, and $V_{G_{r}^{1}}=\{u_{1},\dots,u_{\ell}\}$ and $V_{Q_{r}^{1}}=\{r(u_{1}),\dots,r(u_{\ell})\}$ . According to (9), we know that

g(r)=\sum\limits_{u\in V_{G_{r}^{1}}}c(u\to r(u))\ +\sum\limits_{e(u,u^{\prime% })\in\binom{V_{G_{r}^{1}}}{2}}c(e(u,u^{\prime})\to e(r(u),r(u^{\prime})))

Let $\omega=s\backslash r$ be the partial mapping that contains the vertex mapping pairs belong to $s$ but not $r$ ; namely, $\omega=\{(u\to s(u)):u\in V_{G}\backslash V_{G_{r}^{1}}\}$ . We can obtain that

	$\displaystyle g(s)-g(r)$	$\displaystyle=\sum_{u\in V_{G}}c(u\to s(u))\ +\sum_{e(u,u^{\prime})\in\binom{V% _{G}}{2}}c(e(u,u^{\prime})\to e(s(u),s(u^{\prime})))$
		$\displaystyle\hskip 44.0pt\hbox{}-\bigg{\{}\sum_{u\in V_{G_{r}^{1}}}c(u\to r(u% ))\ +\sum_{e(u,u^{\prime})\in\binom{V_{G_{r}^{1}}}{2}}c(e(u,u^{\prime})\to e(r% (u),r(u^{\prime})))\bigg{\}}$
		$\displaystyle=\sum_{u\in V_{G_{r}^{2}}}c(u\to\omega(u))\ +\sum_{e(u,u^{\prime}% )\in\binom{V_{G_{r}^{2}}}{2}}c(e(u,u^{\prime})\to e(\omega(u),\omega(u^{\prime% })))$
		$\displaystyle\hskip 44.0pt\hbox{}+\sum_{u\in V_{G_{r}^{2}}}\sum_{u^{\prime}\in v% _{G_{r}^{1}}}c(e(u,u^{\prime})\to e(\omega(u),r(u^{\prime})))$
		$\displaystyle=\sum_{u\in V_{G_{r}^{2}}}\bigg{\{}c(u\to\omega(u))+\frac{1}{2}% \sum_{u^{\prime}\in V_{G_{r}^{2}}\backslash\{u\}}c(e(u,u^{\prime})\to e(\omega% (u),\omega(u^{\prime})))\bigg{.}$
		$\displaystyle\hskip 44.0pt\hbox{}+\bigg{.}\sum_{u^{\prime}\in V_{G_{r}^{1}}}c(% e(u,u^{\prime})\to e(\omega(u),r(u^{\prime})))\bigg{\}}$
		$\displaystyle=\sum_{u\in V_{G_{r}^{2}}}f_{B}^{+}(u,\omega(u))\geq\min_{\rho:V_% {G_{r}^{2}}\to V_{Q_{r}^{2}}}f_{B}^{+}(u,\rho(u))=\mathit{BED}^{+}(G_{r}^{2},Q% _{r}^{2})=h(r).$

where $V_{G_{r}^{2}}=V_{G}\backslash V_{G_{r}^{1}}$ and $V_{Q_{r}^{2}}=V_{Q}\backslash V_{Q_{r}^{1}}$ . The second equality is due to $\binom{V_{G}}{2}=\binom{V_{G_{r}^{1}}}{2}\cup\binom{V_{G_{r}^{2}}}{2}\cup(V_{G% _{r}^{1}}\times V_{G_{r}^{2}})$ when $V_{G}$ is partitioned into two disjoint sets $V_{G_{r}^{1}}$ and $V_{G_{r}^{2}}$ . $\hfill\blacktriangleleft$

We give an example in Appendix B of computing three GED lower bounds: LED, HED and BED. The same optimization that produces BED+ from BED can be applied to LED and HED to achieve enhanced heuristics LED+ and HED+, but we do not include them in this paper.

4.3 Algorithm

In this section, we show how to incorporate the heuristic estimation $\mathit{BED}^{+}$ into the anytime-based GED computation algorithm [2]. The reason we consider the anytime-based algorithm is that it is flexible and can control the algorithm to output tighter and tighter GED upper bounds until the exact GED value by setting more and more running time.

Algorithm 1 gives the anytime-based algorithm for computing the GED, where $t_{\max}$ is the user-defined maximum running time. We perform a depth-first search over the GED search tree of $G$ and $Q$ to find better and better GED upper bounds until the running time $\#t$ reaches $t_{\max}$ . To accomplish this, we first employ the BP [25] algorithm to fast compute an initial GED upper bound $u b$ ; then, we adopt a stack $\mathcal{S}$ to finish the depth-first search. Each time we pop a node $q$ from $\mathcal{S}$ . If $q$ is a leaf node, then we find a better solution. Otherwise, we call procedure BasicGenSuccr [9] (see Appendix C) to generate $q$ ’s successors and then insert them into $\mathcal{S}$ . During this process, we adopt the branch-and-bound strategy to prune the useless space: for each successor $r$ , if $g(r)+h(r)\geq ub$ , we can safely prune it, where $h(r)=\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})$ is defined in (8).

Algorithm 1 Anytime-based GED computation.

5 Experiments

5.1 Datasets and settings

Datasets.

We chose four real (GREC, MUTA, PRO, and CMU) and one synthetic (SYN) datasets in the experiments. The datasets GREC, MUTA, and PRO were taken from the IAM Graph Database Repository [24]; the CMU dataset could be found at the CMU website [13]; and the SYN dataset was generated by the synthetic graph generator GraphGen [12]. Following the same procedure in [2, 21], we selected some subsets of GREC, MUTA, and PRO as the tested datasets, respectively, where each subset consists of graphs that have the same number of vertices. Specifically, the subsets of CREC contain 5, 10, 15, and 20 vertices, respectively; the subsets of MUTA contain 10, 20, $\dots$ , 70 vertices, respectively; the subsets of PRO contain 20, 30, and 40 vertices, respectively; and each subset consists of 10 graphs.

Table 1 summarizes the characteristic and applied cost function of each dataset. ED and SED are short for Euclidean distance and string edit distance functions, respectively. $c_{v}$ is the cost of inserting/deleting a vertex; $c_{e}$ is the cost of inserting/deleting an edge; $c_{vs}$ and $c_{es}$ are the costs of substituting a vertex and an edge, respectively. In addition, we introduce a parameter $\alpha$ to control whether edit operations on vertices or edges are more important.

Settings.

We conducted all the experiments on a HP Z800 PC running the Ubuntu 12.04 LTS operating system and equipped with a 2.67GHz CPU and 24 GB of memory. We implemented the algorithm in C++, using -O3 to compile and run it.

Table 1: Summary of characteristics of datasets and cost functions used.

Dataset	#Graphs	$\|V\|$	$\|E\|$	vertex labels	edge labels	$c_{v}$	$c_{e}$	$\alpha$	$c_{vs}$	$c_{es}$
GREC	40	12.5	17.5	(x, y) coord.	Line type	90	15	0.5	Ext. ED	Dirac
MUTA	70	40	41.5	Chem. symbol	Valence	11	1.1	0.25	Dirac	Dirac
PRO	30	30	58.6	Type/AA-seq.	Type/length	11	1	0.75	Ext. SED	Dirac
CMU	111	30	79.1	None	Distance	$\infty$	$-$	0.5	0	L1 norm
SYN	100	14.5	20	Symbol	Symbol	0.3	0.5	0.75	Dirac	Dirac

5.2 Evaluation metrics

We discuss two metrics to evaluate algorithm performance: deviation ( $\mathit{dev}$ ) [1] and solvable ratio ( $s r$ ) [9]. The metric $\mathit{dev}$ measures the deviation generated by an algorithm. Formally, given two graphs $G$ and $Q$ , the deviation of the two graphs can be computed as $\mathit{deviation}(G,Q)=|\mathit{dis}(G,Q)-R(G,Q)|/R(G,Q)$ , where $\mathit{dis}(G,Q)$ is the (approximate) GED value produced by the algorithm, and $R(G,Q)$ is the best GED value produced in all the experiments done on the graph database repository in [1]. Based upon the pairwise comparison model, the deviation on the dataset $\mathcal{G}$ can be computed as

\mathit{dev}=\frac{1}{|\mathcal{G}|\times|\mathcal{G}|}\sum\nolimits_{G\in% \mathcal{G}}\sum\nolimits_{Q\in\mathcal{G}}\mathit{deviation}(G,Q)

(10)

The metric $\mathit{sr}$ measures how often the exact GED value is obtained when reaching the maximum running time threshold $t_{\max}$ . Formally, let $\mathit{slove}(G,Q)$ indicate whether an algorithm outputs the exact GED of $G$ and $Q$ within $t_{\max}$ time; in other words, if the algorithm requires less than $t_{\max}$ time to output the GED, $\mathit{slove}(G,Q)=1$ ; otherwise, $\mathit{slove}(G,Q)=0$ . The solvable ratio ( $s r$ ) on the dataset $\mathcal{G}$ can be computed as

sr=\frac{1}{|\mathcal{G}|\times|\mathcal{G}|}\sum\nolimits_{G\in\mathcal{G}}% \sum\nolimits_{Q\in\mathcal{G}}\mathit{slove}(G,Q)

(11)

Obviously, a smaller $\mathit{dev}$ and a larger $\mathit{sr}$ reflects a better performance of an algorithm.

5.3 Experimental results

As described earlier in this paper, we first analyzed the relation of three GED lower bounds (i.e., LED, HED and BED). Then based upon BED we proposed an efficient heuristic estimation BED⁺. Thus, it is necessary to evaluate the contribution of these lower bounds to the GED computation.

5.3.1 Tightness of LED, HED and BED

In this section, we evaluate the tightness of three GED lower bounds LED, HED and BED as well as their running time. Table 2 shows the obtained results, where the abbreviation “ms” represents milliseconds.

As shown in Table 2, BED achieves the smallest $\mathit{dev}$ , which means that BED is closest to the exact GED value. This result is consistent with the analysis in Section 3, i.e., $\mathit{\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)}$ and $\mathit{\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)}$ for any two graphs $G$ and $Q$ . We also find in most cases tthat LED performs better than HED; the reason is that HED allows multiple assignments between vertices of $G$ and $Q$ and greedily selects matched vertices with the lowest cost.

We also list the running time of each method in Table 2. It can be seen from this table that HED runs faster than LED and LED runs faster than BED. The reason is that HED runs in quadratic time, while both LED and BED run in cubic time. LED independently considers the cost of substituting vertices and edges and ignores the structures, thus it has a better running time than BED.

Table 2: Deviation (%) and running time (ms) of LED, HED, and BED.

Datasets	LED		HED		BED
	$\mathit{dev}$	$\mathit{time}$	$\mathit{dev}$	$\mathit{time}$	$\mathit{dev}$	$\mathit{time}$
GREC	4.41	0.38	17.45	0.29	3.54	0.52
MUTA	12.54	1.07	30.13	0.47	11.49	2.72
PRO	4.61	3.75	21.31	2.84	3.25	6.07
CMU	61.56	9.53	57.6	3.77	25.1	13.2
SYN	67.41	0.28	91.92	0.14	46.61	0.45

5.3.2 Effect of heuristic

Observing that BED produces the tighter lower bound than LED and HED, we propose BED⁺ as a heuristic estimation to improve the GED computation. To achieve the comparison, we adopted LED, HED, BED, and BED⁺ as the heuristic estimations, respectively, and fixed the running time $t_{\max}=10^{4}$ ms. Table 3 lists the obtained deviation $\mathit{dev}$ and solvable ration $\mathit{sr}$ .

Table 3: Deviation (%) and solvable ratio (%) of of LED, HED, BED, and BED⁺.

Datasets	LED		HED		BED		BED⁺
	$\mathit{dev}$	$\mathit{sr}$	$\mathit{dev}$	$\mathit{sr}$	$\mathit{dev}$	$\mathit{sr}$	$\mathit{dev}$	$\mathit{sr}$
GREC	0.36	69.87	0.56	54.38	0.22	67.88	0.01	90
MUTA	5.56	3.47	4.85	3.27	4.49	3.51	2.6	22.33
PRO	1.27	4	0.71	3.56	0.68	4.22	0.06	4.22
CMU	109.89	19.06	49.18	19.06	52.83	31.16	2.97	75.79
SYN	8.79	8.4	9.48	4.18	7.96	7.48	0.17	94.34

From Table 3, we know that using BED⁺ as a heuristic can produce the smallest $\mathit{dev}$ . This is due to the fact that BED⁺ produces a higher estimated bound. For the solvable ratio $\mathit{sr}$ , we also find that BED⁺ achieves the best performance.

We also varied the running time $t_{\max}$ from $10^{1}$ ms to $10^{5}$ ms in order to evaluate the above heuristic estimations under different running times. From Figure 1, as the running time $t_{\max}$ increases, using the above four heuristic estimations we obtain lower and lower deviation $\mathit{dev}$ as well as higher and higher solvable ratio $\mathit{sr}$ . Also, we find in most cases that BED⁺ achieves the best $\mathit{dev}$ and $\mathit{sr}$ under both small (e.g., $10^{2}$ ms) and large (e.g., $10^{5}$ ms) running times. Compared with the widely used heuristic estimation LED, using BED⁺ can decrease the $\mathit{dev}$ by 72.2%, 34.1%, 48.1%, 39.4%, and 52.1% on average on the GREC, MUTA, PRO, CMU, and SYN datasets, respectively. Using BED⁺ can increase the $\mathit{sr}$ by 54.3%, 293.2%, 3.4%, 113.1%, and 702.8% on average on the five above datasets, respectively. Thus, we conclude that using BED⁺ as a heuristic estimation can greatly improve the GED computation.

Figure 1: The

\mathit{dev}

(top two rows) and

\mathit{sr}

(bottom two rows) under different running time.

6 Conclusion and future works

In this paper, we analyze the relationship among three state-of-the-art GED lower bounds that are widely used as heuristic estimations in the tree-based search algorithm for the GED computation. Specifically, we demonstrate that $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ and $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ for any two undirected graphs $G$ and $Q$ . Furthermore, based upon BED we propose an efficient heuristic estimation BED⁺ and demonstrate that BED⁺ still estimates a cost that is not greater than the real cost. Experimental results on four real and one synthetic datasets confirm that BED⁺ can achieve the best performance under both small and large running time.

When calculating the heuristic estimation $\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})$ , we first compute the transformation cost (i.e., $\mathit{f}_{B}(\cdot,\cdot)$ ) of two compared branch structures. In fact, the transformation cost of these two branch structures may have been calculated many times in the previous traversal of the search tree. Future work will consider how to build a suitable index structure to maintain the transformation cost of these traversed branch structures in order to accelerate the computation of $\mathit{BED}^{+}(G_{r}^{2},Q_{r}^{2})$ .

References

[1] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. A graph database repository and performance evaluation metrics for graph edit distance. In GbRPR, pages 138–147, 2015.
[2] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. Anytime graph matching. Pattern Recogn Lett., 84:215–224, 2016. doi:10.1016/J.PATREC.2016.10.004.
[3] Z. Abu-Aisheh, R. Raveaux, J. Y. Ramel, and P. Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In ICPRAM, pages 271–278, 2015.
[4] D. B. Blumenthal and J. Gamper. Exact computation of graph edit distance for uniform and non-uniform metric edit costs. In GbRPR, pages 211–221, 2017.
[5] D. B. Blumenthal and J. Gamper. Improved lower bounds for graph edit distance. IEEE Trans. Knowl Data Eng., 30(3):503–516, 2018. doi:10.1109/TKDE.2017.2772243.
[6] D. B. Blumenthal and J. Gamper. On the exact computation of the graph edit distance. Pattern Recogn Lett., 134:46–57, 2020. doi:10.1016/J.PATREC.2018.05.002.
[7] B. Bonet and H. Geffner. Planning as heuristic search. Artif. Intell., 129(1-2):5–33, 2001. doi:10.1016/S0004-3702(01)00108-4.
[8] L. Chang, X. Feng, X. Lin, L. Qin, and W. Zhang. Efficient graph edit distance computation and verification via anchor-aware lower bound estimation. CoRR, 2017. arXiv:1709.06810.
[9] X. Chen, H. Huo, J. Huan, and J. S. Vitter. An efficient algorithm for graph edit distance computation. Knowl.-Based Syst., 163:762–775, 2019. doi:10.1016/J.KNOSYS.2018.10.002.
[10] X. Chen, H. Huo, J. Huan, J. S. Vitter, W. Zheng, and L. Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Trans. Knowl Data Eng., 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
[11] X. Chen, Y. Wang, H. Huo, and J. S. Vitter. An efficient heuristic for graph edit distance [source code], June 2019. URL: https://github.com/Hongweihuo-Lab/Heur-GED.
[12] James Cheng, Yiping Ke, and Wilfred Ng. GraphGen — a synthetic graph data generator. URL: https://cse.hkust.edu.hk/graphgen/.
[13] CMU house and hotel datasets. URL: https://github.com/dbblumenthal/gedlib/blob/master/data/datasets/CMU-GED.
[14] X. Cortés and F. Serratosa. Learning graph-matching edit-costs based on the optimality of the oracle’s node correspondences. Pattern Recogn Lett., 56:22–29, 2015. doi:10.1016/J.PATREC.2015.01.009.
[15] A. Fischer, R. Plamondon, Y. Savaria, K. Riesen, and H. Bunke. A Hausdorff heuristic for efficient computation of graph edit distance. Structural, Syntactic, and Statistical Pattern Recognition, LNCS 8621:83–92, 2014.
[16] A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke. Approximation of graph edit distance based on Hausdorff matching. Pattern Recogn., 48(2):331–343, 2015. doi:10.1016/J.PATCOG.2014.07.015.
[17] K. Gouda and M. Hassaan. CSI_GED: An efficient approach for graph edit similarity computation. In ICDE, pages 256–275, 2016.
[18] D. Justice and A. Hero. A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal Mach Intell., 28(8):1200–1214, 2006. doi:10.1109/TPAMI.2006.152.
[19] J. Kim. Efficient graph edit distance computation using isomorphic vertices. Pattern Recogn Lett., 168(2023):71–78, 2023. doi:10.1016/J.PATREC.2023.03.002.
[20] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. doi:10.1002/nav.3800020109.
[21] J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, and S. Adam. New binary linear programming formulation to compute the graph edit distance. Pattern Recogn., 72:254–265, 2017. doi:10.1016/J.PATCOG.2017.07.029.
[22] J. Liu, M. Zhou, S. Ma, and L. Pan. MATA*: Combining learnable node matching with A* algorithm for approximate graph edit distance computation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), pages 1503–1512, 2023.
[23] Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, and Hong Cheng. Computing graph edit distance via neural graph matching. Proceedings of the VLDB Endowment, 16(8):1817–1829, 2023. doi:10.14778/3594512.3594514.
[24] K. Riesen and H. Bunke. IAM graph database repository for graph based pattern recognition and machine learning. Structural, Syntactic, and Statistical Pattern Recognition, pages 287–297, 2008.
[25] K. Riesen and H. Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput., 27(7):950–959, 2009. doi:10.1016/J.IMAVIS.2008.04.004.
[26] K. Riesen, S. Emmenegger, and H. Bunke. A novel software toolkit for graph edit distance computation. In GbRPR, pages 142–151, 2013.
[27] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph edit distance computation with a bipartite heuristic. In MLG, pages 21–24, 2007.
[28] S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach (2nd ed.). Prentice-Hall, New Jersey, USA, 2002.
[29] O Schütze, X. Esquivel, A. Lara, and C. A. C. Carlos. Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput., 16(4):504–522, 2012. doi:10.1109/TEVC.2011.2161872.
[30] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou. Comparing stars: On approximating graph edit distance. PVLDB, 2(1):25–36, 2009. doi:10.14778/1687627.1687631.
[31] W. Zheng, L. Zou, X. Lian, D. Wang, and D. Zhao. Efficient graph similarity search over large graph databases. IEEE Trans. Knowl Data Eng., 27(4):964–978, 2015. doi:10.1109/TKDE.2014.2349924.

Appendix A Proof of Lemma 8

Proof.

We prove this lemma by considering the following two cases:

Case I.

when $u=\varepsilon$ or $v=\varepsilon$ . We first discuss the case $u=\varepsilon$ . It is trivial to know that $f_{H}(u,\varepsilon)=f_{B}(u,\varepsilon)=c(u\to\varepsilon)+\frac{1}{2}\sum_{% e\in N_{u}}c(e\to\varepsilon)$ . Similarly, when $v=\varepsilon$ , we also have $f_{H}(u,v)=f_{B}(u,v)$ . Thus, when $u=\varepsilon$ or $v=\varepsilon$ , the lemma follows.

Case II.

when $u\neq\varepsilon$ and $v\neq\varepsilon$ . Then, we know that

	$\displaystyle f_{B}(u,v)$	$\displaystyle=$	$\displaystyle c(u\to v)+\frac{1}{2}\mathit{MLS}(N_{u},N_{v}),$		(12)
	$\displaystyle f_{H}(u,v)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\bigg{\{}c(u\to v)+\frac{1}{2}\mathit{HED}(N_{u},N_{v}% )\bigg{\}}$		(13)

where $\mathit{MLS}(N_{u},N_{v})=\min_{\varrho:N_{u}^{\varepsilon}\to N_{v}^{% \varepsilon}}\sum_{e\in N_{u}^{\varepsilon}}c(e\to\varrho(e))$ , and $\varrho$ is a bijection from $N_{u}^{\varepsilon}$ to $N_{v}^{\varepsilon}$ .

In order to prove $f_{H}(u,v)\leq\frac{1}{2}f_{B}(u,v)$ , it suffices from (12) and (13) to prove

{\mathit{HED}(N_{u},N_{v})\leq\mathit{MLS}(N_{u},N_{v})},

which we do as follows:

(i)

Rewriting $\mathit{MLS}(N_{u},N_{v})$ and $\mathit{HED}(N_{u},N_{v})$ :

$\mathit{MLS}(N_{u},N_{v})=\sum_{e\in N_{u}^{\varepsilon}}c(e\to\varrho_{\min}(% e))=\sum_{e\in N_{u}^{\varepsilon}}c(e\to y),$

where $\varrho_{\min}$ is the bijection from $N_{u}^{\varepsilon}$ to $N_{v}^{\varepsilon}$ that $\mathit{MLS}(N_{u},N_{v})$ achieves the minimum value; $y=\varrho_{\min}(e)\in N_{v}^{\varepsilon}$ is $e$ ’s mapped edge under the bijection $\varrho_{\min}$ , for $\forall e\in N_{u}^{\varepsilon}$ ;

$\mathit{HED}(N_{u},N_{v})=\sum_{e\in N_{u}^{\varepsilon}}\bigg{\{}f_{H}(e,\chi% _{1}(e))+f_{H}(\chi_{2}(y),y)\bigg{\}},$

where $\chi_{1}$ is the mapping from $N_{u}^{\varepsilon}$ to $N_{v}^{\varepsilon}$ satisfying $\chi_{1}(e)=\arg\min_{e^{\prime}\in N_{v}^{\varepsilon}}f_{H}(e,e^{\prime})$ , for $\forall e\in N_{u}^{\varepsilon}$ ; and $\chi_{2}$ is the mapping from $N_{v}^{\varepsilon}$ to $N_{u}^{\varepsilon}$ satisfying $\chi_{2}(y)=\arg\min_{e\in N_{u}^{\varepsilon}}f_{H}(e,y)$ .
(ii)
Proving $f_{H}(e,\chi_{1}(e))+f_{H}(\chi_{2}(y),y)\leq c(e\to y)$ : According to the definition of $\chi_{1}$ and $\chi_{2}$ , $f_{H}(e,\chi_{1}(e))\leq\min\{f_{H}(e,y),f_{H}(e,\varepsilon)\}$ and $f_{H}(\chi_{2}(y),y)\leq\min\{f_{H}(e,y),f_{H}(\varepsilon,y)\}$ . We discuss the following cases (a)–(d):
1. (a)
  
  When $e=\varepsilon$ and $y=\varepsilon$ , then $f_{H}(e,\chi_{1}(e))+f_{H}(\chi_{2}(y),y)\leq f_{H}(\varepsilon,\varepsilon)+f% _{H}(\varepsilon,\varepsilon)=c(\varepsilon\to\varepsilon)=0$ ;
2. (b)
  
  When $e\neq\varepsilon$ and $y=\varepsilon$ , then $f_{H}(e,\chi_{1}(e))+f_{H}(\chi_{2}(y),y)\leq f_{H}(e,\varepsilon)+f_{H}(% \varepsilon,\varepsilon)=c(e\to\varepsilon)$ ;
3. (c)
  
  When $e=\varepsilon$ and $y\neq\varepsilon$ , the analysis is similar to that of (b);
4. (d)
  
  When $e\neq\varepsilon$ and $y\neq\varepsilon$ , then $f_{H}(e,\chi_{1}(e))+f_{H}(\chi_{2}(y),y)\leq f_{H}(e,y)+f_{H}(e,y)=2f_{H}(e,y% )=2\times\frac{1}{2}c(e\to y)=c(e\to y)$ .
(iii)

Combining both (i) and (ii), we have

$\displaystyle\mathit{HED}(N_{u},N_{v})$ $\displaystyle=\sum_{e\in N_{u}^{\varepsilon}}\bigg{\{}f_{H}(e,\chi_{1}(e))+f_{% H}(\chi_{2}(y),y)\bigg{\}}$ $\displaystyle\leq\sum_{e\in N_{u}^{\varepsilon}}c(e\to y)=\mathit{MLS}(N_{u},N% _{v})$

Therefore, $f_{H}(u,v)\leq\frac{1}{2}f_{B}(u,v)$ when $u\neq\varepsilon$ and $v\neq\varepsilon$ . This completes the proof.

$\hfill\blacktriangleleft$

Appendix B Examples of computing LED, HED and BED

In this section, we give an example of calculating three GED lower bounds, LED, HED and BED.

Figure 2: Graphs

G

(left) and

Q

(right).

Figure 2 shows two graphs $G$ and $Q$ , where “A”, “B” and “C” denote vertex labels, and “a” and “b” denote edge labels. Consider the cost function $c$ satisfying: (i) the cost of each vertex edit operation is 2, that is, $c(u\to v)=2$ when two vertices $u\in V_{G}^{\varepsilon}$ and $v\in V_{Q}^{\varepsilon}$ have different labels, and $c(u\to v)=0$ otherwise; (ii) the cost of each edge edit operation is 1, that is, $c(e_{1}\to e_{2})=1$ when two edges $e_{1}\in E_{G}^{\varepsilon}$ and $e_{2}\in E_{Q}^{\varepsilon}$ have different labels, and $c(e_{1}\to e_{2})=0$ otherwise. Based upon this cost function $c$ , we discuss how to compute $\mathit{LED}(G,Q)$ , $\mathit{HED}(G,Q)$ and $\mathit{BED}(G,Q)$ below using the examples shown in Figure 2.

(1) Computing $\mathit{LED}(G,Q)$

In $\mathit{LED}(G,Q)$ (see Definition 2 in main text), we need to compute the minimum substitution cost of vertices and edges of $G$ and $Q$ , i.e., $\lambda_{V}(G,Q)$ and $\lambda_{E}(G,Q)$ . For $\lambda_{V}(G,Q)=\min\nolimits_{\phi:V_{G}^{\varepsilon}\to V_{Q}^{\varepsilon% }}\sum\nolimits_{u\in V_{G}^{\varepsilon}}c(u\to\phi(u))$ , we seek for a bijection $\phi$ from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ to minimize the linear sum $\lambda_{V}(G,Q)$ ; this is a well-investigated linear sum assignment problem (LSAP) and can be solved by the Hungarian algorithm [20] through the following two steps:

(1)

Construct the vertex substitution cost matrix $W^{V}$ , such that $W^{V}_{u,v}=c(u\to v)$ is the cost of substituting vertices $u\in V_{G}$ and $v\in V_{Q}$ ; $W^{V}_{u,\varepsilon}=c(u\to\varepsilon)$ is the cost of deleting $u$ ; and $W^{V}_{\varepsilon,v}=c(\varepsilon\to v)$ is the cost of inserting $v$ . In this example, we compute $W^{V}$ as
(2)

Find the optimal assignment $\phi_{\min}$ that minimizes the linear sum on $W^{V}$ . In this example, we find that $\phi_{\min}=\{(u_{1}\to v_{1}),(u_{2}\to v_{2}),(u_{3}\to v_{3}),(u_{4}\to v_{% 4}),(\varepsilon\to\varepsilon)\}$ is the optimal assignment, and then obtain $\lambda_{V}(G,Q)=W^{V}_{u_{1},v_{1}}+W^{V}_{u_{2},v_{2}}+W^{V}_{u_{3},v_{3}}+W% ^{V}_{u_{4},v_{4}}+W^{V}_{\varepsilon,\varepsilon}=2$ .

Similar to the above process, we can compute the edge substitution cost matrix $W^{E}$ as follows:

With the Hungarian algorithm, we know that the optimal assignment on $W^{E}$ is $\varphi_{\min}=\{(e(u_{1},u_{2})\to\varepsilon),(e(u_{1},u_{3})\to e(v_{1},v_{% 4})),(e(u_{2},u_{4})\to e(v_{2},v_{4})),(e(u_{3},u_{4})\to e(v_{3},v_{4})),(% \varepsilon\to\varepsilon)\}$ . Then, $\lambda_{E}(G,Q)=W^{E}_{e(u_{1},u_{2}),\varepsilon}+W^{E}_{e(u_{1},u_{3}),e(v_% {1},v_{4})}+W^{E}_{e(u_{2},u_{4}),e(v_{2},v_{4})}+W^{E}_{e(u_{3},u_{4}),e(v_{3% },v_{4})}+W^{E}_{\varepsilon,\varepsilon}=2$ . Combing $\lambda_{V}(G,Q)$ and $\lambda_{E}(G,Q)$ , we have $\mathit{LED}(G,Q)=\lambda_{V}(G,Q)+\lambda_{E}(G,Q)=2+2=4$ .

(2) Computing $\mathit{HED}(G,Q)$

According to the definition of $\mathit{HED}(G,Q)$ (see Definition 3 in main text), we need to calculate the hausdorff matching cost $f_{H}(u,v)$ between two vertices $u\in V_{G}^{\varepsilon}$ and $v\in V_{Q}^{\varepsilon}$ , and then perform a bidirectional matching between $G$ and $Q$ . When performing a matching from $G$ to $Q$ , we greedily seek for the minimum matching cost $\min_{v\in V_{Q}^{\varepsilon}}f_{H}(u,v)$ of each vertex $u$ ; then, the sum of these minimum costs is the matching cost from $G$ to $Q$ , i.e., $\sum_{u\in V_{G}}\min_{v\in V_{Q}^{\varepsilon}}f_{H}(u,v)$ . Similarly, the matching cost from $Q$ to $G$ is $\sum_{v\in V_{Q}}\min_{u\in V_{G}^{\varepsilon}}f_{H}(u,v)$ . Finally, the sum of the above two matching costs is $\mathit{HED}(G,Q)$ . We can summarize the computation process of $\mathit{HED}(G,Q)$ as two steps:

(1)

Construct the hausdorff matching cost matrix $W^{H}$ , such that $W^{H}_{u,v}=f_{H}(u,v)$ is the hausdorff cost of matching vertex $u\in V_{G}$ to vertex $v\in V_{Q}$ ; $W^{H}_{u,\varepsilon}=f_{H}(u,\varepsilon)$ is the hausdorff cost of deleting $u$ ; and $W^{H}_{\varepsilon,v}=f_{H}(\varepsilon,v)$ is the hausdorff cost of inserting $v$ , where $f_{H}(\cdot,\cdot)$ is defined in (2) in main text. In this example, we can compute $W^{H}$ as
(2)

Based upon $W^{H}$ , compute $\sum_{u\in V_{G}}\min_{v\in V_{Q}^{\varepsilon}}W^{H}_{u,v}$ and $\sum_{v\in V_{Q}}\min_{u\in V_{G}^{\varepsilon}}W^{H}_{u,v}$ . In this example, we trivially obtain $\mathit{HED}(G,Q)=\sum_{u\in V_{G}}\min_{v\in V_{Q}^{\varepsilon}}W^{H}_{u,v}+% \sum_{v\in V_{Q}}\min_{u\in V_{G}^{\varepsilon}}W^{H}_{u,v}=$ $(W^{H}_{u_{1},v_{1}}+W^{H}_{u_{2},v_{1}}+W^{H}_{u_{3},v_{1}}+W^{H}_{u_{4},v_{4% }})+(W^{H}_{u_{2},v_{1}}+W^{H}_{u_{2},v_{2}}+W^{H}_{u_{3},v_{2}}+W^{H}_{u_{4},% v_{4}})=$ $(1.375+0.125+0.125+0)+(0.125+0.125+0.125+0)=2$ .

Note that when calculating $W^{H}_{u,v}$ (i.e., $f_{H}(u,v)$ ), we need to calculate $\mathit{HED}(N_{u},N_{v})$ (see Equation (3) in main context), where $N_{u}$ and $N_{v}$ are the sets of edges adjacent to $u$ and $v$ , respectively. The computation of $\mathit{HED}(N_{u},N_{v})$ is similar to the above process of computing $\mathit{HED}(G,Q)$ ; and thus, we omit the detailed calculation here.

(3) Computing $\mathit{BED}(G,Q)$

The process of calculating $\mathit{BED}(G,Q)$ is similar to that of calculating $\lambda_{V}(G,Q)$ , which is also looking for a bijection $\rho$ from $V_{G}^{\varepsilon}$ to $V_{Q}^{\varepsilon}$ to minimize the linear sum $\sum_{u\in V_{G}^{\varepsilon}}f_{B}(u,\rho(u))$ . The computation contains two steps:

(1)

Construct the branch matching cost matrix $W^{B}$ , such that $W^{B}_{u,v}=f_{B}(u,v)$ is the branch cost of matching vertex $u\in V_{G}$ to vertex $v\in V_{Q}$ ; $W^{B}_{u,\varepsilon}=f_{B}(u,\varepsilon)$ is the branch cost of deleting $u$ ; and $W^{B}_{\varepsilon,v}=f_{B}(\varepsilon,v)$ is the branch cost of inserting $v$ , where $f_{B}(\cdot,\cdot)$ is defined in (5) in main text. In this example, we can compute $W^{B}$ as
(2)

Find the optimal assignment $\rho_{\min}$ that minimizes the linear sum on $W^{B}$ . In this example, we find that $\rho_{\min}=\{(u_{1}\to v_{1}),(u_{2}\to v_{2}),(u_{3}\to v_{3}),(u_{4}\to v_{% 4}),(\varepsilon\to\varepsilon)\}$ is the optimal assignment, and then obtain $\mathit{BED}(G,Q)=W^{B}_{u_{1},v_{1}}+W^{B}_{u_{2},v_{2}}+W^{B}_{u_{3},v_{3}}+% W^{B}_{u_{4},v_{4}}+W^{B}_{\varepsilon,\varepsilon}=4.5$ .

Note that when calculating $W^{B}_{u,v}$ (i.e., $f_{B}(u,v)$ ), we need to calculate the minimum edge substitution cost between $N_{u}$ and $N_{v}$ , which is similar to the process of calculating $\lambda_{E}(,)$ ; and thus, we omit the detailed computation here.

For graphs $G$ and $Q$ in Figure 2, we finally obtain that $\mathit{LED}(G,Q)=4$ , $\mathit{HED}(G,Q)=2$ and $\mathit{BED}(G,Q)=4.5$ . Clearly, $\mathit{BED}(G,Q)\geq\mathit{LED}(G,Q)$ and $\mathit{BED}(G,Q)\geq\mathit{HED}(G,Q)$ .

Appendix C Successor generation

We discuss how to generate successors of each node in the GED search tree with Algorithm 2.

Consider an inner node $r=\{(u_{1}\to v_{j_{1}}),\dots,(u_{\ell}\to v_{j_{\ell}})\}$ , where $v_{j_{k}}$ is the mapped vertex of $u_{k}$ in the GED search tree, for $1\leq k\leq\ell$ . BasicGenSuccr generates all the possible successors of $r$ . First, we compute the sets of unmapped vertices in $G$ and $Q$ , respectively, i.e., $V_{G}^{r}=V_{G}\backslash\{u_{1},\dots,u_{l}\}$ and $V_{Q}^{r}=V_{Q}\backslash\{v_{j_{1}},\dots,v_{j_{l}}\}$ . If $|V_{G}^{r}|>0$ , then we select a vertex $z$ from $V_{Q}^{r}\cup\{\varepsilon\}$ as the mapped vertex of $u_{\ell+1}$ , and consequently, obtain a successor $\mathit{child}$ of $r$ such that $\mathit{child}=r\cup\{(u_{\ell+1}\to z)\}$ . Otherwise, all the vertices of $G$ are processed; trivially, we obtain a leaf node $s=r\cup\bigcup_{z\in V_{Q}^{r}}\{(\varepsilon\to z)\}$ .

Algorithm 2 BasicGenSuccr(

r

).

[bib.bib1] [1] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. A graph database repository and performance evaluation metrics for graph edit distance. In GbRPR, pages 138–147, 2015.

[bib.bib2] [2] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. Anytime graph matching. Pattern Recogn Lett., 84:215–224, 2016. doi:10.1016/J.PATREC.2016.10.004.

[bib.bib3] [3] Z. Abu-Aisheh, R. Raveaux, J. Y. Ramel, and P. Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In ICPRAM, pages 271–278, 2015.

[bib.bib4] [4] D. B. Blumenthal and J. Gamper. Exact computation of graph edit distance for uniform and non-uniform metric edit costs. In GbRPR, pages 211–221, 2017.

[bib.bib5] [5] D. B. Blumenthal and J. Gamper. Improved lower bounds for graph edit distance. IEEE Trans. Knowl Data Eng., 30(3):503–516, 2018. doi:10.1109/TKDE.2017.2772243.

[bib.bib6] [6] D. B. Blumenthal and J. Gamper. On the exact computation of the graph edit distance. Pattern Recogn Lett., 134:46–57, 2020. doi:10.1016/J.PATREC.2018.05.002.

[bib.bib7] [7] B. Bonet and H. Geffner. Planning as heuristic search. Artif. Intell., 129(1-2):5–33, 2001. doi:10.1016/S0004-3702(01)00108-4.

[bib.bib8] [8] L. Chang, X. Feng, X. Lin, L. Qin, and W. Zhang. Efficient graph edit distance computation and verification via anchor-aware lower bound estimation. CoRR, 2017. arXiv:1709.06810.

[bib.bib9] [9] X. Chen, H. Huo, J. Huan, and J. S. Vitter. An efficient algorithm for graph edit distance computation. Knowl.-Based Syst., 163:762–775, 2019. doi:10.1016/J.KNOSYS.2018.10.002.

[bib.bib10] [10] X. Chen, H. Huo, J. Huan, J. S. Vitter, W. Zheng, and L. Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Trans. Knowl Data Eng., 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.

[bib.bib11] [11] X. Chen, Y. Wang, H. Huo, and J. S. Vitter. An efficient heuristic for graph edit distance [source code], June 2019. URL: https://github.com/Hongweihuo-Lab/Heur-GED.

[bib.bib12] [12] James Cheng, Yiping Ke, and Wilfred Ng. GraphGen — a synthetic graph data generator. URL: https://cse.hkust.edu.hk/graphgen/.

[bib.bib13] [13] CMU house and hotel datasets. URL: https://github.com/dbblumenthal/gedlib/blob/master/data/datasets/CMU-GED.

[bib.bib14] [14] X. Cortés and F. Serratosa. Learning graph-matching edit-costs based on the optimality of the oracle’s node correspondences. Pattern Recogn Lett., 56:22–29, 2015. doi:10.1016/J.PATREC.2015.01.009.

[bib.bib15] [15] A. Fischer, R. Plamondon, Y. Savaria, K. Riesen, and H. Bunke. A Hausdorff heuristic for efficient computation of graph edit distance. Structural, Syntactic, and Statistical Pattern Recognition, LNCS 8621:83–92, 2014.

[bib.bib16] [16] A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke. Approximation of graph edit distance based on Hausdorff matching. Pattern Recogn., 48(2):331–343, 2015. doi:10.1016/J.PATCOG.2014.07.015.

[bib.bib17] [17] K. Gouda and M. Hassaan. CSI_GED: An efficient approach for graph edit similarity computation. In ICDE, pages 256–275, 2016.

[bib.bib18] [18] D. Justice and A. Hero. A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal Mach Intell., 28(8):1200–1214, 2006. doi:10.1109/TPAMI.2006.152.

[bib.bib19] [19] J. Kim. Efficient graph edit distance computation using isomorphic vertices. Pattern Recogn Lett., 168(2023):71–78, 2023. doi:10.1016/J.PATREC.2023.03.002.

[bib.bib20] [20] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. doi:10.1002/nav.3800020109.

[bib.bib21] [21] J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, and S. Adam. New binary linear programming formulation to compute the graph edit distance. Pattern Recogn., 72:254–265, 2017. doi:10.1016/J.PATCOG.2017.07.029.

[bib.bib22] [22] J. Liu, M. Zhou, S. Ma, and L. Pan. MATA*: Combining learnable node matching with A* algorithm for approximate graph edit distance computation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), pages 1503–1512, 2023.

[bib.bib23] [23] Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, and Hong Cheng. Computing graph edit distance via neural graph matching. Proceedings of the VLDB Endowment, 16(8):1817–1829, 2023. doi:10.14778/3594512.3594514.

[bib.bib24] [24] K. Riesen and H. Bunke. IAM graph database repository for graph based pattern recognition and machine learning. Structural, Syntactic, and Statistical Pattern Recognition, pages 287–297, 2008.

[bib.bib25] [25] K. Riesen and H. Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput., 27(7):950–959, 2009. doi:10.1016/J.IMAVIS.2008.04.004.

[bib.bib26] [26] K. Riesen, S. Emmenegger, and H. Bunke. A novel software toolkit for graph edit distance computation. In GbRPR, pages 142–151, 2013.

[bib.bib27] [27] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph edit distance computation with a bipartite heuristic. In MLG, pages 21–24, 2007.

[bib.bib28] [28] S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach (2nd ed.). Prentice-Hall, New Jersey, USA, 2002.

[bib.bib29] [29] O Schütze, X. Esquivel, A. Lara, and C. A. C. Carlos. Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput., 16(4):504–522, 2012. doi:10.1109/TEVC.2011.2161872.

[bib.bib30] [30] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou. Comparing stars: On approximating graph edit distance. PVLDB, 2(1):25–36, 2009. doi:10.14778/1687627.1687631.

[bib.bib31] [31] W. Zheng, L. Zou, X. Lian, D. Wang, and D. Zhao. Efficient graph similarity search over large graph databases. IEEE Trans. Knowl Data Eng., 27(4):964–978, 2015. doi:10.1109/TKDE.2014.2349924.

An Efficient Heuristic for Graph Edit Distance

Abstract

Keywords and phrases:

Category:

Copyright and License:

2012 ACM Subject Classification:

Supplementary Material:

Funding:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

2 Graph edit distance

Definition 1.

2.1 State-of-the-art GED lower bounds

Label Edit Distance.

Definition 2 (Label edit distance).

Hausdorff Edit Distance.

Definition 3 (Hausdorff edit distance).

Definition 4 (Neighboring edges).

Branch Edit Distance.

Definition 5 (Branch structure).

Definition 6 (Branch edit distance).

3 Tightness analysis

3.1 Relation of LED and BED

Theorem 7.

Proof.

3.2 Relation of HED and BED

Lemma 8.

Theorem 9.

Proof.

4 Tree-based search algorithm

4.1 Search tree

4.2 Heuristic cost estimation

4.2.1 Heuristic function

Definition 10.

Theorem 11.

Proof.

Theorem 12.

Proof.

4.3 Algorithm

5 Experiments

5.1 Datasets and settings

Datasets.

Settings.

5.2 Evaluation metrics

5.3 Experimental results

5.3.1 Tightness of LED, HED and BED

5.3.2 Effect of heuristic

6 Conclusion and future works

References

Appendix A Proof of Lemma 8

Proof.

Appendix B Examples of computing LED, HED and BED

(1) Computing 𝑳𝑬𝑫⁢(𝑮,𝑸)

(2) Computing 𝑯𝑬𝑫⁢(𝑮,𝑸)

(3) Computing 𝑩𝑬𝑫⁢(𝑮,𝑸)

Appendix C Successor generation

(1) Computing $\mathit{LED}(G,Q)$

(2) Computing $\mathit{HED}(G,Q)$

(3) Computing $\mathit{BED}(G,Q)$