An Efficient Heuristic for Graph Edit Distance
Abstract
The graph edit distance (GED) is a flexible distance measure widely used in many applications. Existing GED computation methods are usually based upon the tree-based search algorithm that explores all possible vertex (or edge) mappings between two compared graphs. During this process, various GED lower bounds are adopted as heuristic estimations to accelerate the tree-based search algorithm. For the first time, we analyze the relationship among three state-of-the-art GED lower bounds, label edit distance (LED), Hausdorff edit distance (HED), and branch edit distance (BED). Specifically, we demonstrate that and for any two undirected graphs and . Furthermore, for BED we propose an efficient heuristic BED+ for improving the tree-based search algorithm. Extensive experiments on real and synthetic datasets confirm that BED+ achieves smaller deviation and larger solvable ratios than LED, HED and BED when they are employed as heuristic estimations. The source code is available online.
Keywords and phrases:
Graph edit distance, Label edit distance, Hausdorff edit distance, Branch edit distance, Tree-based search, HeuristicsCategory:
ResearchCopyright and License:
2012 ACM Subject Classification:
Information systems Query optimizationSupplementary Material:
Funding:
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott VitterSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Graphs are frequently used to represent a wide variety of various objects, such as networks, maps, handwriting, molecular compounds, and protein structures. The process of evaluating the similarity of two graphs is referred to as error-tolerant graph matching, aiming to find a correspondence between their vertices. In this paper, we focus upon the similarity measure graph edit distance (GED) because it can be applied to all types of graphs and can precisely capture structural differences between the compared graphs. The GED of two graphs is defined as the minimum cost of transforming one graph into another through a sequence of edit operations (inserting, deleting and substituting vertices or edges). An edit cost is assigned to each edit operation to measure its strength, which can be obtained by combining specific knowledge of the domain or learning from a set of sample graphs [14].
However, computing the GED is an NP-hard problem [30] and usually based upon the tree-based search algorithm. This search tree enumerates all possible mappings between vertices (or edges) of two compared graphs, where the inner nodes denote partial mappings and the leaf nodes denote complete mappings. Most existing GED computation methods employ different search paradigms to traverse this search tree to seek for the optimal mapping that induces the GED. Riesen et al. [25, 26] proposed the standard method, A⋆-GED, based upon the best-first search paradigm. It needs to store numerous inner nodes, resulting in high memory consumption. To overcome this bottleneck, Abu-Aisheh et al. [3] proposed a depth-first search based algorithm, DF-GED, whose memory requirement increases linearly with the number of vertices of graphs. On the other hand, Chen et al. [9] introduced a method for the GED computation based upon beam-stack search [28], achieving a flexible tradeoff between memory consumption and the time overhead of backtracking in the depth-first search. Chang et al. [8] developed a unified framework that can be instantiated into either a best-first search approach or a depth-first search approach. Gouda et al. [17] proposed a novel edge-mapping based approach, CSI_GED, and also employed the depth-first search paradigm. CSI_GED works only for the uniform cost model, and Blumenthal et al. [4, 6] generalized it to cover the non-uniform cost model. Kim [19] developed an efficient GED computation algorithm using isomorphic vertices [9]. Liu et al. [22] explored a learning-based method for the approximate GED computation. Piao et al. [23] propose a deep learning method for the GED computation. It is worth mentioning that many researchers have proposed various indexing techniques [31, 8, 10] to accelerate graph similarity searches under the GED metric. They use the above GED computation methods as the final phase to verify the candidate graphs that satisfy the GED constraint.
In the tree-based search algorithm, a heuristic estimation is usually adopted to prune the useless search space to accelerate the search process. In order to ensure that the optimal mapping is not erroneously pruned, this heuristic function must be admissible; namely, it estimates the cost of a tree node that is less than or equal to the real cost. In the previous works A⋆-GED and DF-GED, they adopted the label edit distance (LED) as the heuristic, which calculates the minimum substitution cost of vertices and edges of two compared graphs. After that, Fischer et al. [15, 16] proposed the Hausdorff edit distance (HED) as a heuristic estimation. HED, based upon Hausdorff matching [29], performs a bidirectional matching between two graphs and allows multiple assignments between their vertices. Recently, Blumenthal et al. [5] proposed another effective GED lower bound, branch edit distance (BED), which also can be adopted as a heuristic estimation.
As observed in other studies [7, 15, 27], the higher the heuristic estimates the cost, the better the tree-based search algorithm performs. The following question naturally arises: Which of these three state-of-the-art GED lower bounds (namely, LED, HED, or BED) is more effective? In this paper, we first analyze the relationship among these three lower bounds and then propose an effective heuristic estimation. Our contributions are summarized as follows:
-
(1)
We analyze the relationship among LED, HED and BED for the first time, and we derive that and for any two undirected graphs and .
-
(2)
We propose an efficient heuristic estimation BED+ based upon BED, and demonstrate that BED+ is still admissible.
-
(3)
We conduct extensive experiments to confirm BED+’s effectiveness on the real and synthetic datasets. The source code is available online [11].
The rest of this paper is organized as follows. In Section 2, we give the definition of the graph edit distance and revisit three state-of-the-art GED lower bounds. In Section 3, we theoretically analyze the relationship between LED, HED and BED. In Section 4, we propose the heuristic function BED+ for improving the GED computation. In Section 5, we report the experimental results. Finally, we summarize this paper in Section 6.
2 Graph edit distance
In this paper, we consider undirected, labeled graphs without multi-edges or self-loops. A labeled graph is a triplet , where is the set of vertices, is the set of edges, is a labeling function that assigns a label to a vertex or an edge, and is a set of labels. Also, we use a special symbol to denote a dummy vertex or a dummy edge.
Given two graphs and , six edit operations [18, 25, 21, 5] can be used to transform to (or vice versa): inserting or deleting a vertex or an edge, and substituting the label of a vertex or an edge. We denote the label substitution (or simply substitution) of vertices and by , the deletion of by , and the insertion of by . For the three edit operations on edges, we use similar notation.
An edit path between and is a sequence of edit operations that transforms to , such as , where graph is obtained by performing the edit operation on graph , for . During this transformation, each edit operation is assigned a penalty cost to reflect whether it can strongly change a graph. Note that the cost of editing two dummy vertices (or edges) is 0; that is, . Thus, ’s edit cost is defined as . We define the graph edit distance as follows:
Definition 1.
Given two graphs and , the graph edit distance between them, denoted by , is defined as the minimum cost of transforming to , namely,
| (1) |
where is the set of all edit paths between and , and is edit operation ’s cost.
Hereafter, for ease of presentation, we denote and as the expanded sets of and , respectively, so that and have the same number of vertices. Similarly, and denote the expanded sets of and , respectively.
2.1 State-of-the-art GED lower bounds
Below we introduce three state-of-the-art GED lower bounds, which can be used as heuristic estimations in the tree-based search algorithm to compute GED. Each of the methods gives a lower bound on GED because the operations are done in sets that do not have to be consistent with one another. For example, in the first method LED described below, the edit operations on the vertex labels can be done independently of the edit operations on the edge labels, and thus they may not be globally consistent.
Label Edit Distance.
Riesen et al. [27, 26] proposed the label edit distance (LED), which is the minimum cost of substituting vertices and edges of two graphs.
Definition 2 (Label edit distance).
Given two graphs and , the label edit distance between them is defined as , where is the minimum cost of substituting vertices of and , and is a bijection from to ; and is the minimum cost of substituting edges of and , and is a bijection from to .
Hausdorff Edit Distance.
Inspired by the Hausdorff distance [29] between two finite sets, Fischer et al. [16] proposed the Hausdorff edit distance (HED) between two graphs and . The key ideas of HED are to perform a bidirectional matching between and and to allow multiple assignments between their vertices.
Definition 3 (Hausdorff edit distance).
Given two graphs and , their Hausdorff edit distance is defined as , where is the Hausdorff cost of matching vertex to vertex .
The Hausdorff vertex matching cost considers not only the two vertices and but also their neighboring edges.
Definition 4 (Neighboring edges).
Given graph and a vertex , the neighboring edges of are defined as .
We define as
| (2) |
Similarly to Definition 3, the Hausdorff edit distance between and is defined as
| (3) |
where is the cost of matching two edges such that
| (4) |
Branch Edit Distance.
Blumenthal et al. [5] recently proposed the branch edit distance (BED), which computes the minimum cost of editing branch structures of two graphs.
Definition 5 (Branch structure).
The branch structure of vertex in graph is defined as , where is the set of neighboring edges of .
Given two branch structures and , the minimum cost of editing into is defined as
| (5) |
where and are expanded sets of and , respectively, and is a bijection from to .
Definition 6 (Branch edit distance).
Given two graphs and , the branch edit distance between them is defined as , where is a bijection from to , and is defined in (5).
3 Tightness analysis
In this section, we analyze the tightness of the three GED lower bounds: LED, HED and BED. Specifically, we will prove that BED is the strongest of all; that is, for any two undirected graphs and , we have and .
3.1 Relation of LED and BED
Theorem 7.
Given two graphs and , we have .
Proof.
For ease of proof, we insert dummy vertices and edges into to make it become a complete graph with vertices. Similarly, we transform into a complete graph that also has vertices. Then, we can simplify (5) as
where is the bijection from to for which achieves the minimum value. Thus, we have
where is the bijection from to for which achieves the minimum value, and is the bijection from to satisfying for .
3.2 Relation of HED and BED
Theorem 9.
Given two graphs and , we have .
Proof.
By ’s definition in (2), we know that when , ; and similarly when , . We can rewrite as
| (6) | |||||
where is a mapping from to satisfying ,; is a mapping from to satisfying ; and is the bijection from to for which achieves the minimum value. We know that
| (7) |
By (6) and (7), we can complete the proof by showing that
We do so by considering the following four exhaustive cases:
- Case I.
-
When and , then , by the definitions of , , and .
- Case II.
-
When and , then , by the definitions of , , and . By Lemma 8, we know that .
- Case III.
-
When and , the analysis is similar to that of Case II.
- Case IV.
-
When and , then we have and , by the definitions of , , and . By Lemma 8, we know that . Thus, we have .
4 Tree-based search algorithm
The previous section showed that BED achieves the tightest GED lower bound. In this section, based upon BED we propose an efficient heuristic estimation to improve the tree-based search algorithm [2, 3] for the GED computation.
4.1 Search tree
Computing the GED of graphs and is typically based upon a tree-based search procedure that explores all possible graph mappings from to . Starting from a dummy node, root = , we logically create the search tree layer by layer by iteratively generating successors using BasicGenSuccr [9]. This search space can be organized as an ordered search tree, where the inner nodes denote partial graph mappings and the leaf nodes denote complete graph mappings. Such a search tree is created dynamically at runtime by iteratively generating successors linked by edges to the currently considered node. For more details, please refer to Section 2 in the reference [9].
4.2 Heuristic cost estimation
For a node in the search tree, let be the estimated cost from to its descendant leaf node that is less than or equal to the real cost. Based upon BED, we introduce how to estimate in the tree-based search algorithm.
4.2.1 Heuristic function
Consider an inner node , where is ’s mapped vertex, for . We divide into two subgraphs and , where is the mapped part of such that and , and is the unmapped part such that and . We obtain and similarly.
Clearly, the lower bound can be used to estimate ’s cost. However, has not covered the potential edit cost on the edges between (resp., ) and (resp., ). Recently, [8, 9] proposed two different methods to cover this potential cost; nevertheless, these two methods only worked for the uniform cost function (i.e., for which the cost of each edit operation is 1). We expand the method in [8] to support for any cost function.
Definition 10.
Given vertices and , we define the cost of matching to as , where is the mapped vertex of the already processed vertex , and is the minimum cost of transforming to , which we defined in (5).
When there is no edge between and , we set , and similarly for . Based upon , we define the improved lower bound BED+ as
| (8) |
Theorem 11.
Proof.
We trivially obtain this theorem since for .
Theorem 12.
Given a descendant leaf node of , the heuristic estimation is admissible; that is, , where gives the incurred cost from the root node to the currently considered node.
Proof.
For ease of proof, we insert dummy vertices and edges into to transform it to a complete graph with vertices. Similarly, we transform to a complete graph that also has vertices.
Consider an internal node in the search tree, where is the mapped vertex of , for . For easy presentation, hereafter we use to denote ’s mapped vertex, i.e., . Given a descendent leaf node (i.e., is a complete vertex mapping from to ) of , then the incurred cost of is
| (9) | ||||
As we know, induces an edit path transforming to , where and are the already mapped subgraphs of and , respectively, and and . According to (9), we know that
Let be the partial mapping that contains the vertex mapping pairs belong to but not ; namely, . We can obtain that
where and . The second equality is due to when is partitioned into two disjoint sets and .
We give an example in Appendix B of computing three GED lower bounds: LED, HED and BED. The same optimization that produces BED+ from BED can be applied to LED and HED to achieve enhanced heuristics LED+ and HED+, but we do not include them in this paper.
4.3 Algorithm
In this section, we show how to incorporate the heuristic estimation into the anytime-based GED computation algorithm [2]. The reason we consider the anytime-based algorithm is that it is flexible and can control the algorithm to output tighter and tighter GED upper bounds until the exact GED value by setting more and more running time.
Algorithm 1 gives the anytime-based algorithm for computing the GED, where is the user-defined maximum running time. We perform a depth-first search over the GED search tree of and to find better and better GED upper bounds until the running time reaches . To accomplish this, we first employ the BP [25] algorithm to fast compute an initial GED upper bound ; then, we adopt a stack to finish the depth-first search. Each time we pop a node from . If is a leaf node, then we find a better solution. Otherwise, we call procedure BasicGenSuccr [9] (see Appendix C) to generate ’s successors and then insert them into . During this process, we adopt the branch-and-bound strategy to prune the useless space: for each successor , if , we can safely prune it, where is defined in (8).
5 Experiments
5.1 Datasets and settings
Datasets.
We chose four real (GREC, MUTA, PRO, and CMU) and one synthetic (SYN) datasets in the experiments. The datasets GREC, MUTA, and PRO were taken from the IAM Graph Database Repository [24]; the CMU dataset could be found at the CMU website [13]; and the SYN dataset was generated by the synthetic graph generator GraphGen [12]. Following the same procedure in [2, 21], we selected some subsets of GREC, MUTA, and PRO as the tested datasets, respectively, where each subset consists of graphs that have the same number of vertices. Specifically, the subsets of CREC contain 5, 10, 15, and 20 vertices, respectively; the subsets of MUTA contain 10, 20, , 70 vertices, respectively; the subsets of PRO contain 20, 30, and 40 vertices, respectively; and each subset consists of 10 graphs.
Table 1 summarizes the characteristic and applied cost function of each dataset. ED and SED are short for Euclidean distance and string edit distance functions, respectively. is the cost of inserting/deleting a vertex; is the cost of inserting/deleting an edge; and are the costs of substituting a vertex and an edge, respectively. In addition, we introduce a parameter to control whether edit operations on vertices or edges are more important.
Settings.
We conducted all the experiments on a HP Z800 PC running the Ubuntu 12.04 LTS operating system and equipped with a 2.67GHz CPU and 24 GB of memory. We implemented the algorithm in C++, using -O3 to compile and run it.
| Dataset | #Graphs | vertex labels | edge labels | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| GREC | 40 | 12.5 | 17.5 | (x, y) coord. | Line type | 90 | 15 | 0.5 | Ext. ED | Dirac |
| MUTA | 70 | 40 | 41.5 | Chem. symbol | Valence | 11 | 1.1 | 0.25 | Dirac | Dirac |
| PRO | 30 | 30 | 58.6 | Type/AA-seq. | Type/length | 11 | 1 | 0.75 | Ext. SED | Dirac |
| CMU | 111 | 30 | 79.1 | None | Distance | 0.5 | 0 | L1 norm | ||
| SYN | 100 | 14.5 | 20 | Symbol | Symbol | 0.3 | 0.5 | 0.75 | Dirac | Dirac |
5.2 Evaluation metrics
We discuss two metrics to evaluate algorithm performance: deviation () [1] and solvable ratio () [9]. The metric measures the deviation generated by an algorithm. Formally, given two graphs and , the deviation of the two graphs can be computed as , where is the (approximate) GED value produced by the algorithm, and is the best GED value produced in all the experiments done on the graph database repository in [1]. Based upon the pairwise comparison model, the deviation on the dataset can be computed as
| (10) |
The metric measures how often the exact GED value is obtained when reaching the maximum running time threshold . Formally, let indicate whether an algorithm outputs the exact GED of and within time; in other words, if the algorithm requires less than time to output the GED, ; otherwise, . The solvable ratio () on the dataset can be computed as
| (11) |
Obviously, a smaller and a larger reflects a better performance of an algorithm.
5.3 Experimental results
As described earlier in this paper, we first analyzed the relation of three GED lower bounds (i.e., LED, HED and BED). Then based upon BED we proposed an efficient heuristic estimation BED+. Thus, it is necessary to evaluate the contribution of these lower bounds to the GED computation.
5.3.1 Tightness of LED, HED and BED
In this section, we evaluate the tightness of three GED lower bounds LED, HED and BED as well as their running time. Table 2 shows the obtained results, where the abbreviation “ms” represents milliseconds.
As shown in Table 2, BED achieves the smallest , which means that BED is closest to the exact GED value. This result is consistent with the analysis in Section 3, i.e., and for any two graphs and . We also find in most cases tthat LED performs better than HED; the reason is that HED allows multiple assignments between vertices of and and greedily selects matched vertices with the lowest cost.
We also list the running time of each method in Table 2. It can be seen from this table that HED runs faster than LED and LED runs faster than BED. The reason is that HED runs in quadratic time, while both LED and BED run in cubic time. LED independently considers the cost of substituting vertices and edges and ignores the structures, thus it has a better running time than BED.
| Datasets | LED | HED | BED | |||
|---|---|---|---|---|---|---|
| GREC | 4.41 | 0.38 | 17.45 | 0.29 | 3.54 | 0.52 |
| MUTA | 12.54 | 1.07 | 30.13 | 0.47 | 11.49 | 2.72 |
| PRO | 4.61 | 3.75 | 21.31 | 2.84 | 3.25 | 6.07 |
| CMU | 61.56 | 9.53 | 57.6 | 3.77 | 25.1 | 13.2 |
| SYN | 67.41 | 0.28 | 91.92 | 0.14 | 46.61 | 0.45 |
5.3.2 Effect of heuristic
Observing that BED produces the tighter lower bound than LED and HED, we propose BED+ as a heuristic estimation to improve the GED computation. To achieve the comparison, we adopted LED, HED, BED, and BED+ as the heuristic estimations, respectively, and fixed the running time ms. Table 3 lists the obtained deviation and solvable ration .
| Datasets | LED | HED | BED | BED+ | ||||
|---|---|---|---|---|---|---|---|---|
| GREC | 0.36 | 69.87 | 0.56 | 54.38 | 0.22 | 67.88 | 0.01 | 90 |
| MUTA | 5.56 | 3.47 | 4.85 | 3.27 | 4.49 | 3.51 | 2.6 | 22.33 |
| PRO | 1.27 | 4 | 0.71 | 3.56 | 0.68 | 4.22 | 0.06 | 4.22 |
| CMU | 109.89 | 19.06 | 49.18 | 19.06 | 52.83 | 31.16 | 2.97 | 75.79 |
| SYN | 8.79 | 8.4 | 9.48 | 4.18 | 7.96 | 7.48 | 0.17 | 94.34 |
From Table 3, we know that using BED+ as a heuristic can produce the smallest . This is due to the fact that BED+ produces a higher estimated bound. For the solvable ratio , we also find that BED+ achieves the best performance.
We also varied the running time from ms to ms in order to evaluate the above heuristic estimations under different running times. From Figure 1, as the running time increases, using the above four heuristic estimations we obtain lower and lower deviation as well as higher and higher solvable ratio . Also, we find in most cases that BED+ achieves the best and under both small (e.g., ms) and large (e.g., ms) running times. Compared with the widely used heuristic estimation LED, using BED+ can decrease the by 72.2%, 34.1%, 48.1%, 39.4%, and 52.1% on average on the GREC, MUTA, PRO, CMU, and SYN datasets, respectively. Using BED+ can increase the by 54.3%, 293.2%, 3.4%, 113.1%, and 702.8% on average on the five above datasets, respectively. Thus, we conclude that using BED+ as a heuristic estimation can greatly improve the GED computation.
6 Conclusion and future works
In this paper, we analyze the relationship among three state-of-the-art GED lower bounds that are widely used as heuristic estimations in the tree-based search algorithm for the GED computation. Specifically, we demonstrate that and for any two undirected graphs and . Furthermore, based upon BED we propose an efficient heuristic estimation BED+ and demonstrate that BED+ still estimates a cost that is not greater than the real cost. Experimental results on four real and one synthetic datasets confirm that BED+ can achieve the best performance under both small and large running time.
When calculating the heuristic estimation , we first compute the transformation cost (i.e., ) of two compared branch structures. In fact, the transformation cost of these two branch structures may have been calculated many times in the previous traversal of the search tree. Future work will consider how to build a suitable index structure to maintain the transformation cost of these traversed branch structures in order to accelerate the computation of .
References
- [1] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. A graph database repository and performance evaluation metrics for graph edit distance. In GbRPR, pages 138–147, 2015.
- [2] Z. Abu-Aisheh, R. Raveaux, and J. Y. Ramel. Anytime graph matching. Pattern Recogn Lett., 84:215–224, 2016. doi:10.1016/J.PATREC.2016.10.004.
- [3] Z. Abu-Aisheh, R. Raveaux, J. Y. Ramel, and P. Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In ICPRAM, pages 271–278, 2015.
- [4] D. B. Blumenthal and J. Gamper. Exact computation of graph edit distance for uniform and non-uniform metric edit costs. In GbRPR, pages 211–221, 2017.
- [5] D. B. Blumenthal and J. Gamper. Improved lower bounds for graph edit distance. IEEE Trans. Knowl Data Eng., 30(3):503–516, 2018. doi:10.1109/TKDE.2017.2772243.
- [6] D. B. Blumenthal and J. Gamper. On the exact computation of the graph edit distance. Pattern Recogn Lett., 134:46–57, 2020. doi:10.1016/J.PATREC.2018.05.002.
- [7] B. Bonet and H. Geffner. Planning as heuristic search. Artif. Intell., 129(1-2):5–33, 2001. doi:10.1016/S0004-3702(01)00108-4.
- [8] L. Chang, X. Feng, X. Lin, L. Qin, and W. Zhang. Efficient graph edit distance computation and verification via anchor-aware lower bound estimation. CoRR, 2017. arXiv:1709.06810.
- [9] X. Chen, H. Huo, J. Huan, and J. S. Vitter. An efficient algorithm for graph edit distance computation. Knowl.-Based Syst., 163:762–775, 2019. doi:10.1016/J.KNOSYS.2018.10.002.
- [10] X. Chen, H. Huo, J. Huan, J. S. Vitter, W. Zheng, and L. Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Trans. Knowl Data Eng., 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
- [11] X. Chen, Y. Wang, H. Huo, and J. S. Vitter. An efficient heuristic for graph edit distance [source code], June 2019. URL: https://github.com/Hongweihuo-Lab/Heur-GED.
- [12] James Cheng, Yiping Ke, and Wilfred Ng. GraphGen — a synthetic graph data generator. URL: https://cse.hkust.edu.hk/graphgen/.
- [13] CMU house and hotel datasets. URL: https://github.com/dbblumenthal/gedlib/blob/master/data/datasets/CMU-GED.
- [14] X. Cortés and F. Serratosa. Learning graph-matching edit-costs based on the optimality of the oracle’s node correspondences. Pattern Recogn Lett., 56:22–29, 2015. doi:10.1016/J.PATREC.2015.01.009.
- [15] A. Fischer, R. Plamondon, Y. Savaria, K. Riesen, and H. Bunke. A Hausdorff heuristic for efficient computation of graph edit distance. Structural, Syntactic, and Statistical Pattern Recognition, LNCS 8621:83–92, 2014.
- [16] A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke. Approximation of graph edit distance based on Hausdorff matching. Pattern Recogn., 48(2):331–343, 2015. doi:10.1016/J.PATCOG.2014.07.015.
- [17] K. Gouda and M. Hassaan. CSI_GED: An efficient approach for graph edit similarity computation. In ICDE, pages 256–275, 2016.
- [18] D. Justice and A. Hero. A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal Mach Intell., 28(8):1200–1214, 2006. doi:10.1109/TPAMI.2006.152.
- [19] J. Kim. Efficient graph edit distance computation using isomorphic vertices. Pattern Recogn Lett., 168(2023):71–78, 2023. doi:10.1016/J.PATREC.2023.03.002.
- [20] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. doi:10.1002/nav.3800020109.
- [21] J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, and S. Adam. New binary linear programming formulation to compute the graph edit distance. Pattern Recogn., 72:254–265, 2017. doi:10.1016/J.PATCOG.2017.07.029.
- [22] J. Liu, M. Zhou, S. Ma, and L. Pan. MATA*: Combining learnable node matching with A* algorithm for approximate graph edit distance computation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), pages 1503–1512, 2023.
- [23] Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, and Hong Cheng. Computing graph edit distance via neural graph matching. Proceedings of the VLDB Endowment, 16(8):1817–1829, 2023. doi:10.14778/3594512.3594514.
- [24] K. Riesen and H. Bunke. IAM graph database repository for graph based pattern recognition and machine learning. Structural, Syntactic, and Statistical Pattern Recognition, pages 287–297, 2008.
- [25] K. Riesen and H. Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput., 27(7):950–959, 2009. doi:10.1016/J.IMAVIS.2008.04.004.
- [26] K. Riesen, S. Emmenegger, and H. Bunke. A novel software toolkit for graph edit distance computation. In GbRPR, pages 142–151, 2013.
- [27] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph edit distance computation with a bipartite heuristic. In MLG, pages 21–24, 2007.
- [28] S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach (2nd ed.). Prentice-Hall, New Jersey, USA, 2002.
- [29] O Schütze, X. Esquivel, A. Lara, and C. A. C. Carlos. Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput., 16(4):504–522, 2012. doi:10.1109/TEVC.2011.2161872.
- [30] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou. Comparing stars: On approximating graph edit distance. PVLDB, 2(1):25–36, 2009. doi:10.14778/1687627.1687631.
- [31] W. Zheng, L. Zou, X. Lian, D. Wang, and D. Zhao. Efficient graph similarity search over large graph databases. IEEE Trans. Knowl Data Eng., 27(4):964–978, 2015. doi:10.1109/TKDE.2014.2349924.
Appendix A Proof of Lemma 8
Proof.
We prove this lemma by considering the following two cases:
- Case I.
-
when or . We first discuss the case . It is trivial to know that . Similarly, when , we also have . Thus, when or , the lemma follows.
- Case II.
-
when and . Then, we know that
(12) (13) where , and is a bijection from to .
In order to prove , it suffices from (12) and (13) to prove
which we do as follows:
-
(i)
Rewriting and :
where is the bijection from to that achieves the minimum value; is ’s mapped edge under the bijection , for ;
where is the mapping from to satisfying , for ; and is the mapping from to satisfying .
-
(ii)
Proving : According to the definition of and , and . We discuss the following cases (a)–(d):
-
(a)
When and , then ;
-
(b)
When and , then ;
-
(c)
When and , the analysis is similar to that of (b);
-
(d)
When and , then .
-
(a)
-
(iii)
Combining both (i) and (ii), we have
Therefore, when and . This completes the proof.
-
(i)
Appendix B Examples of computing LED, HED and BED
In this section, we give an example of calculating three GED lower bounds, LED, HED and BED.
Figure 2 shows two graphs and , where “A”, “B” and “C” denote vertex labels, and “a” and “b” denote edge labels. Consider the cost function satisfying: (i) the cost of each vertex edit operation is 2, that is, when two vertices and have different labels, and otherwise; (ii) the cost of each edge edit operation is 1, that is, when two edges and have different labels, and otherwise. Based upon this cost function , we discuss how to compute , and below using the examples shown in Figure 2.
(1) Computing
In (see Definition 2 in main text), we need to compute the minimum substitution cost of vertices and edges of and , i.e., and . For , we seek for a bijection from to to minimize the linear sum ; this is a well-investigated linear sum assignment problem (LSAP) and can be solved by the Hungarian algorithm [20] through the following two steps:
-
(1)
Construct the vertex substitution cost matrix , such that is the cost of substituting vertices and ; is the cost of deleting ; and is the cost of inserting . In this example, we compute as
-
(2)
Find the optimal assignment that minimizes the linear sum on . In this example, we find that is the optimal assignment, and then obtain .
Similar to the above process, we can compute the edge substitution cost matrix as follows:
With the Hungarian algorithm, we know that the optimal assignment on is . Then, . Combing and , we have .
(2) Computing
According to the definition of (see Definition 3 in main text), we need to calculate the hausdorff matching cost between two vertices and , and then perform a bidirectional matching between and . When performing a matching from to , we greedily seek for the minimum matching cost of each vertex ; then, the sum of these minimum costs is the matching cost from to , i.e., . Similarly, the matching cost from to is . Finally, the sum of the above two matching costs is . We can summarize the computation process of as two steps:
-
(1)
Construct the hausdorff matching cost matrix , such that is the hausdorff cost of matching vertex to vertex ; is the hausdorff cost of deleting ; and is the hausdorff cost of inserting , where is defined in (2) in main text. In this example, we can compute as
-
(2)
Based upon , compute and . In this example, we trivially obtain .
Note that when calculating (i.e., ), we need to calculate (see Equation (3) in main context), where and are the sets of edges adjacent to and , respectively. The computation of is similar to the above process of computing ; and thus, we omit the detailed calculation here.
(3) Computing
The process of calculating is similar to that of calculating , which is also looking for a bijection from to to minimize the linear sum . The computation contains two steps:
-
(1)
Construct the branch matching cost matrix , such that is the branch cost of matching vertex to vertex ; is the branch cost of deleting ; and is the branch cost of inserting , where is defined in (5) in main text. In this example, we can compute as
-
(2)
Find the optimal assignment that minimizes the linear sum on . In this example, we find that is the optimal assignment, and then obtain .
Note that when calculating (i.e., ), we need to calculate the minimum edge substitution cost between and , which is similar to the process of calculating ; and thus, we omit the detailed computation here.
For graphs and in Figure 2, we finally obtain that , and . Clearly, and .
Appendix C Successor generation
We discuss how to generate successors of each node in the GED search tree with Algorithm 2.
Consider an inner node , where is the mapped vertex of in the GED search tree, for . BasicGenSuccr generates all the possible successors of . First, we compute the sets of unmapped vertices in and , respectively, i.e., and . If , then we select a vertex from as the mapped vertex of , and consequently, obtain a successor of such that . Otherwise, all the vertices of are processed; trivially, we obtain a leaf node .
