A Declarative Framework for Linking Entities

Authors Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2015.25.pdf
  • Filesize: 459 kB
  • 19 pages

Document Identifiers

Author Details

Douglas Burdick
Ronald Fagin
Phokion G. Kolaitis
Lucian Popa
Wang-Chiew Tan

Cite As Get BibTex

Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. A Declarative Framework for Linking Entities. In 18th International Conference on Database Theory (ICDT 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 31, pp. 25-43, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015) https://doi.org/10.4230/LIPIcs.ICDT.2015.25

Abstract

The aim of this paper is to introduce and develop a truly declarative framework for entity linking and, in particular, for entity resolution. As in some earlier approaches, our framework is based on the systematic use of constraints. However, the constraints we adopt are link-to-source constraints, unlike in earlier approaches where source-to-link constraints were used to dictate how to generate links. Our approach makes it possible to focus entirely on the intended properties of the outcome of entity linking, thus separating the constraints from any procedure of how to achieve that outcome. The core language consists of link-to-source constraints that specify the desired properties of a link relation in terms of source relations and built-in predicates such as similarity measures. A key feature of the link-to-source constraints is that they employ disjunction, which enables the declarative listing of all the reasons as to why two entities should be linked. We also consider extensions of the core language that capture collective entity resolution, by allowing inter-dependence between links.
We identify a class of "good" solutions for entity linking specifications, which we call maximum-value solutions and which capture the strength of a link by counting the reasons that justify it. We study natural algorithmic problems associated with these solutions, including the problem of enumerating the "good" solutions, and the problem of finding the certain links, which are the links that appear in every "good" solution. We show that these problems are tractable for the core language, but may become intractable once we allow inter-dependence between link relations. We also make some surprising connections between our declarative framework, which is deterministic, and probabilistic approaches such as ones based on Markov Logic Networks.

Subject Classification

Keywords
  • entity linking
  • entity resolution
  • constraints
  • certain links

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Bogdan Alexe, Douglas Burdick, Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana R. Stanoi, and Ryan Wisnesky. High-Level Rules for Integration and Analysis of Data: New Challenges. In LNCS 8000: In Search of Elegance in the Theory and Practice of Computation, pages 36-55, 2013. Google Scholar
  2. A. Arasu, C. Re, and D. Suciu. Large-Scale Deduplication with Constraints using Dedupalog. In ICDE, pages 952-963, 2009. Google Scholar
  3. M. Arenas, P. Barceló, R. Fagin, and L. Libkin. Solutions and Query Rewriting in Data Exchange. Inf. Comp., pages 28-51, 2013. Google Scholar
  4. Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent Query Answers in Inconsistent Databases. In PODS, pages 68-79, 1999. Google Scholar
  5. Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory of Computing Systems, 52(3):441-482, 2013. Google Scholar
  6. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 1(1), 2007. Google Scholar
  7. Laura Chiticariu, Yunyao Li, and Frederick R. Reiss. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In EMNLP, pages 827-832, 2013. Google Scholar
  8. Jan Chomicki and Jerzy Marcinkowski. Minimal-Change Integrity Maintenance using Tuple Deletions. Inf. Comp., 197:90-121, 2005. Google Scholar
  9. Xin Dong, Alon Y. Halevy, and Jayant Madhavan. Reference Reconciliation in Complex Information Spaces. In SIGMOD, pages 85-96, 2005. Google Scholar
  10. J. Edmonds. Maximum Matching and a Polyhedron with 0,1-vertices. Journal of Research National Bureau of Standards Section B, 69:125-130, 1965. Google Scholar
  11. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE, 19(1):1-16, 2007. Google Scholar
  12. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and Query Answering. Theoretical Computer Science (TCS), 336(1):89-124, 2005. Google Scholar
  13. Wenfei Fan. Dependencies Revisited for Improving Data Quality. In PODS, pages 159-170, 2008. Google Scholar
  14. Wenfei Fan and Floris Geerts. Foundations of Data Quality Management. Morgan & Claypool Publishers, 2012. Google Scholar
  15. I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. J. Am. Statistical Assoc., 64(328):1183-1210, 1969. Google Scholar
  16. K. Fukuda and T. Matsui. Finding All the Perfect Matchings in Bipartite Graphs. Appl. Math. Lett., 7(1):15-18, 1994. Google Scholar
  17. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB, pages 371-380, 2001. Google Scholar
  18. Venkatesh Ganti and Anish Das Sarma. Data Cleaning: A Practical Perspective. Morgan & Claypool Publishers, 2013. Google Scholar
  19. Lise Getoor and Ashwin Machanavajjhala. Entity Resolution: Theory, Practice & Open Challenges. PVLDB, 5(12):2018-2019, 2012. Google Scholar
  20. Oktie Hassanzadeh, Anastasios Kementsietsidis, Lipyeow Lim, Renée J. Miller, and Min Wang. A Framework for Semantic Link Discovery over Relational Data. In CIKM, pages 1027-1036, 2009. Google Scholar
  21. Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, and Ryan Wisnesky. HIL: A High-Level Scripting Language for Entity Integration. In EDBT, pages 549-560, 2013. Google Scholar
  22. Mauricio A. Hernández and Salvatore J. Stolfo. The Merge/Purge Problem for Large Databases. In SIGMOD, pages 127-138, 1995. Google Scholar
  23. IBM InfoSphere QualityStage. URL: http://www.ibm.com/software/products/en/ibminfoqual.
  24. D.S. Johnson, C.H. Papadimitriou, and M. Yannakakis. On Generating All Maximal Independent Sets. Inf. Process. Lett., 27(3):119-123, 1988. Google Scholar
  25. Peter Jonsson and Andrei A. Krokhin. Recognizing Frozen Variables in Constraint Satisfaction Problems. Theoretical Computer Science (TCS), 329(1-3):93-113, 2004. Google Scholar
  26. Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. Record Linkage: Similarity Measures and Algorithms. In SIGMOD, pages 802-803, 2006. Google Scholar
  27. K.G. Murty. An Algorithm for Ranking All the Assignments in Order of Increasing Cost. Operations Research, 16(3):682-687, 1968. Google Scholar
  28. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. Google Scholar
  29. Matthew Richardson and Pedro Domingos. Markov Logic Networks. Machine Learning, 62(1-2):107-136, 2006. Google Scholar
  30. Parag Singla and Pedro Domingos. Entity Resolution with Markov Logic. In ICDM, pages 572-582, 2006. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail