Repairing Databases over Metric Spaces with Coincidence Constraints

Authors Youri Kaminsky , Benny Kimelfeld , Ester Livshits , Felix Naumann , David Wajc



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2025.14.pdf
  • Filesize: 0.84 MB
  • 18 pages

Document Identifiers

Author Details

Youri Kaminsky
  • Hasso Plattner Institute, University of Potsdam, Germany
Benny Kimelfeld
  • Technion, Haifa, Israel
Ester Livshits
  • Technion, Haifa, Israel
Felix Naumann
  • Hasso Plattner Institute, University of Potsdam, Germany
David Wajc
  • Technion, Haifa, Israel

Cite As Get BibTex

Youri Kaminsky, Benny Kimelfeld, Ester Livshits, Felix Naumann, and David Wajc. Repairing Databases over Metric Spaces with Coincidence Constraints. In 28th International Conference on Database Theory (ICDT 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 328, pp. 14:1-14:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/LIPIcs.ICDT.2025.14

Abstract

Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a vector space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as coincidence constraints, which include unary key constraints, inclusion constraints, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. 

We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where we limit the allowed change of each individual value. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.

Subject Classification

ACM Subject Classification
  • Information systems → Data management systems
Keywords
  • Database repairs
  • metric spaces
  • coincidence constraints
  • inclusion constraints
  • foreign-key constraints

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ofer Arieli, Marc Denecker, and Maurice Bruynooghe. Distance semantics for database repair. Ann. Math. Artif. Intell., 50(3-4):389-415, 2007. URL: https://doi.org/10.1007/S10472-007-9074-1.
  2. Ofer Arieli and Anna Zamansky. A graded approach to database repair by context-aware distance semantics. Fuzzy Sets Syst., 298:4-21, 2016. URL: https://doi.org/10.1016/J.FSS.2015.06.007.
  3. Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In FOCS, pages 184-193. IEEE Computer Society, 1996. URL: https://doi.org/10.1109/SFCS.1996.548477.
  4. Leopoldo E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. URL: https://doi.org/10.2200/S00379ED1V01Y201108DTM020.
  5. Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4-5):407-434, 2008. URL: https://doi.org/10.1016/J.IS.2008.01.005.
  6. Guy E. Blelloch, Yan Gu, and Yihan Sun. Efficient construction of probabilistic tree embeddings. In ICALP, volume 80 of LIPIcs, pages 26:1-26:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPICS.ICALP.2017.26.
  7. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755. IEEE Computer Society, 2007. URL: https://doi.org/10.1109/ICDE.2007.367920.
  8. Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143-154. ACM, 2005. URL: https://doi.org/10.1145/1066157.1066175.
  9. Rajesh Bordawekar and Oded Shmueli. Using word embedding to enable semantic queries in relational databases. In DEEM@SIGMOD, pages 5:1-5:4. ACM, 2017. URL: https://doi.org/10.1145/3076246.3076251.
  10. Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD Conference, pages 1335-1349. ACM, 2020. URL: https://doi.org/10.1145/3318464.3389742.
  11. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. Relaxed functional dependencies - a survey of approaches. IEEE Trans. Knowl. Data Eng., 28(1):147-165, 2016. URL: https://doi.org/10.1109/TKDE.2015.2472010.
  12. Jan Chomicki and Jerzy Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1-2):90-121, 2005. URL: https://doi.org/10.1016/J.IC.2004.04.007.
  13. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469. IEEE Computer Society, 2013. URL: https://doi.org/10.1109/ICDE.2013.6544847.
  14. Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. A metric index for approximate text management. In ISDB, pages 37-42. Acta Press, 2002. Google Scholar
  15. Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. Similarity join in metric spaces using ed-index. In DEXA, volume 2736 of Lecture Notes in Computer Science, pages 484-493. Springer, 2003. URL: https://doi.org/10.1007/978-3-540-45227-0_48.
  16. Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. J. Comput. Syst. Sci., 69(3):485-497, 2004. URL: https://doi.org/10.1016/J.JCSS.2004.04.011.
  17. Wenfei Fan and Floris Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. URL: https://doi.org/10.2200/S00439ED1V01Y201207DTM030.
  18. Amir Gilad, Aviram Imber, and Benny Kimelfeld. The consistency of probabilistic databases with independent cells. In ICDT, volume 255 of LIPIcs, pages 22:1-22:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.ICDT.2023.22.
  19. Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500. Morgan Kaufmann, 2001. URL: http://www.vldb.org/conf/2001/P491.pdf.
  20. Miika Hannula and Jef Wijsen. A dichotomy in consistent query answering for primary keys and unary foreign keys. In PODS, pages 437-449. ACM, 2022. URL: https://doi.org/10.1145/3517804.3524157.
  21. Youri Kaminsky, Benny Kimelfeld, Ester Livshits, Felix Naumann, and David Wajc. Repairing databases over metric spaces with coincidence constraints. CoRR, abs/2409.16713, 2024. URL: https://doi.org/10.48550/arXiv.2409.16713.
  22. Youri Kaminsky, Eduardo H. M. Pena, and Felix Naumann. Discovering similarity inclusion dependencies. Proc. ACM Manag. Data, 1(1):75:1-75:24, 2023. URL: https://doi.org/10.1145/3588929.
  23. Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, volume 361 of ACM International Conference Proceeding Series, pages 53-62. ACM, 2009. URL: https://doi.org/10.1145/1514894.1514901.
  24. Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275-1278. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/ICDE.2009.219.
  25. Selasi Kwashie, Jixue Liu, Jiuyong Li, and Feiyue Ye. Efficient discovery of differential dependencies through association rules mining. In ADC, volume 9093 of Lecture Notes in Computer Science, pages 3-15. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19548-3_1.
  26. Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst., 45(1):4:1-4:46, 2020. URL: https://doi.org/10.1145/3360904.
  27. Yasir Mahmood, Jonni Virtema, Timon Barlag, and Axel-Cyrille Ngonga Ngomo. Computing repairs under functional and inclusion dependencies via argumentation. In FoIKS, volume 14589 of Lecture Notes in Computer Science, pages 23-42. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-56940-1_2.
  28. Dongjing Miao, Pengfei Zhang, Jianzhong Li, Ye Wang, and Zhipeng Cai. Approximation and inapproximability results on computing optimal repairs. VLDB J., 32(1):173-197, 2023. URL: https://doi.org/10.1007/S00778-022-00738-0.
  29. Rajvardhan Patil, Sorio Boit, Venkat N. Gudivada, and Jagadeesh Nandigam. A survey of text representation and embedding techniques in NLP. IEEE Access, 11:36120-36146, 2023. URL: https://doi.org/10.1109/ACCESS.2023.3266377.
  30. Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A formal framework for probabilistic unclean databases. In ICDT, volume 127 of LIPIcs, pages 6:1-6:18, 2019. URL: https://doi.org/10.4230/LIPICS.ICDT.2019.6.
  31. Shaoxu Song and Lei Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1-16:41, 2011. URL: https://doi.org/10.1145/2000824.2000826.
  32. Shaoxu Song, Lei Chen, and Hong Cheng. Parameter-free determination of distance thresholds for metric distance constraints. In ICDE, pages 846-857. IEEE Computer Society, 2012. URL: https://doi.org/10.1109/ICDE.2012.46.
  33. Jan Tönshoff, Neta Friedman, Martin Grohe, and Benny Kimelfeld. Stable tuple embeddings for dynamic databases. In ICDE, pages 1286-1299. IEEE, 2023. URL: https://doi.org/10.1109/ICDE55515.2023.00103.
  34. Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. String similarity search and join: a survey. Frontiers Comput. Sci., 10(3):399-417, 2016. URL: https://doi.org/10.1007/S11704-015-5900-5.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail