Skyline Operators for Document Spanners

Authors Antoine Amarilli , Benny Kimelfeld , Sébastien Labbé, Stefan Mengel



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2024.7.pdf
  • Filesize: 0.77 MB
  • 18 pages

Document Identifiers

Author Details

Antoine Amarilli
  • LTCI, Télécom Paris, Institut Polytechnique de Paris, France
Benny Kimelfeld
  • Technion - Israel Institute of Technology, Haifa, Israel
Sébastien Labbé
  • École normale supérieure, Paris, France
Stefan Mengel
  • Univ. Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), France

Cite AsGet BibTex

Antoine Amarilli, Benny Kimelfeld, Sébastien Labbé, and Stefan Mengel. Skyline Operators for Document Spanners. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ICDT.2024.7

Abstract

When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples that extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.

Subject Classification

ACM Subject Classification
  • Information systems → Information extraction
  • Theory of computation → Database query processing and optimization (theory)
Keywords
  • Information Extraction
  • Document Spanners
  • Query Evaluation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Foto N. Afrati, Paraschos Koutris, Dan Suciu, and Jeffrey D. Ullman. Parallel skyline queries. In ICDT, 2012. URL: https://doi.org/10.1145/2274576.2274605.
  2. Shqiponja Ahmetaj, Wolfgang Fischl, Markus Kröll, Reinhard Pichler, Mantas Šimkus, and Sebastian Skritek. The challenge of optional matching in SPARQL. In FoIKS, 2016. URL: https://doi.org/10.1007/978-3-319-30024-5_10.
  3. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.22.
  4. Antoine Amarilli, Benny Kimelfeld, Sébastien Labbé, and Stefan Mengel. Skyline operators for document spanners. CoRR, 2024. Full version of this article with all proofs. URL: https://doi.org/10.48550/arXiv.2304.06155.
  5. Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. The skyline operator. In ICDE. IEEE, 2001. URL: https://doi.org/10.1109/ICDE.2001.914855.
  6. Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos, and David Poole. CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. J. Artif. Intell. Res., 21, 2004. URL: https://doi.org/10.1613/jair.1234.
  7. Simone Bova, Florent Capelli, Stefan Mengel, and Friedrich Slivovsky. A strongly exponential separation of DNNFs from CNF formulas. CoRR, abs/1411.1995, 2014. URL: https://doi.org/10.48550/arXiv.1411.1995.
  8. Florent Capelli and Yann Strozecki. Incremental delay enumeration: Space and time. Discret. Appl. Math., 268, 2019. URL: https://doi.org/10.1016/j.dam.2018.06.038.
  9. Hamish Cunningham, Kevin Humphreys, Robert J. Gaizauskas, and Yorick Wilks. GATE - A general architecture for text engineering. In ANLP. ACL, 1997. URL: https://doi.org/10.3115/974281.974299.
  10. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015. URL: https://doi.org/10.1145/2699442.
  11. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst., 41(1), 2016. URL: https://doi.org/10.1145/2877202.
  12. Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, 2018. URL: https://doi.org/10.1145/3196959.3196987.
  13. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, 2018. URL: https://doi.org/10.1145/3196959.3196967.
  14. César A. Galindo-Legaria. Outerjoins as disjunctions. SIGMOD Rec., 23(2), 1994. URL: https://doi.org/10.1145/191843.191908.
  15. Yuan Gao, Nelma Moreira, Rogério Reis, and Sheng Yu. A survey on operational state complexity. J. Autom. Lang. Comb., 21(4):251-310, 2017. URL: https://doi.org/10.25596/jalc-2016-251.
  16. Benoit Groz and Tova Milo. Skyline queries with noisy comparisons. In PODS, 2015. URL: https://doi.org/10.1145/2745754.2745775.
  17. Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, 2011. URL: https://aclanthology.org/P11-4019/.
  18. Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, 2018. URL: https://doi.org/10.1145/3196959.3196968.
  19. Liat Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Technion - Computer Science Department, 2019. URL: https://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2019/PHD/PHD-2019-10.
  20. Liat Peterfreund. Grammars for document spanners. In ICDT, 2021. URL: https://doi.org/10.4230/LIPIcs.ICDT.2021.7.
  21. Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In PODS, 2019. URL: https://doi.org/10.1145/3294052.3319699.
  22. Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.13.
  23. Igor Razgon. On the read-once property of branching programs and CNFs of bounded treewidth. Algorithmica, 75(2), 2016. URL: https://doi.org/10.1007/s00453-015-0059-x.
  24. Markus L. Schmid and Nicole Schweikardt. A purely regular approach to non-regular core spanners. In ICDT, 2021. URL: https://doi.org/10.4230/LIPICS.ICDT.2021.4.
  25. Cheng Sheng and Yufei Tao. Worst-case I/O-efficient skyline algorithms. ACM Transactions on Database Systems (TODS), 37(4), 2012. URL: https://doi.org/10.1145/2389241.2389245.
  26. Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2-3), 2012. URL: https://doi.org/10.1007/s10472-012-9288-8.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail