The Complexity of Aggregates over Extractions by Regular Expressions

Doleschal, Johannes; Bratman, Noa; Kimelfeld, Benny; Martens, Wim

doi:10.4230/LIPIcs.ICDT.2021.10

Abstract

Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In Pablo Barceló and Marco Calautti, editors, ICDT, volume 127 of LIPIcs, pages 22:1-22:19. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.22.
Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. Efficient logspace classes for enumeration, counting, and uniform generation. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, PODS, pages 59-73. ACM, 2019. URL: https://doi.org/10.1145/3294052.3319704.
K. Y. Cockwell and I. G. Giles. Software tools for motif and pattern scanning: program descriptions including a universal sequence reading algorithm. Computer Applications in the Biosciences, 5(3):227-232, 1989.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In PODS, pages 149-163, 2019. URL: https://doi.org/10.1145/3294052.3319684.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. In ICDT, pages 8:1-8:18, 2020.
A. Ehrenfeucht and H. Zeiger. Complexity measures for regular expressions. Journal of Computer and System Sciences, 12(2):134-146, 1976.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM, 62(2):12:1-12:51, 2015. URL: https://doi.org/10.1145/2699442.
Fernando Florenzano, Cristian Riveros, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, page 165–177. Association for Computing Machinery, 2018. URL: https://doi.org/10.1145/3196959.3196987.
Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019. URL: https://doi.org/10.1007/s00224-018-9874-1.
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, pages 137-149, 2018.
Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In Carsten Lutz and Jean Christoph Jung, editors, ICDT, volume 155 of LIPIcs, pages 11:1-11:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.11.
Peter J. Haas and Joseph M. Hellerstein. Ripple joins for online aggregation. In SIGMOD Conference, pages 287-298. ACM Press, 1999.
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. FAQ: questions asked frequently. In Tova Milo and Wang-Chiew Tan, editors, PODS, pages 13-28. ACM, 2016. URL: https://doi.org/10.1145/2902251.2902280.
Mark W. Krentel. The complexity of optimization problems. Journal of Computer and System Sciences, 36(3):490-509, 1988. URL: https://doi.org/10.1016/0022-0000(88)90039-6.
Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. SystemT: A system for declarative information extraction. SIGMOD Record, 37(4):7-13, 2008. URL: https://doi.org/10.1145/1519103.1519105.
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: Online aggregation via random walks. In SIGMOD Conference, pages 615-629. ACM, 2016.
Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319-339. Springer, 2004.
Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011.
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD Conference, page 426–435. Association for Computing Machinery, 1998. URL: https://doi.org/10.1145/276304.276342.
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-136, 2018.
Franz Mayr and Sergio Yovine. Regular inference on artificial neural networks. In CD-MAKE, volume 11015 of Lecture Notes in Computer Science, pages 350-369. Springer, 2018.
Joshua J. Michalenko, Ameesh Shah, Abhinav Verma, Richard G. Baraniuk, Swarat Chaudhuri, and Ankit B. Patel. Representing formal languages: A comparison between finite automata and recurrent neural networks. In ICLR (Poster). OpenReview.net, 2019.
Matús Mihalák, Rastislav Srámek, and Peter Widmayer. Approximately counting approximately-shortest paths in directed acyclic graphs. Theory Comput. Syst., 58(1):45-59, 2016. URL: https://doi.org/10.1007/s00224-014-9571-7.
A. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, 239:698-712, 1994.
Galia Nordon, Gideon Koren, Varda Shalev, Benny Kimelfeld, Uri Shalit, and Kira Radinsky. Building causal graphs from medical literature and electronic medical records. In AAAI, pages 1102-1109. AAAI Press, 2019.
Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320-334. ACM, 2019. URL: https://doi.org/10.1145/3294052.3319699.
Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, volume 127 of LIPIcs, pages 13:1-13:18, 2019.
Hoifung Poon and Pedro M. Domingos. Joint inference in information extraction. In AAAI, pages 913-918. AAAI Press, 2007.
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269-282, 2017.
Robert P. Schumaker and Hsinchun Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Trans. Inf. Syst., 27(2):12:1-12:19, 2009. URL: https://doi.org/10.1145/1462198.1462204.
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. Incremental knowledge base construction using deepdive. PVLDB, 8(11):1310-1321, 2015. URL: https://doi.org/10.14778/2809974.2809991.
Charles A. Sutton and Andrew McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267-373, 2012.
Jan van Leeuwen, editor. Handbook of Theoretical Computer Science (Vol. A): Algorithms and Complexity. MIT Press, Cambridge, MA, USA, 1991.
Gail Weiss, Yoav Goldberg, and Eran Yahav. Extracting automata from recurrent neural networks using queries and counterexamples. In ICML, volume 80, pages 5244-5253. JMLR.org, 2018.

The Complexity of Aggregates over Extractions by Regular Expressions

Authors Johannes Doleschal , Noa Bratman, Benny Kimelfeld, Wim Martens

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

The Complexity of Aggregates over Extractions by Regular Expressions

Authors Johannes Doleschal , Noa Bratman, Benny Kimelfeld, Wim Martens

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message