The Complexity of Aggregates over Extractions by Regular Expressions

Authors Johannes Doleschal , Noa Bratman, Benny Kimelfeld, Wim Martens

Thumbnail PDF


  • Filesize: 0.9 MB
  • 20 pages

Document Identifiers

Author Details

Johannes Doleschal
  • Universität Bayreuth, Germany
  • Hasselt University, Belgium
Noa Bratman
  • Technion - Israel Institute of Technology, Haifa, Israel
Benny Kimelfeld
  • Technion - Israel Institute of Technology, Haifa, Israel
Wim Martens
  • Universität Bayreuth, Germany

Cite AsGet BibTex

Johannes Doleschal, Noa Bratman, Benny Kimelfeld, and Wim Martens. The Complexity of Aggregates over Extractions by Regular Expressions. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 10:1-10:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Subject Classification

ACM Subject Classification
  • Information systems → Information extraction
  • Theory of computation → Database query languages (principles)
  • Theory of computation → Problems, reductions and completeness
  • Information extraction
  • document spanners
  • regular expressions
  • aggregation functions


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In Pablo Barceló and Marco Calautti, editors, ICDT, volume 127 of LIPIcs, pages 22:1-22:19. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. URL:
  2. Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. Efficient logspace classes for enumeration, counting, and uniform generation. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, PODS, pages 59-73. ACM, 2019. URL:
  3. K. Y. Cockwell and I. G. Giles. Software tools for motif and pattern scanning: program descriptions including a universal sequence reading algorithm. Computer Applications in the Biosciences, 5(3):227-232, 1989. Google Scholar
  4. Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In PODS, pages 149-163, 2019. URL:
  5. Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. In ICDT, pages 8:1-8:18, 2020. Google Scholar
  6. A. Ehrenfeucht and H. Zeiger. Complexity measures for regular expressions. Journal of Computer and System Sciences, 12(2):134-146, 1976. Google Scholar
  7. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM, 62(2):12:1-12:51, 2015. URL:
  8. Fernando Florenzano, Cristian Riveros, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, page 165–177. Association for Computing Machinery, 2018. URL:
  9. Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019. URL:
  10. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, pages 137-149, 2018. Google Scholar
  11. Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In Carsten Lutz and Jean Christoph Jung, editors, ICDT, volume 155 of LIPIcs, pages 11:1-11:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL:
  12. Peter J. Haas and Joseph M. Hellerstein. Ripple joins for online aggregation. In SIGMOD Conference, pages 287-298. ACM Press, 1999. Google Scholar
  13. Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. FAQ: questions asked frequently. In Tova Milo and Wang-Chiew Tan, editors, PODS, pages 13-28. ACM, 2016. URL:
  14. Mark W. Krentel. The complexity of optimization problems. Journal of Computer and System Sciences, 36(3):490-509, 1988. URL:
  15. Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. SystemT: A system for declarative information extraction. SIGMOD Record, 37(4):7-13, 2008. URL:
  16. Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: Online aggregation via random walks. In SIGMOD Conference, pages 615-629. ACM, 2016. Google Scholar
  17. Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319-339. Springer, 2004. Google Scholar
  18. Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011. Google Scholar
  19. Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD Conference, page 426–435. Association for Computing Machinery, 1998. URL:
  20. Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-136, 2018. Google Scholar
  21. Franz Mayr and Sergio Yovine. Regular inference on artificial neural networks. In CD-MAKE, volume 11015 of Lecture Notes in Computer Science, pages 350-369. Springer, 2018. Google Scholar
  22. Joshua J. Michalenko, Ameesh Shah, Abhinav Verma, Richard G. Baraniuk, Swarat Chaudhuri, and Ankit B. Patel. Representing formal languages: A comparison between finite automata and recurrent neural networks. In ICLR (Poster)., 2019. Google Scholar
  23. Matús Mihalák, Rastislav Srámek, and Peter Widmayer. Approximately counting approximately-shortest paths in directed acyclic graphs. Theory Comput. Syst., 58(1):45-59, 2016. URL:
  24. A. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, 239:698-712, 1994. Google Scholar
  25. Galia Nordon, Gideon Koren, Varda Shalev, Benny Kimelfeld, Uri Shalit, and Kira Radinsky. Building causal graphs from medical literature and electronic medical records. In AAAI, pages 1102-1109. AAAI Press, 2019. Google Scholar
  26. Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320-334. ACM, 2019. URL:
  27. Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, volume 127 of LIPIcs, pages 13:1-13:18, 2019. Google Scholar
  28. Hoifung Poon and Pedro M. Domingos. Joint inference in information extraction. In AAAI, pages 913-918. AAAI Press, 2007. Google Scholar
  29. Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269-282, 2017. Google Scholar
  30. Robert P. Schumaker and Hsinchun Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Trans. Inf. Syst., 27(2):12:1-12:19, 2009. URL:
  31. Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. Incremental knowledge base construction using deepdive. PVLDB, 8(11):1310-1321, 2015. URL:
  32. Charles A. Sutton and Andrew McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267-373, 2012. Google Scholar
  33. Jan van Leeuwen, editor. Handbook of Theoretical Computer Science (Vol. A): Algorithms and Complexity. MIT Press, Cambridge, MA, USA, 1991. Google Scholar
  34. Gail Weiss, Yoav Goldberg, and Eran Yahav. Extracting automata from recurrent neural networks using queries and counterexamples. In ICML, volume 80, pages 5244-5253., 2018. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail