POSIX Lexing with Bitcoded Derivatives

Authors Chengsong Tan, Christian Urban



PDF
Thumbnail PDF

File

LIPIcs.ITP.2023.27.pdf
  • Filesize: 0.7 MB
  • 18 pages

Document Identifiers

Author Details

Chengsong Tan
  • Imperial College London, UK
Christian Urban
  • King’s College London, UK

Cite AsGet BibTex

Chengsong Tan and Christian Urban. POSIX Lexing with Bitcoded Derivatives. In 14th International Conference on Interactive Theorem Proving (ITP 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 268, pp. 27:1-27:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ITP.2023.27

Abstract

Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string - that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an "aggressive" simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big, resulting in an extremely slow lexing algorithm. In this paper we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our variant is correct and generates unique POSIX values (no such proof has been given for the original algorithm by Sulzmann and Lu); we also (ii) establish finite bounds for the size of our derivatives.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Formal languages and automata theory
Keywords
  • POSIX matching and lexing
  • derivatives of regular expressions
  • Isabelle/HOL

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. V. Antimirov. Partial Derivatives of Regular Expressions and Finite Automata Constructions. Theoretical Computer Science, 155:291-319, 1995. Google Scholar
  2. F. Ausaf, R. Dyckhoff, and C. Urban. POSIX Lexing with Derivatives of Regular Expressions (Proof Pearl). In Proc. of the 7th International Conference on Interactive Theorem Proving (ITP), volume 9807 of LNCS, pages 69-86, 2016. Google Scholar
  3. H. Björklund, W. Martens, and T. Timm. Efficient Incremental Evaluation of Succinct Regular Expressions. In Proc. of the 24th ACM Conf. on Information and Knowledge Management (CIKM), pages 1541-1550, 2015. Google Scholar
  4. J. A. Brzozowski. Derivatives of Regular Expressions. Journal of the ACM, 11(4):481-494, 1964. Google Scholar
  5. T. Coquand and V. Siles. A Decision Procedure for Regular Expression Equivalence in Type Theory. In Proc. of the 1st International Conference on Certified Programs and Proofs (CPP), volume 7086 of LNCS, pages 119-134, 2011. Google Scholar
  6. D. Egolf, S. Lasser, and K. Fisher. Verbatim: A Verified Lexer Generator. In 2021 IEEE Security and Privacy Workshops (SPW), pages 92-100, 2021. Google Scholar
  7. D. Egolf, S. Lasser, and K. Fisher. Verbatim++: Verified, Optimized, and Semantically Rich Lexing with Dderivatives. In Proc. of the 11th ACM SIGPLAN Conference on Certified Programs and Proofs (CPP), pages 27-39. ACM, 2022. Google Scholar
  8. A. Krauss and T. Nipkow. Proof Pearl: Regular Expression Equivalence and Relation Algebra. Journal of Automated Reasoning, 49:95-106, 2012. Google Scholar
  9. C. Kuklewicz. Regex Posix. URL: https://wiki.haskell.org/Regex_Posix.
  10. L. Nielsen and F. Henglein. Bit-Coded Regular Expression Parsing. In Proc. of the 5th International Conference on Language and Automata Theory and Applications (LATA), volume 6638 of LNCS, pages 402-413, 2011. Google Scholar
  11. S. Okui and T. Suzuki. Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions. In Proc. of the 15th International Conference on Implementation and Application of Automata (CIAA), volume 6482 of LNCS, pages 231-240, 2010. Google Scholar
  12. S. Owens and K. Slind. Adapting Functional Programs to Higher Order Logic. Higher-Order and Symbolic Computation, 21(4):377-409, 2008. Google Scholar
  13. The Open Group Base Specification Issue 6 IEEE Std 1003.1 2004 Edition, 2004. URL: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html.
  14. R. Ribeiro and A. Du Bois. Certified Bit-Coded Regular Expression Parsing. In Proc. of the 21st Brazilian Symposium on Programming Languages, pages 4:1-4:8, 2017. Google Scholar
  15. M. Sulzmann and K. Lu. POSIX Regular Expression Parsing with Derivatives. In Proc. of the 12th International Conference on Functional and Logic Programming (FLOPS), volume 8475 of LNCS, pages 203-220, 2014. Google Scholar
  16. L. Turoňová, L. Holík, O. Lengál, O. Saarikivi, M. Veanes, and T. Vojnar. Regex Matching with Counting-Set Automata. Proceedings of the ACM on Programming Languages (PACMPL), 4:218:1-218:30, 2020. Google Scholar
  17. S. Vansummeren. Type Inference for Unique Pattern Matching. ACM Transactions on Programming Languages and Systems, 28(3):389-428, 2006. Google Scholar