POSIX Lexing with Bitcoded Derivatives

Tan, Chengsong; Urban, Christian

doi:10.4230/LIPIcs.ITP.2023.27

File

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms
Theory of computation → Formal languages and automata theory

Keywords

POSIX matching and lexing
derivatives of regular expressions
Isabelle/HOL

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string - that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an "aggressive" simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big, resulting in an extremely slow lexing algorithm. In this paper we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our variant is correct and generates unique POSIX values (no such proof has been given for the original algorithm by Sulzmann and Lu); we also (ii) establish finite bounds for the size of our derivatives.

Cite As Get BibTex

Chengsong Tan and Christian Urban. POSIX Lexing with Bitcoded Derivatives. In 14th International Conference on Interactive Theorem Proving (ITP 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 268, pp. 27:1-27:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ITP.2023.27

Author Details

Chengsong Tan

Imperial College London, UK

Christian Urban

King’s College London, UK

References

V. Antimirov. Partial Derivatives of Regular Expressions and Finite Automata Constructions. Theoretical Computer Science, 155:291-319, 1995.
F. Ausaf, R. Dyckhoff, and C. Urban. POSIX Lexing with Derivatives of Regular Expressions (Proof Pearl). In Proc. of the 7th International Conference on Interactive Theorem Proving (ITP), volume 9807 of LNCS, pages 69-86, 2016.
H. Björklund, W. Martens, and T. Timm. Efficient Incremental Evaluation of Succinct Regular Expressions. In Proc. of the 24th ACM Conf. on Information and Knowledge Management (CIKM), pages 1541-1550, 2015.
J. A. Brzozowski. Derivatives of Regular Expressions. Journal of the ACM, 11(4):481-494, 1964.
T. Coquand and V. Siles. A Decision Procedure for Regular Expression Equivalence in Type Theory. In Proc. of the 1st International Conference on Certified Programs and Proofs (CPP), volume 7086 of LNCS, pages 119-134, 2011.
D. Egolf, S. Lasser, and K. Fisher. Verbatim: A Verified Lexer Generator. In 2021 IEEE Security and Privacy Workshops (SPW), pages 92-100, 2021.
D. Egolf, S. Lasser, and K. Fisher. Verbatim++: Verified, Optimized, and Semantically Rich Lexing with Dderivatives. In Proc. of the 11th ACM SIGPLAN Conference on Certified Programs and Proofs (CPP), pages 27-39. ACM, 2022.
A. Krauss and T. Nipkow. Proof Pearl: Regular Expression Equivalence and Relation Algebra. Journal of Automated Reasoning, 49:95-106, 2012.
C. Kuklewicz. Regex Posix. URL: https://wiki.haskell.org/Regex_Posix.
L. Nielsen and F. Henglein. Bit-Coded Regular Expression Parsing. In Proc. of the 5th International Conference on Language and Automata Theory and Applications (LATA), volume 6638 of LNCS, pages 402-413, 2011.
S. Okui and T. Suzuki. Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions. In Proc. of the 15th International Conference on Implementation and Application of Automata (CIAA), volume 6482 of LNCS, pages 231-240, 2010.
S. Owens and K. Slind. Adapting Functional Programs to Higher Order Logic. Higher-Order and Symbolic Computation, 21(4):377-409, 2008.
The Open Group Base Specification Issue 6 IEEE Std 1003.1 2004 Edition, 2004. URL: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html.
R. Ribeiro and A. Du Bois. Certified Bit-Coded Regular Expression Parsing. In Proc. of the 21st Brazilian Symposium on Programming Languages, pages 4:1-4:8, 2017.
M. Sulzmann and K. Lu. POSIX Regular Expression Parsing with Derivatives. In Proc. of the 12th International Conference on Functional and Logic Programming (FLOPS), volume 8475 of LNCS, pages 203-220, 2014.
L. Turoňová, L. Holík, O. Lengál, O. Saarikivi, M. Veanes, and T. Vojnar. Regex Matching with Counting-Set Automata. Proceedings of the ACM on Programming Languages (PACMPL), 4:218:1-218:30, 2020.
S. Vansummeren. Type Inference for Unique Pattern Matching. ACM Transactions on Programming Languages and Systems, 28(3):389-428, 2006.

POSIX Lexing with Bitcoded Derivatives

Authors Chengsong Tan, Christian Urban

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

POSIX Lexing with Bitcoded Derivatives

Authors Chengsong Tan, Christian Urban

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Supplementary Materials

References

Thanks for your feedback!

Could not send message