A complete regex engine implementation in Python featuring a recursive descent parser, Abstract Syntax Tree (AST) representation, and a backtracking matcher with support for advanced regex features.
This project implements a fully functional regular expression engine without relying on Python's built-in re module. It demonstrates:
- Parsing & Backtracking: Recursive descent parsing and backtracking search
- Pattern Matching: Support for quantifiers, character classes, groups, and anchors
| Feature | Syntax | Description | Example |
|---|---|---|---|
| Literals | abc |
Match exact characters | hello matches "hello" |
| Alternation | a|b |
Match either pattern | cat|dog matches "cat" or "dog" |
| Quantifiers | *, +, ?, {n,m} |
Repetition control | a+ matches "a", "aa", "aaa"... |
| Lazy Quantifiers | *?, +?, ??, {n,m}? |
Non-greedy matching | a*? matches as few 'a's as possible |
| Character Classes | [abc], [a-z] |
Match any character in set | [0-9] matches any digit |
| Negated Classes | [^abc] |
Match any character NOT in set | [^0-9] matches non-digits |
| Dot Metacharacter | . |
Match any character except newline | a.c matches "abc", "a1c" |
| Character Escapes | \d, \w, \s |
Predefined character classes | \d+ matches one or more digits |
| Anchors | ^, $, \b, \B |
Position assertions | ^\w+ matches word at start |
| Groups | (...) |
Capturing groups | (ab)+ captures repeated "ab" |
| Non-capturing | (?:...) |
Group without capture | (?:ab)+ groups but doesn't capture |
| Named Groups | (?<name>...) |
Named capture groups | (?<year>\d{4}) captures as "year" |
\d- Digits[0-9]\D- Non-digits[^0-9]\w- Word characters[a-zA-Z0-9_]\W- Non-word characters\s- Whitespace characters\S- Non-whitespace characters
\n- Newline\r- Carriage return\t- Tab\\- Literal backslash\.,\*,\+, etc. - Escaped metacharacters
- Boyer-Moore String matching algorithm to perform fast string matching on long, literal sequence
The architecture is similar to the java.util.regex regex engine.
The parser implements a recursive descent parsing algorithm that converts a regex pattern string into an Abstract Syntax Tree (AST).
The Complete Grammar Structure:
regex → alternation
alternation → concatenation ('|' concatenation)*
concatenation → quantified*
quantified → atom quantifier?
quantifier → ('*' | '+' | '?') '?'? | '{' number '}' '?'? | '{' number ',' '}' '?'? | '{' number ',' number '}' '?'?
atom → character | '.' | character-class | group | anchor | escape-sequence
The AST represents the parsed regex pattern in a tree structure:
ASTNode (base class)
├── Character # Literal character
├── Empty # ε (epsilon) - matches empty string
├── Concatenation # Sequence: abc
├── Alternation # Choice: a|b|c
├── Quantifier # Repetition: a*, a+, a{n,m}
├── Dot # Wildcard: .
├── CharacterClass # Set: [abc], [a-z], [^0-9]
├── Group # Grouping: (...), (?:...), (?<name>...)
└── Anchor # Positions: ^, $, \b, \BThe matcher uses a backtracking algorithm with generators to find matches.
Key Algorithm Features:
- Generator-based backtracking: Each matching function is a generator that yields all possible match positions
- Greedy vs. Lazy quantifiers: Controls the order of backtracking attempts
- Capture group tracking: Records matched text for groups
from parser import Parser
# Create parser and matcher
parser = Parser("hello|world")
matcher = parser.matcher("hello")
# Check for match
if matcher.match():
print("Pattern matched!")# Email-like pattern
parser = Parser(r"[a-zA-Z0-9]+@[a-zA-Z]+\.[a-z]+")
matcher = parser.matcher("user@example.com")
print(matcher.match()) # True
# Phone number pattern
parser = Parser(r"\d{3}-\d{3}-\d{4}")
matcher = parser.matcher("123-456-7890")
print(matcher.match()) # True
# URL pattern with groups
parser = Parser(r"(https?)://([a-z.]+)")
matcher = parser.matcher("https://example.com")
if matcher.match():
print(f"Groups: {matcher.groups}")# Greedy matching
parser = Parser("a*a")
matcher = parser.matcher("aaaa")
print(matcher.match()) # True - matches all 'a's
# Lazy matching
parser = Parser("a*?a")
matcher = parser.matcher("aaaa")
print(matcher.match()) # True - matches minimally
# Counted repetitions
parser = Parser("a{2,4}")
matcher = parser.matcher("aaa")
print(matcher.match()) # True- Performance: Not optimized for production use (no NFA/DFA compilation)
- Features: Limited compared to PCRE (no lookahead/lookbehind, backreferences)
- Unicode: Basic ASCII support (can be extended for full Unicode)
-
Optimization:
- NFA/DFA compilation for linear-time matching
- Memoization to avoid redundant computations
-
Advanced Features:
- Lookahead/lookbehind assertions
- Backreferences (\1, \2, etc.)
- Atomic groups and possessive quantifiers
-
Tooling:
- Interactive regex debugger with step-by-step visualization
- Performance benchmarking suite
Vikrant Mehta - vikrantmehta123@gmail.com
Note: This is an educational project demonstrating compiler construction and algorithm implementation. For production regex needs, use Python's built-in re module.