miniRegex: A Regular Expression Engine from Scratch

A complete regex engine implementation in Python featuring a recursive descent parser, Abstract Syntax Tree (AST) representation, and a backtracking matcher with support for advanced regex features.

Project Overview

This project implements a fully functional regular expression engine without relying on Python's built-in re module. It demonstrates:

Parsing & Backtracking: Recursive descent parsing and backtracking search
Pattern Matching: Support for quantifiers, character classes, groups, and anchors

Features

Supported Regex Syntax

Feature	Syntax	Description	Example
Literals	`abc`	Match exact characters	`hello` matches "hello"
Alternation	`a\|b`	Match either pattern	`cat\|dog` matches "cat" or "dog"
Quantifiers	`*`, `+`, `?`, `{n,m}`	Repetition control	`a+` matches "a", "aa", "aaa"...
Lazy Quantifiers	`*?`, `+?`, `??`, `{n,m}?`	Non-greedy matching	`a*?` matches as few 'a's as possible
Character Classes	`[abc]`, `[a-z]`	Match any character in set	`[0-9]` matches any digit
Negated Classes	`[^abc]`	Match any character NOT in set	`[^0-9]` matches non-digits
Dot Metacharacter	`.`	Match any character except newline	`a.c` matches "abc", "a1c"
Character Escapes	`\d`, `\w`, `\s`	Predefined character classes	`\d+` matches one or more digits
Anchors	`^`, `$`, `\b`, `\B`	Position assertions	`^\w+` matches word at start
Groups	`(...)`	Capturing groups	`(ab)+` captures repeated "ab"
Non-capturing	`(?:...)`	Group without capture	`(?:ab)+` groups but doesn't capture
Named Groups	`(?<name>...)`	Named capture groups	`(?<year>\d{4})` captures as "year"

Character Class Shortcuts

\d - Digits [0-9]
\D - Non-digits [^0-9]
\w - Word characters [a-zA-Z0-9_]
\W - Non-word characters
\s - Whitespace characters
\S - Non-whitespace characters

Escape Sequences

\n - Newline
\r - Carriage return
\t - Tab
\\ - Literal backslash
\., \*, \+, etc. - Escaped metacharacters

Optimization:

Boyer-Moore String matching algorithm to perform fast string matching on long, literal sequence

Architecture

The architecture is similar to the java.util.regex regex engine.

1. Parser (`parser.py`)

The parser implements a recursive descent parsing algorithm that converts a regex pattern string into an Abstract Syntax Tree (AST).

The Complete Grammar Structure:

regex           → alternation
alternation     → concatenation ('|' concatenation)*
concatenation   → quantified*
quantified      → atom quantifier?
quantifier      → ('*' | '+' | '?') '?'? | '{' number '}' '?'? | '{' number ',' '}' '?'? | '{' number ',' number '}' '?'?
atom            → character | '.' | character-class | group | anchor | escape-sequence

2. AST Nodes (`matcher.py`)

The AST represents the parsed regex pattern in a tree structure:

ASTNode (base class)
├── Character       # Literal character
├── Empty           # ε (epsilon) - matches empty string
├── Concatenation   # Sequence: abc
├── Alternation     # Choice: a|b|c
├── Quantifier      # Repetition: a*, a+, a{n,m}
├── Dot             # Wildcard: .
├── CharacterClass  # Set: [abc], [a-z], [^0-9]
├── Group           # Grouping: (...), (?:...), (?<name>...)
└── Anchor          # Positions: ^, $, \b, \B

3. Matcher (`matcher.py`)

The matcher uses a backtracking algorithm with generators to find matches.

Key Algorithm Features:

Generator-based backtracking: Each matching function is a generator that yields all possible match positions
Greedy vs. Lazy quantifiers: Controls the order of backtracking attempts
Capture group tracking: Records matched text for groups

Usage

Basic Pattern Matching

from parser import Parser

# Create parser and matcher
parser = Parser("hello|world")
matcher = parser.matcher("hello")

# Check for match
if matcher.match():
    print("Pattern matched!")

Complex Patterns

# Email-like pattern
parser = Parser(r"[a-zA-Z0-9]+@[a-zA-Z]+\.[a-z]+")
matcher = parser.matcher("user@example.com")
print(matcher.match())  # True

# Phone number pattern
parser = Parser(r"\d{3}-\d{3}-\d{4}")
matcher = parser.matcher("123-456-7890")
print(matcher.match())  # True

# URL pattern with groups
parser = Parser(r"(https?)://([a-z.]+)")
matcher = parser.matcher("https://example.com")
if matcher.match():
    print(f"Groups: {matcher.groups}")

Quantifier Examples

# Greedy matching
parser = Parser("a*a")
matcher = parser.matcher("aaaa")
print(matcher.match())  # True - matches all 'a's

# Lazy matching
parser = Parser("a*?a")
matcher = parser.matcher("aaaa")
print(matcher.match())  # True - matches minimally

# Counted repetitions
parser = Parser("a{2,4}")
matcher = parser.matcher("aaa")
print(matcher.match())  # True

Limitations

Performance: Not optimized for production use (no NFA/DFA compilation)
Features: Limited compared to PCRE (no lookahead/lookbehind, backreferences)
Unicode: Basic ASCII support (can be extended for full Unicode)

Future Enhancements

Optimization:
- NFA/DFA compilation for linear-time matching
- Memoization to avoid redundant computations
Advanced Features:
- Lookahead/lookbehind assertions
- Backreferences (\1, \2, etc.)
- Atomic groups and possessive quantifiers
Tooling:
- Interactive regex debugger with step-by-step visualization
- Performance benchmarking suite

References

Compilers Course, By Suresh Purini: Link
Java's Regex Engine: Link

Author

Vikrant Mehta - vikrantmehta123@gmail.com

Note: This is an educational project demonstrating compiler construction and algorithm implementation. For production regex needs, use Python's built-in re module.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
Notes.md		Notes.md
README.md		README.md
benchmarks.py		benchmarks.py
main.py		main.py
matcher.py		matcher.py
pattern.py		pattern.py
test_cases.py		test_cases.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miniRegex: A Regular Expression Engine from Scratch

Project Overview

Features

Supported Regex Syntax

Character Class Shortcuts

Escape Sequences

Optimization:

Architecture

1. Parser (`parser.py`)

2. AST Nodes (`matcher.py`)

3. Matcher (`matcher.py`)

Usage

Basic Pattern Matching

Complex Patterns

Quantifier Examples

Limitations

Future Enhancements

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

miniRegex: A Regular Expression Engine from Scratch

Project Overview

Features

Supported Regex Syntax

Character Class Shortcuts

Escape Sequences

Optimization:

Architecture

1. Parser (parser.py)

2. AST Nodes (matcher.py)

3. Matcher (matcher.py)

Usage

Basic Pattern Matching

Complex Patterns

Quantifier Examples

Limitations

Future Enhancements

References

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Parser (`parser.py`)

2. AST Nodes (`matcher.py`)

3. Matcher (`matcher.py`)

Packages