A protocol-agnostic reverse engineering pipeline that analyzes binary protocol traffic from PCAP files and automatically infers protocol structure, message types, field boundaries, and semantic roles.
- Protocol-Agnostic Analysis - Works with any binary protocol without prior knowledge
- Automatic Message Clustering - Discovers message families using advanced clustering
- Field Boundary Detection - Infers field boundaries with enhanced anti-fragmentation
- Semantic Labeling - Identifies field labels
- Request/Response Pairing - Discovers protocol interactions and relations
- LLM-Assisted Refinement - LLM integration for improved analysis
- Comprehensive Reports - Generates Markdown and interactive HTML specifications
- Ground Truth Evaluation - Validates results against known protocol specifications
- Getting Started - Installation, first analysis, and basic usage
- Architecture - System design and technical details
- Documentation Guide - How to build and contribute to docs
- Python 3.10+
- TShark (Wireshark CLI)
- Dependencies: numpy, scikit-learn, hdbscan, scapy, torch
protocol_re/
├── src/protocol_re/ # Core library
├── scripts/ # Pipeline stages (01-24)
├── docs/ # Documentation
├── data/ # Intermediate artifacts
├── output/ # Final reports
├── pre_trained/ # Trained nueral models
├── prompts/ # Prompts used in LLM assisted stages
├── schema/ # protocol model, evaluation schema
├── tests/ # test modules
├── truth-files/ # Real protocol specification, used for evaluation
└── main.py # Pipeline runner
The pipeline is protocol-agnostic and has been tested with:
- Modbus TCP
Typical runtime for 200K Modbus messages: ~6 minutes
Accuracy on Modbus TCP:
- Message type detection: 90%+ precision/recall
- Field boundary recall: 88%+
- Field boundary precision: 65%+
MIT License - See LICENSE file for details.
For questions or issues, please open an issue on GitHub.