Skip to content

Moosa-Salehi/protocol_reverse_engineering

Repository files navigation

Protocol Reverse Engineering

Python 3.10+ License

A protocol-agnostic reverse engineering pipeline that analyzes binary protocol traffic from PCAP files and automatically infers protocol structure, message types, field boundaries, and semantic roles.

Key Features

  • Protocol-Agnostic Analysis - Works with any binary protocol without prior knowledge
  • Automatic Message Clustering - Discovers message families using advanced clustering
  • Field Boundary Detection - Infers field boundaries with enhanced anti-fragmentation
  • Semantic Labeling - Identifies field labels
  • Request/Response Pairing - Discovers protocol interactions and relations
  • LLM-Assisted Refinement - LLM integration for improved analysis
  • Comprehensive Reports - Generates Markdown and interactive HTML specifications
  • Ground Truth Evaluation - Validates results against known protocol specifications

Documentation

Requirements

  • Python 3.10+
  • TShark (Wireshark CLI)
  • Dependencies: numpy, scikit-learn, hdbscan, scapy, torch

Project Structure

protocol_re/
├── src/protocol_re/          # Core library
├── scripts/                  # Pipeline stages (01-24)
├── docs/                     # Documentation
├── data/                     # Intermediate artifacts
├── output/                   # Final reports
├── pre_trained/              # Trained nueral models
├── prompts/                  # Prompts used in LLM assisted stages
├── schema/                   # protocol model, evaluation schema
├── tests/                    # test modules
├── truth-files/              # Real protocol specification, used for evaluation
└── main.py                   # Pipeline runner

Supported Protocols

The pipeline is protocol-agnostic and has been tested with:

  • Modbus TCP

Performance

Typical runtime for 200K Modbus messages: ~6 minutes

Accuracy on Modbus TCP:

  • Message type detection: 90%+ precision/recall
  • Field boundary recall: 88%+
  • Field boundary precision: 65%+

License

MIT License - See LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub.

About

LLM-Assisted pipeline for reverse-engineering undocumented protocols from PCAP captures. It normalizes messages, clusters payloads, infers framing/fields/relations/semantics, and outputs html/md reports.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors