# How It Works GraphOptim treats code as a **graph problem**. This page explains the full pipeline. ## Pipeline Overview ``` Source Code → AST → Control Flow Graph (CFG) → Weighted DiGraph → Analysis → Optimization ``` ### Step 1: Parse Python source code is parsed into an Abstract Syntax Tree (AST) using Python's built-in `ast` module. ### Step 2: Build CFG The AST is transformed into a **Control Flow Graph** — a directed graph where: - **Nodes** = Basic blocks (sequences of statements with no branches) - **Edges** = Control flow transitions (if/else branches, loop entries/exits, etc.) - **Edge weights** = Estimated computational cost of the transition ``` ┌─────────────┐ │ Entry Block │ │ x = input() │ └──────┬──────┘ │ ┌──────▼──────┐ │ if x > 0 │──── False ────┐ └──────┬──────┘ │ │ True │ ┌──────▼──────┐ ┌──────▼──────┐ │ return x │ │ return -x │ └──────┬──────┘ └──────┬──────┘ │ │ ┌──────▼──────────────────────▼──────┐ │ Exit Block │ └────────────────────────────────────┘ ``` ### Step 3: Extract Metrics Seven graph-theoretic metrics are extracted from the CFG: | # | Metric | Formula | What It Detects | |---|--------|---------|-----------------| | 1 | Cyclomatic Complexity | `E - N + 2P` | Overall branching density | | 2 | Dead Nodes | `in_degree=0 and not entry` | Unreachable code blocks | | 3 | CFG Diameter | Longest shortest path | Deep execution chains | | 4 | Avg Shortest Path | Mean over all pairs | Execution efficiency | | 5 | Max Betweenness Centrality | Node bottleneck score | Over-centralized functions | | 6 | Node/Edge Ratio | `E/N` | Branching density | | 7 | Lines of Code | Non-blank lines | Verbosity | ### Step 4: Detect Patterns Four pattern detectors run on the CFG: 1. **Dead Nodes** — Unreachable code after return/raise/break/continue 2. **Redundant Paths** — Duplicate conditional branches (structurally equal via AST hashing) 3. **Bottleneck Nodes** — High betweenness centrality (>0.7) indicating over-centralized code 4. **Deep Chains** — Execution paths deeper than threshold (default: 8) ### Step 5: Select Passes (Knapsack) Available optimization passes are evaluated. Each has: - `cost` — Risk of introducing bugs (0.0-1.0) - `benefit` — Expected improvement (0.0-1.0) A **0/1 Knapsack** dynamic programming solver selects the optimal subset within the user's risk budget. ### Step 6: Optimize Selected passes transform the AST: 1. Apply each pass sequentially 2. Validate output with `ast.parse()` after each pass 3. Skip any pass that produces invalid code 4. Post-process for readability ## Data Flow Graph (DFG) In addition to the CFG, GraphOptim builds a **Data Flow Graph** tracking variable definitions and usages. This enables: - Detection of unused variables - Understanding data dependencies between operations - Future dead data flow elimination ## Architecture ``` graphoptim/ ├── parser/ ← Source → Graph │ ├── cfg_builder ← Control Flow Graph │ ├── dfg_builder ← Data Flow Graph │ └── ast_utils ← AST helpers ├── analyzer/ ← Graph → Insights │ ├── metrics ← 7 graph metrics │ ├── patterns ← Pattern detectors │ └── reporter ← Scored reports ├── optimizer/ ← Insights → Better Code │ ├── passes/ ← Individual optimizations │ │ ├── dead_code │ │ ├── path_shortener │ │ ├── centrality │ │ └── knapsack ← Pass selection │ └── rewriter ← Orchestration ├── benchmark/ ← Empirical validation ├── cli.py ← Command-line interface └── config.py ← Configuration ```