A high-performance, cross-platform Sudoku solver featuring multiple algorithms, OCR image recognition, comprehensive benchmarking tools, and multi-platform CI/CD support.
I originally built a Sudoku solver as a small father-son coding challenge. It later evolved into a useful benchmark tool to explore algorithmic efficiency and cross-platform CPU performance.
I wrote two articles documenting this journey:
- Sudoku Solver — From a Father-Son Coding Challenge to a CPU Benchmark Tool
- Sudoku Solver — Cross-Platform Performance Deep Dive (Part II)
-
Multiple Algorithms
- Dancing Links (DLX) - Donald Knuth's Algorithm X implementation
- Backtracking with Constraint Propagation
- Automatic algorithm comparison
-
Find All Solutions
--solve-alloption to find multiple solutions- Configurable limit with
--max-solutions - Save all solutions to JSON
-
OCR Support (Windows/Linux/macOS)
- Extract puzzles from images (PNG, JPG, BMP, TIFF)
- Supports 9x9 and 16x16 grids
- Automatic grid detection and perspective correction
-
Built-in Test Puzzles
- 9x9 classic puzzle for quick tests
- 16x16 hard puzzle (77 clues) for standard benchmarks
- 25x25 mega puzzle for stress testing
-
Comprehensive Benchmarking
- Single-threaded and multi-threaded benchmarks
- Cross-platform system information detection
- Performance metrics: throughput, speedup, efficiency
- Professional benchmark reports
-
Cross-Platform Support
- Windows (x64)
- Linux (x64)
- macOS (Universal Binary - Intel & Apple Silicon)
- Android (ARM64) - Solver & Benchmark only
- CMake 3.20+
- C++20 compatible compiler (MSVC 2022, GCC 11+, Clang 14+)
- vcpkg package manager
# Clone repository
git clone https://github.com/user/sudoku-solver.git
cd sudoku-solver
# Windows
cmake --preset windows-x64-Release
cmake --build --preset windows-x64-Release
# Linux
cmake --preset linux-x64-Release
cmake --build --preset linux-x64-Release
# macOS (Universal Binary)
cmake --preset macos-universal-Release
cmake --build --preset macos-universal-Release# Solve a puzzle from JSON
SudokuSolver -i puzzle.json
# Solve from image (OCR) - Windows/Linux/macOS only
SudokuSolver -i sudoku.png
# Use built-in test puzzles
SudokuSolver -t 9 # 9x9 puzzle
SudokuSolver -t 16 # 16x16 puzzle
SudokuSolver -t 25 # 25x25 puzzle (heavy)
# Compare algorithms
SudokuSolver -i puzzle.json -a compare
# Find all solutions (with default limit of 100)
SudokuSolver -i puzzle.json --solve-all
# Find up to 10 solutions
SudokuSolver -i puzzle.json --solve-all --max-solutions 10
# Find all solutions and save to JSON
SudokuSolver -i puzzle.json --solve-all -o solutions.json
# Run benchmark (1000 iterations, auto-detect threads)
SudokuSolver -t 9 -b 1000 -w 0
# Single-threaded benchmark
SudokuSolver -t 25 -b 100
# Multi-threaded benchmark with built-in puzzle
SudokuSolver -t 16 -b 1000 -w 0 -a compare
# Show system info
SudokuSolver -s| Option | Description |
|---|---|
-i, --input <file> |
Input file (JSON or image) |
-t, --test <size> |
Use built-in test puzzle (9, 16, or 25) |
-a, --algorithm <algo> |
Algorithm: dlx, backtrack, compare |
-b, --benchmark <N> |
Run benchmark with N iterations |
-w, --workers <N> |
Number of parallel workers (0 = auto) |
-o, --output <file> |
Output solution to JSON file |
--solve-all |
Find all solutions (default limit: 100) |
--max-solutions <N> |
Maximum solutions to find (0 = unlimited) |
-s, --sysinfo |
Show system information |
-q, --quiet |
Minimal output |
Throughput (T)
Efficiency (E) = ─────────────────────
Single-Thread x Workers
Where:
- Throughput = Total solves per second (multi-threaded)
- Speedup = Multi-threaded throughput / Single-threaded throughput
- Efficiency = Speedup / Number of workers x 100%
Efficiency Interpretation:
| Efficiency | Rating | Description |
|---|---|---|
| 90-100% | ⭐⭐⭐⭐⭐ | Near-perfect scaling |
| 70-90% | ⭐⭐⭐⭐ | Excellent |
| 50-70% | ⭐⭐⭐ | Good (typical for CPU-bound tasks) |
| 30-50% | ⭐⭐ | Moderate (synchronization overhead) |
| <30% | ⭐ | Poor (bottlenecked) |
┌─────────────────────────────────────────────────────────────────────────────┐
│ GitHub Actions Runner Performance │
├────────────┬──────────────────┬─────────┬───────────┬───────────┬───────────┤
│ Platform │ CPU │ Cores │ BT/s │ DLX/s │ Winner │
├────────────┼──────────────────┼─────────┼───────────┼───────────┼───────────┤
│ Windows │ AMD EPYC 7763 │ 2C/4T │ 44.86/s │ 58.57/s │ DLX +31% │
│ Linux │ AMD EPYC 7763 │ 2C/4T │ 38.94/s │ 67.49/s │ DLX +73% │
│ macOS │ Apple M1 (VM) │ 3C/3T │ 35.56/s │ 94.92/s │ DLX+167% │ ⭐
└────────────┴──────────────────┴─────────┴───────────┴───────────┴───────────┘
BT = Backtracking, DLX = Dancing Links
┌─────────────────────────────────────────────────────────────────────────────┐
│ Single-Threaded Performance Comparison │
├────────────┬──────────────────┬──────────┬────────────┬─────────────────────┤
│ Platform │ CPU │ Avg(ms) │ Throughput │ Notes │
├────────────┼──────────────────┼──────────┼────────────┼─────────────────────┤
│ WSL2 │ AMD 9950X3D │ 17.41 │ 57.43/s │ Fastest │ ⭐
│ Windows │ AMD 9950X3D │ 19.27 │ 51.90/s │ MSVC overhead │
│ Android │ Snapdragon 8G3 │ 25.46 │ 39.27/s │ Mobile SoC! │
│ macOS VM │ Apple M1 (VM) │ 31.86 │ 31.39/s │ VM overhead ~30% │
│ Linux CI │ AMD EPYC 7763 │ 35.98 │ 27.79/s │ CI resource limit │
│ Win CI │ AMD EPYC 7763 │ 41.51 │ 24.09/s │ CI resource limit │
└────────────┴──────────────────┴──────────┴────────────┴─────────────────────┘
Key Observations:
- WSL2 (GCC) outperforms native Windows (MSVC) by ~10% on DLX algorithm
- Snapdragon 8 Gen 3 achieves 68% of desktop performance at 1/34 power consumption
- macOS Runner runs in VM, real M1 estimated ~45/s (based on min latency 22.24ms)
Per-Core Efficiency Comparison (DLX, 25x25 Puzzle)
══════════════════════════════════════════════════════════════════════════════
Platform │ Cores │ Total/s │ Per-Core/s │ Notes
──────────────────────┼───────┼──────────┼────────────┼───────────────────────
Ryzen 9950X3D (WSL2) │ 32T │ 903.59 │ 28.2 │ High core count
Ryzen 9950X3D (Win) │ 32T │ 885.59 │ 27.7 │ MSVC overhead
Apple M1 (VM) │ 3T │ 94.92 │ 31.6 │ Highest IPC!
EPYC 7763 (Linux) │ 4T │ 67.49 │ 16.9 │ CI resource limit
EPYC 7763 (Windows) │ 4T │ 58.57 │ 14.6 │ CI resource limit
Snapdragon 8 Gen 3 │ 8T │ 105.31 │ 13.2 │ big.LITTLE penalty
══════════════════════════════════════════════════════════════════════════════
Key Insight: Apple M1 has the HIGHEST per-core throughput (31.6/s)
Ryzen 9950X3D wins total throughput purely via core count (32T)
Throughput Comparison (GitHub Runners)
═════════════════════════════════════════════════════════════════════════════
DLX Algorithm:
macOS (M1) │████████████████████████████████████████████████████│ 94.92/s ⭐
Linux │███████████████████████████████████ │ 67.49/s
Windows │██████████████████████████████ │ 58.57/s
Backtracking Algorithm:
Windows │███████████████████████ │ 44.86/s
Linux │████████████████████ │ 38.94/s
macOS (M1) │██████████████████ │ 35.56/s
└────────────────────────────────────────────────────┘
0 20 40 60 80 100
Test Configuration: ./SudokuSolver -b 100 -w 0 -a compare -t 25
┌───────────────────────────────────────────────────────────────────────────┐
│ AMD Ryzen 9 9950X3D (16C/32T) - 192GB DDR5 │
├─────────────┬─────────────┬───────────────┬──────────┬────────────────────┤
│ Platform │ Compiler │ Throughput │ Speedup │ Efficiency │
├─────────────┼─────────────┼───────────────┼──────────┼────────────────────┤
│ Windows │ MSVC 1944 │ │ │ │
│ ├─ BT │ │ 805.40/s │ 15.02x │ 46.93% │
│ └─ DLX │ │ 885.59/s │ 17.68x │ 55.25% │
├─────────────┼─────────────┼───────────────┼──────────┼────────────────────┤
│ WSL2 │ GCC 11.4 │ │ │ │
│ ├─ BT │ │ 561.83/s │ 15.87x │ 49.60% │
│ └─ DLX │ │ 903.59/s │ 15.97x │ 49.92% │ ⭐
└─────────────┴─────────────┴───────────────┴──────────┴────────────────────┘
BT = Backtracking, DLX = Dancing Links
Key Observation: WSL2 DLX (903.59/s) outperforms native Windows (885.59/s) by ~2%!
┌─────────────────────────────────────────────────────────────────┐
│ WSL2 vs Windows Analysis │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Task Manager Color Legend: │
│ ┌────────┐ │
│ │ Green │ = Kernel Mode (WSL2 workload) │
│ │ Red │ = User Mode (Windows native) │
│ └────────┘ │
│ │
│ Why WSL2 shows as Kernel Mode? │
│ ├─ WSL2 runs a real Linux kernel inside Hyper-V │
│ ├─ From Windows perspective, VM operations = kernel mode │
│ └─ The Linux userspace code runs inside the VM │
│ │
│ Performance Implications: │
│ ┌─────────────────┬─────────────────┐ │
│ │ Algorithm │ WSL2 vs Windows │ │
│ ├─────────────────┼─────────────────┤ │
│ │ Backtracking │ -30% (slower) │ MSVC recursion wins │
│ │ DLX │ +2% (faster) │ GCC pointers win │
│ └─────────────────┴─────────────────┘ │
│ │
│ Conclusion: │
│ - WSL2 virtualization overhead is negligible for CPU-bound │
│ - Compiler optimization matters more than OS overhead │
│ - DLX benefits from GCC's superior pointer optimization │
│ │
└─────────────────────────────────────────────────────────────────┘
./SudokuSolver -b 100 -w 0 -a compare -t 25
┌───────────────────────────────────────────────────────────────────────────┐
│ Snapdragon 8 Gen 3 (Cortex-X4 + A720 + A520) - 12GB LPDDR5 │
├─────────────┬─────────────┬───────────────┬──────────┬────────────────────┤
│ Platform │ Compiler │ Throughput │ Speedup │ Efficiency │
├─────────────┼─────────────┼───────────────┼──────────┼────────────────────┤
│ Android │ Clang 21.0 │ │ │ │
│ ├─ BT │ │ 95.83/s │ 3.74x │ 46.77% │
│ └─ DLX │ │ 105.31/s │ 2.74x │ 34.23% │
└─────────────┴─────────────┴───────────────┴──────────┴────────────────────┘
Backtracking vs DLX by Compiler
═════════════════════════════════════════════════════════════════════════
MSVC (Windows)
├─ Backtracking │██████████████████████████████████████████│ 805/s
└─ DLX │██████████████████████████████████████████████│ 886/s
Δ = +10% (DLX wins)
GCC (Linux/WSL2)
├─ Backtracking │██████████████████████████████│ 562/s
└─ DLX │██████████████████████████████████████████████████│ 904/s
Δ = +61% (DLX dominates)
═════════════════════════════════════════════════════════════════════════
Key Findings:
| Compiler | Backtracking | DLX | Winner | Analysis |
|---|---|---|---|---|
| MSVC | 805/s | 886/s | DLX +10% | MSVC optimizes recursion well |
| GCC | 562/s | 904/s | DLX +61% | GCC excels at pointer/linked-list ops |
| Clang | 96/s | 105/s | DLX +10% | Similar to MSVC pattern |
┌─────────────────────────────────────────────────────────────────┐
│ Algorithm Characteristics │
├─────────────────────────────┬───────────────────────────────────┤
│ Backtracking │ DLX │
├─────────────────────────────┼───────────────────────────────────┤
│ - Deep recursion │ - Linked-list traversal │
│ - Stack-heavy operations │ - Pointer manipulation │
│ - Predictable branches │ - Cache-unfriendly access │
│ - Cache-friendly (array) │ - Complex data structures │
├─────────────────────────────┼───────────────────────────────────┤
│ MSVC: Excellent tail-call │ GCC: Superior pointer aliasing │
│ optimization & stack mgmt │ analysis & link-time optimization │
└─────────────────────────────┴───────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Apple Silicon Architecture Advantages │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ⚠️ Note: macOS GitHub Runner runs M1 in VM (~30% overhead) │
│ Real M1 estimated: ~45/s single-thread, ~114/s multi-thread │
│ │
│ 1. Unified Memory Architecture (UMA) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ P-Core │ │ E-Core │ │ GPU │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Unified RAM │ << Zero-copy, no NUMA penalty │
│ └─────────────┘ │
│ │
│ 2. Low Core Count = Simple Scheduling │
│ - 3 cores = minimal context switch overhead │
│ - No cross-CCD latency (unlike AMD Ryzen) │
│ - macOS scheduler highly optimized for Apple Silicon │
│ │
│ 3. High IPC (Instructions Per Clock) │
│ - M1 IPC rivals desktop CPUs despite lower clock │
│ - Wide execution units (8-wide decode) │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────┐
│ Unified Memory Architecture │
├────────────────┬────────────────┤
│ Advantages │ Disadvantages │
├────────────────┼────────────────┤
│ + No NUMA │ - Shared │
│ penalties │ bandwidth │
│ │ │
│ + Zero-copy │ - CPU & GPU │
│ GPU <-> CPU │ compete │
│ │ │
│ + Simplified │ - Limited by │
│ programming │ single pool │
│ │ │
│ + Lower power │ - Can't add │
│ consumption │ more RAM │
└────────────────┴────────────────┘
Test Configuration: AMD Ryzen 9 9950X3D (32T)
Puzzle Complexity Comparison
════════════════════════════════════════════════════════════════
Size │ Cells │ Empty │ Search Space │ Recommended Runs │
────────┼───────┼───────┼──────────────┼──────────────────┤
9x9 │ 81 │ ~51 │ ~10^20 │ 5000-10000 │
16x16 │ 256 │ ~220 │ ~10^80 │ 1000-5000 │
25x25 │ 625 │ ~500 │ ~10^200+ │ 50-200 │
════════════════════════════════════════════════════════════════
| Algorithm | System | Throughput | Speedup | Efficiency |
|---|---|---|---|---|
| DLX | 9950X3D (32T) | 23,075 /s | 19.87x | 62.10% |
| DLX | 5800X (16T) | 7,558 /s | 9.16x | 57.27% |
| Backtracking | 9950X3D (32T) | 9,393 /s | 15.83x | 49.48% |
| Backtracking | 5800X (16T) | 2,731 /s | 7.03x | 43.96% |
25x25 Mega Puzzle - Extreme Benchmark
────────────────────────────────────────────────────
Algorithm │ Throughput │ Speedup │ Efficiency
────────────────┼────────────┼─────────┼───────────
Backtracking │ 794/s │ 14.95x │ 46.71%
DLX │ 877/s │ 17.26x │ 53.95% ⭐
────────────────────────────────────────────────────
Note: DLX is 10% faster on 25x25 puzzles
┌─────────────────────────────────────────────────────────────────┐
│ When to Use Each Algorithm │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Use DLX when: │
│ ├─ Puzzle size ≥ 16x16 │
│ ├─ Multi-threaded execution │
│ ├─ Finding ALL solutions │
│ ├─ Running on Linux/GCC │
│ └─ Consistency across platforms is important │
│ │
│ Use Backtracking when: │
│ ├─ Puzzle size = 9x9 │
│ ├─ Single-threaded execution │
│ ├─ Only need ONE solution │
│ ├─ Running on Windows/MSVC │
│ └─ Memory efficiency is critical │
│ │
└─────────────────────────────────────────────────────────────────┘
| Platform | Solver | Benchmark | OCR | System Detection |
|---|---|---|---|---|
| Windows x64 | ✅ | ✅ | ✅ | ✅ Full |
| Linux x64 | ✅ | ✅ | ✅ | ✅ Full |
| macOS (Intel) | ✅ | ✅ | ✅ | ✅ Full |
| macOS (Apple Silicon) | ✅ | ✅ | ✅ | ✅ Full (ref clock) |
| Android ARM64 | ✅ | ✅ | ❌ | ✅ Full (SoC detection) |
- Windows: Best Backtracking performance due to MSVC optimizations
- Linux: Best DLX performance due to GCC pointer optimizations
- macOS: Excellent efficiency due to UMA and optimized scheduler
- Android: Full SoC detection (Snapdragon, Dimensity, Exynos, Tensor)
{
"puzzle": [
[5, 3, 0, 0, 7, 0, 0, 0, 0],
[6, 0, 0, 1, 9, 5, 0, 0, 0],
...
]
}{
"puzzle": "530070000600195000098000060800060003400803001700020006060000280000419005000080079"
}{
"name": "Easy 9x9 Puzzle",
"difficulty": "easy",
"size": 9,
"box_rows": 3,
"box_cols": 3,
"grid": [
[5, 3, 0, 0, 7, 0, 0, 0, 0],
...
]
}When metadata is present, it will be displayed:
Input Puzzle:
...
Name: Easy 9x9 Puzzle
Difficulty: easy
Size: 9x9
Empty cells: 51
sudoku-solver/
├── include/
│ ├── board.hpp # Board representation
│ ├── solver.hpp # Solver interface
│ ├── solver_dlx.hpp # DLX implementation
│ ├── solver_backtrack.hpp # Backtracking implementation
│ ├── ocr_processor.hpp # OCR processing (conditional)
│ ├── benchmark.hpp # Benchmarking utilities
│ ├── system_info.hpp # Cross-platform system detection
│ ├── json_handler.hpp # JSON I/O
│ └── types.hpp # Common types
├── src/
│ ├── main.cpp # Entry point + built-in puzzles
│ ├── board.cpp
│ ├── solver_dlx.cpp
│ ├── solver_backtrack.cpp
│ ├── ocr_processor.cpp # Conditionally compiled
│ ├── benchmark.cpp
│ ├── system_info.cpp # Platform-specific detection
│ └── json_handler.cpp
├── .github/
│ └── workflows/ # CI/CD pipelines
├── CMakeLists.txt
├── CMakePresets.json # Cross-platform build presets
├── CMakeSettings.json # Visual Studio settings
├── vcpkg.json # Dependencies
└── README.md
Managed via vcpkg:
| Package | Purpose | Platforms |
|---|---|---|
| OpenCV | Image processing | All |
| Tesseract | OCR engine | Windows/Linux/macOS |
| nlohmann/json | JSON parsing | All |
| CLI11 | Command line parsing | All |
# List available presets
cmake --list-presets
# Configure and build
cmake --preset <preset-name>
cmake --build --preset <preset-name>cmake -B build -S . \
-DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release| Option | Default | Description |
|---|---|---|
ENABLE_OCR |
ON | Enable Tesseract OCR support |
ENABLE_OPENMP |
OFF | Enable OpenMP parallelization |
MIT License
- AllenK
- Kwyshell
- Donald Knuth for the Dancing Links algorithm
- The Tesseract OCR project
- OpenCV community
- vcpkg maintainers

