This project implements new machine learning models for RSA semiprime factorization research, building on prior work by Murat et al. and Nene & Uludag with novel architectural innovations.
Objective: Explore ML-based approaches to RSA semiprime factorization using:
- Enhanced ECPP (Elliptic Curve Primality Proving) features
- GNFS (General Number Field Sieve) inspired characteristics
- Advanced neural architectures (Transformers, GANs, Hybrid CNN+RNN)
Research Context: This work extends existing ML factorization research with mathematical insights from classical cryptanalysis methods.
- Baseline LSTM: Reproduction of Murat et al.'s architecture for benchmarking
- Transformer Factorizer: Multi-head attention for mathematical pattern recognition
- Factorization GAN: Adversarial training for prime factor generation
- Hybrid CNN+RNN: Combined local pattern recognition and sequence modeling
- Enhanced Feature Engineering: ECPP-based elliptic curve features, GNFS smoothness indicators
- Mathematical Constraints: Built-in primality and factorization constraints
- Multi-Scale Architecture: CNN for local patterns + RNN for sequences + Attention for relationships
# Clone the repository
git clone https://github.com/AureliusNguyen/ML-RSA.git
cd ML-RSA/rsa_ml_attack
# Install dependencies
pip install -r requirements.txt
# or
pip install -e .Run the test suite to verify all models work:
python test_models.pyFirst, generate the training datasets:
# Generate all dataset sizes (tiny, small, medium, large, xlarge)
python generate_data.py --dataset all
# Or generate specific size
python generate_data.py --dataset smallTrain individual models with specific scripts:
# Binary LSTM (Murat et al. reproduction)
python train_binary_models.py --dataset small --epochs 30 --batch-size 4
# Dual Loss LSTM (predicts both p and q)
python train_dual_loss.py --dataset small --epochs 30 --batch-size 4
# Enhanced Transformer (with 125D features)
python train_enhanced_models.py --dataset small --epochs 30 --batch-size 4
# GAN (adversarial factor generation)
python train_gan.py --dataset small --epochs 30 --batch-size 4Current performance on small dataset (N < 10,000):
| Model | ฮฒโ (Exact Match) | ฮฒโ (โค1 bit error) | Parameters | Status |
|---|---|---|---|---|
| Binary LSTM | 0% | ~40% | ~500K | โ Working |
| Dual Loss LSTM | 0% | ~35% | ~800K | โ Working |
| Enhanced Transformer | 0% | 39.58% | 3.3M | โ Best Model |
| GAN | ~2% (exact) | N/A | ~700K | โ Working |
Key Findings:
- Enhanced Transformer achieves 39.58% ฮฒโ accuracy (within 1-bit error)
- 64.58% ฮฒโ accuracy (within 2-bit error) shows strong pattern learning
- Results significantly exceed random chance (~1.2% for 7-bit factors)
- Data leakage issues have been identified and resolved
All models trained on clean datasets with verified train/test splits
- Number-theoretic properties and patterns
- Basic smoothness and divisibility indicators
- Mathematical constraints and heuristics
- Enhanced binary representations with contextual information
- Odd number constraints (last bit = 1)
- Basic factorization consistency checks
- Dual prediction head coordination (p and q models)
- Mathematical validity enforcement during training
- Data Generation: Create semiprimes from random prime pairs
- Feature Extraction: Apply enhanced feature engineering + binary encoding
- Model Training: Train with mathematical constraints and regularization
- Evaluation: ฮฒ-metrics (exact match and near-miss accuracies)
- Comparison: Comprehensive analysis across all architectures
- Analyze RSA implementation vulnerabilities
- Guide key size recommendations
- Develop ML-resistant cryptographic practices
- Understand deep patterns in prime distributions
- Explore connections between elliptic curves and factorization
- Advance number-theoretic machine learning
- Combine classical algorithms with ML acceleration
- Develop quantum-classical factorization strategies
- Create adaptive cryptanalysis tools
ML-RSA/
โโโ .gitignore
โโโ README.md
โโโ Pre-Research/ # Research background
โ โโโ Integer Prime Factorization with Deep Learning.pdf
โ โโโ MLApproachtoIntegerSemiprimeFactorisation.pdf
โ โโโ application.md
โ โโโ nguy5272_UROP_Spring2020 (1).pdf
โโโ kaggle_testing/ # Experimental notebooks and testing
โ โโโ kaggle_binary_train.py
โ โโโ kaggle_binary_train_fixed.py
โโโ rsa_ml_attack/ # Main ML implementation
โโโ src/
โ โโโ crypto_utils.py # ECPP/GNFS feature engineering
โ โโโ models/
โ โโโ baseline_lstm.py # Murat et al. reproduction
โ โโโ transformer_factorizer.py # Mathematical transformer
โ โโโ factorization_gan.py # Adversarial prime generation
โ โโโ hybrid_cnn_rnn.py # Multi-scale hybrid model
โโโ data/ # Clean datasets (no data leakage)
โ โโโ small_train.csv, small_test.csv, small_metadata.json
โ โโโ medium_train.csv, medium_test.csv, medium_metadata.json
โ โโโ tiny_train.csv, tiny_test.csv, tiny_metadata.json
โโโ experiments/ # Training results and saved models
โ โโโ transformer_enhanced_small/ # Best model results
โ โโโ binary_training_small/
โ โโโ dual_training_small/
โ โโโ gan_training_small/
โโโ scripts/archive/ # Historical scripts
โ โโโ fix_data_leakage.py # Data leakage correction (archived)
โโโ generate_data.py # Dataset generation script
โโโ train_binary_models.py # Binary LSTM training
โโโ train_dual_loss.py # Dual output LSTM training
โโโ train_enhanced_models.py # Transformer with enhanced features
โโโ train_gan.py # GAN training script
โโโ evaluate_models.py # Model evaluation utilities
โโโ test_models.py # Model verification tests
โโโ requirements.txt # Python dependencies
โโโ setup.py # Package installation
โโโ README.md # Project-specific documentation
This work builds directly on:
- Murat et al. (2020): "Integer Prime Factorization with Deep Learning" - LSTM baseline
- Nene & Uludag (2022): "Machine Learning Approach to Integer Prime Factorisation" - Binary approaches
- Atkin & Morain (1993): "Elliptic Curves and Primality Proving" - ECPP mathematical foundation
- Data Leakage Detection & Resolution: Identified and fixed critical train/test contamination issues
- Enhanced Feature Engineering: 125-dimensional feature vectors using ECPP and GNFS characteristics
- Transformer Architecture: Successful application of attention mechanisms to RSA factorization
- Robust Training Pipeline: BatchNorm โ LayerNorm conversion for stable small-batch training
- Comprehensive Evaluation: ฮฒ-metrics analysis showing consistent learning patterns
- Configured for distributed GPU training
- Automatic experiment logging and result storage
- Scalable to larger semiprime sizes
- Built-in primality testing during training
- Factorization validity enforcement
- Smooth convergence through constraint regularization
- ฮฒโ: Exact factor match percentage (primary metric)
- ฮฒโ-ฮฒโ: Near-miss accuracies (1-4 bit errors allowed)
- Mathematical Validity: Percentage of predictions that actually factor input semiprimes
- Training Efficiency: Convergence speed and stability
- Murat, B., et al. "Integer prime factorization with deep learning." (2020)
- Nene, R., & Uludag, S. "Machine learning approach to integer prime factorisation." (2022)
- Atkin, A.O.L., & Morain, F. "Elliptic curves and primality proving." (1993)
- Rivest, R.L., et al. "A method for obtaining digital signatures and public-key cryptosystems." (1978)
๐ค Contributing: This research supports the cryptographic community's understanding of ML-based cryptanalysis to develop more robust security systems.