Skip to content

AureliusNguyen/ML-RSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ML-based Attack on Digitally Authenticated RSA Algorithm using Model Estimation

This project implements new machine learning models for RSA semiprime factorization research, building on prior work by Murat et al. and Nene & Uludag with novel architectural innovations.

๐ŸŽฏ Project Overview

Objective: Explore ML-based approaches to RSA semiprime factorization using:

  • Enhanced ECPP (Elliptic Curve Primality Proving) features
  • GNFS (General Number Field Sieve) inspired characteristics
  • Advanced neural architectures (Transformers, GANs, Hybrid CNN+RNN)

Research Context: This work extends existing ML factorization research with mathematical insights from classical cryptanalysis methods.

๐Ÿ—๏ธ Architecture

Models Implemented

  1. Baseline LSTM: Reproduction of Murat et al.'s architecture for benchmarking
  2. Transformer Factorizer: Multi-head attention for mathematical pattern recognition
  3. Factorization GAN: Adversarial training for prime factor generation
  4. Hybrid CNN+RNN: Combined local pattern recognition and sequence modeling

Key Innovations

  • Enhanced Feature Engineering: ECPP-based elliptic curve features, GNFS smoothness indicators
  • Mathematical Constraints: Built-in primality and factorization constraints
  • Multi-Scale Architecture: CNN for local patterns + RNN for sequences + Attention for relationships

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/AureliusNguyen/ML-RSA.git
cd ML-RSA/rsa_ml_attack

# Install dependencies
pip install -r requirements.txt
# or
pip install -e .

Testing

Run the test suite to verify all models work:

python test_models.py

Data Generation

First, generate the training datasets:

# Generate all dataset sizes (tiny, small, medium, large, xlarge)
python generate_data.py --dataset all

# Or generate specific size
python generate_data.py --dataset small

Training

Train individual models with specific scripts:

# Binary LSTM (Murat et al. reproduction)
python train_binary_models.py --dataset small --epochs 30 --batch-size 4

# Dual Loss LSTM (predicts both p and q)  
python train_dual_loss.py --dataset small --epochs 30 --batch-size 4

# Enhanced Transformer (with 125D features)
python train_enhanced_models.py --dataset small --epochs 30 --batch-size 4

# GAN (adversarial factor generation)
python train_gan.py --dataset small --epochs 30 --batch-size 4

๐Ÿ“Š Actual Results

Current performance on small dataset (N < 10,000):

Model ฮฒโ‚€ (Exact Match) ฮฒโ‚ (โ‰ค1 bit error) Parameters Status
Binary LSTM 0% ~40% ~500K โœ… Working
Dual Loss LSTM 0% ~35% ~800K โœ… Working
Enhanced Transformer 0% 39.58% 3.3M โœ… Best Model
GAN ~2% (exact) N/A ~700K โœ… Working

Key Findings:

  • Enhanced Transformer achieves 39.58% ฮฒโ‚ accuracy (within 1-bit error)
  • 64.58% ฮฒโ‚‚ accuracy (within 2-bit error) shows strong pattern learning
  • Results significantly exceed random chance (~1.2% for 7-bit factors)
  • Data leakage issues have been identified and resolved

All models trained on clean datasets with verified train/test splits

๐Ÿงฎ Mathematical Foundation

Enhanced Features (125-dimensional)

  • Number-theoretic properties and patterns
  • Basic smoothness and divisibility indicators
  • Mathematical constraints and heuristics
  • Enhanced binary representations with contextual information

Constraints Enforced

  • Odd number constraints (last bit = 1)
  • Basic factorization consistency checks
  • Dual prediction head coordination (p and q models)
  • Mathematical validity enforcement during training

๐Ÿ“ˆ Training Process

  1. Data Generation: Create semiprimes from random prime pairs
  2. Feature Extraction: Apply enhanced feature engineering + binary encoding
  3. Model Training: Train with mathematical constraints and regularization
  4. Evaluation: ฮฒ-metrics (exact match and near-miss accuracies)
  5. Comparison: Comprehensive analysis across all architectures

๐Ÿ”ฌ Research Applications

Defensive Security

  • Analyze RSA implementation vulnerabilities
  • Guide key size recommendations
  • Develop ML-resistant cryptographic practices

Mathematical Insights

  • Understand deep patterns in prime distributions
  • Explore connections between elliptic curves and factorization
  • Advance number-theoretic machine learning

Hybrid Approaches

  • Combine classical algorithms with ML acceleration
  • Develop quantum-classical factorization strategies
  • Create adaptive cryptanalysis tools

๐Ÿ“ Repository Structure

ML-RSA/
โ”œโ”€โ”€ .gitignore                   
โ”œโ”€โ”€ README.md                    
โ”œโ”€โ”€ Pre-Research/                # Research background
โ”‚   โ”œโ”€โ”€ Integer Prime Factorization with Deep Learning.pdf
โ”‚   โ”œโ”€โ”€ MLApproachtoIntegerSemiprimeFactorisation.pdf
โ”‚   โ”œโ”€โ”€ application.md 
โ”‚   โ””โ”€โ”€ nguy5272_UROP_Spring2020 (1).pdf
โ”œโ”€โ”€ kaggle_testing/             # Experimental notebooks and testing
โ”‚   โ”œโ”€โ”€ kaggle_binary_train.py
โ”‚   โ””โ”€โ”€ kaggle_binary_train_fixed.py
โ””โ”€โ”€ rsa_ml_attack/              # Main ML implementation
    โ”œโ”€โ”€ src/
    โ”‚   โ”œโ”€โ”€ crypto_utils.py      # ECPP/GNFS feature engineering
    โ”‚   โ””โ”€โ”€ models/
    โ”‚       โ”œโ”€โ”€ baseline_lstm.py     # Murat et al. reproduction
    โ”‚       โ”œโ”€โ”€ transformer_factorizer.py  # Mathematical transformer
    โ”‚       โ”œโ”€โ”€ factorization_gan.py # Adversarial prime generation
    โ”‚       โ””โ”€โ”€ hybrid_cnn_rnn.py   # Multi-scale hybrid model
    โ”œโ”€โ”€ data/                    # Clean datasets (no data leakage)
    โ”‚   โ”œโ”€โ”€ small_train.csv, small_test.csv, small_metadata.json
    โ”‚   โ”œโ”€โ”€ medium_train.csv, medium_test.csv, medium_metadata.json
    โ”‚   โ””โ”€โ”€ tiny_train.csv, tiny_test.csv, tiny_metadata.json
    โ”œโ”€โ”€ experiments/             # Training results and saved models
    โ”‚   โ”œโ”€โ”€ transformer_enhanced_small/  # Best model results
    โ”‚   โ”œโ”€โ”€ binary_training_small/
    โ”‚   โ”œโ”€โ”€ dual_training_small/
    โ”‚   โ””โ”€โ”€ gan_training_small/
    โ”œโ”€โ”€ scripts/archive/         # Historical scripts
    โ”‚   โ””โ”€โ”€ fix_data_leakage.py  # Data leakage correction (archived)
    โ”œโ”€โ”€ generate_data.py         # Dataset generation script
    โ”œโ”€โ”€ train_binary_models.py   # Binary LSTM training
    โ”œโ”€โ”€ train_dual_loss.py      # Dual output LSTM training  
    โ”œโ”€โ”€ train_enhanced_models.py # Transformer with enhanced features
    โ”œโ”€โ”€ train_gan.py            # GAN training script
    โ”œโ”€โ”€ evaluate_models.py      # Model evaluation utilities
    โ”œโ”€โ”€ test_models.py          # Model verification tests
    โ”œโ”€โ”€ requirements.txt        # Python dependencies
    โ”œโ”€โ”€ setup.py               # Package installation
    โ””โ”€โ”€ README.md              # Project-specific documentation

๐ŸŽ“ Research Context

This work builds directly on:

  • Murat et al. (2020): "Integer Prime Factorization with Deep Learning" - LSTM baseline
  • Nene & Uludag (2022): "Machine Learning Approach to Integer Prime Factorisation" - Binary approaches
  • Atkin & Morain (1993): "Elliptic Curves and Primality Proving" - ECPP mathematical foundation

Novel Contributions

  1. Data Leakage Detection & Resolution: Identified and fixed critical train/test contamination issues
  2. Enhanced Feature Engineering: 125-dimensional feature vectors using ECPP and GNFS characteristics
  3. Transformer Architecture: Successful application of attention mechanisms to RSA factorization
  4. Robust Training Pipeline: BatchNorm โ†’ LayerNorm conversion for stable small-batch training
  5. Comprehensive Evaluation: ฮฒ-metrics analysis showing consistent learning patterns

โšก Performance Optimization

AWS Integration

  • Configured for distributed GPU training
  • Automatic experiment logging and result storage
  • Scalable to larger semiprime sizes

Mathematical Constraints

  • Built-in primality testing during training
  • Factorization validity enforcement
  • Smooth convergence through constraint regularization

๐Ÿ” Evaluation Metrics

  • ฮฒโ‚€: Exact factor match percentage (primary metric)
  • ฮฒโ‚-ฮฒโ‚„: Near-miss accuracies (1-4 bit errors allowed)
  • Mathematical Validity: Percentage of predictions that actually factor input semiprimes
  • Training Efficiency: Convergence speed and stability

๐Ÿ“š References

  1. Murat, B., et al. "Integer prime factorization with deep learning." (2020)
  2. Nene, R., & Uludag, S. "Machine learning approach to integer prime factorisation." (2022)
  3. Atkin, A.O.L., & Morain, F. "Elliptic curves and primality proving." (1993)
  4. Rivest, R.L., et al. "A method for obtaining digital signatures and public-key cryptosystems." (1978)

โš ๏ธ Research Disclaimer: This project is for educational and defensive security research only. The techniques explored are intended to understand cryptographic vulnerabilities and improve security practices, not to compromise deployed systems.

๐Ÿค Contributing: This research supports the cryptographic community's understanding of ML-based cryptanalysis to develop more robust security systems.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors