LLM Safety Testing Framework

A modular, extensible framework for evaluating whether Large Language Models properly refuse requests that could facilitate human trafficking and migrant worker exploitation. This is defensive security research -- the framework identifies weaknesses in AI safety guardrails so they can be strengthened, not bypassed. Every test measures a model's ability to recognize and refuse harmful requests grounded in real-world exploitation patterns documented by the ILO, IOM, and international anti-trafficking law.

Key Statistics

Metric	Count
Python source files	470+
Lines of Python	~320,000
Unit tests passing	5,208
Test files	51
API routes	303
Web dashboard plugins	16
Prompt injection mutators	548 across 47 categories
Chain detection seeds	150+ chains, 21 modules
Document intelligence facts	20,460 across 174 modules
Training algorithms supported	15
Export formats	9
Migration corridors	126 routes
ILO forced labor indicators	All 11
Research API integrations	5
Dimensional scoring dimensions	45 across 5 categories
LLM cartography modules	7 (topology, scorer, comparator, attack surface, blind spots, gradient generator)

Quick Start

# Clone
git clone https://github.com/tayloramarel/llm-safety-framework.git
cd llm-safety-framework

# Virtual environment
python -m venv .venv
source .venv/bin/activate  # Unix
# .venv\Scripts\activate   # Windows

# Install
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

# Start the web dashboard
python -m uvicorn src.web.app:app --host 127.0.0.1 --port 8080
# Or: docker-compose up web

Open http://localhost:8080 in your browser.

Architecture Overview

llm-safety-framework/
├── src/
│   ├── core/                       # Pydantic v2 models, agent base classes
│   ├── web/                        # FastAPI dashboard (267 routes)
│   │   └── plugins/                # 14 feature plugins
│   ├── chain_detection/            # 126 chains, 16 seed modules, hybrid scoring
│   │   └── seeds/                  # Recruitment, isolation, financial, transit, ...
│   ├── scraper/                    # 20,460 facts, 174 modules, 54+ sources
│   │   └── seeds/                  # ILO reports, court cases, news, legislation
│   ├── spinning/                   # 12 transform techniques
│   ├── intelligent_attack/         # Embedding space coverage analysis
│   ├── prompt_injection/           # 488 mutators, 41 categories, pipeline engine
│   ├── dimensional_matrix/         # 35-dimension scoring, multi-LLM debate judge
│   ├── training/                   # 23 modules: export, fine-tune, RL, evaluation
│   ├── integrations/               # garak, PyRIT, DeepTeam, 5 research APIs
│   └── research/                   # 7 autonomous research agents + coordinator
├── tests/                          # 4,069 unit tests across 39 files
├── data/                           # Test prompts, chain results, seed data
├── docs/                           # 14 documentation files
├── templates/                      # Template data
├── scripts/                        # Utility scripts
├── Dockerfile                      # Container deployment
├── docker-compose.yml              # Multi-service orchestration
└── pyproject.toml                  # Package configuration

┌──────────────────────────────────────────────────────────────────────────────────┐
│                        LLM SAFETY TESTING FRAMEWORK v4.0                        │
├──────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   Web Dashboard (14 plugins, 267 routes)                                         │
│   ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐               │
│   │  Endpoints  │  │  Prompts   │  │  Analytics  │  │   Wizard   │               │
│   └──────┬─────┘  └──────┬─────┘  └──────┬─────┘  └──────┬─────┘               │
│          │               │               │               │                       │
│   ┌──────┴───────────────┴───────────────┴───────────────┴──────┐               │
│   │                    Test Execution Engine                      │               │
│   └──────┬───────────────┬───────────────┬───────────────┬──────┘               │
│          │               │               │               │                       │
│   ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼─────┐               │
│   │  Prompt     │  │   Chain    │  │ Dimensional │  │  Training  │               │
│   │  Injection  │  │ Detection  │  │   Matrix    │  │  Pipeline  │               │
│   │ 488 mutate  │  │ 126 chains │  │  35 dims    │  │ 15 algos   │               │
│   └──────┬─────┘  └──────┬─────┘  └──────┬─────┘  └──────┬─────┘               │
│          │               │               │               │                       │
│   ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼─────┐               │
│   │ Transform   │  │  Document  │  │  Research   │  │   Library  │               │
│   │ Workbench   │  │   Intel    │  │   Agents    │  │ Integrations│              │
│   │ 12 techniq  │  │ 20,460 fct │  │  7 agents   │  │ garak/pyrit│               │
│   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘               │
│                                                                                  │
│   Models Under Test: OpenAI, Anthropic, Mistral, Together, Ollama, OpenRouter    │
└──────────────────────────────────────────────────────────────────────────────────┘

Core Systems

Chain Detection

Evaluates whether LLMs recognize when individually legal activities combine into trafficking patterns. 126 chains across 16 categories (recruitment debt, document control, isolation funnels, financial control, supply chain, digital exploitation, healthcare migration, government complicity, gender-specific, multi-country transit, and more). Five test modes (direct, incremental, contrastive, business, advisory) with a 5-grade rubric from BLIND(0) to EXPERT(4) and hybrid keyword + LLM-as-judge scoring.

Full documentation

Prompt Injection Mutations

488 deterministic mutators across 41 categories. All pure string transforms, chainable via MutationPipeline. Categories include: output evasion (109 mutators), combination engine (21 compositional operators), named jailbreaks, structural injection, step decomposition, puzzle/game encoding, cognitive exploit, multilingual attacks, steganographic encoding, logical fallacy, distraction, rhetorical manipulation, legal persona, professional persona, analytical framing, special token injection, emoji smuggling, entropy noise, control character exploits, encoding exploitation, adversarial tokenization, bijection cipher, context position exploits, mathematical encoding, evaluation manipulation, payload splitting, code steganography, prefill/forced completion, few-shot attack, template fuzzing, reasoning hijack, and authority exploit.

Full documentation

Dimensional Response Matrix

35-dimension severity scoring with 6 operations (Rate, Calibrate, Probe Boundary, Map Embeddings, Debate). Includes a multi-LLM adversarial debate judge for contested evaluations.

Full documentation

Document Intelligence

20,460 seed facts across 174 modules sourced from ILO reports, court decisions, investigative journalism, and legislation. 54+ sources organized across 7 reliability tiers. Indicator stacking matrices map 7 migration phases against all 11 ILO forced labor indicators.

Intelligent Attack

Embedding-based feature space analysis that identifies under-tested regions in the prompt space and generates novel probes to fill coverage gaps.

Research Agents

7 autonomous agents (literature review, gap analysis, trend monitoring, and more) coordinated by a central orchestrator for continuous research updates.

Integrations

Adapters for garak, PyRIT, and DeepTeam. Five research API integrations: Semantic Scholar, arXiv, GitHub Search, HuggingFace Hub, and OpenAlex.

Training Pipeline

The training pipeline is a closed-loop system for improving open-source model safety through iterative fine-tuning and red-team evaluation. It spans 23 modules with 81 export configurations.

Training Algorithms (15)

Algorithm	Type	Description
SFT	Supervised	Standard supervised fine-tuning on refusal examples
DPO	Preference	Direct Preference Optimization (chosen vs. rejected pairs)
ORPO	Preference	Odds Ratio Preference Optimization
KTO	Preference	Kahneman-Tversky Optimization (binary signal)
SPIN	Self-Play	Self-Play Fine-Tuning with iterative improvement
SimPO	Preference	Simple Preference Optimization (reference-free)
IPO	Preference	Identity Preference Optimization
Rejection Sampling	Sampling	Best-of-N with reward model scoring
Constitutional AI	Self-Critique	Critique-revision with principle-based feedback
PPO	RL	Proximal Policy Optimization with reward model
GRPO	RL	Group Relative Policy Optimization
Reward Model	Scoring	Bradley-Terry, regression, and classification variants
SteerLM	Attribute	Attribute-conditioned generation and steering
RLOO	RL	Reinforcement Learning with Leave-One-Out baseline
RAFT	RL	Reward rAnked Fine-Tuning

Export Formats (9)

SFT, DPO, RLHF, ChatML, Alpaca, ShareGPT, ORPO, KTO, Llama3 -- each with configurable system prompts, tokenizer settings, and column mappings.

Fine-Tuning Framework Support

Framework	Quantization	LoRA/QLoRA	Multi-GPU
Unsloth	4-bit, 8-bit	Yes	Yes
Axolotl	4-bit, 8-bit	Yes	Yes
TRL	4-bit, 8-bit	Yes	Yes
LLaMA-Factory	4-bit, 8-bit	Yes	Yes

Cloud Platform Integration

Platform	Fine-Tune	Inference	Status Tracking
Together AI	Yes	Yes	Yes
HuggingFace	Yes	Yes	Yes
RunPod	Yes	Yes	Yes
OpenAI	Yes	Yes	Yes

Academic Attack Implementations

PAIR (Prompt Automatic Iterative Refinement) -- iterative jailbreak refinement with attacker LLM
TAP (Tree of Attacks with Pruning) -- tree-search over attack candidates with pruning
AutoDAN (Automatic Do-Anything-Now) -- genetic algorithm optimization of adversarial suffixes

Additional Capabilities

Ensemble Orchestration: 6 coordinated multi-strategy attack campaigns
Safety Evaluator: automated benchmarking across models with HTML report generation
Synthetic Dataset Generator: contrastive pairs, edge cases, 5 difficulty categories
Evolutionary Engine: genetic algorithm-based prompt evolution with fitness tracking
Token Analysis: distribution statistics, vocabulary coverage, sequence length profiling
RL Attack Optimizer: reinforcement learning loop for adversarial prompt refinement
HuggingFace Hub Integration: direct dataset push, dataset cards, version management
Curriculum Orchestrator: staged difficulty progression for training data

Closed-Loop Workflow

Run Benchmark ──> Export Failures ──> Fine-Tune Model ──> Generate Attacks
       ^                                                        │
       └────────────────── Re-evaluate <────────────────────────┘

Web Dashboard Plugins

Plugin	Routes	Description
Endpoints	--	Configure API keys and models (OpenAI, Anthropic, Mistral, Together, Ollama, OpenRouter)
Prompts	--	Manage test prompt sets with CRUD, import, and test preparation
Chain Detection	--	Browse 126 chains, execute tests, 5-grade scoring with hybrid evaluation
Spinning	--	Transform workbench with 12 technique tabs
Analytics	--	Statistics, conversation browser, attack heatmap, coverage matrix
Intelligent Attack	--	Embedding-based feature space analysis and gap detection
Multi-Turn	--	6 strategies: Crescendo, FITD, Skeleton Key, Many-Shot, Deceptive Delight, Role-Play
Prompt Injection	--	488 mutators, pipeline builder, batch operations, decode tools
Research	--	Paper/repo/dataset search across 5 APIs, saved results
Scraper	--	Document intelligence agent, 54+ sources, indicator matrices
Training	--	Export (9 formats), fine-tune configs, red-team loop, academic attacks, cloud fine-tune, reward modeling, safety evaluation, dataset generation
Integrations	--	garak, PyRIT, DeepTeam adapters
Data Management	--	Import/export conversations, configuration, pipeline data
Wizard	--	Streamlined guided testing mode for new users

Documentation Index

Document	Description
Chain Detection	126 chains, 16 categories, scoring rubric, test modes
Prompt Injection	488 mutators, 41 categories, pipeline usage, academic references
Dimensional Matrix	35 dimensions, 6 operations, debate judge
Attack Taxonomy	Attack generator categories, chain detection overlap analysis
AI Safety Training	Refusal patterns, legal standards, vulnerability gaps
Migration Corridors	126 routes, 11 ILO indicators, transit route analysis
Architecture	System design, component interaction diagrams
API Reference	267 REST endpoints with request/response examples
CLI Reference	All command-line flags and usage patterns
Test Generation	Generation strategies and configuration
Importing Guide	Data formats, import/export methods
Contributing	Code style, PR process, adding new attack modules
Diagrams	Visual system documentation
Ideas & Expansions	Roadmap and research directions

Test Pipeline CLI

# Basic test run
python scripts/run_test_pipeline.py --endpoint mistral --limit 50

# With dimensional scoring and multi-LLM debate
python scripts/run_test_pipeline.py --endpoint openrouter \
  --dimensional --debate --debate-rounds 2

# With mutation variants
python scripts/run_test_pipeline.py --endpoint mistral \
  --mutations base64,rot13,emoji_smuggling --limit 100

# Export training data from results
python -m src.training.export_training_data \
  --format dpo --output training_data.jsonl

Environment Variables

# LLM API keys (for live testing and evaluation)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
MISTRAL_API_KEY=...
TOGETHER_API_KEY=...

# Framework settings
LOG_LEVEL=INFO
LLM_SAFETY_HOST=0.0.0.0
LLM_SAFETY_PORT=8080
MAX_CONCURRENT_REQUESTS=10
CACHE_ENABLED=true

ILO Forced Labor Indicators

The framework tests model responses against all 11 ILO forced labor indicators:

Abuse of vulnerability
Deception
Restriction of movement
Isolation
Physical and sexual violence
Intimidation and threats
Retention of identity documents
Withholding of wages
Debt bondage
Abusive working and living conditions
Excessive overtime

These indicators are cross-referenced with 126 migration corridors spanning routes such as PH-SA (Philippines to Saudi Arabia), NP-QA (Nepal to Qatar), BD-MY (Bangladesh to Malaysia), ET-LB (Ethiopia to Lebanon), MM-TH (Myanmar to Thailand), and multi-country transit routes including MM-TH-MY-SG, NG-LY-IT, and VN-KH-TH.

Citation

@software{llm_safety_framework,
  author       = {Amarel, Taylor},
  title        = {LLM Safety Testing Framework for Migrant Worker Protection},
  version      = {4.0.0},
  year         = {2026},
  url          = {https://github.com/tayloramarel/llm-safety-framework},
  note         = {Defensive AI safety research: 488 prompt injection mutators,
                  126 chain detection seeds, 20460 document intelligence facts,
                  15 training algorithms, 4069 unit tests}
}

License

MIT License -- See LICENSE for details.

Author

Taylor Amarel -- github.com/tayloramarel

Related Resources

Framework Version: 4.0.0 | Last Updated: 2026-03-07 | Tests: 4,069 Passing

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
_archive		_archive
config		config
data		data
docs		docs
examples		examples
scripts		scripts
src		src
templates		templates
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Safety Testing Framework

Key Statistics

Quick Start

Architecture Overview

Core Systems

Chain Detection

Prompt Injection Mutations

Dimensional Response Matrix

Document Intelligence

Intelligent Attack

Research Agents

Integrations

Training Pipeline

Training Algorithms (15)

Export Formats (9)

Fine-Tuning Framework Support

Cloud Platform Integration

Academic Attack Implementations

Additional Capabilities

Closed-Loop Workflow

Web Dashboard Plugins

Documentation Index

Test Pipeline CLI

Environment Variables

ILO Forced Labor Indicators

Citation

License

Author

Related Resources

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages