A modular, extensible framework for evaluating whether Large Language Models properly refuse requests that could facilitate human trafficking and migrant worker exploitation. This is defensive security research -- the framework identifies weaknesses in AI safety guardrails so they can be strengthened, not bypassed. Every test measures a model's ability to recognize and refuse harmful requests grounded in real-world exploitation patterns documented by the ILO, IOM, and international anti-trafficking law.
| Metric | Count |
|---|---|
| Python source files | 470+ |
| Lines of Python | ~320,000 |
| Unit tests passing | 5,208 |
| Test files | 51 |
| API routes | 303 |
| Web dashboard plugins | 16 |
| Prompt injection mutators | 548 across 47 categories |
| Chain detection seeds | 150+ chains, 21 modules |
| Document intelligence facts | 20,460 across 174 modules |
| Training algorithms supported | 15 |
| Export formats | 9 |
| Migration corridors | 126 routes |
| ILO forced labor indicators | All 11 |
| Research API integrations | 5 |
| Dimensional scoring dimensions | 45 across 5 categories |
| LLM cartography modules | 7 (topology, scorer, comparator, attack surface, blind spots, gradient generator) |
# Clone
git clone https://github.com/tayloramarel/llm-safety-framework.git
cd llm-safety-framework
# Virtual environment
python -m venv .venv
source .venv/bin/activate # Unix
# .venv\Scripts\activate # Windows
# Install
pip install -e ".[dev]"
# Run tests
python -m pytest tests/ -v
# Start the web dashboard
python -m uvicorn src.web.app:app --host 127.0.0.1 --port 8080
# Or: docker-compose up webOpen http://localhost:8080 in your browser.
llm-safety-framework/
├── src/
│ ├── core/ # Pydantic v2 models, agent base classes
│ ├── web/ # FastAPI dashboard (267 routes)
│ │ └── plugins/ # 14 feature plugins
│ ├── chain_detection/ # 126 chains, 16 seed modules, hybrid scoring
│ │ └── seeds/ # Recruitment, isolation, financial, transit, ...
│ ├── scraper/ # 20,460 facts, 174 modules, 54+ sources
│ │ └── seeds/ # ILO reports, court cases, news, legislation
│ ├── spinning/ # 12 transform techniques
│ ├── intelligent_attack/ # Embedding space coverage analysis
│ ├── prompt_injection/ # 488 mutators, 41 categories, pipeline engine
│ ├── dimensional_matrix/ # 35-dimension scoring, multi-LLM debate judge
│ ├── training/ # 23 modules: export, fine-tune, RL, evaluation
│ ├── integrations/ # garak, PyRIT, DeepTeam, 5 research APIs
│ └── research/ # 7 autonomous research agents + coordinator
├── tests/ # 4,069 unit tests across 39 files
├── data/ # Test prompts, chain results, seed data
├── docs/ # 14 documentation files
├── templates/ # Template data
├── scripts/ # Utility scripts
├── Dockerfile # Container deployment
├── docker-compose.yml # Multi-service orchestration
└── pyproject.toml # Package configuration
┌──────────────────────────────────────────────────────────────────────────────────┐
│ LLM SAFETY TESTING FRAMEWORK v4.0 │
├──────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Web Dashboard (14 plugins, 267 routes) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Endpoints │ │ Prompts │ │ Analytics │ │ Wizard │ │
│ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ │
│ │ │ │ │ │
│ ┌──────┴───────────────┴───────────────┴───────────────┴──────┐ │
│ │ Test Execution Engine │ │
│ └──────┬───────────────┬───────────────┬───────────────┬──────┘ │
│ │ │ │ │ │
│ ┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐ │
│ │ Prompt │ │ Chain │ │ Dimensional │ │ Training │ │
│ │ Injection │ │ Detection │ │ Matrix │ │ Pipeline │ │
│ │ 488 mutate │ │ 126 chains │ │ 35 dims │ │ 15 algos │ │
│ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ │
│ │ │ │ │ │
│ ┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐ │
│ │ Transform │ │ Document │ │ Research │ │ Library │ │
│ │ Workbench │ │ Intel │ │ Agents │ │ Integrations│ │
│ │ 12 techniq │ │ 20,460 fct │ │ 7 agents │ │ garak/pyrit│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Models Under Test: OpenAI, Anthropic, Mistral, Together, Ollama, OpenRouter │
└──────────────────────────────────────────────────────────────────────────────────┘
Evaluates whether LLMs recognize when individually legal activities combine into trafficking patterns. 126 chains across 16 categories (recruitment debt, document control, isolation funnels, financial control, supply chain, digital exploitation, healthcare migration, government complicity, gender-specific, multi-country transit, and more). Five test modes (direct, incremental, contrastive, business, advisory) with a 5-grade rubric from BLIND(0) to EXPERT(4) and hybrid keyword + LLM-as-judge scoring.
488 deterministic mutators across 41 categories. All pure string transforms, chainable via MutationPipeline. Categories include: output evasion (109 mutators), combination engine (21 compositional operators), named jailbreaks, structural injection, step decomposition, puzzle/game encoding, cognitive exploit, multilingual attacks, steganographic encoding, logical fallacy, distraction, rhetorical manipulation, legal persona, professional persona, analytical framing, special token injection, emoji smuggling, entropy noise, control character exploits, encoding exploitation, adversarial tokenization, bijection cipher, context position exploits, mathematical encoding, evaluation manipulation, payload splitting, code steganography, prefill/forced completion, few-shot attack, template fuzzing, reasoning hijack, and authority exploit.
35-dimension severity scoring with 6 operations (Rate, Calibrate, Probe Boundary, Map Embeddings, Debate). Includes a multi-LLM adversarial debate judge for contested evaluations.
20,460 seed facts across 174 modules sourced from ILO reports, court decisions, investigative journalism, and legislation. 54+ sources organized across 7 reliability tiers. Indicator stacking matrices map 7 migration phases against all 11 ILO forced labor indicators.
Embedding-based feature space analysis that identifies under-tested regions in the prompt space and generates novel probes to fill coverage gaps.
7 autonomous agents (literature review, gap analysis, trend monitoring, and more) coordinated by a central orchestrator for continuous research updates.
Adapters for garak, PyRIT, and DeepTeam. Five research API integrations: Semantic Scholar, arXiv, GitHub Search, HuggingFace Hub, and OpenAlex.
The training pipeline is a closed-loop system for improving open-source model safety through iterative fine-tuning and red-team evaluation. It spans 23 modules with 81 export configurations.
| Algorithm | Type | Description |
|---|---|---|
| SFT | Supervised | Standard supervised fine-tuning on refusal examples |
| DPO | Preference | Direct Preference Optimization (chosen vs. rejected pairs) |
| ORPO | Preference | Odds Ratio Preference Optimization |
| KTO | Preference | Kahneman-Tversky Optimization (binary signal) |
| SPIN | Self-Play | Self-Play Fine-Tuning with iterative improvement |
| SimPO | Preference | Simple Preference Optimization (reference-free) |
| IPO | Preference | Identity Preference Optimization |
| Rejection Sampling | Sampling | Best-of-N with reward model scoring |
| Constitutional AI | Self-Critique | Critique-revision with principle-based feedback |
| PPO | RL | Proximal Policy Optimization with reward model |
| GRPO | RL | Group Relative Policy Optimization |
| Reward Model | Scoring | Bradley-Terry, regression, and classification variants |
| SteerLM | Attribute | Attribute-conditioned generation and steering |
| RLOO | RL | Reinforcement Learning with Leave-One-Out baseline |
| RAFT | RL | Reward rAnked Fine-Tuning |
SFT, DPO, RLHF, ChatML, Alpaca, ShareGPT, ORPO, KTO, Llama3 -- each with configurable system prompts, tokenizer settings, and column mappings.
| Framework | Quantization | LoRA/QLoRA | Multi-GPU |
|---|---|---|---|
| Unsloth | 4-bit, 8-bit | Yes | Yes |
| Axolotl | 4-bit, 8-bit | Yes | Yes |
| TRL | 4-bit, 8-bit | Yes | Yes |
| LLaMA-Factory | 4-bit, 8-bit | Yes | Yes |
| Platform | Fine-Tune | Inference | Status Tracking |
|---|---|---|---|
| Together AI | Yes | Yes | Yes |
| HuggingFace | Yes | Yes | Yes |
| RunPod | Yes | Yes | Yes |
| OpenAI | Yes | Yes | Yes |
- PAIR (Prompt Automatic Iterative Refinement) -- iterative jailbreak refinement with attacker LLM
- TAP (Tree of Attacks with Pruning) -- tree-search over attack candidates with pruning
- AutoDAN (Automatic Do-Anything-Now) -- genetic algorithm optimization of adversarial suffixes
- Ensemble Orchestration: 6 coordinated multi-strategy attack campaigns
- Safety Evaluator: automated benchmarking across models with HTML report generation
- Synthetic Dataset Generator: contrastive pairs, edge cases, 5 difficulty categories
- Evolutionary Engine: genetic algorithm-based prompt evolution with fitness tracking
- Token Analysis: distribution statistics, vocabulary coverage, sequence length profiling
- RL Attack Optimizer: reinforcement learning loop for adversarial prompt refinement
- HuggingFace Hub Integration: direct dataset push, dataset cards, version management
- Curriculum Orchestrator: staged difficulty progression for training data
Run Benchmark ──> Export Failures ──> Fine-Tune Model ──> Generate Attacks
^ │
└────────────────── Re-evaluate <────────────────────────┘
| Plugin | Routes | Description |
|---|---|---|
| Endpoints | -- | Configure API keys and models (OpenAI, Anthropic, Mistral, Together, Ollama, OpenRouter) |
| Prompts | -- | Manage test prompt sets with CRUD, import, and test preparation |
| Chain Detection | -- | Browse 126 chains, execute tests, 5-grade scoring with hybrid evaluation |
| Spinning | -- | Transform workbench with 12 technique tabs |
| Analytics | -- | Statistics, conversation browser, attack heatmap, coverage matrix |
| Intelligent Attack | -- | Embedding-based feature space analysis and gap detection |
| Multi-Turn | -- | 6 strategies: Crescendo, FITD, Skeleton Key, Many-Shot, Deceptive Delight, Role-Play |
| Prompt Injection | -- | 488 mutators, pipeline builder, batch operations, decode tools |
| Research | -- | Paper/repo/dataset search across 5 APIs, saved results |
| Scraper | -- | Document intelligence agent, 54+ sources, indicator matrices |
| Training | -- | Export (9 formats), fine-tune configs, red-team loop, academic attacks, cloud fine-tune, reward modeling, safety evaluation, dataset generation |
| Integrations | -- | garak, PyRIT, DeepTeam adapters |
| Data Management | -- | Import/export conversations, configuration, pipeline data |
| Wizard | -- | Streamlined guided testing mode for new users |
| Document | Description |
|---|---|
| Chain Detection | 126 chains, 16 categories, scoring rubric, test modes |
| Prompt Injection | 488 mutators, 41 categories, pipeline usage, academic references |
| Dimensional Matrix | 35 dimensions, 6 operations, debate judge |
| Attack Taxonomy | Attack generator categories, chain detection overlap analysis |
| AI Safety Training | Refusal patterns, legal standards, vulnerability gaps |
| Migration Corridors | 126 routes, 11 ILO indicators, transit route analysis |
| Architecture | System design, component interaction diagrams |
| API Reference | 267 REST endpoints with request/response examples |
| CLI Reference | All command-line flags and usage patterns |
| Test Generation | Generation strategies and configuration |
| Importing Guide | Data formats, import/export methods |
| Contributing | Code style, PR process, adding new attack modules |
| Diagrams | Visual system documentation |
| Ideas & Expansions | Roadmap and research directions |
# Basic test run
python scripts/run_test_pipeline.py --endpoint mistral --limit 50
# With dimensional scoring and multi-LLM debate
python scripts/run_test_pipeline.py --endpoint openrouter \
--dimensional --debate --debate-rounds 2
# With mutation variants
python scripts/run_test_pipeline.py --endpoint mistral \
--mutations base64,rot13,emoji_smuggling --limit 100
# Export training data from results
python -m src.training.export_training_data \
--format dpo --output training_data.jsonl# LLM API keys (for live testing and evaluation)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
MISTRAL_API_KEY=...
TOGETHER_API_KEY=...
# Framework settings
LOG_LEVEL=INFO
LLM_SAFETY_HOST=0.0.0.0
LLM_SAFETY_PORT=8080
MAX_CONCURRENT_REQUESTS=10
CACHE_ENABLED=trueThe framework tests model responses against all 11 ILO forced labor indicators:
- Abuse of vulnerability
- Deception
- Restriction of movement
- Isolation
- Physical and sexual violence
- Intimidation and threats
- Retention of identity documents
- Withholding of wages
- Debt bondage
- Abusive working and living conditions
- Excessive overtime
These indicators are cross-referenced with 126 migration corridors spanning routes such as PH-SA (Philippines to Saudi Arabia), NP-QA (Nepal to Qatar), BD-MY (Bangladesh to Malaysia), ET-LB (Ethiopia to Lebanon), MM-TH (Myanmar to Thailand), and multi-country transit routes including MM-TH-MY-SG, NG-LY-IT, and VN-KH-TH.
@software{llm_safety_framework,
author = {Amarel, Taylor},
title = {LLM Safety Testing Framework for Migrant Worker Protection},
version = {4.0.0},
year = {2026},
url = {https://github.com/tayloramarel/llm-safety-framework},
note = {Defensive AI safety research: 488 prompt injection mutators,
126 chain detection seeds, 20460 document intelligence facts,
15 training algorithms, 4069 unit tests}
}MIT License -- See LICENSE for details.
Taylor Amarel -- github.com/tayloramarel
- ILO Forced Labour Convention (C029)
- ILO Fair Recruitment Initiative
- ILO Forced Labour Indicators
- UN Palermo Protocol
- Dhaka Principles for Migration with Dignity
- ILO Domestic Workers Convention (C189)
- ILO Private Employment Agencies Convention (C181)
- Employer Pays Principle
Framework Version: 4.0.0 | Last Updated: 2026-03-07 | Tests: 4,069 Passing