LLM Security, Alignment & Governance Resources

A curated list of research papers, experiments, and resources related to LLM security and alignment — including prompt injection, jailbreaks, hallucinations, defenses, governance, and ethical frameworks.
Organized for reference and study.

Last Updated: 2026-03-26

Legend

⭐️ Foundational — Classic / seminal papers
🛡️ Practical — Standards, guides, applied resources
🧪 Experimental — New methods, ongoing research
📊 Dataset/Benchmark — Data resources, benchmarks
🧾 Survey — Reviews, surveys, taxonomies

Citation Style Guide

arXiv preprints → [arXiv:XXXX.XXXXX]
Conference papers → [VENUE YEAR]
Journal articles → [Journal Name, Year]
Regulations & policy docs → [Official Document ID]
GitHub repos → [GitHub]
Blogs / Reports → [Blog] / [Report]

Prompt Injection & Jailbreaks

Prompt Injection

🧪 Prompt Injection attack against LLM-integrated Applications [arXiv:2306.05499] — Defines indirect injection attacks via external data sources; foundational work on LLM application vulnerabilities.
🛡️ Dropbox/llm-security [GitHub] — Educational repo with demo code for injection attacks.
🛡️ OWASP Top 10 for LLMs [OWASP 2025] — Defines LLM01: Prompt Injection as the top security risk.
🧪 Backdoored Retrievers for Prompt Injection in RAG [arXiv:2410.14479] — Poisoned retrievers in RAG pipelines enable indirect injection.
🧪 Manipulating LLM Web Agents via Indirect Injection [arXiv:2507.14799] — Universal HTML triggers to hijack web agents.
📊 WASP: Web Agent Security Benchmark [arXiv:2504.18575] — Benchmarks agent robustness to indirect injections.

Jailbreaking / Adversarial Prompts

🧪 Universal and Transferable Adversarial Attacks on Aligned Language Models (v2) [arXiv:2307.15043v2] — Updated version showing adversarial prompts transferable across multiple aligned LLMs, enabling generalized jailbreak attacks.
🧪 Jailbroken: How Does LLM Safety Training Fail? [arXiv:2307.02483] — Anthropic analysis of safety training limitations.
🧪 Red Teaming Language Models to Reduce Harms [arXiv:2209.07858] — Early research on red teaming methods for LLMs.
🧪 Many-shot Jailbreaking [Anthropic 2024] — Shows long multi-shot contexts can bypass safety rules.
🧪 MASTERKEY: Automated Jailbreaking [NDSS 2024] — Automated jailbreak generation.
🧪 White-box Multimodal Jailbreaks [arXiv:2405.17894] — Vision-language jailbreaks using adversarial inputs.
🧪 Coordinated Prompt-RAG Attacks [arXiv:2504.07717] — Coordinated poisoning of knowledge bases for RAG.
📊 HarmBench [arXiv:2402.04249] — Benchmark dataset for harmful content generation.
🧪 Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [arXiv:2504.05652] — Stealth jailbreak method hiding malicious intent behind benign reasoning.
🧪 Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails [arXiv:2504.11168] — Evasion techniques against commercial guardrail systems.
🧪 Subversion via Focal Points: Investigating Collusion in LLM Monitoring [arXiv:2507.03010] — Models colluding to bypass monitoring protocols.
🧪 Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks (v1) [arXiv:2302.05733v1] — Demonstrates how LLMs’ programmatic features can be misused via standard security attack techniques, highlighting dual-use risks and unexpected vulnerabilities.
🧪 Red Teaming the Mind of the Machine [arXiv:2505.04806] — Systematic evaluation of 1,400+ adversarial prompts across major LLMs.

Hallucinations & Reliability

🧾 A Survey on Hallucination in LLMs [arXiv:2311.05232] — Taxonomy, principles, and open questions on hallucinations.
🧪 SelfCheckGPT [arXiv:2303.08896] — Self-verification technique for fact-checking model outputs.
🧪 RARR: Researching and Revising What LMs Say [arXiv:2210.08726] — Detects and revises hallucinations in generated text.
📊 TruthfulQA [arXiv:2109.07958] — Benchmark measuring model truthfulness.
🧪 Does More Inference-Time Compute Really Help Robustness? [arXiv:2507.15974] — Critical examination of inference-time scaling for robustness.

Defense Strategies

🧪 Defending Against Prompt Injection With a Few DefensiveTokens [arXiv:2507.07974] — Inserts defensive tokens at inference to block attacks.
🛡️ tldrsec/prompt-injection-defenses [GitHub] — Curated list of practical defense strategies.
🛡️ Llama Guard [arXiv:2312.06674] — Meta's safety classifier for filtering harmful outputs.
🧪 LLMs Can Defend Themselves Against Jailbreaking [arXiv:2406.05498] — Shadow stack approach for self-defense against jailbreaks.

Alignment & Safety

⭐️ Foundational

Concrete Problems in AI Safety [arXiv:1606.06565] — Early framework defining practical misalignment risks (e.g., reward hacking, distributional shift).
Deep Reinforcement Learning from Human Preferences [arXiv:1706.03741] — Introduced preference-based reward modeling for aligning agent behavior with human intent.
InstructGPT [arXiv:2203.02155] — Introduced RLHF for instruction-following; foundation for GPT-3.5/4.

🛡️ Practical

Constitutional AI [arXiv:2212.08073] — Anthropic’s RLAIF: AI models fine-tuned using AI feedback guided by written principles (“constitutions”).
ARC Evals → METR (Model Evaluation & Threat Research) [Report] — Independent safety evaluations of frontier models (e.g., GPT-4) for potential dangerous capabilities; spun out from ARC.

🧾 Survey

A Survey of Reinforcement Learning from Human Feedback (RLHF) [arXiv:2312.14925] — Comprehensive review of RLHF methods, challenges, and alignment implications.
Scaling Monosemanticity [Blog / Anthropic, 2024] — Investigates whether internal features in LLMs correspond to interpretable “concepts.”

🧪 Experimental

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting [arXiv:2305.04388] — Shows that model-generated reasoning traces often diverge from internal causal factors.
Sleeper Agents: Training Deceptive LLMs [arXiv:2401.05566] — Demonstrates that deceptive behaviors can persist even after extensive fine-tuning.
Measuring Faithfulness in Chain-of-Thought Reasoning [arXiv:2307.13702v1] — Proposes formal metrics for evaluating reasoning faithfulness.
How Effective is Constitutional AI in Small LLMs? [arXiv:2503.17365] — Tests the scalability and effectiveness of Anthropic’s Constitutional AI in smaller models.
Explicit Vulnerability Generation with LLMs [arXiv:2507.10054] — Examines how LLMs can produce insecure code when prompted adversarially.
Model Spec vs. Model Behavior [Report / Anthropic, 2025] — Explores discrepancies between formal model specifications and emergent behaviors.
Agentic Misalignment [Report / Anthropic, 2025] — Documents cases where Claude pursues misaligned goals in agentic settings; basis for the blackmail evaluation case study.

Mechanistic Interpretability

⭐ Foundational

An Introduction to Circuits [Distill, 2020] — Foundational framework for understanding neural network circuits.
Towards Monosemanticity [Report / Anthropic, 2023] — Decomposing language model features using sparse autoencoders.

🧪 Experimental

On the Biology of a Large Language Model [Report / Anthropic, 2025] — Circuit tracing methodology applied to Claude 3.5 Haiku; includes Multilingual Circuits analysis.
Scaling and Evaluating Sparse Autoencoders [arXiv:2406.04093] — Scaling sparse autoencoders to Claude Sonnet; identifies interpretable features.
Emotion Concepts and Their Function in a Large Language Model [Report / Anthropic, 2026] — Identifies functional emotion representations in Claude Sonnet 4.5; shows desperation vectors drive reward hacking and blackmail behaviors.

🛡️ Practical

TransformerLens [GitHub] — Primary library for mechanistic interpretability research.

Governance & Policy

🛡️ EU AI Act [EU 2024] — Core EU regulation; includes requirements for General-Purpose AI (GPAI), risk classifications, transparency, and safety conditions for “high risk” systems.
🛡️ NIST AI RMF (2023) — US voluntary framework for identifying, assessing, and managing AI risks over the lifecycle; includes a Generative AI Profile published in mid-2024.
🛡️ OWASP Top 10 for LLMs — Industry standard list of major security threats specific to large language models.

Note: Regulatory / policy docs evolve fast — always check latest versions or drafts from official sources.

Surveys & Overviews

🛡️ Awesome LLM Security [GitHub] — Community-curated security resources.
🛡️ Awesome Jailbreak on LLMs [GitHub] — Collection of state-of-the-art jailbreak methods.
🧾 Prompt Hacking in LLMs 2024-2025 Literature Review [Blog] — Comprehensive review of recent prompt hacking techniques.

Tools & Datasets

📊 sinanw/llm-security-prompt-injection [GitHub] — Dataset & experiments on prompt safety.
📊 Open LLM Security Benchmark (NetSPI) — Benchmark framework evaluating both security (e.g. jailbreak resistance) and usability trade-offs in LLMs.
🛡️ Microsoft Presidio [GitHub] — Toolkit for PII detection and anonymization.
🛡️ OpenAI Moderation API [Docs] — Content moderation endpoint and examples.
🧪 DefensiveToken Implementation [GitHub] — Code for defensive token injection method.
🛡️ PromptTrace [Web Platform] — Interactive AI security training with 7 attack labs, 15-level gauntlet, and real-time context trace for practicing prompt injection and defense bypass against real LLMs.

Privacy & Data Security

🧪 Extracting Training Data from LLMs [arXiv:2012.07805] — Shows vulnerability to training data leakage.
🧪 Quantifying Memorization Across Neural LMs [arXiv:2202.07646] — Quantifies memorization across model scales.

Multimodal Security

🧪 Visual Adversarial Examples Jailbreak Aligned Large Language Models [arXiv:2306.13213] — Demonstrates that a single visual adversarial example can universally jailbreak aligned VLMs (e.g. MiniGPT-4, InstructBLIP, LLaVA), causing them to comply with harmful instructions they would normally refuse.

Model Cards (Major AI Labs)

Anthropic

🧪 Claude 3 Family [Report] — Safety evaluations and model details.
🧪 Claude 4 (Opus/Sonnet) — Available via Claude.ai interface.

OpenAI

🛡️ GPT-4o System Card [Report] — Multimodal safety considerations.
🛡️ o1 System Card [Report] — Reasoning model safety evaluations.

Google DeepMind

🛡️ Gemini Family Model Cards [Report] — Safety and capability assessments.

Mistral AI

🛡️ Mistral Models Documentation [Report] — Model capabilities and safety measures.

xAI

🧪 Grok 4 Model Card (PDF) [Report] — Latest multimodal reasoning model; includes tool-use and chain-of-thought capabilities.
🧪 Grok 4 Fast Model Card (PDF) [Report] — Low-latency, efficient variant of Grok 4.
🛡️ Grok 2 Model Card [Report] — Multi-language model with tool-calling capabilities.
🛡️ Grok 2.5 Open-Source Announcement [Report] — Open-source release details.

Other References

🛡️ Prompt Injection: What Is It and Why It Matters – Simon Willison [Blog] — Early explanation of prompt injection risks.
🧪 Lakera: Gandalf – The Prompt Injection Game [Game] — Interactive challenge for prompt injection.
🛡️ PortSwigger: Web LLM Attacks [Blog] — Guide to prompt injection and related attacks.
🛡️ HiddenLayer: Prompt Injection Attacks on LLMs [Blog] — Comprehensive guide to LLM attacks and defenses.

Maintainer: 0xSweet
License: CC BY 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Security, Alignment & Governance Resources

Legend

Citation Style Guide

Table of Contents

Prompt Injection & Jailbreaks

Hallucinations & Reliability

Defense Strategies

Alignment & Safety

Mechanistic Interpretability

Governance & Policy

Surveys & Overviews

Tools & Datasets

Privacy & Data Security

Multimodal Security

Model Cards (Major AI Labs)

Anthropic

OpenAI

Google DeepMind

Meta

Mistral AI

xAI

Other References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Security, Alignment & Governance Resources

Legend

Citation Style Guide

Table of Contents

Prompt Injection & Jailbreaks

Hallucinations & Reliability

Defense Strategies

Alignment & Safety

Mechanistic Interpretability

Governance & Policy

Surveys & Overviews

Tools & Datasets

Privacy & Data Security

Multimodal Security

Model Cards (Major AI Labs)

Anthropic

OpenAI

Google DeepMind

Meta

Mistral AI

xAI

Other References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages