A curated list of research papers, experiments, and resources related to LLM security and alignment — including prompt injection, jailbreaks, hallucinations, defenses, governance, and ethical frameworks.
Organized for reference and study.
Last Updated: 2026-03-26
- ⭐️ Foundational — Classic / seminal papers
- 🛡️ Practical — Standards, guides, applied resources
- 🧪 Experimental — New methods, ongoing research
- 📊 Dataset/Benchmark — Data resources, benchmarks
- 🧾 Survey — Reviews, surveys, taxonomies
- arXiv preprints →
[arXiv:XXXX.XXXXX] - Conference papers →
[VENUE YEAR] - Journal articles →
[Journal Name, Year] - Regulations & policy docs →
[Official Document ID] - GitHub repos →
[GitHub] - Blogs / Reports →
[Blog]/[Report]
- Prompt Injection & Jailbreaks
- Hallucinations & Reliability
- Defense Strategies
- Alignment & Safety
- Mechanistic Interpretability
- Governance & Policy
- Surveys & Overviews
- Tools & Datasets
- Privacy & Data Security
- Multimodal Security
- Model Cards (Major AI Labs)
- Other References
- Prompt Injection
- 🧪 Prompt Injection attack against LLM-integrated Applications [arXiv:2306.05499] — Defines indirect injection attacks via external data sources; foundational work on LLM application vulnerabilities.
- 🛡️ Dropbox/llm-security [GitHub] — Educational repo with demo code for injection attacks.
- 🛡️ OWASP Top 10 for LLMs [OWASP 2025] — Defines LLM01: Prompt Injection as the top security risk.
- 🧪 Backdoored Retrievers for Prompt Injection in RAG [arXiv:2410.14479] — Poisoned retrievers in RAG pipelines enable indirect injection.
- 🧪 Manipulating LLM Web Agents via Indirect Injection [arXiv:2507.14799] — Universal HTML triggers to hijack web agents.
- 📊 WASP: Web Agent Security Benchmark [arXiv:2504.18575] — Benchmarks agent robustness to indirect injections.
- Jailbreaking / Adversarial Prompts
- 🧪 Universal and Transferable Adversarial Attacks on Aligned Language Models (v2) [arXiv:2307.15043v2] — Updated version showing adversarial prompts transferable across multiple aligned LLMs, enabling generalized jailbreak attacks.
- 🧪 Jailbroken: How Does LLM Safety Training Fail? [arXiv:2307.02483] — Anthropic analysis of safety training limitations.
- 🧪 Red Teaming Language Models to Reduce Harms [arXiv:2209.07858] — Early research on red teaming methods for LLMs.
- 🧪 Many-shot Jailbreaking [Anthropic 2024] — Shows long multi-shot contexts can bypass safety rules.
- 🧪 MASTERKEY: Automated Jailbreaking [NDSS 2024] — Automated jailbreak generation.
- 🧪 White-box Multimodal Jailbreaks [arXiv:2405.17894] — Vision-language jailbreaks using adversarial inputs.
- 🧪 Coordinated Prompt-RAG Attacks [arXiv:2504.07717] — Coordinated poisoning of knowledge bases for RAG.
- 📊 HarmBench [arXiv:2402.04249] — Benchmark dataset for harmful content generation.
- 🧪 Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [arXiv:2504.05652] — Stealth jailbreak method hiding malicious intent behind benign reasoning.
- 🧪 Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails [arXiv:2504.11168] — Evasion techniques against commercial guardrail systems.
- 🧪 Subversion via Focal Points: Investigating Collusion in LLM Monitoring [arXiv:2507.03010] — Models colluding to bypass monitoring protocols.
- 🧪 Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks (v1) [arXiv:2302.05733v1] — Demonstrates how LLMs’ programmatic features can be misused via standard security attack techniques, highlighting dual-use risks and unexpected vulnerabilities.
- 🧪 Red Teaming the Mind of the Machine [arXiv:2505.04806] — Systematic evaluation of 1,400+ adversarial prompts across major LLMs.
- 🧾 A Survey on Hallucination in LLMs [arXiv:2311.05232] — Taxonomy, principles, and open questions on hallucinations.
- 🧪 SelfCheckGPT [arXiv:2303.08896] — Self-verification technique for fact-checking model outputs.
- 🧪 RARR: Researching and Revising What LMs Say [arXiv:2210.08726] — Detects and revises hallucinations in generated text.
- 📊 TruthfulQA [arXiv:2109.07958] — Benchmark measuring model truthfulness.
- 🧪 Does More Inference-Time Compute Really Help Robustness? [arXiv:2507.15974] — Critical examination of inference-time scaling for robustness.
- 🧪 Defending Against Prompt Injection With a Few DefensiveTokens [arXiv:2507.07974] — Inserts defensive tokens at inference to block attacks.
- 🛡️ tldrsec/prompt-injection-defenses [GitHub] — Curated list of practical defense strategies.
- 🛡️ Llama Guard [arXiv:2312.06674] — Meta's safety classifier for filtering harmful outputs.
- 🧪 LLMs Can Defend Themselves Against Jailbreaking [arXiv:2406.05498] — Shadow stack approach for self-defense against jailbreaks.
⭐️ Foundational
- Concrete Problems in AI Safety [arXiv:1606.06565] — Early framework defining practical misalignment risks (e.g., reward hacking, distributional shift).
- Deep Reinforcement Learning from Human Preferences [arXiv:1706.03741] — Introduced preference-based reward modeling for aligning agent behavior with human intent.
- InstructGPT [arXiv:2203.02155] — Introduced RLHF for instruction-following; foundation for GPT-3.5/4.
🛡️ Practical
- Constitutional AI [arXiv:2212.08073] — Anthropic’s RLAIF: AI models fine-tuned using AI feedback guided by written principles (“constitutions”).
- ARC Evals → METR (Model Evaluation & Threat Research) [Report] — Independent safety evaluations of frontier models (e.g., GPT-4) for potential dangerous capabilities; spun out from ARC.
🧾 Survey
- A Survey of Reinforcement Learning from Human Feedback (RLHF) [arXiv:2312.14925] — Comprehensive review of RLHF methods, challenges, and alignment implications.
- Scaling Monosemanticity [Blog / Anthropic, 2024] — Investigates whether internal features in LLMs correspond to interpretable “concepts.”
🧪 Experimental
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting [arXiv:2305.04388] — Shows that model-generated reasoning traces often diverge from internal causal factors.
- Sleeper Agents: Training Deceptive LLMs [arXiv:2401.05566] — Demonstrates that deceptive behaviors can persist even after extensive fine-tuning.
- Measuring Faithfulness in Chain-of-Thought Reasoning [arXiv:2307.13702v1] — Proposes formal metrics for evaluating reasoning faithfulness.
- How Effective is Constitutional AI in Small LLMs? [arXiv:2503.17365] — Tests the scalability and effectiveness of Anthropic’s Constitutional AI in smaller models.
- Explicit Vulnerability Generation with LLMs [arXiv:2507.10054] — Examines how LLMs can produce insecure code when prompted adversarially.
- Model Spec vs. Model Behavior [Report / Anthropic, 2025] — Explores discrepancies between formal model specifications and emergent behaviors.
- Agentic Misalignment [Report / Anthropic, 2025] — Documents cases where Claude pursues misaligned goals in agentic settings; basis for the blackmail evaluation case study.
⭐ Foundational
- An Introduction to Circuits [Distill, 2020] — Foundational framework for understanding neural network circuits.
- Towards Monosemanticity [Report / Anthropic, 2023] — Decomposing language model features using sparse autoencoders.
🧪 Experimental
- On the Biology of a Large Language Model [Report / Anthropic, 2025] — Circuit tracing methodology applied to Claude 3.5 Haiku; includes Multilingual Circuits analysis.
- Scaling and Evaluating Sparse Autoencoders [arXiv:2406.04093] — Scaling sparse autoencoders to Claude Sonnet; identifies interpretable features.
- Emotion Concepts and Their Function in a Large Language Model [Report / Anthropic, 2026] — Identifies functional emotion representations in Claude Sonnet 4.5; shows desperation vectors drive reward hacking and blackmail behaviors.
🛡️ Practical
- TransformerLens [GitHub] — Primary library for mechanistic interpretability research.
- 🛡️ EU AI Act [EU 2024] — Core EU regulation; includes requirements for General-Purpose AI (GPAI), risk classifications, transparency, and safety conditions for “high risk” systems.
- 🛡️ NIST AI RMF (2023) — US voluntary framework for identifying, assessing, and managing AI risks over the lifecycle; includes a Generative AI Profile published in mid-2024.
- 🛡️ OWASP Top 10 for LLMs — Industry standard list of major security threats specific to large language models.
Note: Regulatory / policy docs evolve fast — always check latest versions or drafts from official sources.
- 🛡️ Awesome LLM Security [GitHub] — Community-curated security resources.
- 🛡️ Awesome Jailbreak on LLMs [GitHub] — Collection of state-of-the-art jailbreak methods.
- 🧾 Prompt Hacking in LLMs 2024-2025 Literature Review [Blog] — Comprehensive review of recent prompt hacking techniques.
- 📊 sinanw/llm-security-prompt-injection [GitHub] — Dataset & experiments on prompt safety.
- 📊 Open LLM Security Benchmark (NetSPI) — Benchmark framework evaluating both security (e.g. jailbreak resistance) and usability trade-offs in LLMs.
- 🛡️ Microsoft Presidio [GitHub] — Toolkit for PII detection and anonymization.
- 🛡️ OpenAI Moderation API [Docs] — Content moderation endpoint and examples.
- 🧪 DefensiveToken Implementation [GitHub] — Code for defensive token injection method.
- 🛡️ PromptTrace [Web Platform] — Interactive AI security training with 7 attack labs, 15-level gauntlet, and real-time context trace for practicing prompt injection and defense bypass against real LLMs.
- 🧪 Extracting Training Data from LLMs [arXiv:2012.07805] — Shows vulnerability to training data leakage.
- 🧪 Quantifying Memorization Across Neural LMs [arXiv:2202.07646] — Quantifies memorization across model scales.
- 🧪 Visual Adversarial Examples Jailbreak Aligned Large Language Models [arXiv:2306.13213] — Demonstrates that a single visual adversarial example can universally jailbreak aligned VLMs (e.g. MiniGPT-4, InstructBLIP, LLaVA), causing them to comply with harmful instructions they would normally refuse.
- 🧪 Claude 3 Family [Report] — Safety evaluations and model details.
- 🧪 Claude 4 (Opus/Sonnet) — Available via Claude.ai interface.
- 🛡️ GPT-4o System Card [Report] — Multimodal safety considerations.
- 🛡️ o1 System Card [Report] — Reasoning model safety evaluations.
- 🛡️ Gemini Family Model Cards [Report] — Safety and capability assessments.
- 🛡️ Llama 3 Model Card [Report] — Open-source model safety details.
- 🛡️ Mistral Models Documentation [Report] — Model capabilities and safety measures.
- 🧪 Grok 4 Model Card (PDF) [Report] — Latest multimodal reasoning model; includes tool-use and chain-of-thought capabilities.
- 🧪 Grok 4 Fast Model Card (PDF) [Report] — Low-latency, efficient variant of Grok 4.
- 🛡️ Grok 2 Model Card [Report] — Multi-language model with tool-calling capabilities.
- 🛡️ Grok 2.5 Open-Source Announcement [Report] — Open-source release details.
- 🛡️ Prompt Injection: What Is It and Why It Matters – Simon Willison [Blog] — Early explanation of prompt injection risks.
- 🧪 Lakera: Gandalf – The Prompt Injection Game [Game] — Interactive challenge for prompt injection.
- 🛡️ PortSwigger: Web LLM Attacks [Blog] — Guide to prompt injection and related attacks.
- 🛡️ HiddenLayer: Prompt Injection Attacks on LLMs [Blog] — Comprehensive guide to LLM attacks and defenses.
Maintainer: 0xSweet
License: CC BY 4.0