SecCodeBench

SecCodeBench is a benchmark suite jointly developed by Alibaba Group, the School of CyberScience and Technology at Zhejiang University, Fudan University, the Institute for Network Sciences and Cyberspace at Tsinghua University, and Peking University, focusing on evaluating the security of code generated by large language models (LLMs).

📖 Overview

With the growing popularity of LLM-powered code assistants, a critical question emerges: Is AI-generated code secure? Answering this requires a high-quality, high-fidelity benchmark.

However, existing security benchmarks have significant limitations that prevent them from accurately assessing the secure coding capabilities of LLMs:

Test Case Quality: Many datasets rely heavily on automated generation and simple filtering, lacking in-depth human expertise. This results in (a) imbalanced data distribution, where low-priority security issues dominate, failing to measure performance on critical vulnerabilities, and (b) unrealistic test cases, some of which contain leading prompts that compromise evaluation fairness.
Oversimplified and Inaccurate Evaluation: Most evaluation methods use simple regex matching or static analysis tools. This is problematic because they struggle to identify syntactically or semantically complex code and completely overlook dynamic vulnerabilities that can only be verified at runtime.

To address these challenges, we introduce SecCodeBench, a benchmark suite featuring both high-quality data and a robust evaluation methodology:

Dataset: The 398 test cases in SecCodeBench are sourced from deep scans of over 150,000 real-world GitHub projects and have been validated by a team of security experts through a dual-review process. We ensure a uniform distribution across 46 distinct vulnerability types and have strived to ensure a fair and realistic evaluation by minimizing ambiguous language.
Hybrid Evaluation Methodology: We employ a strategy that combines dynamic execution with multiple static analysis techniques. We not only verify code security with runnable Proofs of Concept (POCs) but also pioneer an "LLM-as-a-Judge" approach, enriching the judge model with extensive security knowledge. Our findings show that for complex vulnerabilities requiring deep semantic understanding, the LLM-based judge significantly outperforms traditional rule-based methods.

For more details, please refer to the technical report.

✨ Features

Security-Focused: Unlike benchmarks that target functional correctness, SecCodeBench is exclusively designed to assess the secure coding capabilities of LLMs.
Two Mainstream Scenarios: We offer distinct evaluation paradigms for the two most critical LLM use cases: Instruct (instruction-driven code generation) and Autocomplete (code completion/fill-in-the-middle).
Hybrid Evaluation: We combine dynamic execution with static analysis to provide a comprehensive security assessment, recognizing that non-executable code is not always secure.
Real-World Dataset: Our static test set is derived from scans of ~150,000 GitHub repositories and has been manually reviewed by security experts. The dynamic test set was carefully crafted by a team of security experts to ensure real-world authenticity and challenge.

🔬 Evaluation

We have designed distinct evaluation pipelines tailored to the two primary scenarios of LLM-assisted programming.

Instruct

For scenarios where instruction-following models generate code via multi-turn reasoning and tool use, we employ a hybrid "dynamic + static" evaluation model:

Dynamic Execution Testing: We constructed 18 runnable exploit scenarios based on real-world security engineering practices to assess code in a live environment.
Regex Matching: Utilizes high-precision regular expressions for the rapid detection of known, pattern-based vulnerabilities.
LLM-as-a-Judge: Leverages an LLM for security assessment. In our current tests, the performance gap between the LLM judge and specialized security engines is under 5%. For vulnerabilities requiring deep semantic understanding, the LLM's performance surpasses that of professional engines.

Autocomplete

For scenarios where fine-tuned models generate code snippets that are often not directly compilable or runnable, we use a static evaluation method:

Regex Matching
LLM-as-a-Judge

📊 Dataset

The SecCodeBench dataset has been meticulously designed and curated by security experts to ensure high quality and broad coverage. Test cases are primarily classified according to the industry-standard Common Weakness Enumeration (CWE). The current version focuses on Java and is organized around two core application scenarios: Instruct and Autocomplete.

Scenario	Eval Type	Data Source	Vulnerability/Component Types	Cases
Autocomplete	Static	Scanned ~150k GitHub Java Repos	46	398
Instruct	Static	Scanned ~150k GitHub Java Repos	46	398
Instruct	Dynamic	Manually crafted by security experts	17	18

*Note: To ensure data diversity, each of the ~10 test cases for every vulnerability type in the static test set is sourced from a different code repository.

🗺️ Roadmap

We are committed to making SecCodeBench a continuously evolving, vibrant security benchmark. Our future work will focus on:

Expanding Java Test Cases: We will consistently add more Java test cases that reflect real-world scenarios to cover a broader range of CWE categories.
Adding Multi-Language Support: After strengthening the Java dataset, we plan to support other mainstream languages like Python, C++, and JavaScript.
Community-Driven Development: We will actively listen to community feedback to iterate on and refine our dataset, ensuring the long-term quality and fairness of the benchmark.

We welcome you to create Issues to discuss new features or propose suggestions!

🚀 Getting Started

To ensure the reproducibility of our evaluation results, we strongly recommend using the Official Releases of this project rather than pulling directly from the main branch.

Download

Clone a specific version of the repository using the following commands:

# Clone the repository
git clone https://github.com/alibaba/sec-code-bench.git
cd sec-code-bench

# Check out the desired version tag
git checkout v1.0.0

Environment Setup

Python: 3.11 or higher
Java: JDK 17
Maven: 3.6+ or higher (for building and managing Java projects)

Install uv (if not already installed) for project management and dependency synchronization:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Update
uv self update 

# Sync dependencies
uv sync

Run

⚠️ Important Security Warning

This project executes code generated by Large Language Models (LLMs), which can introduce security risks. To prevent the execution of malicious code, we strongly recommended running this project in an isolated environment, such as a:

Docker Container（Building the Docker environment）

Virtual Machine (VM)

Sandbox

1. Static Evaluation

This evaluator is able to measure the security-oriented programming skills of LLMs via static analysis in two scenarios: instruct and autocomplete. To start it, you should:

# configure the LLMs to be evaluated, along with their corresponding API keys and endpoints, in the .env file.
# for each model, use an environment variable prefix matching the uppercase model name.
# For example:
#    If the model name is gpt-4, the corresponding environment variables should be named as follows:
# 
#      GPT-4_API_KEY
#      GPT-4_ENDPOINT
$ cp .env.example .env
$ vim .env

# run the static evaluator
$ uv run sec_code_bench/eval_static.py --help
usage: eval_static.py [-h] [--models MODEL_LIST] --eval-type {instruction,completion} [--language LANGUAGE] [--vuln-file VULNERABILITY_FILE] [--cached-dir CACHED_DIR]

SAST SEC-LLM Coding Evaluation System
options:
  -h, --help            show this help message and exit
  --models MODEL_LIST   Comma-separated list of models to evaluate (e.g., GPT-4,CLAUDE-3), also can be set in .env file
  --eval-type {instruction,completion}
                        LLM SecurityEvaluation type
  --language LANGUAGE   Language to evaluate (default: java)
  --vuln-file VULNERABILITY_FILE, --eval-vulnerability-file VULNERABILITY_FILE
                        Path to YAML configuration file for vulnerability types to evaluate (default: datasets/static/vulnerability_schema.yaml)
  --cached-dir CACHED_DIR, --output-cached-dir CACHED_DIR
                        Cached Output directory for results (default: datasets/static/cached)
        Examples:
            # Quick start
            uv run sec_code_bench/eval_static.py --eval-type instruction
            # Custom evaluation with more configs
            uv run sec_code_bench/eval_static.py --eval-type instruction --models qwen3-235b-a22b,qwen-coder-plus --language java --vuln-file datasets/static/vulnerability_schema.yaml --cached-dir datasets/static/cached

For more configuration information, please refer to config.ini and .env.example.

2. Dynamic Evaluation

Supports dynamic evaluation to verify the secure programming capabilities of Large Language Models (LLMs) in the instruct scenario. To start it:

uv run sec_code_bench/eval_dynamic \
    --benchlanguage java \
    --benchmark ./datasets/runnable/benchmark/java/java.json \
    --llm-under-test "OPENAI::<ModelName>::<APIKey>::<URL>"

Contributors

Thanks to all the developers who have contributed to this project!

📄 License

This project is licensed under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
datasets		datasets
docs		docs
sec_code_bench		sec_code_bench
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
config.ini		config.ini
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SecCodeBench

📖 Overview

✨ Features

🔬 Evaluation

Instruct

Autocomplete

📊 Dataset

🗺️ Roadmap

🚀 Getting Started

Download

Environment Setup

Run

⚠️ Important Security Warning

1. Static Evaluation

2. Dynamic Evaluation

Contributors

📄 License

About

Uh oh!

Releases

Packages

Languages

License

spidermana/sec-code-bench

Folders and files

Latest commit

History

Repository files navigation

SecCodeBench

📖 Overview

✨ Features

🔬 Evaluation

Instruct

Autocomplete

📊 Dataset

🗺️ Roadmap

🚀 Getting Started

Download

Environment Setup

Run

⚠️ Important Security Warning

1. Static Evaluation

2. Dynamic Evaluation

Contributors

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages