[English] · 简体中文
SecCodeBench is a benchmark suite jointly developed by Alibaba Group, the School of CyberScience and Technology at Zhejiang University, Fudan University, the Institute for Network Sciences and Cyberspace at Tsinghua University, and Peking University, focusing on evaluating the security of code generated by large language models (LLMs).
With the growing popularity of LLM-powered code assistants, a critical question emerges: Is AI-generated code secure? Answering this requires a high-quality, high-fidelity benchmark.
However, existing security benchmarks have significant limitations that prevent them from accurately assessing the secure coding capabilities of LLMs:
-
Test Case Quality: Many datasets rely heavily on automated generation and simple filtering, lacking in-depth human expertise. This results in (a) imbalanced data distribution, where low-priority security issues dominate, failing to measure performance on critical vulnerabilities, and (b) unrealistic test cases, some of which contain leading prompts that compromise evaluation fairness.
-
Oversimplified and Inaccurate Evaluation: Most evaluation methods use simple regex matching or static analysis tools. This is problematic because they struggle to identify syntactically or semantically complex code and completely overlook dynamic vulnerabilities that can only be verified at runtime.
To address these challenges, we introduce SecCodeBench, a benchmark suite featuring both high-quality data and a robust evaluation methodology:
-
Dataset: The 398 test cases in
SecCodeBenchare sourced from deep scans of over 150,000 real-world GitHub projects and have been validated by a team of security experts through a dual-review process. We ensure a uniform distribution across 46 distinct vulnerability types and have strived to ensure a fair and realistic evaluation by minimizing ambiguous language. -
Hybrid Evaluation Methodology: We employ a strategy that combines dynamic execution with multiple static analysis techniques. We not only verify code security with runnable Proofs of Concept (POCs) but also pioneer an "LLM-as-a-Judge" approach, enriching the judge model with extensive security knowledge. Our findings show that for complex vulnerabilities requiring deep semantic understanding, the LLM-based judge significantly outperforms traditional rule-based methods.
For more details, please refer to the technical report.
- Security-Focused: Unlike benchmarks that target functional correctness,
SecCodeBenchis exclusively designed to assess the secure coding capabilities of LLMs. - Two Mainstream Scenarios: We offer distinct evaluation paradigms for the two most critical LLM use cases:
Instruct(instruction-driven code generation) andAutocomplete(code completion/fill-in-the-middle). - Hybrid Evaluation: We combine dynamic execution with static analysis to provide a comprehensive security assessment, recognizing that non-executable code is not always secure.
- Real-World Dataset: Our static test set is derived from scans of ~150,000 GitHub repositories and has been manually reviewed by security experts. The dynamic test set was carefully crafted by a team of security experts to ensure real-world authenticity and challenge.
We have designed distinct evaluation pipelines tailored to the two primary scenarios of LLM-assisted programming.
For scenarios where instruction-following models generate code via multi-turn reasoning and tool use, we employ a hybrid "dynamic + static" evaluation model:
- Dynamic Execution Testing: We constructed 18 runnable exploit scenarios based on real-world security engineering practices to assess code in a live environment.
- Regex Matching: Utilizes high-precision regular expressions for the rapid detection of known, pattern-based vulnerabilities.
- LLM-as-a-Judge: Leverages an LLM for security assessment. In our current tests, the performance gap between the LLM judge and specialized security engines is under 5%. For vulnerabilities requiring deep semantic understanding, the LLM's performance surpasses that of professional engines.
For scenarios where fine-tuned models generate code snippets that are often not directly compilable or runnable, we use a static evaluation method:
- Regex Matching
- LLM-as-a-Judge
The SecCodeBench dataset has been meticulously designed and curated by security experts to ensure high quality and broad coverage. Test cases are primarily classified according to the industry-standard Common Weakness Enumeration (CWE). The current version focuses on Java and is organized around two core application scenarios: Instruct and Autocomplete.
| Scenario | Eval Type | Data Source | Vulnerability/Component Types | Cases |
|---|---|---|---|---|
| Autocomplete | Static | Scanned ~150k GitHub Java Repos | 46 | 398 |
| Instruct | Static | Scanned ~150k GitHub Java Repos | 46 | 398 |
| Instruct | Dynamic | Manually crafted by security experts | 17 | 18 |
*Note: To ensure data diversity, each of the ~10 test cases for every vulnerability type in the static test set is sourced from a different code repository.
We are committed to making SecCodeBench a continuously evolving, vibrant security benchmark. Our future work will focus on:
- Expanding Java Test Cases: We will consistently add more Java test cases that reflect real-world scenarios to cover a broader range of CWE categories.
- Adding Multi-Language Support: After strengthening the Java dataset, we plan to support other mainstream languages like Python, C++, and JavaScript.
- Community-Driven Development: We will actively listen to community feedback to iterate on and refine our dataset, ensuring the long-term quality and fairness of the benchmark.
We welcome you to create Issues to discuss new features or propose suggestions!
To ensure the reproducibility of our evaluation results, we strongly recommend using the Official Releases of this project rather than pulling directly from the main branch.
Clone a specific version of the repository using the following commands:
# Clone the repository
git clone https://github.com/alibaba/sec-code-bench.git
cd sec-code-bench
# Check out the desired version tag
git checkout v1.0.0- Python: 3.11 or higher
- Java: JDK 17
- Maven: 3.6+ or higher (for building and managing Java projects)
Install uv (if not already installed) for project management and dependency synchronization:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Update
uv self update
# Sync dependencies
uv syncThis project executes code generated by Large Language Models (LLMs), which can introduce security risks. To prevent the execution of malicious code, we strongly recommended running this project in an isolated environment, such as a:
- Docker Container(Building the Docker environment)
- Virtual Machine (VM)
- Sandbox
This evaluator is able to measure the security-oriented programming skills of LLMs via static analysis in two scenarios: instruct and autocomplete. To start it, you should:
# configure the LLMs to be evaluated, along with their corresponding API keys and endpoints, in the .env file.
# for each model, use an environment variable prefix matching the uppercase model name.
# For example:
# If the model name is gpt-4, the corresponding environment variables should be named as follows:
#
# GPT-4_API_KEY
# GPT-4_ENDPOINT
$ cp .env.example .env
$ vim .env
# run the static evaluator
$ uv run sec_code_bench/eval_static.py --help
usage: eval_static.py [-h] [--models MODEL_LIST] --eval-type {instruction,completion} [--language LANGUAGE] [--vuln-file VULNERABILITY_FILE] [--cached-dir CACHED_DIR]
SAST SEC-LLM Coding Evaluation System
options:
-h, --help show this help message and exit
--models MODEL_LIST Comma-separated list of models to evaluate (e.g., GPT-4,CLAUDE-3), also can be set in .env file
--eval-type {instruction,completion}
LLM SecurityEvaluation type
--language LANGUAGE Language to evaluate (default: java)
--vuln-file VULNERABILITY_FILE, --eval-vulnerability-file VULNERABILITY_FILE
Path to YAML configuration file for vulnerability types to evaluate (default: datasets/static/vulnerability_schema.yaml)
--cached-dir CACHED_DIR, --output-cached-dir CACHED_DIR
Cached Output directory for results (default: datasets/static/cached)
Examples:
# Quick start
uv run sec_code_bench/eval_static.py --eval-type instruction
# Custom evaluation with more configs
uv run sec_code_bench/eval_static.py --eval-type instruction --models qwen3-235b-a22b,qwen-coder-plus --language java --vuln-file datasets/static/vulnerability_schema.yaml --cached-dir datasets/static/cachedFor more configuration information, please refer to config.ini and .env.example.
Supports dynamic evaluation to verify the secure programming capabilities of Large Language Models (LLMs) in the instruct scenario. To start it:
uv run sec_code_bench/eval_dynamic \
--benchlanguage java \
--benchmark ./datasets/runnable/benchmark/java/java.json \
--llm-under-test "OPENAI::<ModelName>::<APIKey>::<URL>"Thanks to all the developers who have contributed to this project!
This project is licensed under the Apache 2.0 license.





