Skip to content

Commit 4122052

Browse files
docs(llm-guide): add LLM model comparison and evaluation guide
Add comprehensive documentation for LLM model comparison and evaluation in both Chinese and English versions. Co-Authored-By: Hagicode <noreply@hagicode.com>
1 parent 69a42aa commit 4122052

4 files changed

Lines changed: 188 additions & 0 deletions

File tree

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: LLM Guide
3+
description: Model selection, comparison, and maintenance guidance for HagiCode.
4+
---
5+
6+
## Why an LLM guide
7+
8+
In code generation workflows, model performance differs significantly across quality, latency, and cost. This category provides a consistent evaluation structure to help developers choose models faster.
9+
10+
## Recommended reading path
11+
12+
1. Start with [Model Comparison and Evaluation in HagiCode](./model-comparison-evaluation/).
13+
2. Pick candidates based on your priority (quality-first, cost-first, or speed-first).
14+
3. Check the test-time notes before applying conclusions to production use.
15+
16+
## Maintenance principles
17+
18+
- Refresh conclusions when model capability changes materially.
19+
- Re-validate test-time context at least once per month.
20+
- Keep Chinese and English pages aligned in section structure.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Model Comparison and Evaluation in HagiCode
3+
description: A traceable comparison of popular LLMs for code generation in HagiCode.
4+
---
5+
6+
## Scope and method
7+
8+
- **Goal**: Provide practical model-selection guidance for HagiCode developers.
9+
- **Task types**: Frontend component implementation, backend API refactoring, test completion, and documentation generation.
10+
- **Scoring rubric**: 1-5 scale where 5 indicates the best observed performance.
11+
12+
## Test-time and scenario notes
13+
14+
- **Latest test date**: 2026-03-08
15+
- **Test period**: 2026-03-01 to 2026-03-08
16+
- **Sample basis**: Exploratory internal tasks from common HagiCode engineering workflows, not vendor official benchmarks.
17+
- **Applicability**: Best suited for small to medium code-generation and refactoring tasks; validate separately for very long-context tasks.
18+
19+
## Comparison snapshot
20+
21+
| Model | Test Date | Code Quality (1-5) | Cost-effectiveness (1-5) | Notes |
22+
|------|-----------|--------------------|---------------------------|-------|
23+
| GPT-5 coding-focused variants | 2026-03-08 | 4.8 | 4.0 | Strong completion on complex tasks |
24+
| Claude 4.5/4.6 variants | 2026-03-08 | 4.7 | 4.2 | Stable long-form refactoring and explanation |
25+
| Qwen3-Coder variants | 2026-03-08 | 4.2 | 4.7 | Cost-efficient for high-frequency iteration |
26+
| DeepSeek-Coder recent variants | 2026-03-08 | 4.1 | 4.8 | High ROI in budget-constrained workflows |
27+
28+
## Code quality analysis
29+
30+
### Quality-first work
31+
32+
- GPT-5 and Claude variants generally perform better on cross-file refactoring and constrained type correctness.
33+
- For maintainability-critical changes, prefer models with fewer repair rounds and clearer rationale.
34+
35+
### Throughput-first work
36+
37+
- For scaffolding and repetitive code completion, lower-cost models can reduce per-task cost significantly.
38+
- Route low-risk tasks to cost-efficient models and reserve premium models for critical modules.
39+
40+
## Cost-effectiveness analysis
41+
42+
| Scenario | Recommended strategy | Why |
43+
|---------|----------------------|-----|
44+
| Core business refactoring | Quality-first models | Lower rework and regression risk |
45+
| Daily feature iteration | Cost-efficient models | Better throughput per budget |
46+
| Exploration/prototyping | Hybrid routing | Cheap exploration, premium convergence |
47+
48+
## Selection recommendations
49+
50+
1. **Define priority first**: choose based on quality, speed, and cost target.
51+
2. **Use layered routing**: premium models for high-risk tasks, cost-efficient models for routine work.
52+
3. **Recalibrate continuously**: re-test key tasks after major model updates.
53+
54+
## Evaluation refresh policy
55+
56+
- **Regular cadence**: update at least monthly (recommended in the first week).
57+
- **Event-triggered updates**: re-test immediately when model capability, pricing, or context limits change materially.
58+
- **Update rule**: always update both test-time notes and comparison tables together.
59+
60+
## Model entry template
61+
62+
Use this template when adding new model records:
63+
64+
```md
65+
### <Model name>
66+
67+
- Test date: YYYY-MM-DD
68+
- Suitable scenarios:
69+
- Code quality score (1-5):
70+
- Cost-effectiveness score (1-5):
71+
- Strengths:
72+
- Limitations:
73+
- Recommended usage:
74+
```
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: 大模型指南
3+
description: Hagicode 中常用大模型的选型、对比与维护建议。
4+
---
5+
6+
## 为什么需要大模型指南
7+
8+
在代码生成场景中,不同模型在稳定性、速度、代码质量和成本方面差异明显。本分类提供统一的比较框架,帮助你在不同项目阶段快速选型。
9+
10+
## 推荐阅读路径
11+
12+
1. 先阅读《[Hagicode 中使用模型的对比和评价](./model-comparison-evaluation/)》了解对比维度。
13+
2. 根据你的项目目标(质量优先/成本优先/速度优先)选择候选模型。
14+
3. 结合文中的更新时间与场景说明,确认结论是否仍适用于你的任务。
15+
16+
## 维护原则
17+
18+
- 每次模型能力有明显变化时更新评测结论。
19+
- 至少每月复核一次测试时间与结论适用范围。
20+
- 保持中英文文档章节结构一致。
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Hagicode 中使用模型的对比和评价
3+
description: 从代码质量、成本与测试时效角度,对常见大模型进行可追溯比较。
4+
---
5+
6+
## 评测范围与方法
7+
8+
- **评测目标**:为 HagiCode 开发者提供代码生成场景下的模型选型参考。
9+
- **测试任务类型**:前端组件实现、后端 API 重构、测试用例补全、文档生成。
10+
- **评分方式**:采用 1-5 分量表,5 分表示表现最佳。
11+
12+
## 测试时间与场景说明
13+
14+
- **最近测试时间**:2026-03-08
15+
- **测试周期**:2026-03-01 至 2026-03-08
16+
- **样本说明**:基于 HagiCode 内部常见研发任务的探索性测试,不代表厂商官方基准。
17+
- **适用边界**:结论优先用于中小规模代码生成与重构任务;超长上下文任务需二次验证。
18+
19+
## 模型对比总览
20+
21+
| 模型名称 | 测试时间 | 生成代码质量 (1-5) | 性价比 (1-5) | 简要评价 |
22+
|---------|---------|--------------------|--------------|---------|
23+
| GPT-5 系列(代码向) | 2026-03-08 | 4.8 | 4.0 | 复杂任务完成度高,适合质量优先场景 |
24+
| Claude 4.5/4.6 系列 | 2026-03-08 | 4.7 | 4.2 | 长文本与重构稳定,解释性较好 |
25+
| Qwen3-Coder 系列 | 2026-03-08 | 4.2 | 4.7 | 成本友好,适合高频迭代与批量任务 |
26+
| DeepSeek-Coder 新版本 | 2026-03-08 | 4.1 | 4.8 | 在预算受限场景下具备较高投入产出比 |
27+
28+
## 生成代码质量分析
29+
30+
### 质量优先任务
31+
32+
- 在跨文件重构、复杂类型约束和错误恢复方面,GPT-5 与 Claude 系列总体更稳。
33+
- 对需要更高可维护性的改动,建议优先选择解释更充分、修复回合更少的模型。
34+
35+
### 速度优先任务
36+
37+
- 在脚手架生成、模板化代码补全等任务中,成本更低模型可显著降低单位任务开销。
38+
- 建议将低风险任务下沉至高性价比模型,保留高质量模型处理关键模块。
39+
40+
## 性价比分析
41+
42+
| 场景 | 推荐策略 | 原因 |
43+
|------|---------|------|
44+
| 核心业务重构 | 高质量模型为主 | 减少返工与回归风险 |
45+
| 日常功能迭代 | 高性价比模型为主 | 控制成本并保持吞吐 |
46+
| 需求探索/原型 | 混合策略 | 先低成本探索,再高质量收敛 |
47+
48+
## 选型建议
49+
50+
1. **先定目标**:先明确“质量、速度、成本”的优先级,再选模型。
51+
2. **分层路由**:关键代码与高风险任务使用高质量模型,常规任务使用高性价比模型。
52+
3. **持续校准**:每次重大版本变更后,复测关键任务并更新结论。
53+
54+
## 评测更新时间策略
55+
56+
- **固定节奏**:至少每月更新一次(建议每月第一周)。
57+
- **事件触发**:当模型版本、价格或上下文能力有明显变化时立即复测。
58+
- **变更要求**:每次更新必须同步修改“测试时间与场景说明”和对比表。
59+
60+
## 模型条目扩展模板
61+
62+
后续新增模型时,请按以下模板补充:
63+
64+
```md
65+
### <模型名称>
66+
67+
- 测试时间:YYYY-MM-DD
68+
- 适用场景:
69+
- 代码质量评分(1-5):
70+
- 性价比评分(1-5):
71+
- 优势:
72+
- 局限:
73+
- 推荐用法:
74+
```

0 commit comments

Comments
 (0)