Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .codex/skills/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Managed by Tessl
tessl:*
2 changes: 2 additions & 0 deletions .gemini/skills/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Managed by Tessl
tessl:*
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,5 @@ Discovery → Ranking → Context Extraction → Test Gen + Optimization → Bas
# Agent Rules <!-- tessl-managed -->

@.tessl/RULES.md follow the [instructions](.tessl/RULES.md)

@AGENTS.md
12 changes: 12 additions & 0 deletions tessl.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,18 @@
},
"tessl/pypi-filelock": {
"version": "3.19.0"
},
"codeflash/codeflash-rules": {
"version": "0.1.0"
},
"codeflash/codeflash-docs": {
"version": "0.1.0"
},
"codeflash/codeflash-skills": {
"version": "0.2.0"
},
"tessl-labs/tessl-skill-eval-scenarios": {
"version": "0.0.5"
}
}
}
108 changes: 108 additions & 0 deletions tiles/codeflash-docs/docs/ai-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# AI Service

How codeflash communicates with the AI optimization backend.

## `AiServiceClient` (`api/aiservice.py`)

The client connects to the AI service at `https://app.codeflash.ai` (or `http://localhost:8000` when `CODEFLASH_AIS_SERVER=local`).

Authentication uses Bearer token from `get_codeflash_api_key()`. All requests go through `make_ai_service_request()` which handles JSON serialization via Pydantic encoder.

Timeout: 90s for production, 300s for local.

## Endpoints

### `/ai/optimize` — Generate Candidates

Method: `optimize_code()`

Sends source code + dependency context to generate optimization candidates.

Payload:
- `source_code` — The read-writable code (markdown format)
- `dependency_code` — Read-only context code
- `trace_id` — Unique trace ID for the optimization run
- `language` — `"python"`, `"javascript"`, or `"typescript"`
- `n_candidates` — Number of candidates to generate (controlled by effort level)
- `is_async` — Whether the function is async
- `is_numerical_code` — Whether the code is numerical (affects optimization strategy)

Returns: `list[OptimizedCandidate]` with `source=OptimizedCandidateSource.OPTIMIZE`

### `/ai/optimize_line_profiler` — Line-Profiler-Guided Candidates

Method: `optimize_python_code_line_profiler()`

Like `/optimize` but includes `line_profiler_results` to guide the LLM toward hot lines.

Returns: candidates with `source=OptimizedCandidateSource.OPTIMIZE_LP`

### `/ai/refine` — Refine Existing Candidate

Method: `refine_code()`

Request type: `AIServiceRefinerRequest`

Sends an existing candidate with runtime data and line profiler results to generate an improved version.

Key fields:
- `original_source_code` / `optimized_source_code` — Before and after
- `original_code_runtime` / `optimized_code_runtime` — Timing data
- `speedup` — Current speedup ratio
- `original_line_profiler_results` / `optimized_line_profiler_results`

Returns: candidates with `source=OptimizedCandidateSource.REFINE` and `parent_id` set to the refined candidate's ID

### `/ai/repair` — Fix Failed Candidate

Method: `repair_code()`

Request type: `AIServiceCodeRepairRequest`

Sends a failed candidate with test diffs showing what went wrong.

Key fields:
- `original_source_code` / `modified_source_code`
- `test_diffs: list[TestDiff]` — Each with `scope` (return_value/stdout/did_pass), original vs candidate values, and test source code

Returns: candidates with `source=OptimizedCandidateSource.REPAIR` and `parent_id` set

### `/ai/adaptive_optimize` — Multi-Candidate Adaptive

Method: `adaptive_optimize()`

Request type: `AIServiceAdaptiveOptimizeRequest`

Sends multiple previous candidates with their speedups for the LLM to learn from and generate better candidates.

Key fields:
- `candidates: list[AdaptiveOptimizedCandidate]` — Previous candidates with source code, explanation, source type, and speedup

Returns: candidates with `source=OptimizedCandidateSource.ADAPTIVE`

### `/ai/rewrite_jit` — JIT Rewrite

Method: `get_jit_rewritten_code()`

Rewrites code to use JIT compilation (e.g., Numba).

Returns: candidates with `source=OptimizedCandidateSource.JIT_REWRITE`

## Candidate Parsing

All endpoints return JSON with an `optimizations` array. Each entry has:
- `source_code` — Markdown-formatted code blocks
- `explanation` — LLM explanation
- `optimization_id` — Unique ID
- `parent_id` — Optional parent reference
- `model` — Which LLM model was used

`_get_valid_candidates()` parses the markdown code via `CodeStringsMarkdown.parse_markdown_code()` and filters out entries with empty code blocks.

## `LocalAiServiceClient`

Used when `CODEFLASH_EXPERIMENT_ID` is set. Mirrors `AiServiceClient` but sends to a separate experimental endpoint for A/B testing optimization strategies.

## LLM Call Sequencing

`AiServiceClient` tracks call sequence via `llm_call_counter` (itertools.count). Each request includes a `call_sequence` number, used by the backend to maintain conversation context across multiple calls for the same function.
79 changes: 79 additions & 0 deletions tiles/codeflash-docs/docs/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Configuration

Key configuration constants, effort levels, and thresholds.

## Constants (`code_utils/config_consts.py`)

### Test Execution

| Constant | Value | Description |
|----------|-------|-------------|
| `MAX_TEST_RUN_ITERATIONS` | 5 | Maximum test loop iterations |
| `INDIVIDUAL_TESTCASE_TIMEOUT` | 15s | Timeout per individual test case |
| `MAX_FUNCTION_TEST_SECONDS` | 60s | Max total time for function testing |
| `MAX_TEST_FUNCTION_RUNS` | 50 | Max test function executions |
| `MAX_CUMULATIVE_TEST_RUNTIME_NANOSECONDS` | 100ms | Max cumulative test runtime |
| `TOTAL_LOOPING_TIME` | 10s | Candidate benchmarking budget |
| `MIN_TESTCASE_PASSED_THRESHOLD` | 6 | Minimum test cases that must pass |

### Performance Thresholds

| Constant | Value | Description |
|----------|-------|-------------|
| `MIN_IMPROVEMENT_THRESHOLD` | 0.05 (5%) | Minimum speedup to accept a candidate |
| `MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD` | 0.10 (10%) | Minimum async throughput improvement |
| `MIN_CONCURRENCY_IMPROVEMENT_THRESHOLD` | 0.20 (20%) | Minimum concurrency ratio improvement |
| `COVERAGE_THRESHOLD` | 60.0% | Minimum test coverage |

### Stability Thresholds

| Constant | Value | Description |
|----------|-------|-------------|
| `STABILITY_WINDOW_SIZE` | 0.35 | 35% of total iteration window |
| `STABILITY_CENTER_TOLERANCE` | 0.0025 | ±0.25% around median |
| `STABILITY_SPREAD_TOLERANCE` | 0.0025 | 0.25% window spread |

### Context Limits

| Constant | Value | Description |
|----------|-------|-------------|
| `OPTIMIZATION_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for optimization context |
| `TESTGEN_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for test generation context |
| `MAX_CONTEXT_LEN_REVIEW` | 1000 | Max context length for optimization review |

### Other

| Constant | Value | Description |
|----------|-------|-------------|
| `MIN_CORRECT_CANDIDATES` | 2 | Min correct candidates before skipping repair |
| `REPEAT_OPTIMIZATION_PROBABILITY` | 0.1 | Probability of re-optimizing a function |
| `DEFAULT_IMPORTANCE_THRESHOLD` | 0.001 | Minimum addressable time to consider a function |
| `CONCURRENCY_FACTOR` | 10 | Number of concurrent executions for concurrency benchmark |
| `REFINED_CANDIDATE_RANKING_WEIGHTS` | (2, 1) | (runtime, diff) weights — runtime 2x more important |

## Effort Levels

`EffortLevel` enum: `LOW`, `MEDIUM`, `HIGH`

Effort controls the number of candidates, repairs, and refinements:

| Key | LOW | MEDIUM | HIGH |
|-----|-----|--------|------|
| `N_OPTIMIZER_CANDIDATES` | 3 | 5 | 6 |
| `N_OPTIMIZER_LP_CANDIDATES` | 4 | 6 | 7 |
| `N_GENERATED_TESTS` | 2 | 2 | 2 |
| `MAX_CODE_REPAIRS_PER_TRACE` | 2 | 3 | 5 |
| `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` | 0.2 | 0.3 | 0.4 |
| `TOP_VALID_CANDIDATES_FOR_REFINEMENT` | 2 | 3 | 4 |
| `ADAPTIVE_OPTIMIZATION_THRESHOLD` | 0 | 0 | 2 |
| `MAX_ADAPTIVE_OPTIMIZATIONS_PER_TRACE` | 0 | 0 | 4 |

Use `get_effort_value(EffortKeys.KEY, effort_level)` to retrieve values.

## Project Configuration

Configuration is read from `pyproject.toml` under `[tool.codeflash]`. Key settings are auto-detected by `setup/detector.py`:
- `module-root` — Root of the module to optimize
- `tests-root` — Root of test files
- `test-framework` — pytest, unittest, jest, etc.
- `formatter-cmds` — Code formatting commands
60 changes: 60 additions & 0 deletions tiles/codeflash-docs/docs/context-extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Context Extraction

How codeflash extracts and limits code context for optimization and test generation.

## Overview

Context extraction (`context/code_context_extractor.py`) builds a `CodeOptimizationContext` containing all code needed for the LLM to understand and optimize a function, split into:

- **Read-writable code** (`CodeContextType.READ_WRITABLE`): The function being optimized plus its helper functions — code the LLM is allowed to modify
- **Read-only context** (`CodeContextType.READ_ONLY`): Dependency code for reference — imports, type definitions, base classes
- **Testgen context** (`CodeContextType.TESTGEN`): Context for test generation, may include imported class definitions and external base class inits
- **Hashing context** (`CodeContextType.HASHING`): Used for deduplication of optimization runs

## Token Limits

Both optimization and test generation contexts are token-limited:
- `OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000` tokens
- `TESTGEN_CONTEXT_TOKEN_LIMIT = 16000` tokens

Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`. Functions whose context exceeds these limits are skipped.

## Context Building Process

### 1. Helper Discovery

For the target function (`FunctionToOptimize`), the extractor finds:
- **Helpers of the function**: Functions/classes in the same file that the target function calls
- **Helpers of helpers**: Transitive dependencies of the helper functions

These are organized as `dict[Path, set[FunctionSource]]` — mapping file paths to the set of helper functions found in each file.

### 2. Code Extraction

`extract_code_markdown_context_from_files()` builds `CodeStringsMarkdown` from the helper dictionaries. Each file's relevant code is extracted as a `CodeString` with its file path.

### 3. Testgen Context Enrichment

`build_testgen_context()` extends the basic context with:
- Imported class definitions (resolved from imports)
- External base class `__init__` methods
- External class `__init__` methods referenced in the context

### 4. Unused Definition Removal

`detect_unused_helper_functions()` and `remove_unused_definitions_by_function_names()` from `context/unused_definition_remover.py` prune definitions that are not transitively reachable from the target function, reducing token usage.

### 5. Deduplication

The hashing context (`hashing_code_context`) generates a hash (`hashing_code_context_hash`) used to detect when the same function context has already been optimized in a previous run, avoiding redundant work.

## Key Functions

| Function | Location | Purpose |
|----------|----------|---------|
| `build_testgen_context()` | `context/code_context_extractor.py` | Build enriched testgen context |
| `extract_code_markdown_context_from_files()` | `context/code_context_extractor.py` | Convert helper dicts to `CodeStringsMarkdown` |
| `detect_unused_helper_functions()` | `context/unused_definition_remover.py` | Find unused definitions |
| `remove_unused_definitions_by_function_names()` | `context/unused_definition_remover.py` | Remove unused definitions |
| `collect_top_level_defs_with_usages()` | `context/unused_definition_remover.py` | Analyze definition usage |
| `encoded_tokens_len()` | `code_utils/code_utils.py` | Count tokens in code |
Loading