diff --git a/docs.json b/docs.json index a25fe07..1273367 100644 --- a/docs.json +++ b/docs.json @@ -693,6 +693,10 @@ "source": "/llm-evaluation/real-time-evaluation", "destination": "/evaluations/online-evaluation/overview" }, + { + "source": "/llm-evaluation/offline/platform/ci-cd-execution", + "destination": "/evaluations/experiments/ci-cd" + }, { "source": "/llm-evaluation/:path*", "destination": "/evaluations/overview" diff --git a/evaluations/evaluators/custom-scoring.mdx b/evaluations/evaluators/custom-scoring.mdx index d6cb652..fcc0ee3 100644 --- a/evaluations/evaluators/custom-scoring.mdx +++ b/evaluations/evaluators/custom-scoring.mdx @@ -320,9 +320,9 @@ Custom scores appear in: href="/evaluations/experiments/overview" /> diff --git a/evaluations/evaluators/saved-evaluators.mdx b/evaluations/evaluators/saved-evaluators.mdx index 6dfe0c7..5792fca 100644 --- a/evaluations/evaluators/saved-evaluators.mdx +++ b/evaluations/evaluators/saved-evaluators.mdx @@ -28,9 +28,7 @@ Saved evaluators are pre-configured evaluation setups that you create on the Lan 4. Configure the settings (model, prompt, thresholds, etc.) 5. Give it a descriptive name and save - - Creating a saved evaluator - +{/* TODO: Add screenshot of saved evaluator creation UI */} ### Via the Evaluators Page diff --git a/llm-evaluation/offline/platform/ci-cd-execution.mdx b/llm-evaluation/offline/platform/ci-cd-execution.mdx deleted file mode 100644 index 9ff355e..0000000 --- a/llm-evaluation/offline/platform/ci-cd-execution.mdx +++ /dev/null @@ -1,499 +0,0 @@ ---- -title: Run Evaluations from CI/CD -sidebarTitle: CI/CD Execution -description: Execute platform-configured evaluations from your CI/CD pipelines using the LangWatch SDKs or REST API. ---- - -Run evaluations that you've configured in the LangWatch platform directly from your CI/CD pipelines. This enables automated quality gates for your LLM applications. - -## Overview - -After configuring an evaluation in LangWatch (setting up targets, evaluators, and datasets), you can trigger it programmatically using: - -- **Python SDK**: `langwatch.experiment.run("your-slug")` -- **TypeScript SDK**: `langwatch.experiments.run("your-slug")` -- **REST API**: `POST /api/evaluations/v3/{slug}/run` - -The execution uses the configuration saved in LangWatch, so you don't need to specify targets, evaluators, or datasets in your CI/CD script. - -## Quickstart - -### 1. Find Your Evaluation Slug - -Your evaluation slug is visible in the URL when viewing your evaluation: -``` -https://app.langwatch.ai/your-project/evaluations/v3/your-evaluation-slug - ^^^^^^^^^^^^^^^^^^^^^^^^^ -``` - -You can also find it by clicking the **CI/CD** button in the evaluation toolbar. - -### 2. Run from Your Pipeline - - - -```python -import langwatch - -result = langwatch.experiment.run("your-evaluation-slug") -result.print_summary() -``` - - -```typescript -import { LangWatch } from "langwatch"; - -const langwatch = new LangWatch(); - -const result = await langwatch.experiments.run("your-evaluation-slug"); -result.printSummary(); -``` - - -```bash -# Start the evaluation run -RUN_RESPONSE=$(curl -s -X POST "https://app.langwatch.ai/api/evaluations/v3/your-evaluation-slug/run" \ - -H "X-Auth-Token: ${LANGWATCH_API_KEY}") - -RUN_ID=$(echo $RUN_RESPONSE | jq -r '.runId') -echo "Started run: $RUN_ID" - -# Poll for completion -while true; do - STATUS_RESPONSE=$(curl -s "https://app.langwatch.ai/api/evaluations/v3/runs/$RUN_ID" \ - -H "X-Auth-Token: ${LANGWATCH_API_KEY}") - - STATUS=$(echo $STATUS_RESPONSE | jq -r '.status') - PROGRESS=$(echo $STATUS_RESPONSE | jq -r '.progress') - TOTAL=$(echo $STATUS_RESPONSE | jq -r '.total') - - echo "Progress: $PROGRESS/$TOTAL" - - if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then - break - fi - - sleep 2 -done - -# Show summary and exit -echo $STATUS_RESPONSE | jq '.summary' - -if [ "$STATUS" = "failed" ]; then - exit 1 -fi -``` - - - - -Set the `LANGWATCH_API_KEY` environment variable with your project API key. -You can find it in your [project settings](/project/setup). - - -## CI/CD Integration Examples - -### GitHub Actions - -```yaml -name: LLM Evaluation - -on: - pull_request: - branches: [main] - workflow_dispatch: - -jobs: - evaluate: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.11' - - - name: Install dependencies - run: pip install langwatch - - - name: Run evaluation - env: - LANGWATCH_API_KEY: ${{ secrets.LANGWATCH_API_KEY }} - run: | - python -c " - import langwatch - - result = langwatch.experiment.run('my-evaluation') - result.print_summary() - " -``` - -### GitLab CI - -```yaml -evaluate: - stage: test - image: python:3.11 - script: - - pip install langwatch - - | - python -c " - import langwatch - - result = langwatch.experiment.run('my-evaluation') - result.print_summary() - " - variables: - LANGWATCH_API_KEY: $LANGWATCH_API_KEY -``` - -### CircleCI - -```yaml -version: 2.1 - -jobs: - evaluate: - docker: - - image: python:3.11 - steps: - - checkout - - run: - name: Run evaluation - command: | - pip install langwatch - python -c " - import langwatch - - result = langwatch.experiment.run('my-evaluation') - result.print_summary() - " -``` - -## Options - -### Progress Callback - -Track progress during long-running evaluations: - - - -```python -result = langwatch.experiment.run( - "my-evaluation", - on_progress=lambda completed, total: print(f"Progress: {completed}/{total}") -) -result.print_summary() -``` - - -```typescript -const result = await langwatch.experiments.run("my-evaluation", { - onProgress: (completed, total) => { - console.log(`Progress: ${completed}/${total}`); - } -}); -result.printSummary(); -``` - - - -### Timeout - -Set a maximum time to wait for completion: - - - -```python -result = langwatch.experiment.run( - "my-evaluation", - timeout=300.0 # 5 minutes (default: 600 seconds) -) -result.print_summary() -``` - - -```typescript -const result = await langwatch.experiments.run("my-evaluation", { - timeout: 300000 // 5 minutes in ms (default: 600000) -}); -result.printSummary(); -``` - - - -### Poll Interval - -Adjust how frequently to check for completion: - - - -```python -result = langwatch.experiment.run( - "my-evaluation", - poll_interval=5.0 # Check every 5 seconds (default: 2 seconds) -) -result.print_summary() -``` - - -```typescript -const result = await langwatch.experiments.run("my-evaluation", { - pollInterval: 5000 // Check every 5 seconds in ms (default: 2000) -}); -result.printSummary(); -``` - - - -### Exit on Failure - -By default, `print_summary()` / `printSummary()` exits with code 1 when there are failures. You can disable this: - - - -```python -result = langwatch.experiment.run("my-evaluation") -result.print_summary(exit_on_failure=False) # Don't exit automatically - -# Handle failures manually -if result.failed > 0: - print(f"Warning: {result.failed} failures, but continuing...") -``` - - -```typescript -const result = await langwatch.experiments.run("my-evaluation"); -result.printSummary(false); // Don't exit automatically - -// Handle failures manually -if (result.failed > 0) { - console.log(`Warning: ${result.failed} failures, but continuing...`); -} -``` - - - -## Results Summary - -The `print_summary()` / `printSummary()` method outputs a CI-friendly summary: - -``` -════════════════════════════════════════════════════════════ - EVALUATION RESULTS -════════════════════════════════════════════════════════════ - Run ID: run_abc123 - Status: COMPLETED - Duration: 45.2s -──────────────────────────────────────────────────────────── - Passed: 42 - Failed: 3 - Pass Rate: 93.3% -──────────────────────────────────────────────────────────── - TARGETS: - GPT-4o: 20 passed, 2 failed - Avg latency: 1250ms - Total cost: $0.0125 - Claude 3.5: 22 passed, 1 failed - Avg latency: 980ms - Total cost: $0.0098 -──────────────────────────────────────────────────────────── - EVALUATORS: - Exact Match: 85.0% pass rate - Faithfulness: 95.0% pass rate - Avg score: 0.87 -──────────────────────────────────────────────────────────── - View details: https://app.langwatch.ai/project/experiments/my-eval?runId=run_abc123 -════════════════════════════════════════════════════════════ -``` - -## Result Object - -The result object contains detailed information about the run: - - - -```python -result = langwatch.experiment.run("my-evaluation") - -# Basic metrics -result.run_id # Unique run identifier -result.status # "completed", "failed", or "stopped" -result.passed # Number of passed evaluations -result.failed # Number of failed evaluations -result.pass_rate # Percentage passed (0-100) -result.duration # Total duration in milliseconds -result.run_url # URL to view in LangWatch - -# Detailed summary -result.summary.total_cells # Total cells executed -result.summary.completed_cells # Successfully completed -result.summary.failed_cells # Failed executions -result.summary.targets # Per-target statistics -result.summary.evaluators # Per-evaluator statistics - -# Print and exit on failure -result.print_summary() -``` - - -```typescript -const result = await langwatch.experiments.run("my-evaluation"); - -// Basic metrics -result.runId // Unique run identifier -result.status // "completed" | "failed" | "stopped" -result.passed // Number of passed evaluations -result.failed // Number of failed evaluations -result.passRate // Percentage passed (0-100) -result.duration // Total duration in milliseconds -result.runUrl // URL to view in LangWatch - -// Detailed summary -result.summary.totalCells // Total cells executed -result.summary.completedCells // Successfully completed -result.summary.failedCells // Failed executions -result.summary.targets // Per-target statistics -result.summary.evaluators // Per-evaluator statistics - -// Print and exit on failure -result.printSummary(); -``` - - - -## Error Handling - - - -```python -from langwatch.evaluation import ( - EvaluationNotFoundError, - EvaluationTimeoutError, - EvaluationRunFailedError, - EvaluationsApiError, -) - -try: - result = langwatch.experiment.run("my-evaluation", timeout=300) - result.print_summary() -except EvaluationNotFoundError: - print("Evaluation slug not found") - exit(1) -except EvaluationTimeoutError as e: - print(f"Timeout: {e.progress}/{e.total} completed") - exit(1) -except EvaluationRunFailedError as e: - print(f"Run failed: {e.error_message}") - exit(1) -except EvaluationsApiError as e: - print(f"API error: {e} (status: {e.status_code})") - exit(1) -``` - - -```typescript -import { - EvaluationNotFoundError, - EvaluationTimeoutError, - EvaluationRunFailedError, - EvaluationsApiError, -} from "langwatch"; - -try { - const result = await langwatch.experiments.run("my-evaluation", { timeout: 300000 }); - result.printSummary(); -} catch (error) { - if (error instanceof EvaluationNotFoundError) { - console.error("Evaluation slug not found"); - } else if (error instanceof EvaluationTimeoutError) { - console.error(`Timeout: ${error.progress}/${error.total} completed`); - } else if (error instanceof EvaluationRunFailedError) { - console.error(`Run failed: ${error.errorMessage}`); - } else if (error instanceof EvaluationsApiError) { - console.error(`API error: ${error.message} (status: ${error.statusCode})`); - } - process.exit(1); -} -``` - - - -## REST API Reference - -### Start a Run - -``` -POST /api/evaluations/v3/{slug}/run -``` - -**Headers:** -- `X-Auth-Token: your-api-key` or `Authorization: Bearer your-api-key` - -**Response:** -```json -{ - "runId": "run_abc123", - "status": "running", - "total": 45, - "runUrl": "https://app.langwatch.ai/project/experiments/my-eval?runId=run_abc123" -} -``` - -### Get Run Status - -``` -GET /api/evaluations/v3/runs/{runId} -``` - -**Headers:** -- `X-Auth-Token: your-api-key` or `Authorization: Bearer your-api-key` - -**Response (running):** -```json -{ - "runId": "run_abc123", - "status": "running", - "progress": 20, - "total": 45, - "startedAt": 1702500000000 -} -``` - -**Response (completed):** -```json -{ - "runId": "run_abc123", - "status": "completed", - "progress": 45, - "total": 45, - "startedAt": 1702500000000, - "finishedAt": 1702500045000, - "summary": { - "runId": "run_abc123", - "totalCells": 45, - "completedCells": 45, - "failedCells": 3, - "duration": 45000, - "runUrl": "https://app.langwatch.ai/project/experiments/my-eval?runId=run_abc123" - } -} -``` - -## What's Next? - - - - Configure your first evaluation in LangWatch - - - Write evaluations directly in code - - - Browse available evaluation metrics - - - Learn about dataset management - -