Skip to content

Benchmark

Vivek Raman edited this page Feb 19, 2026 · 1 revision

LaTeX Chatbot Benchmark

A CLI tool for evaluating the performance and accuracy of the LaTeX chatbot agents.

Functionality

The benchmark tool runs a suite of predefined tasks against the chatbot agent to measure:

  • Success Rate: Can the agent correctly invoke the compiler and produce a valid PDF?
  • Correction Capability: Can the agent fix intentionally broken LaTeX code?
  • Generation Quality: (Subjective/Heuristic) specific checks on generated content.

Usage

Run Benchmarks

# From project root
make do-benchmark

Or manually:

cd benchmark
uv run benchmark --dir temp run

Workflow

  1. Locate Server: Ensures the backend server is running or reachable.
  2. Load Dataset: Reads benchmark definitions from data/.
  3. Clone Dataset: Sets up a temporary workspace for each test case.
  4. Run: Executes each test case and reports results.
flowchart TD
    Start([Start]) --> Check{Server Running?}
    Check -- Yes --> Load[Load Dataset]
    Check -- No --> Error[Exit]
    
    Load --> Setup[Clone to Temp Dir]
    Setup --> Run[Run Agent Task]
    Run --> Compile{Compile & Validate}
    
    Compile -- Success --> Result[Record Success]
    Compile -- Fail --> Result[Record Failure]
    
    Result --> More{More Tests?}
    More -- Yes --> Setup
    More -- No --> Report[Generate Report]
Loading

Configuration

Benchmark behavior can be rendered via CLI flags. Use --help for more options:

uv run benchmark --help

Clone this wiki locally