Benchmark

LaTeX Chatbot Benchmark

A CLI tool for evaluating the performance and accuracy of the LaTeX chatbot agents.

Functionality

The benchmark tool runs a suite of predefined tasks against the chatbot agent to measure:

Success Rate: Can the agent correctly invoke the compiler and produce a valid PDF?
Correction Capability: Can the agent fix intentionally broken LaTeX code?
Generation Quality: (Subjective/Heuristic) specific checks on generated content.

Usage

Run Benchmarks

# From project root
make do-benchmark

Or manually:

cd benchmark
uv run benchmark --dir temp run

Workflow

Locate Server: Ensures the backend server is running or reachable.
Load Dataset: Reads benchmark definitions from data/.
Clone Dataset: Sets up a temporary workspace for each test case.
Run: Executes each test case and reports results.

flowchart TD
    Start([Start]) --> Check{Server Running?}
    Check -- Yes --> Load[Load Dataset]
    Check -- No --> Error[Exit]
    
    Load --> Setup[Clone to Temp Dir]
    Setup --> Run[Run Agent Task]
    Run --> Compile{Compile & Validate}
    
    Compile -- Success --> Result[Record Success]
    Compile -- Fail --> Result[Record Failure]
    
    Result --> More{More Tests?}
    More -- Yes --> Setup
    More -- No --> Report[Generate Report]

Configuration

Benchmark behavior can be rendered via CLI flags. Use --help for more options:

uv run benchmark --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark

LaTeX Chatbot Benchmark

Functionality

Usage

Run Benchmarks

Workflow

Configuration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Documentation

Flows

Clone this wiki locally