-
Notifications
You must be signed in to change notification settings - Fork 1
Benchmark
Vivek Raman edited this page Feb 19, 2026
·
1 revision
A CLI tool for evaluating the performance and accuracy of the LaTeX chatbot agents.
The benchmark tool runs a suite of predefined tasks against the chatbot agent to measure:
- Success Rate: Can the agent correctly invoke the compiler and produce a valid PDF?
- Correction Capability: Can the agent fix intentionally broken LaTeX code?
- Generation Quality: (Subjective/Heuristic) specific checks on generated content.
# From project root
make do-benchmarkOr manually:
cd benchmark
uv run benchmark --dir temp run- Locate Server: Ensures the backend server is running or reachable.
-
Load Dataset: Reads benchmark definitions from
data/. - Clone Dataset: Sets up a temporary workspace for each test case.
- Run: Executes each test case and reports results.
flowchart TD
Start([Start]) --> Check{Server Running?}
Check -- Yes --> Load[Load Dataset]
Check -- No --> Error[Exit]
Load --> Setup[Clone to Temp Dir]
Setup --> Run[Run Agent Task]
Run --> Compile{Compile & Validate}
Compile -- Success --> Result[Record Success]
Compile -- Fail --> Result[Record Failure]
Result --> More{More Tests?}
More -- Yes --> Setup
More -- No --> Report[Generate Report]
Benchmark behavior can be rendered via CLI flags. Use --help for more options:
uv run benchmark --help