Extract Bench

A benchmark for structured extraction from PDF documents, comprising:

Dataset -- 35 human-validated PDF-to-JSON extraction tasks across 5 schemas and 4 domains (finance, academia, hiring, sports).
Evaluation Suite -- A Python package that scores predicted JSON against gold JSON with per-field metrics (exact match, fuzzy, semantic/LLM-based, numeric tolerance, and more).

Getting Started

Installation

pip install -e .

# With dev dependencies
pip install -e ".[dev]"

Quick Start

import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

# Load your data
schema = json.load(open("schema.json"))
gold = json.load(open("gold.json"))
extracted = json.load(open("model_output.json"))

# Configure and build report
config = ReportConfig(
    output_dir=Path("./eval_results"),
    output_name="nvidia-10k-extract-gemini-flash",  # Identifies this experiment
)
builder = ReportBuilder(config)
report = builder.build(schema, gold, extracted)

# Save all outputs
output_path = builder.save(report)
print(f"Results saved to: {output_path}")

This creates eval_results/nvidia-10k-extract-gemini-flash/ containing:

File	Purpose
`report.json`	Machine-readable full report (for programmatic analysis)
`summary.txt`	Human-readable one-page summary (for quick inspection)
`fields.csv`	Per-field outcomes (for csv analysis)
`fields.md`	Markdown table (for documentation/sharing)

Key metrics on the report object:

print(f"Overall pass rate: {report.overall_pass_rate:.1%}")
print(f"Overall score: {report.overall_score:.3f}")
print(f"Fields evaluated: {report.outcomes.total_evaluated}")
print(f"Passed: {report.outcomes.total_passed}")
print(f"Failed: {report.outcomes.total_failed}")

Evaluation API

ReportConfig Options

config = ReportConfig(
    output_dir=Path("./outputs"),      # Where to save reports
    output_name="my-experiment",       # Subdirectory name (auto-generated if None)
    max_reasoning_length=200,          # Truncate LLM reasoning in outputs
    top_n_lowest_fields=5,             # Track N lowest-scoring fields
    save_json=True,                    # Generate report.json
    save_text=True,                    # Generate summary.txt
    save_csv=True,                     # Generate fields.csv
    save_markdown=True,                # Generate fields.md
)

Batch Evaluation

For running many experiments:

import asyncio
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

async def evaluate_model_outputs(
    schema_path: Path,
    gold_path: Path,
    outputs_dir: Path,
    results_dir: Path,
):
    """Evaluate all model outputs in a directory."""
    schema = json.load(schema_path.open())
    gold = json.load(gold_path.open())

    results = []
    for output_file in outputs_dir.glob("*.json"):
        extracted = json.load(output_file.open())

        config = ReportConfig(
            output_dir=results_dir,
            output_name=output_file.stem,  # Use filename as experiment ID
        )
        builder = ReportBuilder(config)
        report = await builder.build_async(schema, gold, extracted)
        builder.save(report)

        results.append({
            "model": output_file.stem,
            "pass_rate": report.overall_pass_rate,
            "score": report.overall_score,
        })

    return results

# Run batch evaluation
results = asyncio.run(evaluate_model_outputs(
    schema_path=Path("schema.json"),
    gold_path=Path("gold.json"),
    outputs_dir=Path("./model_outputs"),
    results_dir=Path("./eval_results"),
))

# Print comparison
for r in sorted(results, key=lambda x: -x["score"]):
    print(f"{r['model']}: {r['pass_rate']:.1%} pass, {r['score']:.3f} avg score")

Report Output Format

summary.txt

================================================================================
                        EVALUATION REPORT: my-experiment
================================================================================

OVERALL RESULTS
---------------
Pass Rate: 85.2% (23/27 fields)
Average Score: 0.891

SCHEMA SHAPE
------------
Total nodes: 45
By type: object=12, string=18, number=8, array=5, boolean=2

COVERAGE
--------
Present in both: 25
Missing in extracted: 2
Spurious in extracted: 0

PASS/FAIL BY METRIC
-------------------
string_semantic: 15/18 passed (83.3%)
number_tolerance: 6/6 passed (100.0%)
integer_exact: 2/3 passed (66.7%)

LOWEST SCORING FIELDS
---------------------
1. borrower.address (0.45) - Partial match, missing suite number
2. terms.rate_type (0.60) - Semantic mismatch
...

fields.csv

Column	Description
`path`	Full JSONPath to the field
`normalized_path`	Human-readable path (e.g., `borrower.name`)
`metric_id`	Metric used for evaluation
`score`	Numeric score (0.0-1.0)
`passed`	Boolean pass/fail
`gold_value`	Expected value
`extracted_value`	Model's output value
`reasoning`	LLM reasoning (for semantic metrics)

Low-Level API

For direct access to evaluation results without reporting:

from extract_bench import StructuredEvaluator, StructuredEvaluatorConfig

evaluator = StructuredEvaluator(StructuredEvaluatorConfig(metrics=[]))
result = evaluator.evaluate(schema, gold, predicted)

# Raw results dict: path -> metric_id -> MetricResult
for path, metrics in result["results"].items():
    for metric_id, metric_result in metrics.items():
        print(f"{path} [{metric_id}]: passed={metric_result.passed}, score={metric_result.score}")

Use evaluate_async() for better performance with LLM-based metrics.

Metrics

Available Metrics

Category	Metric	Description
String	`string_exact`	Case-sensitive exact match
	`string_case_insensitive`	Case-insensitive match
	`string_fuzzy`	Levenshtein similarity
	`string_semantic`	LLM-based semantic comparison (default)
Number	`number_exact`	Exact numeric equality
	`number_tolerance`	Match within tolerance (default)
	`integer_exact`	Exact integer equality
Boolean	`boolean_exact`	Exact boolean equality
Array	`array_llm`	LLM-based array comparison
General	`string_llm`	LLM judge for any comparison

Evaluation Presets

Specify evaluation_config in schema fields to control which metric is used:

schema = {
    "type": "object",
    "properties": {
        "price": {
            "type": "number",
            "evaluation_config": {
                "metrics": [{"metric_id": "number_tolerance", "params": {"tolerance": 0.01}}]
            }
        },
        "description": {
            "type": "string",
            "evaluation_config": "string_fuzzy"  # Use preset shorthand
        }
    }
}

Preset	Description
`string_exact`	Case-sensitive exact match
`string_fuzzy`	Levenshtein similarity (case-insensitive by default)
`string_case_insensitive`	Case-insensitive match
`string_semantic`	LLM-based semantic similarity (default for strings)
`number_exact`	Exact numeric equality
`number_tolerance`	Match within tolerance (default for numbers)
`integer_exact`	Exact integer equality (default for integers)
`boolean_exact`	Exact boolean equality (default for booleans)
`array_llm`	LLM evaluation of entire array (default for arrays)
`skip`	Skip evaluation for this node

Custom Metrics

from extract_bench import global_metric_registry
from extract_bench.evaluation.metrics import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    metric_id = "my_custom"

    async def evaluate(self, node, config=None):
        gold = node.get_gold_value()
        extracted = node.get_extracted_value()
        return MetricResult(
            metric_id=self.metric_id,
            score=1.0,
            passed=True,
            details={"custom": "data"}
        )

global_metric_registry.register_metric(MyCustomMetric)

Configuration

Environment Setup

LLM-based metrics use LiteLLM. Configure your provider:

# Vertex AI (Google Cloud)
gcloud auth application-default login

# OpenAI
export OPENAI_API_KEY=sk-...

# Or copy .env.example to .env

LLM Model Configuration

Default model: vertex_ai/gemini-2.5-flash (or set DEFAULT_LLM_MODEL in .env).

Override per-field in schema:

schema = {
    "type": "object",
    "properties": {
        "company": {
            "type": "string",
            "evaluation_config": {
                "metrics": [{"metric_id": "string_semantic", "params": {"model": "openai/gpt-4o-mini"}}]
            },
        }
    },
}

Architecture

extract-bench/
├── dataset/                   # Benchmark dataset (see dataset/README.md)
│   ├── {domain}/{schema}/     #   e.g. finance/10k/, academic/research/
│   │   ├── *-schema.json      #   JSON Schema with evaluation_config per field
│   │   └── pdf+gold/          #   Source PDFs + human-validated gold JSONs
├── src/extract_bench/         # Evaluation suite
│   ├── infra/                 #   Schema AST (nodes, visitors)
│   └── evaluation/
│       ├── metrics/           #   Metric implementations
│       └── reporting/         #   Report generation (see reporting/README.md)

Schema → AST → Values instantiated → Metrics evaluated async in parallel → Report generated.

Development

pip install -e ".[dev]"
pytest tests/ -v

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
src/extract_bench		src/extract_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Bench

Table of Contents

Getting Started

Installation

Quick Start

Evaluation API

ReportConfig Options

Batch Evaluation

Report Output Format

summary.txt

fields.csv

Low-Level API

Metrics

Available Metrics

Evaluation Presets

Custom Metrics

Configuration

Environment Setup

LLM Model Configuration

Architecture

Development

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ContextualAI/extract-bench

Folders and files

Latest commit

History

Repository files navigation

Extract Bench

Table of Contents

Getting Started

Installation

Quick Start

Evaluation API

ReportConfig Options

Batch Evaluation

Report Output Format

summary.txt

fields.csv

Low-Level API

Metrics

Available Metrics

Evaluation Presets

Custom Metrics

Configuration

Environment Setup

LLM Model Configuration

Architecture

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages