MIT — see LICENSE.
A model embedding evaluation framework for comparing multiple request-building / chunking strategies across multiple embedding models.
There are two Streamlit entrypoints:
- Import UI (build
data/sample.jsonlfrom CSV/JSON/JSONL)
streamlit run ui/app.py- Dashboard UI (run evaluation + build report + view charts/tables)
streamlit run data_viewer.py- Run the importer:
streamlit run ui/app.py - Import your dataset and write
data/sample.jsonl - Run the dashboard:
streamlit run data_viewer.py - Click Execute Model Evaluation to generate:
results/results.tsvresults/summary.csv
- CI: GitHub Actions runs tests on Python 3.9–3.12 (
.github/workflows/ci.yml). - Lint/format:
ruff(configured forpy39viaruff.toml). - Tests:
pytestunit tests cover deterministic helpers (no model download required).
This generic version is designed for portfolio/public sharing:
- No company-specific database schemas
- No internal model names
- No Postgres dependency required (uses a local JSONL dataset)
- Strategy ablations, not just model swaps: compare request-building approaches (
currentvschunking) to isolate whether retrieval quality is dominated by prompt construction rather than the embedding model itself. - Field-aware retrieval design:
weighted_chunkingencodes fields independently, applies explicit per-field weights, and then pools—useful when certain attributes (e.g. tags vs description) should contribute more to similarity. - Data hygiene as a first-class baseline:
cleansedshows how lightweight normalization (ASCII cleanup, whitespace collapse, dedupe) can change embedding inputs and therefore ranking behavior. - Production-parity input simulation:
triton_input_simulator.pymirrors Triton-style string tensor formatting so you can catch serving-time text formatting issues locally before deploying. - Evaluation plumbing that’s easy to extend: JSONL in → per-(row, model, strategy) TSV out → aggregated report (
build_report.py) enables quick iteration and adding new strategies/metrics without changing the dataset format.
flowchart TD
R[Row inputs]
S{Strategy}
BR1[build_request json_data]
FM[build_field_map json_data]
BR2[build_request field_map]
TOK[tokenize request]
KW[keyword]
TR[simulate_triton_encode\nTriton-style string tensor]
EMB[Embeddings\nkeyword_emb plus request_emb]
COS[cosine_similarity]
OUT[results.tsv row]
REP[build_report.py\naggregations plus correlations]
R --> S
S -->|current or cleansed| BR1
S -->|chunking or weighted_chunking| FM
FM --> BR2
BR1 --> TOK
BR2 --> TOK
R --> KW --> TR
TOK --> TR
TR --> EMB --> COS --> OUT --> REP
Given rows containing:
- an
id - an
entity_type json_datafields (e.g.name,description,sponsors,tagCategories,tagTopics)- a
keyword - a labeled
relevance_score
…the runner builds different request texts, embeds keyword + request, computes cosine similarity, and writes results to TSV.
By default, the runner compares two public SentenceTransformer models:
BAAI/bge-base-en-v1.5BAAI/bge-large-en-v1.5
You can override via --models.
The generic runner evaluates these strategies (see strategies/):
current: concatenate normalized fields into one requestchunking: embed each field independently and mean-normalize the pooled embeddingweighted_chunking: like chunking but applies fixed field weightscleansed: uses basic text cleansing + dedupe before request construction
A small synthetic dataset lives at:
data/sample.jsonl
Each line is a JSON object.
A TSV is written containing, per (row, model, strategy):
- request text
- tokens/token length
- cosine similarity
- relevance labels
Default output path:
results/results.tsv
Run evaluation:
python3 run_evaluation.py \
--data data/sample.jsonl \
--out results/results.tsvGenerate an aggregated summary report:
python3 build_report.py \
--in results/results.tsv \
--out results/summary.csvImporter UI:
streamlit run ui/app.pyDashboard UI:
streamlit run data_viewer.pyFilter to a single entity type:
python3 run_evaluation.py --entity-type episodeEvaluate different models:
python3 run_evaluation.py \
--models BAAI/bge-base-en-v1.5 BAAI/bge-large-en-v1.5flowchart TD
A[JSONL dataset] --> B[run_evaluation.py]
B --> C[Load models\nSentenceTransformers]
B --> D[Strategies\ncurrent/chunking/weighted/cleansed]
D --> E[Cosine similarity\nkeyword vs request]
E --> F[results/results.tsv]
F --> G[build_report.py]
G --> H[results/summary.csv]
Run the unit tests:
.venv/bin/python -m pytest -qBelow are example screenshots from screenshots/ showing the Streamlit dashboard output.
- Models are downloaded/cached automatically by
sentence-transformers. - For a portfolio repo, avoid committing any internal/copyrighted datasets.
- On macOS, Hugging Face stores its cached models at
~/.cache/huggingface/hub. You can completely clear this cache by running:
rm -rf ~/.cache/huggingface/hub








