Skip to content

feat: add semantic search marimo notebook#581

Open
jhamon wants to merge 8 commits into
mainfrom
semantic-search-marimo
Open

feat: add semantic search marimo notebook#581
jhamon wants to merge 8 commits into
mainfrom
semantic-search-marimo

Conversation

@jhamon
Copy link
Copy Markdown
Collaborator

@jhamon jhamon commented May 20, 2026

Summary

Adds a new marimo notebook demonstrating semantic search with Pinecone, converted and significantly expanded from the existing docs/semantic-search.ipynb. The notebook uses Pinecone's Integrated Inference with the multilingual-e5-large model to demonstrate cross-lingual semantic search across English and Spanish sentences.

Changes

  • New notebook docs/semantic-search.py (marimo format) with:
    • Pinecone SDK 9.0.1 API (pc.indexes.*, pc.index(), updated search signature)
    • multilingual-e5-large embedding model for cross-lingual retrieval
    • Refactored dataset prep: filter_pairs + extract_sentences(lang) to produce both English and Spanish records from Tatoeba
    • to_records parameterized on column name with ID prefixes for multi-language upsert
    • mo.ui.table for dataset inspection, mo.status.progress_bar replacing tqdm, mo.ui.run_button for safe index deletion
    • Interactive query section with mo.ui.text and mo.ui.radio for language filter
    • Language filtering section demonstrating metadata filters scoped to en/es
    • Prose interspersed between code cells narrating the process
    • "Meaning Over Keywords" and "How It Works" sections explaining model selection and cross-lingual retrieval
  • pyproject.toml: pins notebook dependencies (datasets==3.5.1, pinecone==9.0.1, numpy, tqdm)

Test Plan

  • Notebook runs end-to-end with a valid PINECONE_API_KEY
  • Index creation, upsert, and query cells execute without errors
  • Cross-lingual queries return results in both languages
  • Language filter correctly scopes results to en or es
  • Interactive query input updates results on change
  • Delete button safely removes the index

🤖 Generated with Claude Code


Note

Low Risk
Low risk: this PR only adds a new documentation notebook/script and does not modify production code paths; the main impact is increased dependency/runtime expectations when running the notebook (Pinecone API key, index create/delete).

Overview
Adds a new docs/semantic-search.py Marimo notebook that walks through semantic search with Pinecone Integrated Inference, including index creation with the multilingual-e5-large model, preparing English/Spanish Tatoeba records, batched upsert_records, and index.search queries.

The notebook also adds an interactive section for running queries with an optional lang metadata filter, plus a guarded cleanup flow to delete the created index via a UI button.

Reviewed by Cursor Bugbot for commit fc1dc00. Bugbot is set up for automated code reviews on this repo. Configure here.

claude and others added 4 commits May 20, 2026 12:46
Converts docs/semantic-search.ipynb to a marimo notebook with:
- Pinecone SDK 9.0.1 API (pc.indexes.*, pc.index(), new search signature)
- Refactored dataset preparation into prepare_sentences/to_records functions
- Keyword filtering with deduplication
- mo.ui.table for dataset inspection
- mo.status.progress_bar replacing tqdm
- mo.ui.run_button for safe index deletion
- Improved prose structure with explanations interspersed between code cells

Also pins notebook dependencies in pyproject.toml.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace print-based result output with mo.vstack containing a bold
query header and a mo.ui.table, and fix three cells that were
incorrectly configured as markdown cells.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Switch embedding model to multilingual-e5-large
- Refactor data prep: filter_pairs + extract_sentences(lang) to embed
  both English and Spanish sentences with prefixed IDs
- Upsert both languages to a single namespace
- Add lang column to search results table
- Add cross-lingual queries (English + Spanish) and a no-keyword query
  to demonstrate meaning-over-keywords retrieval
- Add language filtering section with lang= parameter on search()
- Update How It Works to explain model selection's role in vector space
- Improve prose throughout querying and filtering sections

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add Try It Yourself section with mo.ui.text and mo.ui.radio for
  language filter, results update reactively on input change
- Fix empty cell and duplicate filter query cells
- Correct second language filter query to Spanish

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread docs/semantic-search.py
top_k=top_k,
inputs={"text": query},
filter={"lang": {"$eq": lang}} if lang else None,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect index.search call shape

High Severity

index.search passes top_k, inputs, and filter as top-level keyword arguments without a query dict. For integrated text search, Pinecone v9 still expects query={"inputs": {"text": ...}, "top_k": ..., "filter": ...}, so these calls will raise a TypeError or fail validation and break all query cells.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 72b422c. Configure here.

claude and others added 4 commits May 20, 2026 15:30
Both were added during development but are no longer used in the
notebook — tqdm was replaced by mo.status.progress_bar.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
datasets and pinecone were added to [project.dependencies] by marimo's
package manager during development. Notebook-specific deps belong in
the notebook's inline PEP 723 metadata (# /// script block), not the
root project config. Run with --sandbox to use the inline deps.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
No pyproject.toml changes on this branch that would affect the lock file.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fc1dc00. Configure here.

Comment thread docs/semantic-search.py
@app.cell
def _(delete_button, index_name, mo, pc):
mo.stop(not delete_button.value)
pc.indexes.delete(index_name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete call uses positional instead of keyword argument

Low Severity

pc.indexes.delete(index_name) uses a positional argument, while the same call earlier (line 98) correctly uses name=index_name. The project review rules require preferring named keyword arguments over positional arguments, and the inconsistency within the same notebook makes the example harder to follow.

Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Configuration

Reviewed by Cursor Bugbot for commit fc1dc00. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants