Seeklet is a minimal, educational web search engine written in Python.
It is designed as a self-educational playground for developers who want to learn the fundamentals of:
- web crawling
- HTML extraction
- text normalization and tokenization
- inverted indexing
- BM25 ranking
- SQLite-backed search systems
- open-source Python project structure and workflows
Seeklet intentionally favors clarity, simplicity, and contributor friendliness over advanced features or production-scale complexity.
Modern search systems can become complex quickly. Seeklet exists to show the core ideas with a small, readable codebase.
The project aims to be:
- minimal: only the essential moving parts
- practical: built around real website crawling and search
- educational: architecture and logic are easy to inspect
- open-source friendly: straightforward setup, tooling, and testing
Current MVP features:
- seeded website crawling from one or more URLs
- same-host crawl scoping
robots.txtsupport- HTML title, text, and link extraction
- normalized URL handling
- SQLite-backed local persistence
- inverted index storage
- BM25 ranking
- result snippets
- CLI commands for:
crawlsearchstatsreset
- automated tests with
pytest - linting and formatting with
ruff - GitHub Actions CI
Seeklet is intentionally not trying to be a full production search engine.
Not included in the MVP:
- JavaScript rendering
- distributed crawling
- asynchronous crawling
- PageRank or link-analysis ranking
- phrase search
- boolean search
- fuzzy search
- semantic/vector search
- REST API
- full browser UI
These can be added later as follow-up learning milestones.
seed URLs
-> crawl allowed pages
-> fetch HTML
-> extract title, text, and links
-> normalize URLs and tokenize text
-> rebuild SQLite index
-> execute BM25 search
-> print ranked CLI results
Core modules:
crawl.py— seeded crawling androbots.txthandlingextract.py— HTML parsing and content extractionnormalize.py— URL normalization and tokenizationstorage.py— SQLite schema and storage helpersindex.py— index rebuildingranking.py— BM25 scoring helperssearch.py— query executionsnippet.py— result snippet generationcli.py— command-line interface
For more detail, see docs/architecture.md.
Seeklet is currently at the MVP stage.
It is ready to:
- crawl a small website
- extract and index its pages locally
- perform BM25-based keyword search from the CLI
It is not yet optimized for large-scale crawling or advanced retrieval features.
- Python 3.12
- Linux or macOS
- internet access for crawling live websites
git clone https://github.com/0xklkuo/seeklet.git
cd YOUR-REPOpython3.12 -m venv .venv
source .venv/bin/activatepip install -e ".[dev]"seeklet crawl https://example.com --max-pages 20 --max-depth 1seeklet statsseeklet search "example domain"seeklet reset --yesCrawl and index one or more seed URLs.
seeklet crawl SEED_URL [SEED_URL ...] [--db PATH] [--max-pages N] [--max-depth N] [--delay-seconds N]Example:
seeklet crawl https://example.com --max-pages 50 --max-depth 2Options:
--db— path to the SQLite database--max-pages— maximum number of pages to crawl--max-depth— maximum crawl depth from seed URLs--delay-seconds— delay between requests
Search the local index.
seeklet search "query text" [--db PATH] [--top-k N]Example:
seeklet search "python packaging"Options:
--db— path to the SQLite database--top-k— maximum number of results to return
Show index statistics.
seeklet stats [--db PATH]Delete local index data.
seeklet reset [--db PATH] [--yes]seeklet crawl https://example.com --max-pages 20 --max-depth 1
seeklet stats
seeklet search "example domain"
seeklet reset --yesExample output shape for search:
1. Example Domain
URL: https://example.com/
Score: 1.2345
Snippet: This domain is for use in illustrative examples...
Exact results depend on the crawled site and its content.
Seeklet stores its local index in SQLite.
Main tables:
documents- one row per crawled page
terms- one row per normalized term
postings- term-to-document mapping with term frequency
This keeps the storage layer:
- simple
- inspectable
- easy to learn from
- easy to run locally without extra services
Seeklet currently uses BM25 for ranking.
At query time:
- query text is normalized and tokenized
- matching terms are looked up in the index
- postings are loaded from SQLite
- BM25 scores are computed in Python
- top results are sorted and printed
This keeps the ranking logic explicit and educational.
Seeklet follows a few simple rules:
- prefer the standard library when it keeps the code clear
- add dependencies only when they clearly help
- keep modules small and focused
- favor readability over cleverness
- avoid premature optimization
- keep the contributor experience simple
Current limitations are intentional:
- crawl scope is limited to the original host(s)
- JavaScript-rendered pages are not indexed
- the crawler is synchronous
- each crawl rebuilds the full index
- tokenization is intentionally basic
- ranking uses only term-based BM25, not link signals or semantics
These are acceptable tradeoffs for the educational MVP.
ruff check .
ruff format --check .
pytestpython -m seeklet --helppip install -e ".[dev]"src/seeklet/
__init__.py
__main__.py
cli.py
config.py
crawl.py
extract.py
index.py
models.py
normalize.py
ranking.py
search.py
snippet.py
storage.py
tests/
docs/
.github/workflows/
Contributions are welcome, especially if they preserve the project's goals:
- simplicity
- readability
- educational value
- minimalism
Please read CONTRIBUTING.md before opening a pull request.
Good first areas for contribution:
- better tokenization
- sitemap support
- benchmark scripts
- phrase search
- optional subdomain crawl mode
- improved documentation and examples
Possible next steps after the MVP:
- incremental recrawling
- phrase search
- boolean search
- optional subdomain crawling
- sitemap discovery
- better language handling
- benchmark scripts
- simple web UI
- link-analysis ranking experiments
These are future learning extensions, not required for the current MVP.
Seeklet is intentionally structured to be approachable for contributors.
The repo includes:
- tests
- linting/formatting
- CI
- architecture documentation
- contributor guidance
The goal is to make it easy for developers of different experience levels to:
- run the project
- inspect the code
- understand the design
- contribute improvements
This project is licensed under the MIT License.
Seeklet is inspired by the idea that the best way to understand search systems is to build one from first principles, with modern tooling but without unnecessary complexity.