Web Crawler

Performant, extensible and lean web crawler, utilizes all available CPUs by default.

Uses event loop for I/O and processes for analyzing the pages.

Batteries included

Basic httpx page downloader
S3 page storage
Local filesystem page storage

Usage

Have a look at tests/integration/test_crawl.py
Implement your own PageAnalyzer and PageDownloader classes
Optionally customize structlog logging, see configuration
Have fun!

Customization

All classes in the modules folder can be replaced with your custom implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
datek_web_crawler		datek_web_crawler
tests		tests
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Batteries included

Usage

Customization

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Batteries included

Usage

Customization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages