⛏️ OARC-Crawlers ⛏️

OARC-Crawlers is a Python framework designed for acquiring, processing, and storing data from diverse online sources including YouTube, GitHub, ArXiv, DuckDuckGo, and general websites. Built with modularity and asynchronous operation, it features specialized crawlers for each source, unified data persistence using Apache Parquet via a central ParquetStorage component, and provides both a command-line interface (CLI) and a Python API for interaction. The framework aims to streamline data collection for research, analysis, and integration into agentic AI workflows by delivering structured, analysis-ready outputs.

Feature	Description
🎬 YouTube Crawler	Download videos, playlists, and extract captions
🐙 GitHub Crawler	Clone repositories and extract code for analysis
🦆 DuckDuckGo Crawler	Search for text, images, and news
🌐 Web Crawler	Extract content from websites using BeautifulSoup
📜 ArXiv Crawler	Download academic papers and their LaTeX source
💾 Parquet Storage	Utility for saving and loading data in Parquet format

Setup

OARC-Crawlers requires Python >=3.10 and <3.12 (Python 3.12+ is not yet fully supported due to some dependency compatibility issues).

# Install package from PyPI
pip install oarc-crawlers

Using the CLI

Once setup, the OARC-Crawlers package can be installed directly using pip and used immediately from the command line, without any complex setup. See our Cheat Sheet for quick reference of common commands.

# Basic CLI usage examples
oarc-crawlers yt download --url "https://youtube.com/watch?v=dQw4w9WgXcQ"
oarc-crawlers gh analyze --url "https://github.com/pytorch/pytorch"
oarc-crawlers arxiv download --id "2103.00020"
oarc-crawlers ddg text --query "python async programming" --max-results 5
oarc-crawlers web pypi --package "requests"

# Get help for any command
oarc-crawlers --help
oarc-crawlers yt --help

For more detailed CLI usage examples and command patterns, see the CLI Usage and Examples.

API Usage

OARC-Crawlers provides a simple, intuitive Python API that allows you to integrate any of the crawlers directly into your Python applications or workflows. Each crawler is designed with async support for efficient execution.

Basic Import Pattern

from oarc_crawlers import (
    YTCrawler,
    GHCrawler,
    ArxivCrawler,
    DDGCrawler,
    WebCrawler,
    ParquetStorage,
)

Quick Examples

import asyncio
from oarc_crawlers import YTCrawler, ParquetStorage

async def download_video():
    # Initialize the crawler
    yt = YTCrawler(data_dir="./data")
    
    # Download a video
    result = await yt.download_video("https://youtube.com/watch?v=dQw4w9WgXcQ")
    
    # Print the result
    print(f"Video saved to: {result.get('file_path', 'N/A')}")
    print(f"Video title: {result.get('title', 'Unknown')}")

# Run the async function
asyncio.run(download_video())

Running Example Scripts

The package includes a collection of example scripts that demonstrate how to use each component:

# Run specific module example
python examples/run_example.py youtube
python examples/run_example.py github
python examples/run_example.py ddg
python examples/run_example.py bs # Note: 'bs' refers to BeautifulSoup/WebCrawler examples
python examples/run_example.py arxiv
python examples/run_example.py parquet

# Run the combined example
python examples/run_example.py combined

# Run all examples
python examples/run_example.py all

For detailed examples and advanced usage patterns, check our comprehensive Examples documentation.

Example Categories

Category	Examples
Basic Operations	Initializing Crawlers
	Configuring Storage Paths
	Error Handling
YouTube Operations	YouTube CLI Examples
	Download a Video
	Download a Playlist
	Extract Captions
	Search Videos
	Fetch Chat Messages
GitHub Operations	GitHub CLI Examples
	Clone a Repository
	Analyze Code
	Search Repositories
Search Operations	DuckDuckGo CLI Examples
	Text Search
	News and Image Search
ArXiv Operations	ArXiv CLI Examples
	Downloading Papers
	Extracting LaTeX Sources
	Extracting Keywords and References
	Working with Categories
	Generate Citation Network
Web Crawling	Web Crawler CLI Examples
	Crawling Websites
	Extracting Specific Content
Data Management	Data Management CLI Examples
	Working with Parquet Files
	Converting Between Formats
	Working with the Parquet Storage System

Development

For development purposes, oarc-crawlers can be installed by cloning the repository:

# Clone the repository
git clone https://github.com/Ollama-Agent-Roll-Cage/oarc-crawlers.git
cd oarc-crawlers

# Install UV package manager
pip install uv

# Create & activate virtual environment with UV (use 3.10 or 3.11)
uv venv --python 3.11

# Install the package and dependencies in one step
uv run pip install -e .[dev]

Package Structure

oarc-crawlers/
├── .github/                     # GitHub workflows and config
├── docs/                        # Core documentation
├── examples/                    # Example usage scripts
├── src/
│   └── oarc_crawlers/           # Source code to package
│   └── tests/                   # Unit tests
└── README.md                    # Project overview
└── LICENSE                      # Apache 2.0

See the Project Structure for a detailed module breakdown.

Running OARC Tests

To run all tests:

uv run pytest

Or to run a specific test:

uv run pytest src/tests/test_parquet_storage.py

Architecture

The oarc-crawlers package is designed with a modular architecture, allowing for easy extension and maintenance. Each crawler (YTCrawler, GHCrawler, ArxivCrawler, DDGCrawler, WebCrawler) operates independently but shares a common interface for data storage via the ParquetStorage utility. The system leverages asynchronous programming (asyncio) for efficient I/O operations, especially crucial for network-bound tasks like downloading videos or cloning repositories. A unified Command Line Interface (CLI) built with click provides a consistent user experience across all modules.

graph TD
    subgraph UserInteraction
        CLI[CLI Click]
        API[Python API]
    end

    subgraph CoreModules
        Router{Module Router}
        YT[YTCrawler]
        GH[GHCrawler]
        AX[ArxivCrawler]
        DDG[DDGCrawler]
        WEB[WebCrawler]
    end

    subgraph DataHandling
        PS[ParquetStorage]
        FS[(File System)]
    end

    subgraph ExternalSources
        SourceYT[YouTube API]
        SourceGH[GitHub API]
        SourceAX[ArXiv API]
        SourceDDG[DuckDuckGo]
        SourceWeb[Web Pages]
    end

    CLI --> Router
    API --> Router
    Router --> YT
    Router --> GH
    Router --> AX
    Router --> DDG
    Router --> WEB

    YT --> SourceYT
    GH --> SourceGH
    AX --> SourceAX
    DDG --> SourceDDG
    WEB --> SourceWeb

    SourceYT -- Data --> YT
    SourceGH -- Data --> GH
    SourceAX -- Data --> AX
    SourceDDG -- Data --> DDG
    SourceWeb -- Data --> WEB

    YT -- Data --> PS
    GH -- Data --> PS
    AX -- Data --> PS
    DDG -- Data --> PS
    WEB -- Data --> PS

    PS --> FS

FAQ

Q: What is OARC-Crawlers?
A: OARC-Crawlers is a modular Python framework for acquiring, processing, and storing data from sources like YouTube, GitHub, ArXiv, DuckDuckGo, and general websites. It provides both CLI and Python API interfaces, with unified Parquet storage for all outputs.

Q: Which Python versions are supported?
A: Python 3.10 and 3.11 are supported. Python 3.12+ is not yet fully supported due to dependency compatibility.

Q: How do I install OARC-Crawlers?
A: You can install via pip (pip install oarc-crawlers) or from source. See the "Setup" section above for details.

Q: What is the recommended way to manage environments?
A: Use the uv package manager to create a virtual environment with Python 3.10 or 3.11 for best compatibility.

Q: How do I run a crawler from the command line?
A: Use the CLI, e.g.,

oarc-crawlers yt download --url "https://youtube.com/watch?v=..."
oarc-crawlers gh analyze --url "https://github.com/..."

See the CLI Usage section for more examples.

Q: How do I use the Python API?
A: Import the relevant crawler and use async methods. Example:

from oarc_crawlers import YTCrawler
yt = YTCrawler(data_dir="./data")
result = await yt.download_video("https://youtube.com/watch?v=...")

Q: What data format is used for storage?
A: All crawlers use Apache Parquet for structured, efficient, and analysis-ready data storage.

Q: Can I use OARC-Crawlers in agentic or AI workflows?
A: Yes. The framework is designed for integration with agentic systems and supports asynchronous operation for non-blocking workflows.

Q: How does error handling work?
A: Errors are logged and returned as structured output (e.g., dictionaries with error messages) rather than raising exceptions, allowing workflows to continue when possible.

Q: Where can I find more examples?
A: See the Examples documentation for detailed usage patterns.

Q: How do I report bugs or request features?
A: Please open an issue on GitHub Issues.

Q: Where can I get help?
A: Refer to the Troubleshooting Guide or contact the maintainers via email or GitHub Issues.

Troubleshooting

For common issues and their solutions, please refer to our Troubleshooting Guide.

Quick links:

MCP & VS Code Integration

OARC-Crawlers provides a Model Context Protocol (MCP) server for agentic workflows and VS Code Copilot Chat integration.

See docs/VSCodeMCP.md for setup and troubleshooting.
To run the MCP server:
```
oarc-crawlers mcp run
```

To install for VS Code:

oarc-crawlers mcp install --name "OARC Crawlers"

License

This project is licensed under the Apache 2.0 License

Citations

Please use the following BibTeX entry to cite this project:

@software{oarc_crawlers,
  author = {OARC Team},
  title = {OARC-Crawlers: OARC's dynamic webcrawler module collection},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Ollama-Agent-Roll-Cage/oarc-crawlers}}
}

Contact

For questions or support, please contact us at:

Email: NotSetup@gmail.com
Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github		.github
.vscode		.vscode
assets		assets
docs		docs
examples		examples
src		src
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⛏️ OARC-Crawlers ⛏️

Setup

Using the CLI

API Usage

Basic Import Pattern

Quick Examples

Running Example Scripts

Example Categories

Development

Package Structure

Running OARC Tests

Architecture

FAQ

Troubleshooting

MCP & VS Code Integration

License

Citations

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⛏️ OARC-Crawlers ⛏️

Setup

Using the CLI

API Usage

Basic Import Pattern

Quick Examples

Running Example Scripts

Example Categories

Development

Package Structure

Running OARC Tests

Architecture

FAQ

Troubleshooting

MCP & VS Code Integration

License

Citations

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages