Skip to content

laser-eagle/oarc-crawlers

 
 

OARC LOGO

⛏️ OARC-Crawlers ⛏️

OARC-Crawlers is a Python framework designed for acquiring, processing, and storing data from diverse online sources including YouTube, GitHub, ArXiv, DuckDuckGo, and general websites. Built with modularity and asynchronous operation, it features specialized crawlers for each source, unified data persistence using Apache Parquet via a central ParquetStorage component, and provides both a command-line interface (CLI) and a Python API for interaction. The framework aims to streamline data collection for research, analysis, and integration into agentic AI workflows by delivering structured, analysis-ready outputs.

Feature Description
🎬 YouTube Crawler Download videos, playlists, and extract captions
🐙 GitHub Crawler Clone repositories and extract code for analysis
🦆 DuckDuckGo Crawler Search for text, images, and news
🌐 Web Crawler Extract content from websites using BeautifulSoup
📜 ArXiv Crawler Download academic papers and their LaTeX source
💾 Parquet Storage Utility for saving and loading data in Parquet format

Setup

OARC-Crawlers requires Python >=3.10 and <3.12 (Python 3.12+ is not yet fully supported due to some dependency compatibility issues).

# Install package from PyPI
pip install oarc-crawlers

Using the CLI

Once setup, the OARC-Crawlers package can be installed directly using pip and used immediately from the command line, without any complex setup. See our Cheat Sheet for quick reference of common commands.

# Basic CLI usage examples
oarc-crawlers yt download --url "https://youtube.com/watch?v=dQw4w9WgXcQ"
oarc-crawlers gh analyze --url "https://github.com/pytorch/pytorch"
oarc-crawlers arxiv download --id "2103.00020"
oarc-crawlers ddg text --query "python async programming" --max-results 5
oarc-crawlers web pypi --package "requests"

# Get help for any command
oarc-crawlers --help
oarc-crawlers yt --help

For more detailed CLI usage examples and command patterns, see the CLI Usage and Examples.

API Usage

OARC-Crawlers provides a simple, intuitive Python API that allows you to integrate any of the crawlers directly into your Python applications or workflows. Each crawler is designed with async support for efficient execution.

Basic Import Pattern

from oarc_crawlers import (
    YTCrawler,
    GHCrawler,
    ArxivCrawler,
    DDGCrawler,
    WebCrawler,
    ParquetStorage,
)

Quick Examples

import asyncio
from oarc_crawlers import YTCrawler, ParquetStorage

async def download_video():
    # Initialize the crawler
    yt = YTCrawler(data_dir="./data")
    
    # Download a video
    result = await yt.download_video("https://youtube.com/watch?v=dQw4w9WgXcQ")
    
    # Print the result
    print(f"Video saved to: {result.get('file_path', 'N/A')}")
    print(f"Video title: {result.get('title', 'Unknown')}")

# Run the async function
asyncio.run(download_video())

Running Example Scripts

The package includes a collection of example scripts that demonstrate how to use each component:

# Run specific module example
python examples/run_example.py youtube
python examples/run_example.py github
python examples/run_example.py ddg
python examples/run_example.py bs # Note: 'bs' refers to BeautifulSoup/WebCrawler examples
python examples/run_example.py arxiv
python examples/run_example.py parquet

# Run the combined example
python examples/run_example.py combined

# Run all examples
python examples/run_example.py all

For detailed examples and advanced usage patterns, check our comprehensive Examples documentation.

Example Categories

Category Examples
Basic Operations Initializing Crawlers
Configuring Storage Paths
Error Handling
YouTube Operations YouTube CLI Examples
Download a Video
Download a Playlist
Extract Captions
Search Videos
Fetch Chat Messages
GitHub Operations GitHub CLI Examples
Clone a Repository
Analyze Code
Search Repositories
Search Operations DuckDuckGo CLI Examples
Text Search
News and Image Search
ArXiv Operations ArXiv CLI Examples
Downloading Papers
Extracting LaTeX Sources
Extracting Keywords and References
Working with Categories
Generate Citation Network
Web Crawling Web Crawler CLI Examples
Crawling Websites
Extracting Specific Content
Data Management Data Management CLI Examples
Working with Parquet Files
Converting Between Formats
Working with the Parquet Storage System

Development

For development purposes, oarc-crawlers can be installed by cloning the repository:

# Clone the repository
git clone https://github.com/Ollama-Agent-Roll-Cage/oarc-crawlers.git
cd oarc-crawlers

# Install UV package manager
pip install uv

# Create & activate virtual environment with UV (use 3.10 or 3.11)
uv venv --python 3.11

# Install the package and dependencies in one step
uv run pip install -e .[dev]

Package Structure

oarc-crawlers/
├── .github/                     # GitHub workflows and config
├── docs/                        # Core documentation
├── examples/                    # Example usage scripts
├── src/
│   └── oarc_crawlers/           # Source code to package
│   └── tests/                   # Unit tests
└── README.md                    # Project overview
└── LICENSE                      # Apache 2.0

See the Project Structure for a detailed module breakdown.

Running OARC Tests

To run all tests:

uv run pytest

Or to run a specific test:

uv run pytest src/tests/test_parquet_storage.py

Architecture

The oarc-crawlers package is designed with a modular architecture, allowing for easy extension and maintenance. Each crawler (YTCrawler, GHCrawler, ArxivCrawler, DDGCrawler, WebCrawler) operates independently but shares a common interface for data storage via the ParquetStorage utility. The system leverages asynchronous programming (asyncio) for efficient I/O operations, especially crucial for network-bound tasks like downloading videos or cloning repositories. A unified Command Line Interface (CLI) built with click provides a consistent user experience across all modules.

graph TD
    subgraph UserInteraction
        CLI[CLI Click]
        API[Python API]
    end

    subgraph CoreModules
        Router{Module Router}
        YT[YTCrawler]
        GH[GHCrawler]
        AX[ArxivCrawler]
        DDG[DDGCrawler]
        WEB[WebCrawler]
    end

    subgraph DataHandling
        PS[ParquetStorage]
        FS[(File System)]
    end

    subgraph ExternalSources
        SourceYT[YouTube API]
        SourceGH[GitHub API]
        SourceAX[ArXiv API]
        SourceDDG[DuckDuckGo]
        SourceWeb[Web Pages]
    end

    CLI --> Router
    API --> Router
    Router --> YT
    Router --> GH
    Router --> AX
    Router --> DDG
    Router --> WEB

    YT --> SourceYT
    GH --> SourceGH
    AX --> SourceAX
    DDG --> SourceDDG
    WEB --> SourceWeb

    SourceYT -- Data --> YT
    SourceGH -- Data --> GH
    SourceAX -- Data --> AX
    SourceDDG -- Data --> DDG
    SourceWeb -- Data --> WEB

    YT -- Data --> PS
    GH -- Data --> PS
    AX -- Data --> PS
    DDG -- Data --> PS
    WEB -- Data --> PS

    PS --> FS
Loading

FAQ

Q: What is OARC-Crawlers?
A: OARC-Crawlers is a modular Python framework for acquiring, processing, and storing data from sources like YouTube, GitHub, ArXiv, DuckDuckGo, and general websites. It provides both CLI and Python API interfaces, with unified Parquet storage for all outputs.

Q: Which Python versions are supported?
A: Python 3.10 and 3.11 are supported. Python 3.12+ is not yet fully supported due to dependency compatibility.

Q: How do I install OARC-Crawlers?
A: You can install via pip (pip install oarc-crawlers) or from source. See the "Setup" section above for details.

Q: What is the recommended way to manage environments?
A: Use the uv package manager to create a virtual environment with Python 3.10 or 3.11 for best compatibility.

Q: How do I run a crawler from the command line?
A: Use the CLI, e.g.,

oarc-crawlers yt download --url "https://youtube.com/watch?v=..."
oarc-crawlers gh analyze --url "https://github.com/..."

See the CLI Usage section for more examples.

Q: How do I use the Python API?
A: Import the relevant crawler and use async methods. Example:

from oarc_crawlers import YTCrawler
yt = YTCrawler(data_dir="./data")
result = await yt.download_video("https://youtube.com/watch?v=...")

Q: What data format is used for storage?
A: All crawlers use Apache Parquet for structured, efficient, and analysis-ready data storage.

Q: Can I use OARC-Crawlers in agentic or AI workflows?
A: Yes. The framework is designed for integration with agentic systems and supports asynchronous operation for non-blocking workflows.

Q: How does error handling work?
A: Errors are logged and returned as structured output (e.g., dictionaries with error messages) rather than raising exceptions, allowing workflows to continue when possible.

Q: Where can I find more examples?
A: See the Examples documentation for detailed usage patterns.

Q: How do I report bugs or request features?
A: Please open an issue on GitHub Issues.

Q: Where can I get help?
A: Refer to the Troubleshooting Guide or contact the maintainers via email or GitHub Issues.

Troubleshooting

For common issues and their solutions, please refer to our Troubleshooting Guide.

Quick links:

MCP & VS Code Integration

OARC-Crawlers provides a Model Context Protocol (MCP) server for agentic workflows and VS Code Copilot Chat integration.

  • See docs/VSCodeMCP.md for setup and troubleshooting.
  • To run the MCP server:
    oarc-crawlers mcp run
  • To install for VS Code:
    oarc-crawlers mcp install --name "OARC Crawlers"

License

This project is licensed under the Apache 2.0 License

Citations

Please use the following BibTeX entry to cite this project:

@software{oarc_crawlers,
  author = {OARC Team},
  title = {OARC-Crawlers: OARC's dynamic webcrawler module collection},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Ollama-Agent-Roll-Cage/oarc-crawlers}}
}

Contact

For questions or support, please contact us at:

About

A modular, async Python framework for structured online data collection used by Ollama Agent Roll Cage (OARC)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%