|
| 1 | +# Scrapfly Crawler API Examples |
| 2 | + |
| 3 | +This directory contains examples demonstrating the Scrapfly Crawler API integration. |
| 4 | + |
| 5 | +## Setup |
| 6 | + |
| 7 | +### Get Your API Key |
| 8 | + |
| 9 | +Get your API key from [https://scrapfly.io/dashboard](https://scrapfly.io/dashboard) |
| 10 | + |
| 11 | +### Configure Your API Key |
| 12 | + |
| 13 | +You have **two options** to provide your API key: |
| 14 | + |
| 15 | +#### Option A: Environment Variable (Recommended) |
| 16 | + |
| 17 | +Export the API key in your terminal: |
| 18 | + |
| 19 | +```bash |
| 20 | +export SCRAPFLY_API_KEY='scp-live-your-key-here' |
| 21 | +``` |
| 22 | + |
| 23 | +Then run any example: |
| 24 | + |
| 25 | +```bash |
| 26 | +python3 sync_crawl.py |
| 27 | +``` |
| 28 | + |
| 29 | +#### Option B: .env File |
| 30 | + |
| 31 | +1. Copy the example .env file: |
| 32 | + |
| 33 | +```bash |
| 34 | +cp .env.example .env |
| 35 | +``` |
| 36 | + |
| 37 | +2. Edit `.env` and replace the placeholder with your actual API key: |
| 38 | + |
| 39 | +``` |
| 40 | +SCRAPFLY_API_KEY=scp-live-your-actual-key-here |
| 41 | +``` |
| 42 | + |
| 43 | +3. Run any example (the .env file will be loaded automatically): |
| 44 | + |
| 45 | +```bash |
| 46 | +python3 sync_crawl.py |
| 47 | +``` |
| 48 | + |
| 49 | +> **Note:** Install `python-dotenv` for automatic .env file loading: `pip install python-dotenv` |
| 50 | +> |
| 51 | +> If you don't install it, the examples will still work with environment variables exported in your shell. |
| 52 | +
|
| 53 | +## Quick Start |
| 54 | + |
| 55 | +The easiest way to use the Crawler API is with the high-level `Crawl` object (see [quickstart.py](quickstart.py)): |
| 56 | + |
| 57 | +```python |
| 58 | +from scrapfly import ScrapflyClient, CrawlerConfig, Crawl |
| 59 | + |
| 60 | +client = ScrapflyClient(key='your-key') |
| 61 | + |
| 62 | +# Method chaining for concise usage |
| 63 | +crawl = Crawl( |
| 64 | + client, |
| 65 | + CrawlerConfig( |
| 66 | + url='https://web-scraping.dev/products', |
| 67 | + page_limit=5 |
| 68 | + ) |
| 69 | +).crawl().wait() |
| 70 | + |
| 71 | +# Get results |
| 72 | +pages = crawl.warc().get_pages() |
| 73 | +for page in pages: |
| 74 | + print(f"{page['url']} ({page['status_code']})") |
| 75 | +``` |
| 76 | + |
| 77 | +## Examples |
| 78 | + |
| 79 | +- **[quickstart.py](quickstart.py)** - Simplest example using high-level `Crawl` API with method chaining |
| 80 | +- **[sync_crawl.py](sync_crawl.py)** - Low-level API example showing start, poll, and download workflow |
| 81 | +- **[demo_markdown.py](demo_markdown.py)** - Build LLM.txt files from crawled documentation with batch content retrieval |
| 82 | +- **[webhook_example.py](webhook_example.py)** - Handle Crawler API webhooks for real-time event notifications |
| 83 | + |
| 84 | +## Crawl Object Features |
| 85 | + |
| 86 | +The `Crawl` object provides a stateful, high-level interface: |
| 87 | + |
| 88 | +### Methods |
| 89 | + |
| 90 | +- **`crawl()`** - Start the crawler job |
| 91 | +- **`wait(poll_interval=5, max_wait=None, verbose=False)`** - Wait for completion |
| 92 | +- **`status(refresh=True)`** - Get current status |
| 93 | +- **`warc(artifact_type='warc')`** - Download WARC artifact |
| 94 | +- **`har()`** - Download HAR (HTTP Archive) artifact with timing data |
| 95 | +- **`read(url, format='html')`** - Get content for specific URL |
| 96 | +- **`read_batch(urls, formats=['html'])`** - Get content for multiple URLs efficiently (up to 100 per request) |
| 97 | +- **`read_iter(pattern, format='html')`** - Iterate through URLs matching wildcard pattern |
| 98 | +- **`stats()`** - Get comprehensive statistics |
| 99 | + |
| 100 | +### Properties |
| 101 | + |
| 102 | +- **`uuid`** - Crawler job UUID |
| 103 | +- **`started`** - Whether crawler has been started |
| 104 | + |
| 105 | +### Usage Patterns |
| 106 | + |
| 107 | +#### 1. Method Chaining (Most Concise) |
| 108 | + |
| 109 | +```python |
| 110 | +crawl = Crawl(client, config).crawl().wait() |
| 111 | +pages = crawl.warc().get_pages() |
| 112 | +``` |
| 113 | + |
| 114 | +#### 2. Step-by-Step (More Control) |
| 115 | + |
| 116 | +```python |
| 117 | +crawl = Crawl(client, config) |
| 118 | +crawl.crawl() |
| 119 | +crawl.wait(verbose=True, max_wait=300) |
| 120 | + |
| 121 | +# Check status |
| 122 | +status = crawl.status() |
| 123 | +print(f"Crawled {status.urls_crawled} URLs") |
| 124 | + |
| 125 | +# Get results |
| 126 | +artifact = crawl.warc() |
| 127 | +pages = artifact.get_pages() |
| 128 | +``` |
| 129 | + |
| 130 | +#### 3. Read Specific URLs |
| 131 | + |
| 132 | +```python |
| 133 | +# Get content for a specific URL |
| 134 | +html = crawl.read('https://example.com/page1') |
| 135 | +if html: |
| 136 | + print(html.decode('utf-8')) |
| 137 | +``` |
| 138 | + |
| 139 | +#### 4. Statistics |
| 140 | + |
| 141 | +```python |
| 142 | +stats = crawl.stats() |
| 143 | +print(f"URLs discovered: {stats['urls_discovered']}") |
| 144 | +print(f"URLs crawled: {stats['urls_crawled']}") |
| 145 | +print(f"Crawl rate: {stats['crawl_rate']:.1f}%") |
| 146 | +print(f"Total size: {stats['total_size_kb']:.2f} KB") |
| 147 | +``` |
| 148 | + |
| 149 | +## Configuration Options |
| 150 | + |
| 151 | +The `CrawlerConfig` class supports all crawler parameters: |
| 152 | + |
| 153 | +```python |
| 154 | +config = CrawlerConfig( |
| 155 | + url='https://example.com', |
| 156 | + page_limit=100, |
| 157 | + max_depth=3, |
| 158 | + exclude_paths=['/admin/*', '/api/*'], |
| 159 | + include_paths=['/products/*'], |
| 160 | + content_formats=['html', 'markdown'], |
| 161 | + # ... and many more options |
| 162 | +) |
| 163 | +``` |
| 164 | + |
| 165 | +See `CrawlerConfig` class documentation for all available parameters. |
| 166 | + |
| 167 | +## Artifact Formats |
| 168 | + |
| 169 | +### WARC Format |
| 170 | + |
| 171 | +The crawler returns results in WARC (Web ARChive) format by default, which is automatically parsed: |
| 172 | + |
| 173 | +```python |
| 174 | +artifact = crawl.warc() |
| 175 | + |
| 176 | +# Easy way: Get all pages as dictionaries |
| 177 | +pages = artifact.get_pages() |
| 178 | +for page in pages: |
| 179 | + url = page['url'] |
| 180 | + status_code = page['status_code'] |
| 181 | + headers = page['headers'] |
| 182 | + content = page['content'] # bytes |
| 183 | + |
| 184 | +# Memory-efficient: Iterate one record at a time |
| 185 | +for record in artifact.iter_responses(): |
| 186 | + print(f"{record.url}: {len(record.content)} bytes") |
| 187 | + |
| 188 | +# Save to file |
| 189 | +artifact.save('results.warc.gz') |
| 190 | +``` |
| 191 | + |
| 192 | +### HAR Format |
| 193 | + |
| 194 | +HAR (HTTP Archive) format includes detailed timing information for performance analysis: |
| 195 | + |
| 196 | +```python |
| 197 | +artifact = crawl.har() |
| 198 | + |
| 199 | +# Access timing data |
| 200 | +for entry in artifact.iter_responses(): |
| 201 | + print(f"{entry.url}") |
| 202 | + print(f" Status: {entry.status_code}") |
| 203 | + print(f" Total time: {entry.time}ms") |
| 204 | + print(f" Content type: {entry.content_type}") |
| 205 | + |
| 206 | + # Detailed timing breakdown |
| 207 | + timings = entry.timings |
| 208 | + print(f" DNS: {timings.get('dns', 0)}ms") |
| 209 | + print(f" Connect: {timings.get('connect', 0)}ms") |
| 210 | + print(f" Wait: {timings.get('wait', 0)}ms") |
| 211 | + print(f" Receive: {timings.get('receive', 0)}ms") |
| 212 | + |
| 213 | +# Same easy interface as WARC |
| 214 | +pages = artifact.get_pages() |
| 215 | +``` |
| 216 | + |
| 217 | +## Error Handling |
| 218 | + |
| 219 | +```python |
| 220 | +from scrapfly import Crawl, CrawlerConfig |
| 221 | + |
| 222 | +try: |
| 223 | + crawl = Crawl(client, config) |
| 224 | + crawl.crawl().wait(max_wait=300) |
| 225 | + |
| 226 | + if crawl.status().is_complete: |
| 227 | + pages = crawl.warc().get_pages() |
| 228 | + print(f"Success! Got {len(pages)} pages") |
| 229 | + elif crawl.status().is_failed: |
| 230 | + print("Crawler failed") |
| 231 | + |
| 232 | +except RuntimeError as e: |
| 233 | + print(f"Error: {e}") |
| 234 | +``` |
| 235 | + |
| 236 | +## Troubleshooting |
| 237 | + |
| 238 | +### "SCRAPFLY_API_KEY environment variable not set" |
| 239 | + |
| 240 | +Make sure you've either: |
| 241 | +1. Exported the environment variable: `export SCRAPFLY_API_KEY='your-key'` |
| 242 | +2. Created a `.env` file with your API key |
| 243 | + |
| 244 | +### "Invalid API key" error |
| 245 | + |
| 246 | +Double-check that: |
| 247 | +1. Your API key is correct and starts with `scp-live-` |
| 248 | +2. You have an active Scrapfly subscription |
| 249 | +3. You're using the correct API key from your dashboard |
| 250 | + |
| 251 | +### Import errors for dotenv |
| 252 | + |
| 253 | +The `python-dotenv` package is optional. If you see import warnings, you can either: |
| 254 | +1. Install it: `pip install python-dotenv` |
| 255 | +2. Ignore them - environment variables will still work |
| 256 | + |
| 257 | +## Learn More |
| 258 | + |
| 259 | +- [Scrapfly Crawler API Documentation](https://scrapfly.io/docs/crawler-api) |
| 260 | +- [Python SDK Documentation](https://scrapfly.io/docs/sdk/python) |
0 commit comments