Skip to content

Commit 20a3bed

Browse files
committed
initial work to support crawler api
1 parent 4a353f5 commit 20a3bed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+14824
-13
lines changed

.env.example

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Scrapfly API Configuration
2+
# Copy this file to .env and fill in your actual values
3+
4+
# Your Scrapfly API key
5+
SCRAPFLY_KEY=scp-live-your-api-key-here
6+
7+
# Scrapfly API host (optional, defaults to production)
8+
SCRAPFLY_API_HOST=https://api.scrapfly.io

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ scrapfly_sdk.egg-info
66
venv
77
examples/scrapy/demo/images
88
examples/scrapy/demo/*.csv
9-
!examples/scrapy/demo/images/.gitkeep
9+
!examples/scrapy/demo/images/.gitkeep
10+
/tests/crawler/*.gz
11+
.env

examples/crawler/.env.example

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Scrapfly API Configuration
2+
# Get your API key from: https://scrapfly.io/dashboard
3+
4+
# Required: Your Scrapfly API key
5+
SCRAPFLY_API_KEY=scp-live-your-key-here
6+
7+
# Usage:
8+
# 1. Copy this file to .env
9+
# 2. Replace 'scp-live-your-key-here' with your actual API key
10+
# 3. The examples will automatically load your API key from the .env file

examples/crawler/README.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# Scrapfly Crawler API Examples
2+
3+
This directory contains examples demonstrating the Scrapfly Crawler API integration.
4+
5+
## Setup
6+
7+
### Get Your API Key
8+
9+
Get your API key from [https://scrapfly.io/dashboard](https://scrapfly.io/dashboard)
10+
11+
### Configure Your API Key
12+
13+
You have **two options** to provide your API key:
14+
15+
#### Option A: Environment Variable (Recommended)
16+
17+
Export the API key in your terminal:
18+
19+
```bash
20+
export SCRAPFLY_API_KEY='scp-live-your-key-here'
21+
```
22+
23+
Then run any example:
24+
25+
```bash
26+
python3 sync_crawl.py
27+
```
28+
29+
#### Option B: .env File
30+
31+
1. Copy the example .env file:
32+
33+
```bash
34+
cp .env.example .env
35+
```
36+
37+
2. Edit `.env` and replace the placeholder with your actual API key:
38+
39+
```
40+
SCRAPFLY_API_KEY=scp-live-your-actual-key-here
41+
```
42+
43+
3. Run any example (the .env file will be loaded automatically):
44+
45+
```bash
46+
python3 sync_crawl.py
47+
```
48+
49+
> **Note:** Install `python-dotenv` for automatic .env file loading: `pip install python-dotenv`
50+
>
51+
> If you don't install it, the examples will still work with environment variables exported in your shell.
52+
53+
## Quick Start
54+
55+
The easiest way to use the Crawler API is with the high-level `Crawl` object (see [quickstart.py](quickstart.py)):
56+
57+
```python
58+
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
59+
60+
client = ScrapflyClient(key='your-key')
61+
62+
# Method chaining for concise usage
63+
crawl = Crawl(
64+
client,
65+
CrawlerConfig(
66+
url='https://web-scraping.dev/products',
67+
page_limit=5
68+
)
69+
).crawl().wait()
70+
71+
# Get results
72+
pages = crawl.warc().get_pages()
73+
for page in pages:
74+
print(f"{page['url']} ({page['status_code']})")
75+
```
76+
77+
## Examples
78+
79+
- **[quickstart.py](quickstart.py)** - Simplest example using high-level `Crawl` API with method chaining
80+
- **[sync_crawl.py](sync_crawl.py)** - Low-level API example showing start, poll, and download workflow
81+
- **[demo_markdown.py](demo_markdown.py)** - Build LLM.txt files from crawled documentation with batch content retrieval
82+
- **[webhook_example.py](webhook_example.py)** - Handle Crawler API webhooks for real-time event notifications
83+
84+
## Crawl Object Features
85+
86+
The `Crawl` object provides a stateful, high-level interface:
87+
88+
### Methods
89+
90+
- **`crawl()`** - Start the crawler job
91+
- **`wait(poll_interval=5, max_wait=None, verbose=False)`** - Wait for completion
92+
- **`status(refresh=True)`** - Get current status
93+
- **`warc(artifact_type='warc')`** - Download WARC artifact
94+
- **`har()`** - Download HAR (HTTP Archive) artifact with timing data
95+
- **`read(url, format='html')`** - Get content for specific URL
96+
- **`read_batch(urls, formats=['html'])`** - Get content for multiple URLs efficiently (up to 100 per request)
97+
- **`read_iter(pattern, format='html')`** - Iterate through URLs matching wildcard pattern
98+
- **`stats()`** - Get comprehensive statistics
99+
100+
### Properties
101+
102+
- **`uuid`** - Crawler job UUID
103+
- **`started`** - Whether crawler has been started
104+
105+
### Usage Patterns
106+
107+
#### 1. Method Chaining (Most Concise)
108+
109+
```python
110+
crawl = Crawl(client, config).crawl().wait()
111+
pages = crawl.warc().get_pages()
112+
```
113+
114+
#### 2. Step-by-Step (More Control)
115+
116+
```python
117+
crawl = Crawl(client, config)
118+
crawl.crawl()
119+
crawl.wait(verbose=True, max_wait=300)
120+
121+
# Check status
122+
status = crawl.status()
123+
print(f"Crawled {status.urls_crawled} URLs")
124+
125+
# Get results
126+
artifact = crawl.warc()
127+
pages = artifact.get_pages()
128+
```
129+
130+
#### 3. Read Specific URLs
131+
132+
```python
133+
# Get content for a specific URL
134+
html = crawl.read('https://example.com/page1')
135+
if html:
136+
print(html.decode('utf-8'))
137+
```
138+
139+
#### 4. Statistics
140+
141+
```python
142+
stats = crawl.stats()
143+
print(f"URLs discovered: {stats['urls_discovered']}")
144+
print(f"URLs crawled: {stats['urls_crawled']}")
145+
print(f"Crawl rate: {stats['crawl_rate']:.1f}%")
146+
print(f"Total size: {stats['total_size_kb']:.2f} KB")
147+
```
148+
149+
## Configuration Options
150+
151+
The `CrawlerConfig` class supports all crawler parameters:
152+
153+
```python
154+
config = CrawlerConfig(
155+
url='https://example.com',
156+
page_limit=100,
157+
max_depth=3,
158+
exclude_paths=['/admin/*', '/api/*'],
159+
include_paths=['/products/*'],
160+
content_formats=['html', 'markdown'],
161+
# ... and many more options
162+
)
163+
```
164+
165+
See `CrawlerConfig` class documentation for all available parameters.
166+
167+
## Artifact Formats
168+
169+
### WARC Format
170+
171+
The crawler returns results in WARC (Web ARChive) format by default, which is automatically parsed:
172+
173+
```python
174+
artifact = crawl.warc()
175+
176+
# Easy way: Get all pages as dictionaries
177+
pages = artifact.get_pages()
178+
for page in pages:
179+
url = page['url']
180+
status_code = page['status_code']
181+
headers = page['headers']
182+
content = page['content'] # bytes
183+
184+
# Memory-efficient: Iterate one record at a time
185+
for record in artifact.iter_responses():
186+
print(f"{record.url}: {len(record.content)} bytes")
187+
188+
# Save to file
189+
artifact.save('results.warc.gz')
190+
```
191+
192+
### HAR Format
193+
194+
HAR (HTTP Archive) format includes detailed timing information for performance analysis:
195+
196+
```python
197+
artifact = crawl.har()
198+
199+
# Access timing data
200+
for entry in artifact.iter_responses():
201+
print(f"{entry.url}")
202+
print(f" Status: {entry.status_code}")
203+
print(f" Total time: {entry.time}ms")
204+
print(f" Content type: {entry.content_type}")
205+
206+
# Detailed timing breakdown
207+
timings = entry.timings
208+
print(f" DNS: {timings.get('dns', 0)}ms")
209+
print(f" Connect: {timings.get('connect', 0)}ms")
210+
print(f" Wait: {timings.get('wait', 0)}ms")
211+
print(f" Receive: {timings.get('receive', 0)}ms")
212+
213+
# Same easy interface as WARC
214+
pages = artifact.get_pages()
215+
```
216+
217+
## Error Handling
218+
219+
```python
220+
from scrapfly import Crawl, CrawlerConfig
221+
222+
try:
223+
crawl = Crawl(client, config)
224+
crawl.crawl().wait(max_wait=300)
225+
226+
if crawl.status().is_complete:
227+
pages = crawl.warc().get_pages()
228+
print(f"Success! Got {len(pages)} pages")
229+
elif crawl.status().is_failed:
230+
print("Crawler failed")
231+
232+
except RuntimeError as e:
233+
print(f"Error: {e}")
234+
```
235+
236+
## Troubleshooting
237+
238+
### "SCRAPFLY_API_KEY environment variable not set"
239+
240+
Make sure you've either:
241+
1. Exported the environment variable: `export SCRAPFLY_API_KEY='your-key'`
242+
2. Created a `.env` file with your API key
243+
244+
### "Invalid API key" error
245+
246+
Double-check that:
247+
1. Your API key is correct and starts with `scp-live-`
248+
2. You have an active Scrapfly subscription
249+
3. You're using the correct API key from your dashboard
250+
251+
### Import errors for dotenv
252+
253+
The `python-dotenv` package is optional. If you see import warnings, you can either:
254+
1. Install it: `pip install python-dotenv`
255+
2. Ignore them - environment variables will still work
256+
257+
## Learn More
258+
259+
- [Scrapfly Crawler API Documentation](https://scrapfly.io/docs/crawler-api)
260+
- [Python SDK Documentation](https://scrapfly.io/docs/sdk/python)

0 commit comments

Comments
 (0)