Skip to content

Commit aeddfb7

Browse files
authored
Merge pull request #13 from ScrapingBee/docs/advanced-usage-examples
Add advanced usage examples documentation
2 parents 10b58fc + 67fdf95 commit aeddfb7

2 files changed

Lines changed: 385 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ scrapingbee schedule --list
9090

9191
## More information
9292

93+
- **[Advanced usage examples](docs/advanced-usage.md)** – Shell piping, command chaining, batch workflows, monitoring scripts, NDJSON streaming, screenshots, Google search patterns, LLM chunking, and more.
9394
- **[ScrapingBee API documentation](https://www.scrapingbee.com/documentation/)** – Parameters, response formats, credit costs, and best practices.
9495
- **Claude / AI agents:** This repo includes a [Claude Skill](https://github.com/ScrapingBee/scrapingbee-cli/tree/main/skills/scrapingbee-cli) and [Claude Plugin](.claude-plugin/) for agent use with file-based output and security rules.
9596

docs/advanced-usage.md

Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
# Advanced Usage Examples
2+
3+
Real-world workflows combining `scrapingbee` with standard shell tools. For basic usage and installation, see [README.md](../README.md).
4+
5+
---
6+
7+
## Piping and Text Processing
8+
9+
### Extract titles and filter by keyword
10+
11+
```bash
12+
scrapingbee scrape "https://news.ycombinator.com/" \
13+
--extract-rules '{"titles":{"selector":".titleline > a","type":"list"}}' \
14+
--extract-field titles \
15+
| grep -i "python\|rust\|linux\|ai"
16+
```
17+
18+
### Count links on a page
19+
20+
```bash
21+
scrapingbee scrape "https://example.com" --preset extract-links \
22+
--extract-field links | wc -l
23+
```
24+
25+
### Extract domains from Google results
26+
27+
```bash
28+
scrapingbee google "web scraping tools" \
29+
--extract-field "organic_results.url" \
30+
| awk -F/ '{print $3}' | sort -u
31+
```
32+
33+
### Format extraction output as markdown list
34+
35+
```bash
36+
scrapingbee scrape "https://news.ycombinator.com/" \
37+
--extract-rules '{"titles":{"selector":".titleline > a","type":"list"}}' \
38+
--extract-field titles \
39+
| sed 's/^/- /'
40+
```
41+
42+
---
43+
44+
## Chaining Commands
45+
46+
### Google search → scrape each result as markdown
47+
48+
```bash
49+
scrapingbee google "scrapingbee tutorial" \
50+
--extract-field "organic_results.url" \
51+
| head -3 \
52+
| scrapingbee scrape --input-file - \
53+
--render-js false --return-page-markdown true \
54+
--output-dir tutorial_pages/
55+
```
56+
57+
### Google search → scrape → extract with AI
58+
59+
```bash
60+
scrapingbee google "best restaurants paris" \
61+
--extract-field "organic_results.url" \
62+
| head -5 \
63+
| scrapingbee scrape --input-file - \
64+
--render-js false \
65+
--ai-query "list of restaurant names and addresses" \
66+
--output-format ndjson
67+
```
68+
69+
---
70+
71+
## Batch Processing
72+
73+
### Scrape URLs from a file
74+
75+
```bash
76+
scrapingbee scrape --input-file urls.txt \
77+
--render-js false \
78+
--return-page-text true \
79+
--output-dir scraped_text/ \
80+
--concurrency 10
81+
```
82+
83+
### Process CSV input, augment in-place
84+
85+
```bash
86+
# Input CSV has columns: url,company_name
87+
# --update-csv adds extracted data as new columns
88+
scrapingbee scrape --input-file companies.csv \
89+
--input-column url \
90+
--render-js false \
91+
--extract-rules '{"title":"h1","description":"meta[name=description]@content"}' \
92+
--update-csv
93+
```
94+
95+
### Random sample from large URL list
96+
97+
```bash
98+
scrapingbee scrape --input-file 10k_urls.txt \
99+
--render-js false --sample 50 \
100+
--output-format ndjson \
101+
| python3 -c "
102+
import sys, json
103+
for line in sys.stdin:
104+
d = json.loads(line)
105+
print(f\"{d['status_code']} {d['latency_ms']}ms {d['input']}\")
106+
"
107+
```
108+
109+
### Deduplicate and resume interrupted batches
110+
111+
```bash
112+
cat urls_*.txt \
113+
| scrapingbee scrape --input-file - \
114+
--render-js false --deduplicate \
115+
--output-dir deduped_results/ \
116+
--resume
117+
```
118+
119+
`--resume` skips items already saved in `--output-dir` from a previous run.
120+
121+
---
122+
123+
## NDJSON Streaming Pipelines
124+
125+
### Stream results and process each line
126+
127+
```bash
128+
scrapingbee scrape --input-file urls.txt \
129+
--render-js false --output-format ndjson \
130+
| while IFS= read -r line; do
131+
url=$(echo "$line" | python3 -c "import sys,json; print(json.loads(sys.stdin.read())['input'])")
132+
status=$(echo "$line" | python3 -c "import sys,json; print(json.loads(sys.stdin.read())['status_code'])")
133+
echo "$status $url"
134+
done
135+
```
136+
137+
### Transform each result with --post-process
138+
139+
```bash
140+
# Extract just the page title from each scraped page
141+
scrapingbee scrape --input-file urls.txt \
142+
--render-js false --output-format ndjson \
143+
--post-process 'python3 -c "
144+
import sys, re
145+
html = sys.stdin.read()
146+
m = re.search(r\"<title>(.*?)</title>\", html, re.IGNORECASE)
147+
print(m.group(1) if m else \"(no title)\")
148+
"'
149+
```
150+
151+
---
152+
153+
## Monitoring and Diffing
154+
155+
### Detect content changes on a page
156+
157+
```bash
158+
#!/bin/bash
159+
URL="https://news.ycombinator.com/"
160+
SNAP="/tmp/hn_snapshot.txt"
161+
RULES='{"titles":{"selector":".titleline > a","type":"list"}}'
162+
163+
new=$(scrapingbee scrape "$URL" --extract-rules "$RULES" --extract-field titles 2>/dev/null)
164+
165+
if [ -f "$SNAP" ]; then
166+
diff <(cat "$SNAP") <(echo "$new") && echo "No changes" || echo "Page updated!"
167+
fi
168+
echo "$new" > "$SNAP"
169+
```
170+
171+
### Monitor price changes
172+
173+
```bash
174+
#!/bin/bash
175+
URL="https://example.com/product"
176+
PREV="/tmp/price_prev.txt"
177+
178+
price=$(scrapingbee scrape "$URL" \
179+
--ai-query "current price of the product" \
180+
--render-js false 2>/dev/null)
181+
182+
if [ -f "$PREV" ] && [ "$(cat "$PREV")" != "$price" ]; then
183+
echo "Price changed: $(cat "$PREV")$price"
184+
fi
185+
echo "$price" > "$PREV"
186+
```
187+
188+
---
189+
190+
## Screenshots
191+
192+
### Screenshot to file via redirect
193+
194+
```bash
195+
scrapingbee scrape "https://example.com" \
196+
--screenshot true 2>/dev/null > page.png
197+
```
198+
199+
### Full-page screenshot
200+
201+
```bash
202+
scrapingbee scrape "https://example.com" \
203+
--screenshot true --screenshot-full-page true \
204+
--output-file fullpage.png
205+
```
206+
207+
### Screenshot + HTML in one request
208+
209+
```bash
210+
scrapingbee scrape "https://example.com" \
211+
--preset screenshot-and-html \
212+
--output-file result.json
213+
214+
# Extract the screenshot:
215+
python3 -c "
216+
import json, base64
217+
with open('result.json') as f:
218+
data = json.load(f)
219+
with open('screenshot.png', 'wb') as f:
220+
f.write(base64.b64decode(data['screenshot']))
221+
print(f\"HTML length: {len(data['body'])} chars\")
222+
"
223+
```
224+
225+
---
226+
227+
## Scripting Patterns
228+
229+
### Credit budget check before batch
230+
231+
```bash
232+
#!/bin/bash
233+
set -e
234+
235+
credits=$(scrapingbee usage 2>/dev/null \
236+
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['max_api_credit']-d['used_api_credit'])")
237+
url_count=$(wc -l < urls.txt)
238+
cost=$((url_count * 1)) # 1 credit each with render_js=false
239+
240+
echo "Need ~$cost credits, have $credits"
241+
if [ "$cost" -gt "$credits" ]; then
242+
echo "Not enough credits!" >&2
243+
exit 1
244+
fi
245+
246+
scrapingbee scrape --input-file urls.txt --render-js false --output-dir results/
247+
```
248+
249+
### Custom retry logic
250+
251+
```bash
252+
#!/bin/bash
253+
URL="https://example.com"
254+
MAX_RETRIES=3
255+
256+
for i in $(seq 1 $MAX_RETRIES); do
257+
if result=$(scrapingbee scrape "$URL" --render-js false 2>/dev/null); then
258+
echo "$result"
259+
exit 0
260+
fi
261+
echo "Attempt $i failed, retrying..." >&2
262+
sleep $((i * 2))
263+
done
264+
echo "All $MAX_RETRIES attempts failed" >&2
265+
exit 1
266+
```
267+
268+
> **Note:** The CLI has built-in `--retries` and `--backoff` flags, so manual retry is only needed for custom logic.
269+
270+
---
271+
272+
## Google Search Workflows
273+
274+
### Compare search results across countries
275+
276+
```bash
277+
for cc in us gb de fr; do
278+
echo "=== $cc ==="
279+
scrapingbee google "best vpn" --country-code $cc \
280+
--extract-field "organic_results.title" 2>/dev/null | head -3
281+
done
282+
```
283+
284+
### Search and summarize with AI
285+
286+
```bash
287+
scrapingbee google "climate change" --search-type news \
288+
--extract-field "organic_results.url" 2>/dev/null \
289+
| head -3 \
290+
| scrapingbee scrape --input-file - --render-js false \
291+
--ai-query "one-sentence summary of the article" \
292+
--output-format ndjson 2>/dev/null \
293+
| python3 -c "
294+
import sys, json
295+
for line in sys.stdin:
296+
d = json.loads(line)
297+
print(f\"[{d['input'][:60]}]\")
298+
print(f\" {d['body']}\")
299+
print()
300+
"
301+
```
302+
303+
### Google AI mode
304+
305+
```bash
306+
scrapingbee google "how does CRISPR work" --search-type ai-mode \
307+
--extract-field "ai_mode_answer.response_text"
308+
```
309+
310+
---
311+
312+
## Export and Format Conversion
313+
314+
### Merge batch output to single NDJSON file
315+
316+
```bash
317+
scrapingbee export --input-dir batch_results/ --format ndjson --output-file all.ndjson
318+
```
319+
320+
### Flatten nested JSON to CSV
321+
322+
```bash
323+
scrapingbee export --input-dir batch_results/ --format csv --flatten --output-file flat.csv
324+
```
325+
326+
### Select specific CSV columns
327+
328+
```bash
329+
scrapingbee export --input-dir batch_results/ --format csv --flatten \
330+
--columns "_url,title,description" --output-file selected.csv
331+
```
332+
333+
---
334+
335+
## LLM / RAG Chunking
336+
337+
### Split page into chunks for embeddings
338+
339+
```bash
340+
scrapingbee scrape "https://en.wikipedia.org/wiki/Web_scraping" \
341+
--return-page-markdown true \
342+
--chunk-size 1000 --chunk-overlap 100 \
343+
| python3 -c "
344+
import sys, json
345+
for line in sys.stdin:
346+
chunk = json.loads(line)
347+
print(f\"Chunk {chunk['chunk_index']}/{chunk['total_chunks']}: {len(chunk['content'])} chars\")
348+
"
349+
```
350+
351+
---
352+
353+
## Presets
354+
355+
```bash
356+
# Fastest possible scrape (no JS rendering, 1 credit)
357+
scrapingbee scrape URL --preset fetch
358+
359+
# Get all links on a page
360+
scrapingbee scrape URL --preset extract-links --extract-field links
361+
362+
# Get all emails on a page
363+
scrapingbee scrape URL --preset extract-emails --extract-field emails
364+
365+
# Screenshot only
366+
scrapingbee scrape URL --preset screenshot --output-file page.png
367+
368+
# Screenshot + HTML in one request
369+
scrapingbee scrape URL --preset screenshot-and-html --output-file both.json
370+
```
371+
372+
---
373+
374+
## Tips
375+
376+
- **Use `--render-js false`** whenever possible — 1 credit vs 5, and faster. `--preset fetch` is a shorthand.
377+
- **Use `--verbose`** to see credit costs — output goes to stderr so it won't break pipes.
378+
- **Use `--extract-field`** for pipe-friendly output (one value per line) instead of parsing JSON manually.
379+
- **Use `--output-format ndjson`** for streaming batch results to stdout.
380+
- **Use `--deduplicate`** to avoid wasting credits on duplicate URLs in batch input.
381+
- **Use `--sample N`** to test your pipeline on a small subset before running a full batch.
382+
- **Use `--resume`** to restart interrupted batches without re-fetching completed items.
383+
- **Use `--post-process`** to transform each batch result with a shell command.
384+
- **Use `SCRAPINGBEE_API_KEY` env var** in scripts and CI instead of `scrapingbee auth`.

0 commit comments

Comments
 (0)