Getting Started with CrawlX

This guide will help you get up and running with CrawlX quickly.

Installation

npm install crawlx

Quick Start

Basic Crawling

The simplest way to use CrawlX is with the quickCrawl function:

import { quickCrawl } from 'crawlx';

const result = await quickCrawl('https://example.com', {
  title: 'title',
  description: 'meta[name="description"]@content'
});

console.log(result.parsed);
// Output: { title: "Example Domain", description: "..." }

Creating a Crawler Instance

For more control, create a crawler instance:

import { CrawlX } from 'crawlx';

const crawler = new CrawlX({
  concurrency: 5,
  timeout: 10000,
  userAgent: 'MyBot/1.0'
});

const result = await crawler.crawl('https://example.com');
console.log(result.response.statusCode); // 200

await crawler.destroy(); // Clean up resources

Data Extraction

CrawlX uses CSS selectors for data extraction:

Simple Selectors

const parseRule = {
  title: 'title',                    // Text content
  links: '[a@href]',                 // Attribute values
  images: ['img@src'],               // Arrays
  price: '.price | trim | number'    // With filters
};

const result = await crawler.crawl('https://shop.example.com', {
  parse: parseRule
});

Nested Data Structures

const parseRule = {
  products: {
    _scope: '.product',              // Scope to product elements
    name: '.name',
    price: '.price | trim | number',
    image: 'img@src',
    details: {
      _scope: '.details',
      description: '.desc',
      specs: ['.spec']
    }
  }
};

Custom Functions

const parseRule = {
  title: 'title',
  url: () => window.location.href,
  timestamp: () => new Date().toISOString(),
  productCount: ($) => $('.product').length
};

Factory Functions

CrawlX provides factory functions for common use cases:

Lightweight Crawler

import { createLightweightCrawler } from 'crawlx';

const crawler = createLightweightCrawler({
  concurrency: 2,
  timeout: 5000
});

Scraper

import { createScraper } from 'crawlx';

const scraper = createScraper();
const result = await scraper.crawl('https://example.com', {
  parse: {
    title: 'title',
    content: '.content'
  }
});

Spider (Link Following)

import { createSpider } from 'crawlx';

const spider = createSpider({
  plugins: {
    follow: {
      maxDepth: 3,
      sameDomainOnly: true
    }
  }
});

const results = await spider.crawlMany(['https://example.com'], {
  parse: { title: 'title' },
  follow: '[a@href]'  // Follow all links
});

Configuration

Basic Configuration

const crawler = new CrawlX({
  mode: 'high-performance',
  concurrency: 10,
  timeout: 30000,
  maxRetries: 3,
  userAgent: 'MyBot/1.0',
  headers: {
    'Accept': 'text/html,application/xhtml+xml'
  }
});

Plugin Configuration

const crawler = new CrawlX({
  plugins: {
    delay: {
      enabled: true,
      defaultDelay: 1000,
      randomDelay: true
    },
    rateLimit: {
      enabled: true,
      globalLimit: { requests: 100, window: 60000 }
    },
    retry: {
      enabled: true,
      maxRetries: 3,
      exponentialBackoff: true
    }
  }
});

Environment Variables

CRAWLX_MODE=high-performance
CRAWLX_CONCURRENCY=10
CRAWLX_TIMEOUT=30000
CRAWLX_PLUGINS_DELAY_ENABLED=true
CRAWLX_PLUGINS_DELAY_DEFAULT_DELAY=1000

Configuration Presets

import { ConfigPresets } from 'crawlx';

// Development preset
const devCrawler = ConfigPresets.development();

// Production preset
const prodCrawler = ConfigPresets.production();

// Testing preset
const testCrawler = ConfigPresets.testing();

Event Handling

const crawler = new CrawlX();

crawler.on('task-start', (task) => {
  console.log(`Starting: ${task.url}`);
});

crawler.on('task-complete', (result) => {
  console.log(`Completed: ${result.response.url}`);
});

crawler.on('data-extracted', (data, url) => {
  console.log(`Data from ${url}:`, data);
});

crawler.on('task-error', (error, task) => {
  console.error(`Failed ${task.url}:`, error.message);
});

Error Handling

import { CrawlXError, NetworkError, TimeoutError } from 'crawlx';

try {
  const result = await crawler.crawl('https://example.com');
} catch (error) {
  if (error instanceof NetworkError) {
    console.log('Network error:', error.statusCode);
  } else if (error instanceof TimeoutError) {
    console.log('Timeout after:', error.timeout);
  } else if (error instanceof CrawlXError) {
    console.log('CrawlX error:', error.code, error.context);
  }
}

Custom Plugins

class CustomPlugin {
  name = 'custom';
  version = '1.0.0';
  priority = 100;

  async onTaskComplete(result) {
    // Add custom processing
    result.customData = {
      processedAt: new Date().toISOString(),
      urlLength: result.response.url.length
    };
    return result;
  }
}

const crawler = new CrawlX();
crawler.addPlugin(new CustomPlugin());

Best Practices

1. Resource Management

Always clean up resources:

const crawler = new CrawlX();
try {
  const result = await crawler.crawl('https://example.com');
  // Process result
} finally {
  await crawler.destroy();
}

2. Error Handling

Handle errors gracefully:

const crawler = new CrawlX({
  maxRetries: 3,
  timeout: 10000
});

crawler.on('task-error', (error, task) => {
  console.error(`Failed to crawl ${task.url}:`, error.message);
});

3. Rate Limiting

Be respectful to target websites:

const crawler = new CrawlX({
  plugins: {
    delay: {
      enabled: true,
      defaultDelay: 1000  // 1 second between requests
    },
    rateLimit: {
      enabled: true,
      perDomainLimit: { requests: 10, window: 60000 }
    }
  }
});

4. Memory Management

For large-scale crawling:

const crawler = new CrawlX({
  mode: 'high-performance',
  concurrency: 20,
  scheduler: {
    maxQueueSize: 1000,
    resourceLimits: {
      maxMemoryUsage: 1073741824  // 1GB
    }
  }
});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started with CrawlX

Installation

Quick Start

Basic Crawling

Creating a Crawler Instance

Data Extraction

Simple Selectors

Nested Data Structures

Custom Functions

Factory Functions

Lightweight Crawler

Scraper

Spider (Link Following)

Configuration

Basic Configuration

Plugin Configuration

Environment Variables

Configuration Presets

Event Handling

Error Handling

Custom Plugins

Best Practices

1. Resource Management

2. Error Handling

3. Rate Limiting

4. Memory Management

Next Steps

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started with CrawlX

Installation

Quick Start

Basic Crawling

Creating a Crawler Instance

Data Extraction

Simple Selectors

Nested Data Structures

Custom Functions

Factory Functions

Lightweight Crawler

Scraper

Spider (Link Following)

Configuration

Basic Configuration

Plugin Configuration

Environment Variables

Configuration Presets

Event Handling

Error Handling

Custom Plugins

Best Practices

1. Resource Management

2. Error Handling

3. Rate Limiting

4. Memory Management

Next Steps