Performance Tuning Guide

Learn how to optimize CrawlX for maximum performance and efficiency.

Performance Modes

Lightweight Mode

Optimized for simple tasks and minimal resource usage:

const crawler = new CrawlX({
  mode: 'lightweight',
  concurrency: 2,
  timeout: 5000,
  plugins: {
    parse: { enabled: true },
    delay: { enabled: true, defaultDelay: 1000 }
  }
});

Best for:

Single page scraping
Small-scale operations
Development and testing
Resource-constrained environments

High-Performance Mode

Optimized for large-scale crawling operations:

const crawler = new CrawlX({
  mode: 'high-performance',
  concurrency: 20,
  timeout: 10000,
  plugins: {
    parse: { enabled: true },
    duplicateFilter: { enabled: true, maxCacheSize: 100000 },
    rateLimit: { enabled: true }
  }
});

Best for:

Large-scale crawling
Production environments
High-throughput requirements
Batch processing

Concurrency Optimization

Finding the Right Concurrency Level

// Start conservative
const crawler = new CrawlX({ concurrency: 5 });

// Monitor performance
crawler.on('task-complete', () => {
  const stats = crawler.getStats();
  console.log(`Queue size: ${stats.scheduler.queueSize}`);
  console.log(`Active tasks: ${stats.scheduler.activeTasks}`);
});

// Adjust based on results
if (stats.scheduler.queueSize > 1000) {
  crawler.updateConfig({ concurrency: 10 });
}

Concurrency Guidelines

Scenario	Recommended Concurrency
Single domain	2-5
Multiple domains	10-20
API endpoints	20-50
Local testing	1-2

Dynamic Concurrency Adjustment

class AdaptiveConcurrencyPlugin {
  name = 'adaptive-concurrency';
  version = '1.0.0';
  priority = 200;

  private errorRate = 0;
  private successCount = 0;
  private errorCount = 0;

  async onTaskComplete(result: TaskResult): Promise<TaskResult> {
    this.successCount++;
    this.adjustConcurrency();
    return result;
  }

  async onTaskError(error: Error, task: TaskOptions): Promise<void> {
    this.errorCount++;
    this.adjustConcurrency();
  }

  private adjustConcurrency(): void {
    const totalRequests = this.successCount + this.errorCount;
    if (totalRequests < 100) return; // Wait for enough data

    this.errorRate = this.errorCount / totalRequests;
    const currentConcurrency = crawler.getConfig().get('concurrency');

    if (this.errorRate > 0.1 && currentConcurrency > 1) {
      // Too many errors, reduce concurrency
      crawler.updateConfig({ concurrency: Math.max(1, currentConcurrency - 1) });
    } else if (this.errorRate < 0.05 && currentConcurrency < 20) {
      // Low error rate, increase concurrency
      crawler.updateConfig({ concurrency: currentConcurrency + 1 });
    }

    // Reset counters
    this.successCount = 0;
    this.errorCount = 0;
  }
}

Memory Optimization

Memory-Efficient Configuration

const crawler = new CrawlX({
  scheduler: {
    maxQueueSize: 1000, // Limit queue size
    resourceLimits: {
      maxMemoryUsage: 512 * 1024 * 1024, // 512MB
      maxCpuUsage: 80 // 80%
    }
  },
  plugins: {
    duplicateFilter: {
      enabled: true,
      maxCacheSize: 10000 // Limit cache size
    }
  }
});

Memory Monitoring

class MemoryMonitorPlugin {
  name = 'memory-monitor';
  version = '1.0.0';
  priority = 10;

  private checkInterval: NodeJS.Timeout;

  async onInit(): Promise<void> {
    this.checkInterval = setInterval(() => {
      const usage = process.memoryUsage();
      const usageMB = usage.heapUsed / 1024 / 1024;
      
      console.log(`Memory usage: ${usageMB.toFixed(2)} MB`);
      
      if (usageMB > 500) { // 500MB threshold
        console.warn('High memory usage detected');
        // Trigger garbage collection if available
        if (global.gc) {
          global.gc();
        }
      }
    }, 5000);
  }

  async onDestroy(): Promise<void> {
    if (this.checkInterval) {
      clearInterval(this.checkInterval);
    }
  }
}

Streaming Large Responses

class StreamingPlugin {
  name = 'streaming';
  version = '1.0.0';
  priority = 100;

  async onResponse(response: HttpResponse): Promise<HttpResponse> {
    // For large responses, process in chunks
    if (response.headers['content-length'] && 
        parseInt(response.headers['content-length']) > 10 * 1024 * 1024) {
      
      // Process response in streaming fashion
      response._streaming = true;
    }
    
    return response;
  }
}

Network Optimization

Connection Pooling

const crawler = new CrawlX({
  httpClient: {
    keepAlive: true,
    maxSockets: 50,
    maxFreeSockets: 10,
    timeout: 30000,
    freeSocketTimeout: 15000
  }
});

Request Optimization

const crawler = new CrawlX({
  headers: {
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Cache-Control': 'no-cache'
  },
  timeout: 10000,
  maxRetries: 2
});

DNS Optimization

import { lookup } from 'dns';
import { promisify } from 'util';

class DNSCachePlugin {
  name = 'dns-cache';
  version = '1.0.0';
  priority = 200;

  private cache = new Map<string, string>();
  private dnsLookup = promisify(lookup);

  async onRequest(request: HttpRequest): Promise<HttpRequest> {
    const url = new URL(request.url);
    const hostname = url.hostname;

    if (!this.cache.has(hostname)) {
      try {
        const result = await this.dnsLookup(hostname);
        this.cache.set(hostname, result.address);
      } catch (error) {
        console.warn(`DNS lookup failed for ${hostname}:`, error);
      }
    }

    return request;
  }
}

Parser Optimization

Efficient Selectors

// Inefficient - searches entire document
const badRule = {
  title: 'title',
  content: 'p' // Too broad
};

// Efficient - targeted selectors
const goodRule = {
  title: 'head > title',
  content: 'article.main-content p'
};

Scoped Parsing

// Process large lists efficiently
const efficientRule = {
  products: {
    _scope: '.product-grid .product', // Limit scope first
    name: '.product-name',
    price: '.price',
    // Only parse what you need
  }
};

Custom Optimized Filters

import { Parser } from 'crawlx';

const parser = new Parser();

// Optimized number filter
parser.addFilter('fastNumber', (value) => {
  if (typeof value === 'number') return value;
  const num = parseFloat(value);
  return isNaN(num) ? 0 : num;
});

// Optimized text cleaning
parser.addFilter('fastTrim', (value) => {
  return typeof value === 'string' ? value.trim() : value;
});

Caching Strategies

Response Caching

class SmartCachePlugin {
  name = 'smart-cache';
  version = '1.0.0';
  priority = 150;

  private cache = new Map();
  private maxSize = 1000;
  private ttl = 300000; // 5 minutes

  async onRequest(request: HttpRequest): Promise<HttpRequest> {
    const cacheKey = this.getCacheKey(request);
    const cached = this.cache.get(cacheKey);

    if (cached && Date.now() - cached.timestamp < this.ttl) {
      request._cached = cached.response;
    }

    return request;
  }

  async onResponse(response: HttpResponse): Promise<HttpResponse> {
    if (!response._fromCache) {
      this.addToCache(response);
    }
    return response;
  }

  private getCacheKey(request: HttpRequest): string {
    return `${request.method || 'GET'}:${request.url}`;
  }

  private addToCache(response: HttpResponse): void {
    if (this.cache.size >= this.maxSize) {
      // LRU eviction
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
    }

    this.cache.set(this.getCacheKey(response.request), {
      response,
      timestamp: Date.now()
    });
  }
}

Persistent Caching

import { createHash } from 'crypto';
import { promises as fs } from 'fs';
import { join } from 'path';

class PersistentCachePlugin {
  name = 'persistent-cache';
  version = '1.0.0';
  priority = 150;

  constructor(private cacheDir: string = './cache') {}

  async onInit(): Promise<void> {
    await fs.mkdir(this.cacheDir, { recursive: true });
  }

  async onRequest(request: HttpRequest): Promise<HttpRequest> {
    const cacheFile = this.getCacheFile(request);
    
    try {
      const cached = await fs.readFile(cacheFile, 'utf8');
      const data = JSON.parse(cached);
      
      if (Date.now() - data.timestamp < 3600000) { // 1 hour
        request._cached = data.response;
      }
    } catch {
      // Cache miss
    }

    return request;
  }

  async onResponse(response: HttpResponse): Promise<HttpResponse> {
    if (!response._fromCache) {
      const cacheFile = this.getCacheFile(response.request);
      const data = {
        response,
        timestamp: Date.now()
      };
      
      await fs.writeFile(cacheFile, JSON.stringify(data));
    }

    return response;
  }

  private getCacheFile(request: HttpRequest): string {
    const hash = createHash('md5')
      .update(`${request.method || 'GET'}:${request.url}`)
      .digest('hex');
    return join(this.cacheDir, `${hash}.json`);
  }
}

Monitoring and Profiling

Performance Metrics

class PerformancePlugin {
  name = 'performance';
  version = '1.0.0';
  priority = 5;

  private metrics = {
    totalRequests: 0,
    successfulRequests: 0,
    failedRequests: 0,
    totalTime: 0,
    averageTime: 0,
    minTime: Infinity,
    maxTime: 0
  };

  async onTaskStart(task: TaskOptions): Promise<void> {
    task._startTime = process.hrtime.bigint();
  }

  async onTaskComplete(result: TaskResult): Promise<TaskResult> {
    this.updateMetrics(result.task, true);
    return result;
  }

  async onTaskError(error: Error, task: TaskOptions): Promise<void> {
    this.updateMetrics(task, false);
  }

  private updateMetrics(task: TaskOptions, success: boolean): void {
    if (!task._startTime) return;

    const duration = Number(process.hrtime.bigint() - task._startTime) / 1000000; // ms
    
    this.metrics.totalRequests++;
    this.metrics.totalTime += duration;
    this.metrics.averageTime = this.metrics.totalTime / this.metrics.totalRequests;
    this.metrics.minTime = Math.min(this.metrics.minTime, duration);
    this.metrics.maxTime = Math.max(this.metrics.maxTime, duration);

    if (success) {
      this.metrics.successfulRequests++;
    } else {
      this.metrics.failedRequests++;
    }

    // Log metrics every 100 requests
    if (this.metrics.totalRequests % 100 === 0) {
      console.log('Performance Metrics:', this.metrics);
    }
  }
}

Best Practices Summary

1. Choose the Right Mode

Use lightweight mode for simple tasks
Use high-performance mode for large-scale operations

2. Optimize Concurrency

Start with conservative values
Monitor and adjust based on performance
Consider target server capacity

3. Manage Memory

Set appropriate limits
Monitor memory usage
Use streaming for large responses

4. Cache Strategically

Cache responses when appropriate
Implement TTL for cache entries
Consider persistent caching for long-running operations

5. Monitor Performance

Track key metrics
Set up alerts for performance degradation
Profile regularly to identify bottlenecks

6. Respect Target Servers

Implement appropriate delays
Use rate limiting
Follow robots.txt guidelines

By following these optimization strategies, you can achieve maximum performance while maintaining reliability and respecting target servers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning Guide

Performance Modes

Lightweight Mode

High-Performance Mode

Concurrency Optimization

Finding the Right Concurrency Level

Concurrency Guidelines

Dynamic Concurrency Adjustment

Memory Optimization

Memory-Efficient Configuration

Memory Monitoring

Streaming Large Responses

Network Optimization

Connection Pooling

Request Optimization

DNS Optimization

Parser Optimization

Efficient Selectors

Scoped Parsing

Custom Optimized Filters

Caching Strategies

Response Caching

Persistent Caching

Monitoring and Profiling

Performance Metrics

Best Practices Summary

1. Choose the Right Mode

2. Optimize Concurrency

3. Manage Memory

4. Cache Strategically

5. Monitor Performance

6. Respect Target Servers

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance Tuning Guide

Performance Modes

Lightweight Mode

High-Performance Mode

Concurrency Optimization

Finding the Right Concurrency Level

Concurrency Guidelines

Dynamic Concurrency Adjustment

Memory Optimization

Memory-Efficient Configuration

Memory Monitoring

Streaming Large Responses

Network Optimization

Connection Pooling

Request Optimization

DNS Optimization

Parser Optimization

Efficient Selectors

Scoped Parsing

Custom Optimized Filters

Caching Strategies

Response Caching

Persistent Caching

Monitoring and Profiling

Performance Metrics

Best Practices Summary

1. Choose the Right Mode

2. Optimize Concurrency

3. Manage Memory

4. Cache Strategically

5. Monitor Performance

6. Respect Target Servers