Skip to content

iyulab/WebFlux

Repository files navigation

WebFlux

A .NET SDK for preprocessing web content for RAG (Retrieval-Augmented Generation) systems.

NuGet Version NuGet Downloads .NET Support License

Overview

WebFlux processes web content into chunks optimized for RAG systems. It handles web crawling, content extraction, and intelligent chunking with support for multiple content formats.

Installation

dotnet add package WebFlux

Quick Start

using WebFlux;
using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Register your AI service implementations
services.AddScoped<ITextEmbeddingService, YourEmbeddingService>();
services.AddScoped<ITextCompletionService, YourLLMService>(); // Optional

// Add WebFlux
services.AddWebFlux();

var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IWebContentProcessor>();

// Process a single URL
var chunks = await processor.ProcessUrlAsync("https://example.com");
foreach (var chunk in chunks)
{
    Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}

// Or stream a whole website
await foreach (var chunk in processor.ProcessWebsiteAsync("https://example.com"))
{
    Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}

Features

  • Interface-Based Design: Bring your own AI services (OpenAI, Anthropic, Azure, local models)
  • Multiple Chunking Strategies: Auto, Smart, Semantic, Intelligent, MemoryOptimized, Paragraph, FixedSize, DomStructure
  • Content Formats: HTML, Markdown, JSON, XML, PDF
  • Web Standards: robots.txt, sitemap.xml, ai.txt, llms.txt, manifest.json
  • Streaming: Process large websites with AsyncEnumerable
  • Parallel Processing: Concurrent crawling and processing
  • Rich Metadata: Web document metadata extraction (SEO, Open Graph, Schema.org, Twitter Cards)
  • Progress Tracking: Real-time batch crawling progress with detailed statistics

Chunking Strategies

Strategy Use Case
Auto Automatically selects best strategy based on content
Smart Structured HTML documentation
Semantic General web pages and articles
Intelligent Blogs and knowledge bases
MemoryOptimized Large documents with memory constraints
Paragraph Markdown with natural boundaries
FixedSize Uniform chunks for testing
DomStructure HTML DOM structure-based chunking preserving semantic boundaries

Core Interfaces

WebFlux uses the Interface Provider pattern. You provide AI service implementations, and WebFlux handles crawling, extraction, and chunking.

Required AI Services

ITextEmbeddingService (Required)

Vector embedding generation for semantic chunking:

public interface ITextEmbeddingService
{
    Task<float[]> GetEmbeddingAsync(string text, CancellationToken cancellationToken = default);
    Task<IReadOnlyList<float[]>> GetEmbeddingsAsync(IReadOnlyList<string> texts, CancellationToken cancellationToken = default);
    int MaxTokens { get; }
    int EmbeddingDimension { get; }
}

Optional AI Services

ITextCompletionService (Optional)

LLM text completion for multimodal processing and content reconstruction:

public interface ITextCompletionService
{
    Task<string> CompleteAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
    IAsyncEnumerable<string> CompleteStreamAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
    Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}

IImageToTextService (Optional)

Image-to-text conversion for multimodal content:

public interface IImageToTextService
{
    Task<string> ConvertImageToTextAsync(string imageUrl, ImageToTextOptions? options = null, CancellationToken cancellationToken = default);
    Task<string> ExtractTextFromImageAsync(string imageUrl, CancellationToken cancellationToken = default);
    Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}

Main Processor

IWebContentProcessor

The main entry point for all web content processing:

// Single URL processing
var chunks = await processor.ProcessUrlAsync("https://example.com");

// Website crawling (streaming)
await foreach (var chunk in processor.ProcessWebsiteAsync(url, crawlOptions, chunkOptions))
{
    // Process chunk
}

// Batch processing
var results = await processor.ProcessUrlsBatchAsync(urls, chunkOptions);

Focused Interfaces (ISP)

For consumers that only need extraction or chunking:

// Extraction only
var extractor = provider.GetRequiredService<IContentExtractService>();
var result = await extractor.ExtractContentAsync("https://example.com");

// Chunking only
var chunker = provider.GetRequiredService<IContentChunkService>();
var chunks = await chunker.ProcessUrlAsync("https://example.com");

Extensibility

IChunkingStrategy

Implement custom chunking strategies:

public interface IChunkingStrategy
{
    string Name { get; }
    string Description { get; }
    Task<IReadOnlyList<WebContentChunk>> ChunkAsync(ExtractedContent content, ChunkingOptions? options = null, CancellationToken cancellationToken = default);
}

IEventPublisher

Subscribe to pipeline events for monitoring, metrics collection, and observability. IEventPublisher is automatically registered as a singleton when you call AddWebFlux().

using WebFlux.Core.Interfaces;
using WebFlux.Core.Models.Events;

var publisher = provider.GetRequiredService<IEventPublisher>();

// Subscribe to specific event types
using var s1 = publisher.Subscribe<PageCrawledEvent>(async e =>
{
    Console.WriteLine($"Crawled {e.Url} [{e.StatusCode}] in {e.ProcessingTimeMs}ms");
    await metrics.RecordPageCrawl(e);
});

using var s2 = publisher.Subscribe<ChunkGeneratedEvent>(e =>
{
    Console.WriteLine($"Chunk #{e.SequenceNumber} ({e.ChunkSize} tokens) from {e.SourceUrl}");
});

using var s3 = publisher.Subscribe<ErrorOccurredEvent>(e =>
{
    Console.WriteLine($"[{e.ErrorCategory}] {e.ErrorCode}: {e.Message}");
});

// Or subscribe to ALL events
using var sAll = publisher.SubscribeAll(async e =>
{
    await logger.LogEventAsync(e.EventType, e);
});

Available event types (WebFlux.Core.Models.Events namespace):

Category Events
Pipeline ProcessingStartedEvent, ProcessingProgressEvent, ProcessingCompletedEvent, ProcessingFailedEvent
Crawling CrawlingStartedEvent, CrawlingCompletedEvent, PageCrawledEvent, UrlProcessingStartedEvent, UrlProcessedEvent, UrlProcessingFailedEvent
Extraction ContentExtractionStartedEvent, ContentExtractionCompletedEvent, ContentExtractionFailedEvent, ImageProcessedEvent
Chunking ChunkingStartedEvent, ChunkingCompletedEvent, ChunkGeneratedEvent
Monitoring ErrorOccurredEvent, PerformanceMetricsEvent

All events derive from ProcessingEvent (base class with EventId, EventType, Timestamp, Severity, CorrelationId).

For detailed implementation examples, see the Tutorial.

Configuration

var options = new CrawlOptions
{
    MaxDepth = 3,
    MaxPages = 100,
    RespectRobotsTxt = true,
    UserAgent = "MyBot/1.0"
};

var chunkOptions = new ChunkingOptions
{
    Strategy = "Auto",
    MaxChunkSize = 512,
    OverlapSize = 64
};

await foreach (var chunk in processor.ProcessWebsiteAsync(url, options, chunkOptions))
{
    // Handle chunk
}

Documentation

License

MIT License - see LICENSE file for details.

Support

About

.NET SDK for preparing web content for RAG systems — crawls, extracts, and chunks websites with interface-based AI service integration.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors