WebFlux

A .NET SDK for preprocessing web content for RAG (Retrieval-Augmented Generation) systems.

Overview

WebFlux processes web content into chunks optimized for RAG systems. It handles web crawling, content extraction, and intelligent chunking with support for multiple content formats.

Installation

dotnet add package WebFlux

Quick Start

using WebFlux;
using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Register your AI service implementations
services.AddScoped<ITextEmbeddingService, YourEmbeddingService>();
services.AddScoped<ITextCompletionService, YourLLMService>(); // Optional

// Add WebFlux
services.AddWebFlux();

var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IWebContentProcessor>();

// Process a single URL
var chunks = await processor.ProcessUrlAsync("https://example.com");
foreach (var chunk in chunks)
{
    Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}

// Or stream a whole website
await foreach (var chunk in processor.ProcessWebsiteAsync("https://example.com"))
{
    Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}

Features

Interface-Based Design: Bring your own AI services (OpenAI, Anthropic, Azure, local models)
Multiple Chunking Strategies: Auto, Smart, Semantic, Intelligent, MemoryOptimized, Paragraph, FixedSize, DomStructure
Content Formats: HTML, Markdown, JSON, XML, PDF
Web Standards: robots.txt, sitemap.xml, ai.txt, llms.txt, manifest.json
Streaming: Process large websites with AsyncEnumerable
Parallel Processing: Concurrent crawling and processing
Rich Metadata: Web document metadata extraction (SEO, Open Graph, Schema.org, Twitter Cards)
Progress Tracking: Real-time batch crawling progress with detailed statistics

Chunking Strategies

Strategy	Use Case
Auto	Automatically selects best strategy based on content
Smart	Structured HTML documentation
Semantic	General web pages and articles
Intelligent	Blogs and knowledge bases
MemoryOptimized	Large documents with memory constraints
Paragraph	Markdown with natural boundaries
FixedSize	Uniform chunks for testing
DomStructure	HTML DOM structure-based chunking preserving semantic boundaries

Core Interfaces

WebFlux uses the Interface Provider pattern. You provide AI service implementations, and WebFlux handles crawling, extraction, and chunking.

Required AI Services

ITextEmbeddingService (Required)

Vector embedding generation for semantic chunking:

public interface ITextEmbeddingService
{
    Task<float[]> GetEmbeddingAsync(string text, CancellationToken cancellationToken = default);
    Task<IReadOnlyList<float[]>> GetEmbeddingsAsync(IReadOnlyList<string> texts, CancellationToken cancellationToken = default);
    int MaxTokens { get; }
    int EmbeddingDimension { get; }
}

Optional AI Services

ITextCompletionService (Optional)

LLM text completion for multimodal processing and content reconstruction:

public interface ITextCompletionService
{
    Task<string> CompleteAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
    IAsyncEnumerable<string> CompleteStreamAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
    Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}

IImageToTextService (Optional)

Image-to-text conversion for multimodal content:

public interface IImageToTextService
{
    Task<string> ConvertImageToTextAsync(string imageUrl, ImageToTextOptions? options = null, CancellationToken cancellationToken = default);
    Task<string> ExtractTextFromImageAsync(string imageUrl, CancellationToken cancellationToken = default);
    Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}

Main Processor

IWebContentProcessor

The main entry point for all web content processing:

// Single URL processing
var chunks = await processor.ProcessUrlAsync("https://example.com");

// Website crawling (streaming)
await foreach (var chunk in processor.ProcessWebsiteAsync(url, crawlOptions, chunkOptions))
{
    // Process chunk
}

// Batch processing
var results = await processor.ProcessUrlsBatchAsync(urls, chunkOptions);

Focused Interfaces (ISP)

For consumers that only need extraction or chunking:

// Extraction only
var extractor = provider.GetRequiredService<IContentExtractService>();
var result = await extractor.ExtractContentAsync("https://example.com");

// Chunking only
var chunker = provider.GetRequiredService<IContentChunkService>();
var chunks = await chunker.ProcessUrlAsync("https://example.com");

Extensibility

IChunkingStrategy

Implement custom chunking strategies:

public interface IChunkingStrategy
{
    string Name { get; }
    string Description { get; }
    Task<IReadOnlyList<WebContentChunk>> ChunkAsync(ExtractedContent content, ChunkingOptions? options = null, CancellationToken cancellationToken = default);
}

IEventPublisher

Subscribe to pipeline events for monitoring, metrics collection, and observability. IEventPublisher is automatically registered as a singleton when you call AddWebFlux().

using WebFlux.Core.Interfaces;
using WebFlux.Core.Models.Events;

var publisher = provider.GetRequiredService<IEventPublisher>();

// Subscribe to specific event types
using var s1 = publisher.Subscribe<PageCrawledEvent>(async e =>
{
    Console.WriteLine($"Crawled {e.Url} [{e.StatusCode}] in {e.ProcessingTimeMs}ms");
    await metrics.RecordPageCrawl(e);
});

using var s2 = publisher.Subscribe<ChunkGeneratedEvent>(e =>
{
    Console.WriteLine($"Chunk #{e.SequenceNumber} ({e.ChunkSize} tokens) from {e.SourceUrl}");
});

using var s3 = publisher.Subscribe<ErrorOccurredEvent>(e =>
{
    Console.WriteLine($"[{e.ErrorCategory}] {e.ErrorCode}: {e.Message}");
});

// Or subscribe to ALL events
using var sAll = publisher.SubscribeAll(async e =>
{
    await logger.LogEventAsync(e.EventType, e);
});

Available event types (WebFlux.Core.Models.Events namespace):

Category	Events
Pipeline	`ProcessingStartedEvent`, `ProcessingProgressEvent`, `ProcessingCompletedEvent`, `ProcessingFailedEvent`
Crawling	`CrawlingStartedEvent`, `CrawlingCompletedEvent`, `PageCrawledEvent`, `UrlProcessingStartedEvent`, `UrlProcessedEvent`, `UrlProcessingFailedEvent`
Extraction	`ContentExtractionStartedEvent`, `ContentExtractionCompletedEvent`, `ContentExtractionFailedEvent`, `ImageProcessedEvent`
Chunking	`ChunkingStartedEvent`, `ChunkingCompletedEvent`, `ChunkGeneratedEvent`
Monitoring	`ErrorOccurredEvent`, `PerformanceMetricsEvent`

All events derive from ProcessingEvent (base class with EventId, EventType, Timestamp, Severity, CorrelationId).

For detailed implementation examples, see the Tutorial.

Configuration

var options = new CrawlOptions
{
    MaxDepth = 3,
    MaxPages = 100,
    RespectRobotsTxt = true,
    UserAgent = "MyBot/1.0"
};

var chunkOptions = new ChunkingOptions
{
    Strategy = "Auto",
    MaxChunkSize = 512,
    OverlapSize = 64
};

await foreach (var chunk in processor.ProcessWebsiteAsync(url, options, chunkOptions))
{
    // Handle chunk
}

Documentation

Tutorial - Step-by-step guide with practical examples
Architecture - System design and pipeline
Interfaces - API contracts and implementation guide
Chunking Strategies - Detailed strategy guide
Changelog - Version history and release notes

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Package: NuGet

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
samples		samples
scripts		scripts
src		src
tests/WebFlux.Tests		tests/WebFlux.Tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Directory.Build.props		Directory.Build.props
Directory.Packages.props		Directory.Packages.props
LICENSE		LICENSE
README.md		README.md
TASKS.md		TASKS.md
WebFlux.slnx		WebFlux.slnx
nuget.config		nuget.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebFlux

Overview

Installation

Quick Start

Features

Chunking Strategies

Core Interfaces

Required AI Services

ITextEmbeddingService (Required)

Optional AI Services

ITextCompletionService (Optional)

IImageToTextService (Optional)

Main Processor

IWebContentProcessor

Focused Interfaces (ISP)

Extensibility

IChunkingStrategy

IEventPublisher

Configuration

Documentation

License

Support

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebFlux

Overview

Installation

Quick Start

Features

Chunking Strategies

Core Interfaces

Required AI Services

ITextEmbeddingService (Required)

Optional AI Services

ITextCompletionService (Optional)

IImageToTextService (Optional)

Main Processor

IWebContentProcessor

Focused Interfaces (ISP)

Extensibility

IChunkingStrategy

IEventPublisher

Configuration

Documentation

License

Support

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages