A .NET SDK for preprocessing web content for RAG (Retrieval-Augmented Generation) systems.
WebFlux processes web content into chunks optimized for RAG systems. It handles web crawling, content extraction, and intelligent chunking with support for multiple content formats.
dotnet add package WebFluxusing WebFlux;
using Microsoft.Extensions.DependencyInjection;
var services = new ServiceCollection();
// Register your AI service implementations
services.AddScoped<ITextEmbeddingService, YourEmbeddingService>();
services.AddScoped<ITextCompletionService, YourLLMService>(); // Optional
// Add WebFlux
services.AddWebFlux();
var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IWebContentProcessor>();
// Process a single URL
var chunks = await processor.ProcessUrlAsync("https://example.com");
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}
// Or stream a whole website
await foreach (var chunk in processor.ProcessWebsiteAsync("https://example.com"))
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}- Interface-Based Design: Bring your own AI services (OpenAI, Anthropic, Azure, local models)
- Multiple Chunking Strategies: Auto, Smart, Semantic, Intelligent, MemoryOptimized, Paragraph, FixedSize, DomStructure
- Content Formats: HTML, Markdown, JSON, XML, PDF
- Web Standards: robots.txt, sitemap.xml, ai.txt, llms.txt, manifest.json
- Streaming: Process large websites with AsyncEnumerable
- Parallel Processing: Concurrent crawling and processing
- Rich Metadata: Web document metadata extraction (SEO, Open Graph, Schema.org, Twitter Cards)
- Progress Tracking: Real-time batch crawling progress with detailed statistics
| Strategy | Use Case |
|---|---|
| Auto | Automatically selects best strategy based on content |
| Smart | Structured HTML documentation |
| Semantic | General web pages and articles |
| Intelligent | Blogs and knowledge bases |
| MemoryOptimized | Large documents with memory constraints |
| Paragraph | Markdown with natural boundaries |
| FixedSize | Uniform chunks for testing |
| DomStructure | HTML DOM structure-based chunking preserving semantic boundaries |
WebFlux uses the Interface Provider pattern. You provide AI service implementations, and WebFlux handles crawling, extraction, and chunking.
Vector embedding generation for semantic chunking:
public interface ITextEmbeddingService
{
Task<float[]> GetEmbeddingAsync(string text, CancellationToken cancellationToken = default);
Task<IReadOnlyList<float[]>> GetEmbeddingsAsync(IReadOnlyList<string> texts, CancellationToken cancellationToken = default);
int MaxTokens { get; }
int EmbeddingDimension { get; }
}LLM text completion for multimodal processing and content reconstruction:
public interface ITextCompletionService
{
Task<string> CompleteAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
IAsyncEnumerable<string> CompleteStreamAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}Image-to-text conversion for multimodal content:
public interface IImageToTextService
{
Task<string> ConvertImageToTextAsync(string imageUrl, ImageToTextOptions? options = null, CancellationToken cancellationToken = default);
Task<string> ExtractTextFromImageAsync(string imageUrl, CancellationToken cancellationToken = default);
Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}The main entry point for all web content processing:
// Single URL processing
var chunks = await processor.ProcessUrlAsync("https://example.com");
// Website crawling (streaming)
await foreach (var chunk in processor.ProcessWebsiteAsync(url, crawlOptions, chunkOptions))
{
// Process chunk
}
// Batch processing
var results = await processor.ProcessUrlsBatchAsync(urls, chunkOptions);For consumers that only need extraction or chunking:
// Extraction only
var extractor = provider.GetRequiredService<IContentExtractService>();
var result = await extractor.ExtractContentAsync("https://example.com");
// Chunking only
var chunker = provider.GetRequiredService<IContentChunkService>();
var chunks = await chunker.ProcessUrlAsync("https://example.com");Implement custom chunking strategies:
public interface IChunkingStrategy
{
string Name { get; }
string Description { get; }
Task<IReadOnlyList<WebContentChunk>> ChunkAsync(ExtractedContent content, ChunkingOptions? options = null, CancellationToken cancellationToken = default);
}Subscribe to pipeline events for monitoring, metrics collection, and observability.
IEventPublisher is automatically registered as a singleton when you call AddWebFlux().
using WebFlux.Core.Interfaces;
using WebFlux.Core.Models.Events;
var publisher = provider.GetRequiredService<IEventPublisher>();
// Subscribe to specific event types
using var s1 = publisher.Subscribe<PageCrawledEvent>(async e =>
{
Console.WriteLine($"Crawled {e.Url} [{e.StatusCode}] in {e.ProcessingTimeMs}ms");
await metrics.RecordPageCrawl(e);
});
using var s2 = publisher.Subscribe<ChunkGeneratedEvent>(e =>
{
Console.WriteLine($"Chunk #{e.SequenceNumber} ({e.ChunkSize} tokens) from {e.SourceUrl}");
});
using var s3 = publisher.Subscribe<ErrorOccurredEvent>(e =>
{
Console.WriteLine($"[{e.ErrorCategory}] {e.ErrorCode}: {e.Message}");
});
// Or subscribe to ALL events
using var sAll = publisher.SubscribeAll(async e =>
{
await logger.LogEventAsync(e.EventType, e);
});Available event types (WebFlux.Core.Models.Events namespace):
| Category | Events |
|---|---|
| Pipeline | ProcessingStartedEvent, ProcessingProgressEvent, ProcessingCompletedEvent, ProcessingFailedEvent |
| Crawling | CrawlingStartedEvent, CrawlingCompletedEvent, PageCrawledEvent, UrlProcessingStartedEvent, UrlProcessedEvent, UrlProcessingFailedEvent |
| Extraction | ContentExtractionStartedEvent, ContentExtractionCompletedEvent, ContentExtractionFailedEvent, ImageProcessedEvent |
| Chunking | ChunkingStartedEvent, ChunkingCompletedEvent, ChunkGeneratedEvent |
| Monitoring | ErrorOccurredEvent, PerformanceMetricsEvent |
All events derive from ProcessingEvent (base class with EventId, EventType, Timestamp, Severity, CorrelationId).
For detailed implementation examples, see the Tutorial.
var options = new CrawlOptions
{
MaxDepth = 3,
MaxPages = 100,
RespectRobotsTxt = true,
UserAgent = "MyBot/1.0"
};
var chunkOptions = new ChunkingOptions
{
Strategy = "Auto",
MaxChunkSize = 512,
OverlapSize = 64
};
await foreach (var chunk in processor.ProcessWebsiteAsync(url, options, chunkOptions))
{
// Handle chunk
}- Tutorial - Step-by-step guide with practical examples
- Architecture - System design and pipeline
- Interfaces - API contracts and implementation guide
- Chunking Strategies - Detailed strategy guide
- Changelog - Version history and release notes
MIT License - see LICENSE file for details.
- Issues: GitHub Issues
- Package: NuGet