.NET document processing library for RAG systems
FileFlux is a .NET library that transforms various document formats into optimized chunks for RAG (Retrieval-Augmented Generation) systems. Built on high-performance Rust FFI libraries for document parsing.
- 5-Stage Stateful Pipeline: Extract β Rule-Refine β LLM-Refine β Chunk β Enrich
- Native Document Readers: Rust FFI-based readers (Unpdf, Undoc, Unhwp) for 2-5x faster processing
- Multiple Document Formats: PDF, DOCX, XLSX, PPTX, HWP, HWPX, Markdown, HTML, TXT, JSON, CSV
- Flexible Chunking Strategies: Auto, Smart, Intelligent, Semantic, Paragraph, FixedSize, Hierarchical, PageLevel
- Interface-Driven AI: Define AI service interfaces, implement with your preferred provider
- Document Graph: Inter-chunk relationship tracking with sequential, hierarchical, and semantic edges
- Structural Metadata: HeadingPath, page numbers, ContextDependency scores for enhanced RAG
- Language Detection: Automatic language detection using NTextCat
- IEnrichedChunk Interface: Standardized interface for RAG system integration
- Metadata Enrichment: AI-powered metadata extraction with caching and fallback
- Extensible Architecture: Interface-based design for easy customization
- Async Processing: Streaming and parallel processing for large documents
dotnet add package FileFluxdotnet add package FileFlux.CorePackage Comparison:
| Feature | FileFlux.Core | FileFlux |
|---|---|---|
| Document Readers (PDF, DOCX, etc.) | β | β |
| Core Interfaces & Models | β | β |
| AI Service Interfaces | β | β |
| Chunking Strategies | β | β |
| FluxCurator & FluxImprover | β | β |
| DocumentProcessor | β | β |
| Use Case | Custom chunking | Full RAG pipeline |
using FileFlux;
using Microsoft.Extensions.DependencyInjection;
var services = new ServiceCollection();
// Optional: Register AI services for advanced features
// services.AddScoped<IDocumentAnalysisService, YourLLMService>();
// Register FileFlux services (no logger required)
services.AddFileFlux();
var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IDocumentProcessor>();
// Process document
var chunks = await processor.ProcessAsync("document.pdf");
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content}");
}await foreach (var result in processor.ProcessStreamAsync("document.pdf"))
{
if (result.IsSuccess && result.Result != null)
{
foreach (var chunk in result.Result)
{
Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content.Length} chars");
}
}
}var options = new ChunkingOptions
{
Strategy = "Auto", // Automatic strategy selection
MaxChunkSize = 512, // Maximum chunk size
OverlapSize = 64 // Overlap between chunks
};
var chunks = await processor.ProcessAsync("document.pdf", options);The new stateful pipeline provides explicit control over each processing stage:
using FileFlux;
using FileFlux.Infrastructure.Factories;
// Create processor via factory
var factory = provider.GetRequiredService<IDocumentProcessorFactory>();
using var processor = factory.Create("document.pdf");
// Execute stages explicitly
await processor.ExtractAsync(); // Stage 1: Raw content extraction
await processor.RefineAsync(); // Stage 2: Rule-based text cleaning
await processor.LlmRefineAsync(); // Stage 3: LLM-powered refinement (optional)
await processor.ChunkAsync(); // Stage 4: Content chunking
await processor.EnrichAsync(); // Stage 5: LLM-powered enrichment (optional)
// Access results at each stage
Console.WriteLine($"State: {processor.State}");
Console.WriteLine($"Raw text length: {processor.Result.Raw?.Text.Length}");
Console.WriteLine($"Sections found: {processor.Result.Refined?.Sections.Count}");
Console.WriteLine($"Chunks created: {processor.Result.Chunks?.Count}");
// Or run full pipeline at once
await processor.ProcessAsync(new ProcessingOptions
{
IncludeEnrich = true,
Enrich = new EnrichOptions { BuildGraph = true }
});
// Access the document graph
if (processor.Result.Graph != null)
{
Console.WriteLine($"Graph nodes: {processor.Result.Graph.NodeCount}");
Console.WriteLine($"Graph edges: {processor.Result.Graph.EdgeCount}");
}Pipeline Stages:
| Stage | Interface | AI | Description |
|---|---|---|---|
| Extract | IDocumentReader |
β | Raw content extraction from files |
| Rule-Refine | IDocumentRefiner |
β | Text cleaning, normalization, structure analysis |
| LLM-Refine | ILlmRefiner |
β | AI-powered noise removal, sentence restoration |
| Chunk | IChunkerFactory |
Optional | Content segmentation with various strategies |
| Enrich | IDocumentEnricher |
β | LLM-powered summaries, keywords, contextual text |
var options = new ChunkingOptions
{
Strategy = "Auto",
MaxChunkSize = 512,
CustomProperties = new Dictionary<string, object>
{
["enableMetadataEnrichment"] = true,
["metadataSchema"] = MetadataSchema.General
}
};
var chunks = await processor.ProcessAsync("document.pdf", options);
// Access enriched metadata
foreach (var chunk in chunks)
{
var keywords = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_keywords");
var description = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_description");
var documentType = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_documentType");
var language = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_language");
}FileFlux defines AI service interfaces - consumer applications provide implementations.
| Interface | Purpose | Example Implementations |
|---|---|---|
IDocumentAnalysisService |
Text generation, intelligent chunking | OpenAI, Anthropic, LMSupply |
IImageToTextService |
Image captioning, OCR | OpenAI Vision, LMSupply Captioner/OCR |
IEmbeddingService |
Embedding generation | OpenAI, LMSupply Embedder |
using FileFlux;
using Microsoft.Extensions.DependencyInjection;
var services = new ServiceCollection();
// Implement your own AI service
services.AddScoped<IDocumentAnalysisService, YourOpenAIService>();
services.AddScoped<IImageToTextService, YourVisionService>();
services.AddScoped<IEmbeddingService, YourEmbeddingService>();
// Register FileFlux
services.AddFileFlux();
var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IDocumentProcessor>();For local AI processing without external API calls, see LMSupply. The FileFlux CLI demonstrates LMSupply integration:
// Example from FileFlux.CLI - local AI processing
var lmSupplyOptions = new LMSupplyOptions
{
UseGpuAcceleration = true,
EmbeddingModel = "default",
GeneratorModel = "microsoft/Phi-4-mini-instruct-onnx"
};
// Create LMSupply service implementations
var embedder = await LMSupplyEmbedderService.CreateAsync(lmSupplyOptions);
var generator = await LMSupplyGeneratorService.CreateAsync(lmSupplyOptions);
// Register as AI service implementations
services.AddSingleton<IEmbeddingService>(embedder);
services.AddSingleton<IDocumentAnalysisService>(generator);
services.AddFileFlux();Note: LMSupply is not a direct dependency of FileFlux. Consumer applications that need local AI should reference LMSupply packages directly.
| Format | Extension | Reader | Features |
|---|---|---|---|
| Unpdf (Rust FFI) | Text, tables, image extraction | ||
| Word | .docx | Undoc (Rust FFI) | Style and structure preservation |
| Excel | .xlsx | Undoc (Rust FFI) | Multi-sheet and table structure |
| PowerPoint | .pptx | Undoc (Rust FFI) | Slide and notes extraction |
| HWP | .hwp, .hwpx | Unhwp (Rust FFI) | Native Korean document support |
| Markdown | .md | Built-in | Structure preservation |
| HTML | .html, .htm | Built-in | Web content extraction |
| Text | .txt, .json, .csv | Built-in | Basic text processing |
- Vector Graphics Tables: Tables created with drawing primitives (lines/rectangles) instead of text layout may not be detected. These are rendered as images in most PDF viewers.
- Complex Multi-column Layouts: Documents with intricate multi-column arrangements may have suboptimal text ordering.
- Scanned Documents: OCR is not included; scanned PDFs require pre-processing with external OCR tools.
- Partial Extraction: When whole-document extraction fails, FileFlux automatically falls back to per-page extraction. Pages that cannot be extracted are skipped and recorded in
RawContent.Errors.RawContent.Statusis set toProcessingStatus.Partialwhen some pages fail, allowing RAG pipelines to use the successfully extracted content rather than losing the entire document.
FileFlux uses layout-based table detection with confidence scoring:
- Tables with confidence score β₯ 0.5 are converted to Markdown format
- Low-confidence tables fall back to plain text to prevent garbled output
- Table quality metrics are exposed via
StructuralHintsfor consumer applications
- Excel: Very large worksheets (>100K rows) may impact memory usage
- PowerPoint: Embedded objects are extracted as placeholder text
- HTML: JavaScript-rendered content is not supported
| Strategy | Use Case |
|---|---|
| Auto | Automatic selection based on document type (recommended) |
| Smart | Legal, medical, academic documents |
| Intelligent | Technical documentation, API docs |
| Semantic | General documents, papers |
| Paragraph | Markdown, blogs |
| FixedSize | When uniform size is required |
FileFlux defines interfaces while implementation is up to the consumer application.
// Optional: Register AI services for advanced features
// - IDocumentAnalysisService: For intelligent chunking and metadata enrichment
// - IImageToTextService: For multimodal document processing
services.AddScoped<IDocumentAnalysisService, YourLLMService>();
services.AddScoped<IImageToTextService, YourVisionService>();
// Register FileFlux services (works without AI services too)
services.AddFileFlux();Note: Logger registration is optional. FileFlux uses NullLogger internally if no logger is provided.
For AI service implementation examples, see the samples/ directory.
FileFlux defines interfaces - YOU implement them with your preferred AI provider.
// Register your AI service implementation
services.AddScoped<IDocumentAnalysisService, YourAIService>();
services.AddFileFlux();Features enabled with AI services:
- Intelligent structure analysis for optimal chunking
- Semantic content summarization
- AI-powered quality assessment
- Q&A benchmark generation for RAG testing
π See Tutorial for AI service implementation examples.
Evaluate and optimize chunking quality for RAG systems:
var analyzer = serviceProvider.GetRequiredService<IDocumentQualityAnalyzer>();
// Analyze document quality
var report = await analyzer.AnalyzeQualityAsync("document.pdf");
Console.WriteLine($"Quality Score: {report.OverallQualityScore:P2}");
// Generate Q&A benchmark for RAG testing
var benchmark = await analyzer.GenerateQABenchmarkAsync("document.pdf", questionCount: 20);
// Compare different chunking strategies
var strategies = new[] { "Intelligent", "Semantic", "Smart" };
var comparison = await analyzer.BenchmarkChunkingAsync("document.pdf", strategies);π See Architecture for quality analysis details.
FileFlux works with or without AI services:
// Minimal setup (no AI)
services.AddFileFlux();
// With AI service
services.AddScoped<IDocumentAnalysisService, YourAIService>();
services.AddFileFlux();
// Environment-specific configuration
if (Environment.IsDevelopment())
services.AddScoped<IDocumentAnalysisService, MockTextCompletionService>();
else
services.AddScoped<IDocumentAnalysisService, ProductionAIService>();
services.AddFileFlux();π See Tutorial for more DI patterns and examples.
- Tutorial - Detailed usage guide and examples
- Architecture - System design and pipeline documentation
- Changelog - Version history and release notes
FileFlux/
βββ src/
β βββ FileFlux.Core/ # Extraction only (zero AI dependencies)
β β βββ Contracts/ # IDocumentProcessor, ProcessingResult
β β βββ Core/ # IDocumentRefiner, IDocumentEnricher
β β βββ Domain/ # DocumentGraph, RefinedContent, StructuredElement
β βββ FileFlux/ # Full RAG pipeline (interface-driven)
β βββ Infrastructure/ # StatefulDocumentProcessor, DocumentRefiner, DocumentEnricher
βββ cli/ # CLI with LMSupply integration (not published)
β βββ FileFlux.CLI/
β βββ Services/LMSupply/ # LMSupply service implementations
βββ tests/
β βββ FileFlux.Tests/ # Test suite (343+ tests)
βββ samples/
βββ FileFlux.SampleApp/ # Usage examples
- Create and discuss an issue
- Work on a feature branch
- Add/modify tests
- Submit a pull request
MIT License - See LICENSE file
- Issue Reports: GitHub Issues
- Feature Requests: GitHub Discussions