Skip to content

iyulab/FileFlux

Repository files navigation

FileFlux

.NET document processing library for RAG systems

NuGet Downloads .NET 10 License

Overview

FileFlux is a .NET library that transforms various document formats into optimized chunks for RAG (Retrieval-Augmented Generation) systems. Built on high-performance Rust FFI libraries for document parsing.

Key Features

  • 5-Stage Stateful Pipeline: Extract β†’ Rule-Refine β†’ LLM-Refine β†’ Chunk β†’ Enrich
  • Native Document Readers: Rust FFI-based readers (Unpdf, Undoc, Unhwp) for 2-5x faster processing
  • Multiple Document Formats: PDF, DOCX, XLSX, PPTX, HWP, HWPX, Markdown, HTML, TXT, JSON, CSV
  • Flexible Chunking Strategies: Auto, Smart, Intelligent, Semantic, Paragraph, FixedSize, Hierarchical, PageLevel
  • Interface-Driven AI: Define AI service interfaces, implement with your preferred provider
  • Document Graph: Inter-chunk relationship tracking with sequential, hierarchical, and semantic edges
  • Structural Metadata: HeadingPath, page numbers, ContextDependency scores for enhanced RAG
  • Language Detection: Automatic language detection using NTextCat
  • IEnrichedChunk Interface: Standardized interface for RAG system integration
  • Metadata Enrichment: AI-powered metadata extraction with caching and fallback
  • Extensible Architecture: Interface-based design for easy customization
  • Async Processing: Streaming and parallel processing for large documents

Installation

Full RAG Pipeline

dotnet add package FileFlux

Extraction Only (Minimal Dependencies)

dotnet add package FileFlux.Core

Package Comparison:

Feature FileFlux.Core FileFlux
Document Readers (PDF, DOCX, etc.) βœ… βœ…
Core Interfaces & Models βœ… βœ…
AI Service Interfaces βœ… βœ…
Chunking Strategies ❌ βœ…
FluxCurator & FluxImprover ❌ βœ…
DocumentProcessor ❌ βœ…
Use Case Custom chunking Full RAG pipeline

Quick Start

Basic Usage

using FileFlux;
using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Optional: Register AI services for advanced features
// services.AddScoped<IDocumentAnalysisService, YourLLMService>();

// Register FileFlux services (no logger required)
services.AddFileFlux();

var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IDocumentProcessor>();

// Process document
var chunks = await processor.ProcessAsync("document.pdf");

foreach (var chunk in chunks)
{
    Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content}");
}

Streaming Processing

await foreach (var result in processor.ProcessStreamAsync("document.pdf"))
{
    if (result.IsSuccess && result.Result != null)
    {
        foreach (var chunk in result.Result)
        {
            Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content.Length} chars");
        }
    }
}

Chunking Options

var options = new ChunkingOptions
{
    Strategy = "Auto",      // Automatic strategy selection
    MaxChunkSize = 512,     // Maximum chunk size
    OverlapSize = 64        // Overlap between chunks
};

var chunks = await processor.ProcessAsync("document.pdf", options);

Stateful Pipeline (v0.9.0+)

The new stateful pipeline provides explicit control over each processing stage:

using FileFlux;
using FileFlux.Infrastructure.Factories;

// Create processor via factory
var factory = provider.GetRequiredService<IDocumentProcessorFactory>();
using var processor = factory.Create("document.pdf");

// Execute stages explicitly
await processor.ExtractAsync();     // Stage 1: Raw content extraction
await processor.RefineAsync();      // Stage 2: Rule-based text cleaning
await processor.LlmRefineAsync();   // Stage 3: LLM-powered refinement (optional)
await processor.ChunkAsync();       // Stage 4: Content chunking
await processor.EnrichAsync();      // Stage 5: LLM-powered enrichment (optional)

// Access results at each stage
Console.WriteLine($"State: {processor.State}");
Console.WriteLine($"Raw text length: {processor.Result.Raw?.Text.Length}");
Console.WriteLine($"Sections found: {processor.Result.Refined?.Sections.Count}");
Console.WriteLine($"Chunks created: {processor.Result.Chunks?.Count}");

// Or run full pipeline at once
await processor.ProcessAsync(new ProcessingOptions
{
    IncludeEnrich = true,
    Enrich = new EnrichOptions { BuildGraph = true }
});

// Access the document graph
if (processor.Result.Graph != null)
{
    Console.WriteLine($"Graph nodes: {processor.Result.Graph.NodeCount}");
    Console.WriteLine($"Graph edges: {processor.Result.Graph.EdgeCount}");
}

Pipeline Stages:

Stage Interface AI Description
Extract IDocumentReader ❌ Raw content extraction from files
Rule-Refine IDocumentRefiner ❌ Text cleaning, normalization, structure analysis
LLM-Refine ILlmRefiner βœ… AI-powered noise removal, sentence restoration
Chunk IChunkerFactory Optional Content segmentation with various strategies
Enrich IDocumentEnricher βœ… LLM-powered summaries, keywords, contextual text

Metadata Enrichment

var options = new ChunkingOptions
{
    Strategy = "Auto",
    MaxChunkSize = 512,
    CustomProperties = new Dictionary<string, object>
    {
        ["enableMetadataEnrichment"] = true,
        ["metadataSchema"] = MetadataSchema.General
    }
};

var chunks = await processor.ProcessAsync("document.pdf", options);

// Access enriched metadata
foreach (var chunk in chunks)
{
    var keywords = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_keywords");
    var description = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_description");
    var documentType = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_documentType");
    var language = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_language");
}

AI Service Interfaces

FileFlux defines AI service interfaces - consumer applications provide implementations.

Available Interfaces

Interface Purpose Example Implementations
IDocumentAnalysisService Text generation, intelligent chunking OpenAI, Anthropic, LMSupply
IImageToTextService Image captioning, OCR OpenAI Vision, LMSupply Captioner/OCR
IEmbeddingService Embedding generation OpenAI, LMSupply Embedder

Example: Custom AI Provider

using FileFlux;
using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Implement your own AI service
services.AddScoped<IDocumentAnalysisService, YourOpenAIService>();
services.AddScoped<IImageToTextService, YourVisionService>();
services.AddScoped<IEmbeddingService, YourEmbeddingService>();

// Register FileFlux
services.AddFileFlux();

var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IDocumentProcessor>();

Local AI with LMSupply (CLI Example)

For local AI processing without external API calls, see LMSupply. The FileFlux CLI demonstrates LMSupply integration:

// Example from FileFlux.CLI - local AI processing
var lmSupplyOptions = new LMSupplyOptions
{
    UseGpuAcceleration = true,
    EmbeddingModel = "default",
    GeneratorModel = "microsoft/Phi-4-mini-instruct-onnx"
};

// Create LMSupply service implementations
var embedder = await LMSupplyEmbedderService.CreateAsync(lmSupplyOptions);
var generator = await LMSupplyGeneratorService.CreateAsync(lmSupplyOptions);

// Register as AI service implementations
services.AddSingleton<IEmbeddingService>(embedder);
services.AddSingleton<IDocumentAnalysisService>(generator);
services.AddFileFlux();

Note: LMSupply is not a direct dependency of FileFlux. Consumer applications that need local AI should reference LMSupply packages directly.

Supported Document Formats

Format Extension Reader Features
PDF .pdf Unpdf (Rust FFI) Text, tables, image extraction
Word .docx Undoc (Rust FFI) Style and structure preservation
Excel .xlsx Undoc (Rust FFI) Multi-sheet and table structure
PowerPoint .pptx Undoc (Rust FFI) Slide and notes extraction
HWP .hwp, .hwpx Unhwp (Rust FFI) Native Korean document support
Markdown .md Built-in Structure preservation
HTML .html, .htm Built-in Web content extraction
Text .txt, .json, .csv Built-in Basic text processing

Known Limitations

PDF Processing

  • Vector Graphics Tables: Tables created with drawing primitives (lines/rectangles) instead of text layout may not be detected. These are rendered as images in most PDF viewers.
  • Complex Multi-column Layouts: Documents with intricate multi-column arrangements may have suboptimal text ordering.
  • Scanned Documents: OCR is not included; scanned PDFs require pre-processing with external OCR tools.
  • Partial Extraction: When whole-document extraction fails, FileFlux automatically falls back to per-page extraction. Pages that cannot be extracted are skipped and recorded in RawContent.Errors. RawContent.Status is set to ProcessingStatus.Partial when some pages fail, allowing RAG pipelines to use the successfully extracted content rather than losing the entire document.

Table Extraction

FileFlux uses layout-based table detection with confidence scoring:

  • Tables with confidence score β‰₯ 0.5 are converted to Markdown format
  • Low-confidence tables fall back to plain text to prevent garbled output
  • Table quality metrics are exposed via StructuralHints for consumer applications

Document-Specific Notes

  • Excel: Very large worksheets (>100K rows) may impact memory usage
  • PowerPoint: Embedded objects are extracted as placeholder text
  • HTML: JavaScript-rendered content is not supported

Chunking Strategies

Strategy Use Case
Auto Automatic selection based on document type (recommended)
Smart Legal, medical, academic documents
Intelligent Technical documentation, API docs
Semantic General documents, papers
Paragraph Markdown, blogs
FixedSize When uniform size is required

AI Service Integration

FileFlux defines interfaces while implementation is up to the consumer application.

// Optional: Register AI services for advanced features
// - IDocumentAnalysisService: For intelligent chunking and metadata enrichment
// - IImageToTextService: For multimodal document processing
services.AddScoped<IDocumentAnalysisService, YourLLMService>();
services.AddScoped<IImageToTextService, YourVisionService>();

// Register FileFlux services (works without AI services too)
services.AddFileFlux();

Note: Logger registration is optional. FileFlux uses NullLogger internally if no logger is provided.

For AI service implementation examples, see the samples/ directory.

Advanced Features

πŸ€– AI Integration (Optional)

FileFlux defines interfaces - YOU implement them with your preferred AI provider.

// Register your AI service implementation
services.AddScoped<IDocumentAnalysisService, YourAIService>();
services.AddFileFlux();

Features enabled with AI services:

  • Intelligent structure analysis for optimal chunking
  • Semantic content summarization
  • AI-powered quality assessment
  • Q&A benchmark generation for RAG testing

πŸ“– See Tutorial for AI service implementation examples.

πŸ“Š Quality Analysis

Evaluate and optimize chunking quality for RAG systems:

var analyzer = serviceProvider.GetRequiredService<IDocumentQualityAnalyzer>();

// Analyze document quality
var report = await analyzer.AnalyzeQualityAsync("document.pdf");
Console.WriteLine($"Quality Score: {report.OverallQualityScore:P2}");

// Generate Q&A benchmark for RAG testing
var benchmark = await analyzer.GenerateQABenchmarkAsync("document.pdf", questionCount: 20);

// Compare different chunking strategies
var strategies = new[] { "Intelligent", "Semantic", "Smart" };
var comparison = await analyzer.BenchmarkChunkingAsync("document.pdf", strategies);

πŸ“– See Architecture for quality analysis details.

πŸ”§ Dependency Injection

FileFlux works with or without AI services:

// Minimal setup (no AI)
services.AddFileFlux();

// With AI service
services.AddScoped<IDocumentAnalysisService, YourAIService>();
services.AddFileFlux();

// Environment-specific configuration
if (Environment.IsDevelopment())
    services.AddScoped<IDocumentAnalysisService, MockTextCompletionService>();
else
    services.AddScoped<IDocumentAnalysisService, ProductionAIService>();

services.AddFileFlux();

πŸ“– See Tutorial for more DI patterns and examples.

Documentation

  • Tutorial - Detailed usage guide and examples
  • Architecture - System design and pipeline documentation
  • Changelog - Version history and release notes

Project Structure

FileFlux/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ FileFlux.Core/               # Extraction only (zero AI dependencies)
β”‚   β”‚   β”œβ”€β”€ Contracts/               # IDocumentProcessor, ProcessingResult
β”‚   β”‚   β”œβ”€β”€ Core/                    # IDocumentRefiner, IDocumentEnricher
β”‚   β”‚   └── Domain/                  # DocumentGraph, RefinedContent, StructuredElement
β”‚   └── FileFlux/                    # Full RAG pipeline (interface-driven)
β”‚       └── Infrastructure/          # StatefulDocumentProcessor, DocumentRefiner, DocumentEnricher
β”œβ”€β”€ cli/                             # CLI with LMSupply integration (not published)
β”‚   └── FileFlux.CLI/
β”‚       └── Services/LMSupply/       # LMSupply service implementations
β”œβ”€β”€ tests/
β”‚   └── FileFlux.Tests/              # Test suite (343+ tests)
└── samples/
    └── FileFlux.SampleApp/          # Usage examples

Contributing

  1. Create and discuss an issue
  2. Work on a feature branch
  3. Add/modify tests
  4. Submit a pull request

License

MIT License - See LICENSE file

Support

About

.NET RAG document processing library that transforms PDF, DOCX, HWP, and more into optimized chunks via a 5-stage pipeline with Rust-based FFI readers.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors