replace docx binary fixture with generated stream#20
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR replaces a binary DOCX regression test asset with a programmatic document generator and implements comprehensive conversion middleware infrastructure for AI-powered image enrichment. The changes enable document converters to capture raw extraction artifacts and pass them through a configurable middleware pipeline before Markdown composition.
Key changes:
- Introduced conversion middleware architecture with pipeline execution for document post-processing
- Added AI image enrichment middleware that generates detailed descriptions using chat clients
- Replaced binary test fixture with generated DOCX containing inline PNG images
- Enhanced PDF, DOCX, and PPTX converters to capture image artifacts and execute middleware pipelines
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/MarkItDown.Tests/Fixtures/DocxInlineImageFactory.cs |
Programmatic DOCX generator that creates documents with embedded PNG images |
tests/MarkItDown.Tests/RecordingPipeline.cs |
Test harness that records pipeline execution and injects test content after image placeholders |
src/MarkItDown/Conversion/ |
New middleware infrastructure including pipeline execution, context objects, and AI enrichment middleware |
src/MarkItDown/Converters/DocxConverter.cs |
Enhanced to extract image artifacts and execute conversion pipeline |
src/MarkItDown/Converters/PptxConverter.cs |
Enhanced to extract slide images and execute conversion pipeline |
src/MarkItDown/Converters/PdfConverter.cs |
Enhanced to capture page snapshots and execute conversion pipeline |
src/MarkItDown/MarkItDown.cs |
Updated to build conversion pipeline and pass to converters |
|
|
||
| #pragma warning disable CA1416 | ||
| foreach (var bitmap in Conversion.ToImages(pdfBytes, password: null, options)) | ||
| foreach (var bitmap in PDFtoImage.Conversion.ToImages(pdfBytes, password: null, options)) |
There was a problem hiding this comment.
Namespace alias inconsistency. The using statement imports PDFtoImage directly as Conversion (line not shown), but this code references PDFtoImage.Conversion. Remove the PDFtoImage. prefix to match the existing alias pattern.
| { | ||
| var metadata = new Dictionary<string, string> | ||
| { | ||
| ["page"] = pageNumber.ToString(CultureInfo.InvariantCulture) |
There was a problem hiding this comment.
Metadata key inconsistency. Use the constant MetadataKeys.Page instead of the hardcoded string "page" to maintain consistency with the established pattern used elsewhere in the codebase.
Summary
Testing
dotnet test MarkItDown.slnxhttps://chatgpt.com/codex/tasks/task_e_68e9f8ffeda0832686f0fff19e585de0