Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196
Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196PredictiveManish wants to merge 9 commits intousemoss:mainfrom
Conversation
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
| async def create_index_from_files( | ||
| self, | ||
| name: str, | ||
| file_paths: List[str], | ||
| model_id: Optional[str] = None, | ||
| ) -> MutationResult: | ||
| """Create a new index by parsing files and adding their contents. | ||
|
|
||
| Supports PDF, DOCX, PPTX, HTML, and Markdown files. | ||
| Files are parsed into DocumentInfo objects before index creation. | ||
|
|
||
| Args: | ||
| name: Name of the index to create | ||
| file_paths: List of file paths to parse and add to the index | ||
| model_id: Optional model ID to use for embeddings | ||
|
|
||
| Returns: | ||
| MutationResult containing job information | ||
| """ | ||
| # Import moss-doc-parser here to avoid hard dependency | ||
| try: | ||
| from moss_doc_parser import FileTypeDetector | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "moss-doc-parser is required for create_index_from_files. " | ||
| "Install it with: pip install moss-doc-parser" | ||
| ) from e | ||
|
|
||
| # Parse all files | ||
| all_docs: List[DocumentInfo] = [] | ||
| detector = FileTypeDetector() | ||
|
|
||
| for file_path in file_paths: | ||
| try: | ||
| # Get appropriate parser for file type | ||
| parser = detector.get_parser_for_file(file_path) | ||
| # Parse the file | ||
| parse_result = parser.parse(file_path) | ||
| # Convert MossDocument objects to DocumentInfo objects | ||
| for doc in parse_result.documents: | ||
| all_docs.append( | ||
| DocumentInfo( | ||
| id=doc.id, | ||
| text=doc.text, | ||
| metadata=doc.metadata, | ||
| ) | ||
| ) | ||
| except Exception as e: | ||
| raise ValueError(f"Failed to parse file {file_path}: {str(e)}") | ||
|
|
||
| # Create index with parsed documents | ||
| resolved_model_id = self._resolve_model_id(all_docs, model_id) | ||
| return await asyncio.to_thread( | ||
| self._manage.create_index, | ||
| name, | ||
| all_docs, | ||
| resolved_model_id, | ||
| ) |
There was a problem hiding this comment.
🔴 New public API method create_index_from_files missing from .pyi type stub
The new create_index_from_files method is added to MossClient in moss_client.py:94-151 but is not declared in the type stub at sdks/python/sdk/src/moss/__init__.pyi. Every other public method on MossClient (e.g., create_index, add_docs, delete_docs, load_index, query) has a corresponding declaration in the .pyi file (see sdks/python/sdk/src/moss/__init__.pyi:13-61). This means type checkers, IDE autocompletion, and documentation generators won't see the new method. This also violates the CONTRIBUTING.md rule: "If you've changed APIs, update the documentation."
Prompt for agents
The new create_index_from_files method in MossClient (sdks/python/sdk/src/moss/client/moss_client.py:94-151) needs a corresponding type stub declaration in sdks/python/sdk/src/moss/__init__.pyi. Add it after the existing create_index stub (line 18), following the same pattern as other methods. The stub should be:
async def create_index_from_files(
self,
name: str,
file_paths: List[str],
model_id: Optional[str] = ...,
) -> MutationResult: ...
This ensures type checkers, IDE autocompletion, and documentation tools recognize the new public API method.
Was this helpful? React with 👍 or 👎 to provide feedback.
| class PPTXParser(BaseParser): | ||
| """Parser for PPTX files.""" | ||
|
|
||
| def parse(self, file_path: str) -> ParseResult: | ||
| """Parse a PPTX file and extract text from each slide. | ||
|
|
||
| Args: | ||
| file_path: Path to the PPTX file. | ||
|
|
||
| Returns: | ||
| ParseResult containing one document per slide (or chunked if needed). | ||
| """ | ||
| start_time = time.time() | ||
|
|
||
| documents = [] | ||
| prs = Presentation(file_path) | ||
|
|
||
| for slide_num, slide in enumerate(prs.slides): | ||
| text_runs = [] | ||
| for shape in slide.shapes: | ||
| if not shape.has_text_frame: | ||
| continue | ||
| for paragraph in shape.text_frame.paragraphs: | ||
| for run in paragraph.runs: | ||
| text_runs.append(run.text) | ||
| text = "\n".join(text_runs) | ||
| if text.strip(): # Only add non-empty slides | ||
| doc_id = f"{file_path}_slide_{slide_num}" | ||
| metadata = { | ||
| "source_file": file_path, | ||
| "slide_number": slide_num + 1, # 1-indexed for humans | ||
| "total_slides": len(prs.slides), | ||
| } | ||
| documents.append( | ||
| MossDocument( | ||
| id=doc_id, | ||
| text=text.strip(), | ||
| metadata=metadata, | ||
| ) | ||
| ) | ||
|
|
||
| parse_time_ms = (time.time() - start_time) * 1000 | ||
| return ParseResult( | ||
| documents=documents, | ||
| source_path=file_path, | ||
| parse_time_ms=parse_time_ms, | ||
| ) | ||
|
|
||
| def supported_extensions(self) -> List[str]: | ||
| """Return a list of file extensions this parser supports. | ||
|
|
||
| Returns: | ||
| List of file extensions (without the dot). | ||
| """ | ||
| return ["pptx"] |
There was a problem hiding this comment.
🟡 Missing test for PPTXParser violates CONTRIBUTING.md
CONTRIBUTING.md mandates: "If you've added code that should be tested, add tests." All four other parsers (PDF, DOCX, HTML, Markdown) have corresponding test files in packages/moss-doc-parser/tests/, but the new PPTXParser at packages/moss-doc-parser/src/moss_doc_parser/parsers/pptx.py has no test file. There is no test_pptx_parser.py.
Prompt for agents
Add a test file packages/moss-doc-parser/tests/test_pptx_parser.py following the same pattern as the existing test files (test_pdf_parser.py, test_docx_parser.py, etc.). The test should mock pptx.Presentation, create mock slides with shapes containing text frames and runs, and verify that PPTXParser.parse() returns a ParseResult with the expected documents, metadata (source_file, slide_number, total_slides), and text content.
Was this helpful? React with 👍 or 👎 to provide feedback.
| ) | ||
| ) | ||
| except Exception as e: | ||
| raise ValueError(f"Failed to parse file {file_path}: {str(e)}") |
There was a problem hiding this comment.
🟡 Exception chain lost in create_index_from_files error handler
At moss_client.py:142, raise ValueError(...) is used without from e, which discards the explicit exception chain. While Python 3 still attaches the original as __context__, the traceback message will say "During handling of the above exception, another exception occurred" instead of the clearer "The above exception was the direct cause of the following exception". This makes debugging harder, especially for file-not-found or parser-internal errors that get wrapped as generic ValueError.
| raise ValueError(f"Failed to parse file {file_path}: {str(e)}") | |
| raise ValueError(f"Failed to parse file {file_path}: {str(e)}") from e |
Was this helpful? React with 👍 or 👎 to provide feedback.
| metadata = { | ||
| "source_file": file_path, | ||
| "page_number": page_num + 1, # 1-indexed for humans | ||
| "total_pages": len(pdf.pages), |
There was a problem hiding this comment.
🚩 Metadata type mismatch between MossDocument and DocumentInfo
All parsers create metadata dicts with integer values (e.g., page_number: int, total_pages: int in packages/moss-doc-parser/src/moss_doc_parser/parsers/pdf.py:35-36). However, DocumentInfo.metadata is typed as Optional[Dict[str, str]] per the .pyi stub at sdks/python/sdk/src/moss/__init__.pyi:158. If the Rust-backed moss_core.DocumentInfo strictly enforces Dict[str, str], passing integer metadata values would cause runtime errors. In practice, existing code (e.g., _dict_to_doc in packages/moss-cli/src/moss_cli/documents.py:153 which passes through arbitrary JSON metadata) suggests the Rust binding accepts non-string values, but this is worth verifying — especially since all five parsers depend on this assumption.
Was this helpful? React with 👍 or 👎 to provide feedback.
| def test_parse_returns_documents(self, mock_bs, mock_md, mock_file): | ||
| # Setup mock for markdown conversion | ||
| mock_md_instance = MagicMock() | ||
| mock_md_instance.convert.return_value = ( | ||
| "<h1>Title</h1><p>This is a paragraph.</p>" | ||
| ) | ||
| mock_md.return_value = mock_md_instance | ||
|
|
||
| # Setup mock for BeautifulSoup | ||
| mock_soup = MagicMock() | ||
| # We need to mock the find_all method to return a list of elements | ||
| # For simplicity, we'll return two elements: a header and a paragraph | ||
| mock_h1 = MagicMock() | ||
| mock_h1.name = "h1" | ||
| mock_h1.get_text.return_value = "Title" | ||
| mock_p = MagicMock() | ||
| mock_p.name = "p" | ||
| mock_p.get_text.return_value = "This is a paragraph." | ||
| mock_soup.find_all.return_value = [mock_h1, mock_p] | ||
| mock_bs.return_value = mock_soup | ||
|
|
||
| parser = MarkdownParser() | ||
| result = parser.parse("dummy_path.md") |
There was a problem hiding this comment.
🚩 Markdown test mock setup has dead code but test still passes
In packages/moss-doc-parser/tests/test_markdown_parser.py:18-21, the test creates mock_md_instance and sets mock_md.return_value = mock_md_instance. However, MarkdownParser() is instantiated at line 37 (which calls markdown.Markdown(...) — i.e., the mock), and self.md captures the auto-generated return value before mock_md.return_value is reassigned. So parser.md is never mock_md_instance, and the convert.return_value setup is dead code. The test still passes because BeautifulSoup is also mocked and returns controlled elements regardless of what self.md.convert() returns. Not a production bug, but it means the test doesn't actually verify the markdown-to-HTML conversion step.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Hi, you might wanna wait on this feature. we can come up with service internally for doc-parsing. |
Got it, thanks for the heads-up. I’ll pause further work on this feature for now. Should I close this PR? |
Pull Request Checklist
Please ensure that your PR meets the following requirements:
Description
Add doc-parsing connectors to enable native ingestion of various document formats (PDF, DOCX, PPTX, HTML, Markdown) into Moss indexes without requiring users to pre-process or convert documents themselves.
This feature adds a
preprocessinglayer that converts arbitrary files into theDocumentInfoformat expected by Moss, eliminating the need for users to manually extract text from documents before ingestion.Key Components Added:
moss-doc-parserpackage (packages/moss-doc-parser/) - A standalone Python package providing:Abstract
BaseParserclass defining the parser interfaceConcrete implementations for:
PDFParser(using pypdf)DocxParser(using python-docx)PptxParser(using python-pptx)HtmlParser(usingBeautifulSoup)MarkdownParser(with structured header/paragraph/code/list extraction)FileTypeDetectorfor automatic file type detection using extension-based and MIME type detectionDocument chunking capabilities (planned for future enhancement)
packages/moss-cli/src/moss_cli/documents.py:load_documents()functioncreate_index_from_files()method toMossClientinsdks/python/sdk/src/moss/client/moss_client.py:DocumentInfoobjectsFixes #98
Type of Change