Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown by PredictiveManish · Pull Request #196 · usemoss/moss

PredictiveManish · 2026-05-01T13:46:58Z

Pull Request Checklist

Please ensure that your PR meets the following requirements:

I have read the CONTRIBUTING guide.
I have updated the documentation (if applicable).
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.

Description

Add doc-parsing connectors to enable native ingestion of various document formats (PDF, DOCX, PPTX, HTML, Markdown) into Moss indexes without requiring users to pre-process or convert documents themselves.

This feature adds a preprocessing layer that converts arbitrary files into the DocumentInfo format expected by Moss, eliminating the need for users to manually extract text from documents before ingestion.

Key Components Added:

New moss-doc-parser package (packages/moss-doc-parser/) - A standalone Python package providing:

Abstract BaseParser class defining the parser interface
Concrete implementations for:
- PDFParser (using pypdf)
- DocxParser (using python-docx)
- PptxParser (using python-pptx)
- HtmlParser (using BeautifulSoup)
- MarkdownParser (with structured header/paragraph/code/list extraction)
FileTypeDetector for automatic file type detection using extension-based and MIME type detection
Document chunking capabilities (planned for future enhancement)

CLI Integration - Enhanced packages/moss-cli/src/moss_cli/documents.py:

Automatic detection and parsing of supported document files
Seamless integration with existing load_documents() function
Maintains backward compatibility with JSON/CSV/stdin input
Updated help text to reflect new capabilities

SDK Enhancement - Added create_index_from_files() method to MossClient in sdks/python/sdk/src/moss/client/moss_client.py:

Accepts a list of file paths of various supported formats
Automatically parses files into DocumentInfo objects
Creates a new index with the parsed content
Lazy imports moss-doc-parser to avoid hard dependency
Includes proper error handling and informative error messages

Fixes #98

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 5 new potential issues.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-05-01T15:01:18Z

+    async def create_index_from_files(
+        self,
+        name: str,
+        file_paths: List[str],
+        model_id: Optional[str] = None,
+    ) -> MutationResult:
+        """Create a new index by parsing files and adding their contents.
+
+        Supports PDF, DOCX, PPTX, HTML, and Markdown files.
+        Files are parsed into DocumentInfo objects before index creation.
+
+        Args:
+            name: Name of the index to create
+            file_paths: List of file paths to parse and add to the index
+            model_id: Optional model ID to use for embeddings
+
+        Returns:
+            MutationResult containing job information
+        """
+        # Import moss-doc-parser here to avoid hard dependency
+        try:
+            from moss_doc_parser import FileTypeDetector
+        except ImportError as e:
+            raise ImportError(
+                "moss-doc-parser is required for create_index_from_files. "
+                "Install it with: pip install moss-doc-parser"
+            ) from e
+
+        # Parse all files
+        all_docs: List[DocumentInfo] = []
+        detector = FileTypeDetector()
+
+        for file_path in file_paths:
+            try:
+                # Get appropriate parser for file type
+                parser = detector.get_parser_for_file(file_path)
+                # Parse the file
+                parse_result = parser.parse(file_path)
+                # Convert MossDocument objects to DocumentInfo objects
+                for doc in parse_result.documents:
+                    all_docs.append(
+                        DocumentInfo(
+                            id=doc.id,
+                            text=doc.text,
+                            metadata=doc.metadata,
+                        )
+                    )
+            except Exception as e:
+                raise ValueError(f"Failed to parse file {file_path}: {str(e)}")
+
+        # Create index with parsed documents
+        resolved_model_id = self._resolve_model_id(all_docs, model_id)
+        return await asyncio.to_thread(
+            self._manage.create_index,
+            name,
+            all_docs,
+            resolved_model_id,
+        )


🔴 New public API method create_index_from_files missing from .pyi type stub

The new create_index_from_files method is added to MossClient in moss_client.py:94-151 but is not declared in the type stub at sdks/python/sdk/src/moss/__init__.pyi. Every other public method on MossClient (e.g., create_index, add_docs, delete_docs, load_index, query) has a corresponding declaration in the .pyi file (see sdks/python/sdk/src/moss/__init__.pyi:13-61). This means type checkers, IDE autocompletion, and documentation generators won't see the new method. This also violates the CONTRIBUTING.md rule: "If you've changed APIs, update the documentation."

Prompt for agents

The new create_index_from_files method in MossClient (sdks/python/sdk/src/moss/client/moss_client.py:94-151) needs a corresponding type stub declaration in sdks/python/sdk/src/moss/__init__.pyi. Add it after the existing create_index stub (line 18), following the same pattern as other methods. The stub should be: async def create_index_from_files( self, name: str, file_paths: List[str], model_id: Optional[str] = ..., ) -> MutationResult: ... This ensures type checkers, IDE autocompletion, and documentation tools recognize the new public API method.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-01T15:01:20Z

+class PPTXParser(BaseParser):
+    """Parser for PPTX files."""
+
+    def parse(self, file_path: str) -> ParseResult:
+        """Parse a PPTX file and extract text from each slide.
+
+        Args:
+            file_path: Path to the PPTX file.
+
+        Returns:
+            ParseResult containing one document per slide (or chunked if needed).
+        """
+        start_time = time.time()
+
+        documents = []
+        prs = Presentation(file_path)
+
+        for slide_num, slide in enumerate(prs.slides):
+            text_runs = []
+            for shape in slide.shapes:
+                if not shape.has_text_frame:
+                    continue
+                for paragraph in shape.text_frame.paragraphs:
+                    for run in paragraph.runs:
+                        text_runs.append(run.text)
+            text = "\n".join(text_runs)
+            if text.strip():  # Only add non-empty slides
+                doc_id = f"{file_path}_slide_{slide_num}"
+                metadata = {
+                    "source_file": file_path,
+                    "slide_number": slide_num + 1,  # 1-indexed for humans
+                    "total_slides": len(prs.slides),
+                }
+                documents.append(
+                    MossDocument(
+                        id=doc_id,
+                        text=text.strip(),
+                        metadata=metadata,
+                    )
+                )
+
+        parse_time_ms = (time.time() - start_time) * 1000
+        return ParseResult(
+            documents=documents,
+            source_path=file_path,
+            parse_time_ms=parse_time_ms,
+        )
+
+    def supported_extensions(self) -> List[str]:
+        """Return a list of file extensions this parser supports.
+
+        Returns:
+            List of file extensions (without the dot).
+        """
+        return ["pptx"]


🟡 Missing test for PPTXParser violates CONTRIBUTING.md

CONTRIBUTING.md mandates: "If you've added code that should be tested, add tests." All four other parsers (PDF, DOCX, HTML, Markdown) have corresponding test files in packages/moss-doc-parser/tests/, but the new PPTXParser at packages/moss-doc-parser/src/moss_doc_parser/parsers/pptx.py has no test file. There is no test_pptx_parser.py.

Prompt for agents

Add a test file packages/moss-doc-parser/tests/test_pptx_parser.py following the same pattern as the existing test files (test_pdf_parser.py, test_docx_parser.py, etc.). The test should mock pptx.Presentation, create mock slides with shapes containing text frames and runs, and verify that PPTXParser.parse() returns a ParseResult with the expected documents, metadata (source_file, slide_number, total_slides), and text content.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-01T15:01:21Z

+                        )
+                    )
+            except Exception as e:
+                raise ValueError(f"Failed to parse file {file_path}: {str(e)}")


🟡 Exception chain lost in create_index_from_files error handler

At moss_client.py:142, raise ValueError(...) is used without from e, which discards the explicit exception chain. While Python 3 still attaches the original as __context__, the traceback message will say "During handling of the above exception, another exception occurred" instead of the clearer "The above exception was the direct cause of the following exception". This makes debugging harder, especially for file-not-found or parser-internal errors that get wrapped as generic ValueError.

Suggested change

raise ValueError(f"Failed to parse file {file_path}: {str(e)}")

raise ValueError(f"Failed to parse file {file_path}: {str(e)}") from e

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-01T15:01:23Z

+                    metadata = {
+                        "source_file": file_path,
+                        "page_number": page_num + 1,  # 1-indexed for humans
+                        "total_pages": len(pdf.pages),


🚩 Metadata type mismatch between MossDocument and DocumentInfo

All parsers create metadata dicts with integer values (e.g., page_number: int, total_pages: int in packages/moss-doc-parser/src/moss_doc_parser/parsers/pdf.py:35-36). However, DocumentInfo.metadata is typed as Optional[Dict[str, str]] per the .pyi stub at sdks/python/sdk/src/moss/__init__.pyi:158. If the Rust-backed moss_core.DocumentInfo strictly enforces Dict[str, str], passing integer metadata values would cause runtime errors. In practice, existing code (e.g., _dict_to_doc in packages/moss-cli/src/moss_cli/documents.py:153 which passes through arbitrary JSON metadata) suggests the Rust binding accepts non-string values, but this is worth verifying — especially since all five parsers depend on this assumption.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-01T15:01:24Z

+    def test_parse_returns_documents(self, mock_bs, mock_md, mock_file):
+        # Setup mock for markdown conversion
+        mock_md_instance = MagicMock()
+        mock_md_instance.convert.return_value = (
+            "<h1>Title</h1><p>This is a paragraph.</p>"
+        )
+        mock_md.return_value = mock_md_instance
+
+        # Setup mock for BeautifulSoup
+        mock_soup = MagicMock()
+        # We need to mock the find_all method to return a list of elements
+        # For simplicity, we'll return two elements: a header and a paragraph
+        mock_h1 = MagicMock()
+        mock_h1.name = "h1"
+        mock_h1.get_text.return_value = "Title"
+        mock_p = MagicMock()
+        mock_p.name = "p"
+        mock_p.get_text.return_value = "This is a paragraph."
+        mock_soup.find_all.return_value = [mock_h1, mock_p]
+        mock_bs.return_value = mock_soup
+
+        parser = MarkdownParser()
+        result = parser.parse("dummy_path.md")


🚩 Markdown test mock setup has dead code but test still passes

In packages/moss-doc-parser/tests/test_markdown_parser.py:18-21, the test creates mock_md_instance and sets mock_md.return_value = mock_md_instance. However, MarkdownParser() is instantiated at line 37 (which calls markdown.Markdown(...) — i.e., the mock), and self.md captures the auto-generated return value before mock_md.return_value is reassigned. So parser.md is never mock_md_instance, and the convert.return_value setup is dead code. The test still passes because BeautifulSoup is also mocked and returns controlled elements regardless of what self.md.convert() returns. Not a production bug, but it means the test doesn't actually verify the markdown-to-HTML conversion step.

Was this helpful? React with 👍 or 👎 to provide feedback.

yatharthk2 · 2026-05-01T15:29:26Z

Hi, you might wanna wait on this feature. we can come up with service internally for doc-parsing.

PredictiveManish · 2026-05-01T15:46:02Z

Hi, you might wanna wait on this feature. we can come up with service internally for doc-parsing.

Got it, thanks for the heads-up. I’ll pause further work on this feature for now. Should I close this PR?

PredictiveManish added 4 commits April 30, 2026 04:18

Added docs-parser

268fca5

Added docs-parser

2ccc10e

Merge branch 'usemoss:main' into doc-parsing

0a38e90

Removed unnecessary file

7f97a4a

PredictiveManish requested review from CoderOMaster, HarshaNalluru, r4ghu and yatharthk2 as code owners May 1, 2026 13:46

This comment was marked as resolved.

Sign in to view

PredictiveManish marked this pull request as draft May 1, 2026 14:00

PredictiveManish and others added 5 commits May 1, 2026 19:32

Update packages/moss-doc-parser/src/moss_doc_parser/parsers/markdown.py

effdb8e

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

removed unnecessary imports

3d15e07

removed unnecessary imports

078128a

Added tests

c91e1c8

added module initialized and proper instruction added

38ceda6

PredictiveManish marked this pull request as ready for review May 1, 2026 14:54

devin-ai-integration Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196

Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196
PredictiveManish wants to merge 9 commits intousemoss:mainfrom
PredictiveManish:doc-parsing

PredictiveManish commented May 1, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Uh oh!

devin-ai-integration Bot May 1, 2026

Uh oh!

devin-ai-integration Bot May 1, 2026

Uh oh!

devin-ai-integration Bot May 1, 2026

Uh oh!

devin-ai-integration Bot May 1, 2026

Uh oh!

yatharthk2 commented May 1, 2026

Uh oh!

PredictiveManish commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	raise ValueError(f"Failed to parse file {file_path}: {str(e)}")
	raise ValueError(f"Failed to parse file {file_path}: {str(e)}") from e

Conversation

PredictiveManish commented May 1, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Description

Key Components Added:

Type of Change

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

yatharthk2 commented May 1, 2026

Uh oh!

PredictiveManish commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PredictiveManish commented May 1, 2026 •

edited by devin-ai-integration Bot

Loading