Skip to content

Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196

Open
PredictiveManish wants to merge 9 commits intousemoss:mainfrom
PredictiveManish:doc-parsing
Open

Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown#196
PredictiveManish wants to merge 9 commits intousemoss:mainfrom
PredictiveManish:doc-parsing

Conversation

@PredictiveManish
Copy link
Copy Markdown
Contributor

@PredictiveManish PredictiveManish commented May 1, 2026

Pull Request Checklist

Please ensure that your PR meets the following requirements:

  • I have read the CONTRIBUTING guide.
  • I have updated the documentation (if applicable).
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.

Description

Add doc-parsing connectors to enable native ingestion of various document formats (PDF, DOCX, PPTX, HTML, Markdown) into Moss indexes without requiring users to pre-process or convert documents themselves.

This feature adds a preprocessing layer that converts arbitrary files into the DocumentInfo format expected by Moss, eliminating the need for users to manually extract text from documents before ingestion.

Key Components Added:

  1. New moss-doc-parser package (packages/moss-doc-parser/) - A standalone Python package providing:
  • Abstract BaseParser class defining the parser interface

  • Concrete implementations for:

    • PDFParser (using pypdf)
    • DocxParser (using python-docx)
    • PptxParser (using python-pptx)
    • HtmlParser (using BeautifulSoup)
    • MarkdownParser (with structured header/paragraph/code/list extraction)
  • FileTypeDetector for automatic file type detection using extension-based and MIME type detection

  • Document chunking capabilities (planned for future enhancement)

  1. CLI Integration - Enhanced packages/moss-cli/src/moss_cli/documents.py:
  • Automatic detection and parsing of supported document files
  • Seamless integration with existing load_documents() function
  • Maintains backward compatibility with JSON/CSV/stdin input
  • Updated help text to reflect new capabilities
  1. SDK Enhancement - Added create_index_from_files() method to MossClient in sdks/python/sdk/src/moss/client/moss_client.py:
  • Accepts a list of file paths of various supported formats
  • Automatically parses files into DocumentInfo objects
  • Creates a new index with the parsed content
  • Lazy imports moss-doc-parser to avoid hard dependency
  • Includes proper error handling and informative error messages

Fixes #98

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

@PredictiveManish PredictiveManish marked this pull request as draft May 1, 2026 14:00
@PredictiveManish PredictiveManish marked this pull request as ready for review May 1, 2026 14:54
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 5 new potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +94 to +151
async def create_index_from_files(
self,
name: str,
file_paths: List[str],
model_id: Optional[str] = None,
) -> MutationResult:
"""Create a new index by parsing files and adding their contents.

Supports PDF, DOCX, PPTX, HTML, and Markdown files.
Files are parsed into DocumentInfo objects before index creation.

Args:
name: Name of the index to create
file_paths: List of file paths to parse and add to the index
model_id: Optional model ID to use for embeddings

Returns:
MutationResult containing job information
"""
# Import moss-doc-parser here to avoid hard dependency
try:
from moss_doc_parser import FileTypeDetector
except ImportError as e:
raise ImportError(
"moss-doc-parser is required for create_index_from_files. "
"Install it with: pip install moss-doc-parser"
) from e

# Parse all files
all_docs: List[DocumentInfo] = []
detector = FileTypeDetector()

for file_path in file_paths:
try:
# Get appropriate parser for file type
parser = detector.get_parser_for_file(file_path)
# Parse the file
parse_result = parser.parse(file_path)
# Convert MossDocument objects to DocumentInfo objects
for doc in parse_result.documents:
all_docs.append(
DocumentInfo(
id=doc.id,
text=doc.text,
metadata=doc.metadata,
)
)
except Exception as e:
raise ValueError(f"Failed to parse file {file_path}: {str(e)}")

# Create index with parsed documents
resolved_model_id = self._resolve_model_id(all_docs, model_id)
return await asyncio.to_thread(
self._manage.create_index,
name,
all_docs,
resolved_model_id,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 New public API method create_index_from_files missing from .pyi type stub

The new create_index_from_files method is added to MossClient in moss_client.py:94-151 but is not declared in the type stub at sdks/python/sdk/src/moss/__init__.pyi. Every other public method on MossClient (e.g., create_index, add_docs, delete_docs, load_index, query) has a corresponding declaration in the .pyi file (see sdks/python/sdk/src/moss/__init__.pyi:13-61). This means type checkers, IDE autocompletion, and documentation generators won't see the new method. This also violates the CONTRIBUTING.md rule: "If you've changed APIs, update the documentation."

Prompt for agents
The new create_index_from_files method in MossClient (sdks/python/sdk/src/moss/client/moss_client.py:94-151) needs a corresponding type stub declaration in sdks/python/sdk/src/moss/__init__.pyi. Add it after the existing create_index stub (line 18), following the same pattern as other methods. The stub should be:

async def create_index_from_files(
    self,
    name: str,
    file_paths: List[str],
    model_id: Optional[str] = ...,
) -> MutationResult: ...

This ensures type checkers, IDE autocompletion, and documentation tools recognize the new public API method.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +12 to +66
class PPTXParser(BaseParser):
"""Parser for PPTX files."""

def parse(self, file_path: str) -> ParseResult:
"""Parse a PPTX file and extract text from each slide.

Args:
file_path: Path to the PPTX file.

Returns:
ParseResult containing one document per slide (or chunked if needed).
"""
start_time = time.time()

documents = []
prs = Presentation(file_path)

for slide_num, slide in enumerate(prs.slides):
text_runs = []
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
text = "\n".join(text_runs)
if text.strip(): # Only add non-empty slides
doc_id = f"{file_path}_slide_{slide_num}"
metadata = {
"source_file": file_path,
"slide_number": slide_num + 1, # 1-indexed for humans
"total_slides": len(prs.slides),
}
documents.append(
MossDocument(
id=doc_id,
text=text.strip(),
metadata=metadata,
)
)

parse_time_ms = (time.time() - start_time) * 1000
return ParseResult(
documents=documents,
source_path=file_path,
parse_time_ms=parse_time_ms,
)

def supported_extensions(self) -> List[str]:
"""Return a list of file extensions this parser supports.

Returns:
List of file extensions (without the dot).
"""
return ["pptx"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing test for PPTXParser violates CONTRIBUTING.md

CONTRIBUTING.md mandates: "If you've added code that should be tested, add tests." All four other parsers (PDF, DOCX, HTML, Markdown) have corresponding test files in packages/moss-doc-parser/tests/, but the new PPTXParser at packages/moss-doc-parser/src/moss_doc_parser/parsers/pptx.py has no test file. There is no test_pptx_parser.py.

Prompt for agents
Add a test file packages/moss-doc-parser/tests/test_pptx_parser.py following the same pattern as the existing test files (test_pdf_parser.py, test_docx_parser.py, etc.). The test should mock pptx.Presentation, create mock slides with shapes containing text frames and runs, and verify that PPTXParser.parse() returns a ParseResult with the expected documents, metadata (source_file, slide_number, total_slides), and text content.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

)
)
except Exception as e:
raise ValueError(f"Failed to parse file {file_path}: {str(e)}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Exception chain lost in create_index_from_files error handler

At moss_client.py:142, raise ValueError(...) is used without from e, which discards the explicit exception chain. While Python 3 still attaches the original as __context__, the traceback message will say "During handling of the above exception, another exception occurred" instead of the clearer "The above exception was the direct cause of the following exception". This makes debugging harder, especially for file-not-found or parser-internal errors that get wrapped as generic ValueError.

Suggested change
raise ValueError(f"Failed to parse file {file_path}: {str(e)}")
raise ValueError(f"Failed to parse file {file_path}: {str(e)}") from e
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +33 to +36
metadata = {
"source_file": file_path,
"page_number": page_num + 1, # 1-indexed for humans
"total_pages": len(pdf.pages),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Metadata type mismatch between MossDocument and DocumentInfo

All parsers create metadata dicts with integer values (e.g., page_number: int, total_pages: int in packages/moss-doc-parser/src/moss_doc_parser/parsers/pdf.py:35-36). However, DocumentInfo.metadata is typed as Optional[Dict[str, str]] per the .pyi stub at sdks/python/sdk/src/moss/__init__.pyi:158. If the Rust-backed moss_core.DocumentInfo strictly enforces Dict[str, str], passing integer metadata values would cause runtime errors. In practice, existing code (e.g., _dict_to_doc in packages/moss-cli/src/moss_cli/documents.py:153 which passes through arbitrary JSON metadata) suggests the Rust binding accepts non-string values, but this is worth verifying — especially since all five parsers depend on this assumption.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +16 to +38
def test_parse_returns_documents(self, mock_bs, mock_md, mock_file):
# Setup mock for markdown conversion
mock_md_instance = MagicMock()
mock_md_instance.convert.return_value = (
"<h1>Title</h1><p>This is a paragraph.</p>"
)
mock_md.return_value = mock_md_instance

# Setup mock for BeautifulSoup
mock_soup = MagicMock()
# We need to mock the find_all method to return a list of elements
# For simplicity, we'll return two elements: a header and a paragraph
mock_h1 = MagicMock()
mock_h1.name = "h1"
mock_h1.get_text.return_value = "Title"
mock_p = MagicMock()
mock_p.name = "p"
mock_p.get_text.return_value = "This is a paragraph."
mock_soup.find_all.return_value = [mock_h1, mock_p]
mock_bs.return_value = mock_soup

parser = MarkdownParser()
result = parser.parse("dummy_path.md")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Markdown test mock setup has dead code but test still passes

In packages/moss-doc-parser/tests/test_markdown_parser.py:18-21, the test creates mock_md_instance and sets mock_md.return_value = mock_md_instance. However, MarkdownParser() is instantiated at line 37 (which calls markdown.Markdown(...) — i.e., the mock), and self.md captures the auto-generated return value before mock_md.return_value is reassigned. So parser.md is never mock_md_instance, and the convert.return_value setup is dead code. The test still passes because BeautifulSoup is also mocked and returns controlled elements regardless of what self.md.convert() returns. Not a production bug, but it means the test doesn't actually verify the markdown-to-HTML conversion step.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@yatharthk2
Copy link
Copy Markdown
Collaborator

Hi, you might wanna wait on this feature. we can come up with service internally for doc-parsing.

@PredictiveManish
Copy link
Copy Markdown
Contributor Author

Hi, you might wanna wait on this feature. we can come up with service internally for doc-parsing.

Got it, thanks for the heads-up. I’ll pause further work on this feature for now. Should I close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add Doc-parsing connectors

2 participants