Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

spacypdfreader is a Python library that extracts text from PDF documents and converts them into spaCy `Doc` objects with custom extensions for page tracking. The library supports multiple PDF parsing backends and multiprocessing for performance. Refer to @README.md for more details about the library.

## Development Commands

This project uses `uv` for dependency management and `just` as a command runner.

### Testing

```bash
# Run tests with a specific Python version (default 3.12)
just test 3.12

# Run tests across multiple Python versions
just test-matrix

# Test doctests in the code
just test-docs

# Trigger GitHub Actions workflow
just test-gha
```

### Code Quality

```bash
# Format code (imports and style)
just format

# Run linting
just lint
```

### Documentation

```bash
# Preview docs locally
just preview-docs

# Publish docs to GitHub Pages
just publish-docs
```

### Building and Publishing

```bash
# Build the package
just build

# Publish to test PyPI
just publish-test

# Publish to PyPI
just publish
```

## Architecture

### Core Components

- **`spacypdfreader.spacypdfreader.pdf_reader()`**: Main entry point function that converts a PDF to a spaCy `Doc` object
- Takes a PDF path and a spaCy `Language` object
- Returns a `Doc` object with custom extensions
- Supports multiprocessing via `n_processes` parameter
- Supports page range extraction via `page_range` parameter

### Parser System

The library uses a pluggable parser architecture in `spacypdfreader/parsers/`:

- **pdfminer** (`parsers/pdfminer.py`): Default parser, fast but lower accuracy
- Uses `pdfminer.high_level.extract_text()`
- Zero-indexed internally but converts from 1-indexed API

- **pytesseract** (`parsers/pytesseract.py`): OCR-based parser, slower but higher accuracy
- Converts PDF pages to images first
- Requires optional dependencies: `pip install 'spacypdfreader[pytesseract]'`

Each parser implements a `parser(pdf_path: str, page_number: int, **kwargs)` function that returns text for a single page.

### spaCy Custom Extensions

The library registers several custom attributes on spaCy tokens and docs:

- `token._.page_number`: Page number for each token (1-indexed)
- `doc._.pdf_file_name`: Original PDF file path
- `doc._.first_page`: First page number in the doc
- `doc._.last_page`: Last page number in the doc
- `doc._.page_range`: Tuple of (first_page, last_page)
- `doc._.page(int)`: Method to extract text from a specific page

These extensions are registered in `spacypdfreader/spacypdfreader.py` at module import time.

### Processing Flow

1. PDF path and spaCy Language object provided to `pdf_reader()`
2. PDF page count determined using pdfminer's `PDFParser`
3. Pages extracted in parallel (if `n_processes` specified) or sequentially
4. Each page text converted to a spaCy `Doc` via `nlp.pipe()`
5. Page numbers assigned to all tokens
6. Individual page `Doc` objects combined using `Doc.from_docs()`
7. Custom extensions set on the combined doc

## Important Notes

- This library breaks spaCy convention: it does NOT use `nlp.add_pipe()` because text extraction must happen before spaCy processing
- Page numbers use 1-based indexing in the public API (but pdfminer uses 0-based internally)
- When using pdfminer parser, do NOT pass `page_numbers` kwarg - use `page_range` instead
- Multiprocessing uses `ThreadPool` not `ProcessPool` (see imports in spacypdfreader.py:4)

## Testing Notes

- Test files are in `tests/data/` directory
- Tests use spaCy model `en_core_web_sm` which is installed via uv from a wheel URL
- The project supports Python 3.9 through 3.13 (Python 3.14+ not supported)

## Python Version Support

This project supports Python 3.9 through 3.13. **Python 3.14 is not supported** due to a dependency constraint:

- spaCy (the core dependency) requires `Python <3.14, >=3.9`
- spaCy uses Pydantic v1 internally, which is incompatible with Python 3.14
- This is a known upstream issue tracked in spaCy issue #13885
- Support for Python 3.14 will be added once spaCy releases a compatible version
61 changes: 36 additions & 25 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# Changelog

## 0.4.0 (2025-10-30)

**Changes**

- Support for Python 3.9 to Python 3.13
- Use `uv_build` back end instead of hatchling.

**Fixes**

None

## 0.3.2 (2024-10-04)

**Changes**
Expand All @@ -19,14 +30,14 @@ None

- Support for `page_range` argument ([#16](https://github.com/SamEdwardes/spacypdfreader/issues/16), [#18](https://github.com/SamEdwardes/spacypdfreader/issues/18)).

```python
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract
```python
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4, page_range=(2, 3))
```
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4, page_range=(2, 3))
```

**Fixes**

Expand All @@ -38,19 +49,19 @@ None

- Added support for multi-processing. For example:

```python
import spacy
```python
import spacy

from spacypdfreader.parsers import pytesseract
from spacypdfreader.spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract
from spacypdfreader.spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
print(doc._.first_page)
print(doc._.last_page)
print(doc[12].text)
print(doc[12]._.page_number)
```
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
print(doc._.first_page)
print(doc._.last_page)
print(doc[12].text)
print(doc[12]._.page_number)
```

- Changed the way in which parsers are implemented. They are now implemented with a function as opposed to a class. See <https://github.com/SamEdwardes/spacypdfreader/tree/feature/multi-processing/spacypdfreader/parsers> for examples.

Expand All @@ -66,15 +77,15 @@ None
## 0.2.0 (2021-12-10)

- Added support for additional pdf to text extraction engines:
- [pytesseract](https://pypi.org/project/pytesseract/)
- [textract](https://textract.readthedocs.io/en/stable/index.html)
- [pytesseract](https://pypi.org/project/pytesseract/)
- [textract](https://textract.readthedocs.io/en/stable/index.html)
- Added the ability to bring your own pdf to text extraction engine.
- Added new spacy extension attributes and methods:
- `doc._.page_range`
- `doc._.first_page`
- `doc._.last_page`
- `doc._.pdf_file_name`
- `doc._.page(int)`
- `doc._.page_range`
- `doc._.first_page`
- `doc._.last_page`
- `doc._.pdf_file_name`
- `doc._.page(int)`
- Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)

## 0.1.1 (2021-12-10)
Expand Down
8 changes: 2 additions & 6 deletions justfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,14 @@ lint:

[group('tests')]
test version="3.12":
uv run --python {{version}} --all-extras pytest
UV_PROJECT_ENVIRONMENT="./.venv-{{version}}" uv run --python {{version}} --all-extras pytest

[group('tests')]
test-matrix:
just test 3.9
just test 3.10
just test 3.11
just test 3.12

[group('tests')]
test-pre-release-python:
# As of 2024-10-04 3.13 is failing
just test 3.13

[group('tests')]
Expand All @@ -63,4 +59,4 @@ publish-docs:

[group('docs')]
test-docs:
uv run --python 3.12 --all-extras pytest --doctest-modules spacypdfreader/
uv run --all-extras pytest --doctest-modules src/spacypdfreader/
56 changes: 30 additions & 26 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,18 +1,25 @@
[project]
name = "spacypdfreader"
version = "0.3.2"
version = "0.4.0"
description = "A PDF to text extraction pipeline component for spaCy."
license = "MIT"
readme = "README.md"
maintainers = [
{name = "Sam Edwardes", email = "edwardes.s@gmail.com"}
]
maintainers = [{ name = "Sam Edwardes", email = "edwardes.s@gmail.com" }]
keywords = ["python", "spacy", "nlp", "pdf", "pdfs"]
requires-python = ">=3.9"
dependencies = [
"pdfminer-six>=20240706",
"rich>=13.9.2",
"spacy>=3.8.2",
requires-python = ">=3.9, <3.14"
dependencies = ["pdfminer-six>=20240706", "rich>=13.9.2", "spacy>=3.8.7"]
classifiers = [
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Text Processing",
]

[project.urls]
Expand All @@ -23,26 +30,23 @@ Issues = "https://github.com/SamEdwardes/spaCyPDFreader/issues"
Changelog = "https://github.com/SamEdwardes/spacypdfreader/blob/main/docs/changelog.md"

[project.optional-dependencies]
pytesseract = [
"pdf2image>=1.17.0",
"pillow>=10.4.0",
"pytesseract>=0.3.13",
]
pytesseract = ["pdf2image>=1.17.0", "pillow>=10.4.0", "pytesseract>=0.3.13"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
requires = ["uv_build>=0.9.5,<0.10.0"]
build-backend = "uv_build"

[tool.uv]
dev-dependencies = [
"mkdocs>=1.6.1",
"mkdocs-include-markdown-plugin>=6.2.2",
"mkdocs-material>=9.5.39",
"pytest>=8.3.3",
"en-core-web-sm",
"mkdocstrings>=0.26.1",
"mkdocstrings-python>=1.11.1",
]

[tool.uv.sources]
en-core-web-sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl" }

[dependency-groups]
dev = [
"mkdocs>=1.6.1",
"mkdocs-include-markdown-plugin>=6.2.2",
"mkdocs-material>=9.5.39",
"pytest>=8.3.3",
"en-core-web-sm",
"mkdocstrings>=0.26.1",
"mkdocstrings-python>=1.11.1",
]
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion tests/test_spacypdfreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@


def test_version():
assert __version__ == "0.3.2"
assert __version__ == "0.4.0"


def test_get_number_of_pages():
Expand Down
Loading