AI-powered file organization using CLIP vision, OCR, Schema.org metadata, and entity detection.
Version: 2.0.0 | Python: 3.8 - 3.14 | Files Processed: 265,000+
# Setup
git clone https://github.com/aledlie/schema-org-file-system.git
cd schema-org-file-system
python3 -m venv venv && source venv/bin/activate
pip install -e ".[all]"
brew install tesseract poppler
# Run
organize-files content --source ~/Downloads --dry-run --limit 100
organize-files health # Check dependencies| Command | Description |
|---|---|
organize-files content |
AI-powered organization (CLIP, OCR) |
organize-files name |
Filename pattern organization |
organize-files type |
Extension-based organization |
organize-files health |
Check system dependencies |
organize-files migrate-ids |
Run database migration |
organize-files update-site |
Update dashboard data |
flowchart LR
A[Source Files] --> B[CLI]
B --> C{Organizer}
C --> D[CLIP Vision]
C --> E[OCR]
C --> F[Entity Detection]
D & E & F --> G[Category Assignment]
G --> H[GraphStore]
H --> I[(SQLite)]
H --> J[Schema.org JSON-LD]
I & J --> K[Dashboard]
- Organization - client, vendor, invoice, company names
- Person - resume, contact, signatures (OCR-enhanced)
- Legal/Contract - contracts, agreements, terms
- E-commerce - product listings, shopping carts
- Software UI - app interfaces, dashboards
- Game Assets - 200+ patterns, sprites, textures, audio
- Filepath - directory structure patterns
- Content Analysis - OCR text + CLIP vision
- MIME Type - file extension fallback
├── src/
│ ├── cli.py # CLI entry point
│ ├── generators.py # Schema.org generators
│ ├── api/
│ │ ├── schema_org_api.py # FastAPI JSON-LD REST endpoints
│ │ └── schema_org_models.py # Pydantic models
│ └── storage/
│ ├── graph_store.py # GraphStore + canonical IDs
│ ├── models.py # ORM models with to_schema_org()
│ ├── schema_org_exporter.py # Bulk export (JSON / NDJSON / @graph)
│ ├── schema_org_context.py # JSON-LD @context generation
│ └── schema_org_variants.py # Typed representation variants
├── scripts/ # Organizer scripts
├── tests/
│ ├── unit/ # 102 unit tests
│ ├── integration/ # Export pipeline integration tests
│ ├── performance/ # pytest-benchmark suite
│ └── e2e/ # Playwright + OpenTelemetry
├── _site/ # Web dashboard
└── results/ # Database & reports
~/Documents/
├── Organization/{Company}/ # Vendor/partner files
├── Person/{Name}/ # Person-related files
├── GameAssets/ # Sprites, textures, models
├── Financial/ # Invoices, receipts
├── Technical/ # Code, configs
└── Media/ # Photos, videos, audio
- Entity Detection - Prioritizes Organization and Person identification
- Canonical IDs - UUID v5 + SHA256 for persistent identification
- Schema.org JSON-LD - Full JSON-LD generation with validated spec URLs on every emitted property
- REST API - FastAPI endpoints returning
{"@context":…,"@graph":[…]}for all entity types - Bulk Export - JSON, NDJSON, and
@graphformats viaSchemaOrgExporter - Cost Tracking - ROI calculation with manual time savings
- E2E Testing - Playwright with OpenTelemetry instrumentation
| Layer | Technology |
|---|---|
| AI/ML | PyTorch, CLIP, OpenCV |
| OCR | Tesseract |
| Database | SQLite + SQLAlchemy |
| API | FastAPI |
| Monitoring | Sentry SDK |
| Testing | pytest, pytest-benchmark, Playwright |
- CHANGELOG - Version history
- DEPENDENCIES - Installation guide
- ARCHITECTURE_REFACTOR - Design decisions
- SCHEMA_ORG_ARCHITECTURE - Schema.org type mappings, IRI patterns, JSON-LD context, and implementation reference
Schema.org Integration
SchemaOrgExporter— bulk export in JSON, NDJSON, and@graphformatsschema_org_context.py— standalone JSON-LD@contextdocument withschema:andml:prefixesschema_org_variants.py—CategoryVariants,PersonVariants,FileVariants- All five
to_schema_org()methods annotated with validated# https://schema.org/spec URLs
REST API
- FastAPI app at
src/api/schema_org_api.py - Bulk endpoints return proper
{"@context":…,"@graph":[…]}JSON-LD documents /api/schema-org/export,/api/schema-org/graph,/schema/contextendpoints
Testing
- 102 unit tests, 26 integration tests, performance benchmarks (100 / 1k / 10k entities)
- Per-entity
to_schema_org()benchmarks and relationship-overhead baseline
Features
- Typed subdirectories for screenshot categories
- Enhanced weak image classification with full CLIP + OCR fallback
- Shared utilities module consolidating 576 lines of duplication
See full history: git log --oneline
| Variable | Description |
|---|---|
FILE_SYSTEM_SENTRY_DSN |
Sentry error tracking (Doppler) |
--sentry-dsn |
CLI override |
| Issue | Solution |
|---|---|
| HEIC fails | pip install pillow-heif |
| No OCR | brew install tesseract |
| No AI | pip install torch transformers |
| Check deps | organize-files health |
flowchart TB
subgraph Input
U[User] --> CLI[organize-files CLI]
F[Source Files] --> CLI
end
subgraph Processing["Processing Pipeline"]
CLI --> CO{Organizer Type}
CO -->|content| AI[AI Organizer]
CO -->|name| NM[Name Organizer]
CO -->|type| TY[Type Organizer]
AI --> CLIP[CLIP Vision]
AI --> OCR[Tesseract OCR]
AI --> ED[Entity Detection]
AI --> GAD[Game Asset Detection]
AI --> LCD[Legal/Contract]
AI --> ECD[E-commerce]
AI --> SUI[Software UI]
end
subgraph Storage
CLIP & OCR & ED & GAD & LCD & ECD & SUI --> GS[GraphStore]
GS --> DB[(SQLite)]
GS --> JSON[Schema.org JSON-LD]
end
subgraph Output
DB --> DASH[Web Dashboard]
JSON --> DASH
DB --> RPT[Reports]
end
subgraph Monitoring
AI -.-> SENTRY[Sentry]
AI -.-> COST[Cost Tracker]
end
subgraph External
CLIP -.-> HF[HuggingFace]
ED -.-> NOM[Nominatim]
end
erDiagram
File ||--o{ FileCategory : has
File ||--o{ FileCompany : has
File ||--o{ FilePerson : has
File ||--o{ FileLocation : has
File ||--o{ FileRelationship : source
File ||--o{ FileRelationship : target
File }o--|| OrganizationSession : belongs_to
File {
string id PK "SHA-256"
string canonical_id "UUID v5"
string filename
string original_path
string current_path
enum status
string schema_type
string content_hash
}
Category {
int id PK
string canonical_id
string name
int parent_id FK
string full_path
}
Company {
int id PK
string canonical_id
string name
string normalized_name
string domain
}
Person {
int id PK
string canonical_id
string name
string email
string role
}
Location {
int id PK
string canonical_id
string city
string state
float lat
float lng
}
OrganizationSession {
uuid id PK
datetime started_at
int total_files
float total_cost
}
CostRecord {
int id PK
uuid session_id FK
string file_id FK
string feature_name
float processing_time
float cost
}
graph TB
subgraph CLI
cli[src/cli.py]
end
subgraph API
soa[schema_org_api.py]
som[schema_org_models.py]
end
subgraph Scripts
foc[file_organizer_content_based.py]
icr[image_content_renamer.py]
ica[image_content_analyzer.py]
end
subgraph Core
gen[generators.py]
err[error_tracking.py]
cost[cost_roi_calculator.py]
end
subgraph Storage
gs[graph_store.py]
models[models.py]
exp[schema_org_exporter.py]
ctx[schema_org_context.py]
var[schema_org_variants.py]
end
subgraph External
torch[PyTorch/CLIP]
tess[Tesseract]
sentry[Sentry SDK]
sa[SQLAlchemy]
fa[FastAPI]
end
cli --> foc
foc --> gen & err & cost & gs
icr & ica --> torch
gs --> models --> sa
err --> sentry
foc --> torch & tess
soa --> exp & ctx & models & som
soa --> fa
exp --> models
var --> models