diff --git a/.env.example b/.env.example index 7e8e872b..bf67ca9d 100644 --- a/.env.example +++ b/.env.example @@ -43,7 +43,7 @@ BLUESKY_CREDENTIALS_ENCRYPTION_KEY= LINKEDIN_CREDENTIALS_ENCRYPTION_KEY= LINKEDIN_CLIENT_ID= LINKEDIN_CLIENT_SECRET= -LINKEDIN_OAUTH_SCOPES=openid profile email offline_access +LINKEDIN_OAUTH_SCOPES="openid profile email offline_access" # Outbound mail provider. Use Resend or Amazon SES. EMAIL_BACKEND=anymail.backends.resend.EmailBackend diff --git a/.gitignore b/.gitignore index 19f12e61..c27f7586 100644 --- a/.gitignore +++ b/.gitignore @@ -17,7 +17,7 @@ frontend/.next/ frontend/coverage/ frontend/node_modules/ -docs/ +docs/_internal_only/ *storybook.log storybook-static diff --git a/README.md b/README.md index dbbf97d7..c4a167ca 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,48 @@ An AI-powered content curation platform for technically-oriented newsletters. Ne The system is organized into projects: each newsletter project has its own tracked entities, relevance model, and content pipeline. Projects are assigned to Django groups so editorial access can be shared cleanly. Designed for non-technical editors who don't know what a vector database is and don't need to. +## Local Development + +Use the repo's `just` commands to get a local stack running: + +Linux: + +```bash +python3 -m venv .venv +source .venv/bin/activate +just install +just build +just dev +just seed +xdg-open http://localhost:8080/ +``` + +macOS: + +```bash +python3 -m venv .venv +source .venv/bin/activate +just install +just build +just dev +just seed +open http://localhost:8080/ +``` + +Windows PowerShell: + +```powershell +py -m venv .venv +.\.venv\Scripts\Activate.ps1 +just install +just build +just dev +just seed +Start-Process http://localhost:8080/ +``` + +`just build` prepares the backend image and frontend bundle, `just dev` starts the Docker Compose stack, and `just seed` loads demo data into the running app. For the full workflow and troubleshooting notes, see [docs/developer-guide/local-development.md](docs/developer-guide/local-development.md). + ## What This Does That Existing Tools Don't Tools like Feedly, UpContent, and ContentStudio handle parts of the content curation problem. Newsletter Maker combines several capabilities none of them offer: @@ -58,14 +100,14 @@ The roadmap progresses from contextual actions (MVP) to multi-step skill chainin Each data source implements a common interface (`fetch_new_content`, `get_entity_profile`, `health_check`) and handles its own auth and rate limiting. The core system just calls the interface. Planned integrations: -| Source | Purpose | Priority | -| ------ | ------- | -------- | -| RSS | Blog/site tracking for followed entities | Phase 1 | -| Reddit | Trend detection and community sentiment | Phase 1 | -| Resend Inbound | Newsletter email ingestion and authority signals | Phase 2 | -| Bluesky | Entity content tracking (open AT Protocol) | Phase 2 | -| Mastodon | Entity content tracking (ActivityPub) | Phase 3 | -| LinkedIn | Entity enrichment and article discovery | Phase 4 | +| Source | Purpose | +| ------ | ------- | +| RSS | Blog/site tracking for followed entities | +| Reddit | Trend detection and community sentiment | +| Resend Inbound | Newsletter email ingestion and authority signals | +| Bluesky | Entity content tracking (open AT Protocol) | +| Mastodon | Entity content tracking (ActivityPub) | +| LinkedIn | Entity enrichment and article discovery | ### Production-Grade Error Handling @@ -83,102 +125,14 @@ The system is designed for graceful failure, not silent corruption. Unparseable ## Project Documentation -- [Developer Guide](docs/DEVELOPER_GUIDE.md) gives a fast "where to look first" map for new contributors. -- [Deployment Guide](docs/DEPLOYMENT.md) covers Docker Compose, Helm, Minikube, and deployment-aware CI. -- [Implementation Overview](docs/IMPLEMENTATION_OVERVIEW.md) summarizes the main features and current architecture. -- [Data Models](docs/MODELS.md) describes the purpose of each core model. -- [Relevance Scoring](docs/RELEVANCE_SCORING.md) explains how similarity scoring and review thresholds work. -- [Logging](docs/LOGGING.md) explains where application logs go in local and containerized environments. - -## Local Development - -```bash -python3 -m venv .venv -source .venv/bin/activate -just install -``` - -`just install` installs the backend and frontend dependencies and registers the repository's `pre-commit` hooks, so `git commit` runs the configured lint and test hooks locally. - -There are two intentionally separate workflows: +Newsletter Maker documentation is organized by audience inside the `docs/` folder: -- `just lint` and `just test` run on the host without Docker. The backend half of those commands uses `.env.test`. -- Runtime, data, and Django management commands run against the Docker Compose stack. - -1. Run `just dev` to start Django, Celery, Postgres, Redis, Qdrant, and Nginx. On the first run Docker builds the app image automatically. After that, `just dev` reuses the existing image so normal restarts are fast. If `.env` is missing, the `just` command copies `.env.example` automatically. -2. Run `just build` after changing `requirements.txt` or `docker/web/Dockerfile`. It does not copy or depend on local env files. -3. For a fully fresh local stack after schema changes, run `just reset-volumes` before starting the containers again. This drops the Docker-backed Postgres, Redis, and Qdrant state so regenerated migrations apply cleanly. -4. Run Django management commands against the running backend container. `just migrate`, `just shell`, `just embed-all`, `just embed-project `, `just embed-smoke`, `just embed-smoke-content `, and `just bootstrap-live-sources ` all use `docker compose exec django ...`. -5. `.env.example` is Compose-oriented and uses Docker service hostnames for the backend runtime. Update `.env` with non-default secrets before using the stack outside local development. -6. Open `http://localhost:8080/healthz/` for a liveness check and `http://localhost:8080/admin/` for Django admin. Use `just seed` after the stack is up if you want the demo project and sample content. - -### Testing - -Run the test suite with: - -```bash -just test -``` - -Pytest auto-loads `.env.test` during test startup. That file is intentionally checked in and only contains non-sensitive placeholder values used by tests, such as fake API keys, fake Reddit credentials, and localhost service URLs. - -`.env.test` also pins Django tests to an explicit SQLite configuration so backend tests stay independent from the Compose-backed Postgres development database. - -`backend-lint` also runs Django-aware host-side checks (`mypy` with the Django plugin and `manage.py check`) under `.env.test`, so `just lint` stays independent from Docker. - -Use `.env.test` for stable dummy values that make tests deterministic. Do not put real secrets in it. Real local or production secrets belong in `.env`, which remains ignored. - -### Embedding Backends - -The embedding layer is provider-based. Configure it with `EMBEDDING_PROVIDER` and `EMBEDDING_MODEL`: - -- `sentence-transformers`: loads a Hugging Face / SentenceTransformers model inside the Django process -- `ollama`: calls a local Ollama server for embeddings -- `openrouter`: calls OpenRouter's embeddings API using the configured model id - -Common examples: - -```dotenv -EMBEDDING_PROVIDER=sentence-transformers -EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 -``` - -```dotenv -EMBEDDING_PROVIDER=ollama -EMBEDDING_MODEL=nomic-embed-text -OLLAMA_URL=http://localhost:11434 -``` - -```dotenv -EMBEDDING_PROVIDER=openrouter -EMBEDDING_MODEL=openai/text-embedding-3-small -OPENROUTER_API_KEY=... -OPENROUTER_API_BASE=https://openrouter.ai/api/v1 -``` - -For SentenceTransformers models that require custom remote code, set `EMBEDDING_TRUST_REMOTE_CODE=true`. - -### Embedding Commands - -Use these commands to backfill or refresh embeddings for existing content: - -```bash -just embed-all -just embed-project 1 -docker compose exec django python manage.py sync_embeddings --content-id 42 -docker compose exec django python manage.py sync_embeddings --references-only -``` - -When `just dev` is running, Django admin and the developer-facing `just` wrappers all operate against the Compose-backed Postgres database. - -Create or update an admin user for the running Docker stack with: - -```bash -just createsuperuser -just changepassword your-username -``` +- [User Guide](docs/user-guide/getting-started-saas.md) covers managing projects, intaking content, and curating drafts. +- [Admin Guide](docs/admin-guide/overview.md) covers installation, configuration, user management, and operational health. +- [Developer Guide](docs/developer-guide/overview.md) covers local workflows, backend/frontend conventions, and testing logic. +- [Reference](docs/reference/data-model.md) details the backend API, algorithms, pipeline definitions, and tunables. -For the default local bootstrap, `.env` also seeds an `admin` superuser in the container database using `DJANGO_SUPERUSER_USERNAME`, `DJANGO_SUPERUSER_EMAIL`, and `DJANGO_SUPERUSER_PASSWORD`. +Start at the [Documentation Root](docs/README.md) to navigate to the specific section you need. ## License diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..73218c9e --- /dev/null +++ b/docs/README.md @@ -0,0 +1,14 @@ +# Newsletter Maker Documentation + +Newsletter Maker is an AI-powered platform for ingesting, scoring, and writing domain-specific newsletters. It uses LangGraph to orchestrate Claude Skills against incoming RSS, Reddit, and forwarded email content to synthesize high-quality reading lists. + +These documents are organized by audience. + +* **I am an Editor or Curator using the product day-to-day**: Head to the [User Guide](user-guide/getting-started-saas.md) to learn how to ingest content, manage authority, and synthesize drafts. +* **I am an Administrator installing or managing the platform**: Head to the [Admin Guide](admin-guide/overview.md) to understand Docker deployments, API keys, and queue troubleshooting. +* **I am a Developer contributing code to this repository**: Head to the [Developer Guide](developer-guide/overview.md) to understand local workflows, architecture, and coding conventions. +* **I need to understand the underlying Math and Logic**: Head to the [Reference Section](reference/data-model.md) to see how LangGraph, LangChain, Celery, Qdrant, and the Cosine similarity algorithms are wired together. + +## Terminology Note +In this repository, a distinct newsletter workspace is called a **Project** (not a Tenant, not a Workspace). An article or extracted text is called **Content**. +See the full [Glossary](reference/glossary.md) for clarification on Entities, Skills, and Velocity. diff --git a/docs/admin-guide/backups-and-retention.md b/docs/admin-guide/backups-and-retention.md new file mode 100644 index 00000000..ad345f3d --- /dev/null +++ b/docs/admin-guide/backups-and-retention.md @@ -0,0 +1,21 @@ +# Backups and Retention + +## Postgres Backup +Back up Postgres using standard `pg_dump`: +```bash +docker compose exec postgres pg_dump -U newsletter newsletter_maker > backup.sql +``` + +## Qdrant Snapshot +Qdrant manages internal snapshots. See Qdrant Snapshot API documentation for exporting raw vector archives. Otherwise, vector data can be entirely reconstructed from Postgres text if necessary (though it costs API tokens to recalculate). + +## Observability Retention Windows +To prevent unbound DB growth, old logs and task runs are deleted according to: +- `OBSERVABILITY_SNAPSHOT_RETENTION_DAYS` (default 90) +- `OBSERVABILITY_TREND_TASK_RUN_RETENTION_DAYS` (default 30) + +## Restore Drill +To restore the platform: +1. `docker compose down -v` +2. Restore Postgres DB volume. +3. Bring system up. (If Qdrant is empty, trigger an embedding backfill from Postgres text). diff --git a/docs/admin-guide/configuration.md b/docs/admin-guide/configuration.md new file mode 100644 index 00000000..b2b6fd7b --- /dev/null +++ b/docs/admin-guide/configuration.md @@ -0,0 +1,31 @@ +# Configuration + +See the [Tunables Reference](../reference/tunables.md) for the exact list of algorithms and thresholds. + +## Required vs Optional Variables +**Required**: +* `DATABASE_URL`, `REDIS_URL`, `QDRANT_URL`, `SECRET_KEY`, `NEWSLETTER_API_BASE_URL`. +**Optional but critical for AI**: +* `OPENROUTER_API_KEY` (Required for relevance tie-breaking and categorization). + +## Secrets Handling +* In Docker Compose: Loaded tightly from the `.env` file mapped securely to the container. +* In Kubernetes: Expected to be mapped into the Pod `env` spec via Secrets. + +## Internal vs Public URLs +Due to container networking: +* `NEWSLETTER_API_BASE_URL` (Internal) will reference inner hostnames like `http://nginx`. +* `NEWSLETTER_PUBLIC_URL` (Public) should point to your real FQDN (e.g. `https://news.mydomain.com`) used in emails. + +## Email Provider (Anymail) +Newsletter intake relies on Resend webhooks and Django Anymail forwarding. +Configured via: +* `RESEND_API_KEY` +* `RESEND_INBOUND_SECRET` +* `DEFAULT_FROM_EMAIL` + +## LLM Provider Routing +Select between `local`, `ollama` or remote providers using `EMBEDDING_PROVIDER`. Set URLs correctly to point to either the internal container (`http://ollama:11434`) or external APIs (`https://api.openai.com/v1`). + +## OAuth Provider Toggles +If `LINKEDIN_CLIENT_ID` or `REDDIT_CLIENT_ID` are present, their respective capabilities light up dynamically in the application. diff --git a/docs/admin-guide/installation.md b/docs/admin-guide/installation.md new file mode 100644 index 00000000..81533d63 --- /dev/null +++ b/docs/admin-guide/installation.md @@ -0,0 +1,31 @@ +# Installation + +## Minimum Requirements +* 4 CPU Cores (8 recommended if running Ollama locally). +* 8 GB RAM (16GB recommended if running Ollama locally). +* Postgres 14+, Redis 7+. + +## Docker Compose Path +The easiest way to stand up a single VPS is Docker Compose: +1. Clone the repository. +2. Copy `.env.example` to `.env` and fill in secrets. +3. Run `docker compose build`. +4. Run `docker compose up -d`. +5. Run migrations: `docker compose exec django python manage.py migrate`. + +## Helm + ArgoCD Path +For Kubernetes usage, an ArgoCD App configuration lives in `deploy/argocd` pointing to the Helm chart in `deploy/helm`. Configure your values file with the required secrets (or rely on ExternalSecrets). + +## First-Run Checklist +1. Ensure containers are healthy (`docker compose ps`). +2. Run database migrations. +3. Create the superuser (see below). +4. Run `docker compose exec django python manage.py bootstrap_live_sources` to seed default RSS/Reddit connections. + +## Creating the First Superuser +```bash +docker compose exec django python manage.py createsuperuser +``` + +## Smoke Test +Log into the dashboard. Go to settings and add an RSS feed. If the `Ingestion Settings` page shows health check successes within 5 minutes, Celery, Postgres, and the Network are functional. diff --git a/docs/admin-guide/operations.md b/docs/admin-guide/operations.md new file mode 100644 index 00000000..d3f1220d --- /dev/null +++ b/docs/admin-guide/operations.md @@ -0,0 +1,24 @@ +# Operations + +## Daily/Weekly Health Checks +* Check Celery Beat logs to see if scheduled ticks are executing. +* Look for `429 Too Many Requests` in your API provider logs (OpenRouter). + +## Celery Beat & Worker Monitoring +Use Celery Flower (if enabled in Compose) or monitor queue depth in Redis length. + +## Qdrant Collection Health +Ensure Qdrant is snapshotting to disk and not constantly OOM-killed. If it runs out of memory, increase VPS limits. + +## Embeddings Backfill +If you switch embedding providers (e.g., moving from `local` to `openai`), previous cosine scores are invalidated. You must run a backfill management command to rewrite all `Content` vectors. + +## Re-running Pipeline +If LLM failures occurred due to an outage: +Go to Django Admin -> `PipelineRun` -> select failed items and re-trigger. + +## Clearing Stuck Items +Use the `ReviewQueue` in the Next.js frontend to clear items the LLM had zero confidence about. + +## Messaging/Channels Health +If real-time notifications fail, verify Daphne is alive and the `REDIS_URL` matches the Channels configuration. diff --git a/docs/admin-guide/overview.md b/docs/admin-guide/overview.md new file mode 100644 index 00000000..ba737b3a --- /dev/null +++ b/docs/admin-guide/overview.md @@ -0,0 +1,26 @@ +# Admin Overview + +Welcome to the operator's manual for Newsletter Maker. This guide is assuming you are running the system, not writing code for it. + +## Component Map +* **Django (API & Workers)**: Python core running the REST API. +* **Celery Worker**: Asynchronous task runner for LangGraph skills and entity extraction. +* **Celery Beat**: Cron scheduler for trend gathering and fetching RSS/Reddit plugins. +* **PostgreSQL**: Holds all standard application state and configuration. +* **Redis**: Acts as the message broker for Celery and the WebSockets channel layer. +* **Qdrant**: The Vector Database storing the high-dimensional embeddings for Cosine relevance calculations. +* **Ollama** (Optional): A containerized local LLM server for generating embeddings locally without paying OpenAI/OpenRouter. +* **Nginx**: Reverse proxy to route `/api/` traffic to Django and `/` traffic to Next.js. +* **Next.js**: The frontend App Router. + +## Request Path +Browser -> Nginx -> Next.js (for HTML) -> Nginx -> Django Gunicorn -> Postgres. + +## Ingestion Path +Beat triggers fetch -> Celery Worker -> Fetches RSS array -> Django DB -> Triggers Embedding -> Saves to Qdrant -> Enqueues LangGraph Pipeline -> Celery Worker Executes Skills. + +## AI Pipeline Path +Orchestrated by LangGraph inside a Celery task. Calls out to the specific `OPENROUTER_API_BASE` or Local Ollama instance. State transitions are saved continuously to Postgres mapping to `SkillResult`s. + +## Realtime Path +Browser -> Nginx (WebSocket Upgrade) -> Django Daphne ASGI -> Redis `CHANNEL_LAYER` -> Broadcast to users. diff --git a/docs/admin-guide/sources-and-allowlist.md b/docs/admin-guide/sources-and-allowlist.md new file mode 100644 index 00000000..cc5a409b --- /dev/null +++ b/docs/admin-guide/sources-and-allowlist.md @@ -0,0 +1,29 @@ +# Sources and Allowlist + +Administrators must babysit the influx of content to keep the system healthy. + +## Per-Plugin Config +* **RSS**: Relies purely on outbound GET fetches. +* **Reddit / Bluesky / Mastodon**: Relies entirely on their respective API limitations. If you hit 429 Too Many Requests, throttle the polling intervals. + +## Health Check Semantics +Plugins record timestamped failures into `IngestionRun`. If an ingestion source fails 5 times consecutively, the frontend highlights it in red. + +## Bootstrap Live Sources +You can instantly seed a fresh database with: +```bash +docker compose exec django python manage.py bootstrap_live_sources +``` + +## Intake Allowlist Lifecycle +When you forward a newsletter to your ingest address: +1. `Pending`: The system receives the email but quarantines it. +2. `Confirmation Sent`: The system emails the sender back with a one-time link. +3. `Confirmed`: The user clicks the link. Their address is now Trusted. +4. `Expired`: Stalled after 7 days. + +## Revoking Senders +If a newsletter breaks or creates spam, remove it via the Django Admin panel under `newsletters.IntakeAllowlist`. + +## Investigating Dropped Subscriptions +Check the `NewsletterIntake` model in Django admin. Emails that fail to parse HTML correctly record their Stack Traces there. diff --git a/docs/admin-guide/troubleshooting.md b/docs/admin-guide/troubleshooting.md new file mode 100644 index 00000000..d8fa0204 --- /dev/null +++ b/docs/admin-guide/troubleshooting.md @@ -0,0 +1,25 @@ +# Troubleshooting + +## No Content Appearing +* Check Celery Beat: Are tasks scheduling? +* Check Celery Worker: Are tasks crashing? +* Check `SourceConfig`: Are there active integrations? + +## Newsletters Never Confirm +* Ensure `RESEND_API_KEY` is valid. +* Ensure `NEWSLETTER_PUBLIC_URL` is set to an address routing correctly to the Reverse Proxy, or else the generated links in the emails will point to `127.0.0.1` and fail. + +## Relevance Scores Near Zero +* If all scores sit at 0.1, the Qdrant `is_reference=True` corpus is either empty or entirely corrupted. Tag at least 5 articles as explicitly "Reference Quality" to set a project boundary. + +## Qdrant Search Returns Nothing +* Connect to Qdrant UI mapping on port `6333`. Ensure the collections exist and match the `EMBEDDING_PROVIDER` token sizes. + +## Embeddings Worker Idle +* Double check `OLLAMA_URL` network resolution if running local containers. + +## Pipeline Stuck +* Review `SkillResult` tables in Postgres for unhandled tracebacks. + +## Messaging Not Delivering +* Ensure Daphne ASGI is handling traffic and isn't overridden by WSGI Gunicorn blocks inside `nginx/conf.d`. diff --git a/docs/admin-guide/users-and-access.md b/docs/admin-guide/users-and-access.md new file mode 100644 index 00000000..ccaea253 --- /dev/null +++ b/docs/admin-guide/users-and-access.md @@ -0,0 +1,20 @@ +# Users and Access + +## Account Creation Paths +1. Standard Username/Password (Local). +2. LinkedIn OAuth (If configured via `LINKEDIN_CLIENT_ID`). + +## Project Membership Model +We do not use standard Django Groups for multi-tenancy. Access is strictly mapped via `ProjectMembership` with explicit Roles: `admin`, `member`, `reader`. + +## Django Groups and Roles +Django `Group` and `Permission` models are retained ONLY for granting staff/Superuser global abilities (e.g., viewing standard Django Admin), NOT for managing newsletter workspaces. + +## Service Accounts +If you need scripts to push data, you can generate a Long Lived API token attached to a standard User account flagged programmatically. + +## LinkedIn OAuth Admin Steps +To enable LinkedIn SSO: +1. Create a LinkedIn Developer Application. +2. Whitelist your `NEWSLETTER_PUBLIC_URL/api/v1/auth/linkedin/callback` endpoint. +3. Inject the ID and Secret into your `.env` configuration. diff --git a/docs/developer-guide/architecture.md b/docs/developer-guide/architecture.md new file mode 100644 index 00000000..8bd9086f --- /dev/null +++ b/docs/developer-guide/architecture.md @@ -0,0 +1,32 @@ +# Architecture + +This document maps the flow of data and requests through the Newsletter Maker platform. + +## Sync Request Path +Standard API requests (e.g., fetching content, editing entity thresholds) follow a standard synchronous Django flow: +1. **Nginx** terminates SSL and proxies the request to the Gunicorn WSGI worker. +2. **Django & DRF** authenticate the request (session or token). +3. The ViewSet enforces project scoping via `ProjectOwnedQuerysetMixin`. +4. State is read/written to **PostgreSQL**. + +## Async Path +Real-time features, like the messaging drawer, use WebSockets: +1. **Nginx** upgrades the connection and routes to Daphne (ASGI). +2. **Django Channels** accepts the WebSocket. +3. Broadcasts and channel layers are coordinated via **Redis**. + +## Ingestion Path +How articles enter the system: +1. **Celery Beat** runs scheduled cron jobs (e.g., `core.tasks.fetch_rss`). +2. The task queries `SourceConfig` and invokes the matching `SourcePlugin` (e.g., `RssPlugin`). +3. Raw items are parsed. Novel items are passed to the `embeddings` module. +4. Text is passed to the configured embedding provider (`local`, `ollama`, or `openai`). +5. The `Content` record is saved to Postgres, and the Vector is pushed to **Qdrant**. +6. The item is queued into the LangGraph pipeline. + +## LangGraph Orchestration Overview +The AI Pipeline is managed as a State Graph (see [Pipeline](../reference/pipeline.md)). A Celery worker processes each node. If an LLM call fails, the node emits an error state, and the graph gracefully terminates or routes the item to a `ReviewQueue`. + +## Frontend Rendering & Data Fetching +* The Next.js 15 App Router utilizes both Server Components (for initial page loads) and Client Components. +* **TanStack React Query** manages client-side data fetching and caching against the Django DRF endpoints. diff --git a/docs/developer-guide/backend-conventions.md b/docs/developer-guide/backend-conventions.md new file mode 100644 index 00000000..ff2bca47 --- /dev/null +++ b/docs/developer-guide/backend-conventions.md @@ -0,0 +1,28 @@ +# Backend Conventions + +## Project Scoping Invariants +Because this is a multi-project (not tenant) system, data leakage is the worst possible bug. +* Almost all ViewSets must inherit from `ProjectOwnedQuerysetMixin`. +* URLs are nested: `/api/v1/projects/{project_id}/...`. +* The `project_id` must be explicitly verified when linking related objects via Foreign Key in serializers. + +## DRF Patterns +* Keep views and viewsets extremely thin. +* Put operational logic in `core/tasks.py`, `core/pipeline.py`, or application-specific helpers like `newsletters/intake.py`. +* Pass the `project` object via `serializer.context` during creation overrides. + +## Code Placement +Do not dump everything into `core/`. +If you are adding functionality for trend clustering, it belongs in `trends/`. If you are adding a new ingestion source, it belongs in `ingestion/`. `core/` is exclusively for plumbing (auth, WSGI, abstract models, LLM wrappers). + +## Plugin Interface +All sources must implement the `SourcePlugin` interface, keeping `fetch()` operations standardized whether they connect to RSS, Reddit, or Bluesky. + +## Celery Task Conventions +Always pass database IDs (e.g., `content_id`), not serialized ORM objects, as arguments to Celery tasks. + +## drf-spectacular Schema Metadata +If you change an API return shape, use `@extend_schema` to update the type hints so the OpenAPI spec remains accurate. + +## Docstring Rules +Follow Google-style docstrings with PEP 257 conventions. Add docstrings to modules, public classes, and public functions. Skip obvious one-liners. diff --git a/docs/developer-guide/contributing.md b/docs/developer-guide/contributing.md new file mode 100644 index 00000000..bc271ce7 --- /dev/null +++ b/docs/developer-guide/contributing.md @@ -0,0 +1,18 @@ +# Contributing + +## Branch Naming +Standard `feature/`, `bugfix/`, or `chore/` prefixes. + +## Commit-Time Validation +You are expected to pass formatting and linting locally. Run: +```bash +just lint +``` +This runs `ruff`, `mypy`, `eslint`, and typechecks via `tsc`. + +## Instruction Files +This repository uses contextual instruction files to guide AI code generation natively in VS Code. If you are changing a core pattern, update the relevant file in `.github/instructions/`. + +## Skills System +We maintain discrete prompt-files for LLM features under `skills/`. +If you are modifying how an AI feature works, edit the `SKILL.md` file rather than burying prompt text in Python strings. See `docs/reference/skills.md` for more info. diff --git a/docs/developer-guide/deployment.md b/docs/developer-guide/deployment.md new file mode 100644 index 00000000..641dc000 --- /dev/null +++ b/docs/developer-guide/deployment.md @@ -0,0 +1,19 @@ +# Deployment + +## just build Contract +The `just build` target makes zero assumptions about the environment file. It uses `DOCKER_BUILDKIT=0` to ensure legacy build isolation and host image cache utilization. No `.env` copies are made during build time. + +## Docker Compose +Used primarily for local testing and running the application on a single VPS. See [Admin Installation](../admin-guide/installation.md) for details. + +## Helm Chart Layout +For Kubernetes deployments, a reusable Helm chart sits in `deploy/helm/`. + +## ArgoCD Application +We maintain an ArgoCD application manifest in `deploy/argocd/` to support GitOps continuous delivery. + +## Staging Overlay +Staging branches utilize encrypted / sealed secrets (or external secret operators) pushed into the cluster. + +## Prometheus ServiceMonitor +If deployed alongside the `kube-prometheus-stack`, the chart deploys a `ServiceMonitor` to scrape port 8000 for Django metrics exposed by `django-prometheus`. diff --git a/docs/developer-guide/frontend-conventions.md b/docs/developer-guide/frontend-conventions.md new file mode 100644 index 00000000..a3e306a9 --- /dev/null +++ b/docs/developer-guide/frontend-conventions.md @@ -0,0 +1,27 @@ +# Frontend Conventions + +## App Router Layout +* `app/`: Next.js page components, layouts, and route handlers. +* `components/`: UI pieces, divided into: + * `elements/`: App-owned smart components that combine primitives with project logic. + * `layout/`: Shared navigation and page chrome. + * `ui/`: Raw `shadcn/ui` installed components. **Do not modify these unless absolutely necessary.** +* `providers/`: Context wrappers (theme, query client). +* `lib/`: API fetchers, types, and hooks. + +## Components Structure +For components in `elements/` or `layout/`: +Let the folder carry the name, e.g., `components/elements/UserAvatar/`. +Inside, use: +* `index.tsx` +* `index.test.tsx` +* `index.stories.tsx` +**Do not use barrel `index.ts` files** to simply re-export unless architecturally required. + +## Shared Types and API +* Shared backend-facing types live in `frontend/src/lib/types.ts`. +* Data requests live in `frontend/src/lib/api.ts`. +* **Preserve `snake_case`**: Do not arbitrarily convert the backend's `snake_case` properties into `camelCase` on the frontend. Consume them as they arrive to keep grepping simple. + +## Test-with-the-Change Rule +When you add a route, component, or helper, you must write or update the colocated `.test.tsx` Vitest file in the same PR. diff --git a/docs/developer-guide/local-development.md b/docs/developer-guide/local-development.md new file mode 100644 index 00000000..4af7dd8d --- /dev/null +++ b/docs/developer-guide/local-development.md @@ -0,0 +1,38 @@ +# Local Development + +Newsletter Maker uses a **two-workflow split** to isolate fast local iteration from full full-stack fidelity. + +## The Two-Workflow Split +1. **Host-Side Track**: Used for fast linting, typechecking, and unit tests WITHOUT spinning up Docker. +2. **Docker Track**: Used for running the application, seeing the UI, background workers, and Postgres. + +## Host-Side Track +When you run commands on your local OS (e.g., `just lint`, `just test`, `just frontend-lint`): +- Django reads from `.env.test`. +- `DATABASE_URL` defaults to `sqlite:///:memory:` for instantaneous migrations/tests. +- No Redis or Qdrant is required for basic unit test stubs. + +## Docker Track +When you want to run the app: +```bash +just build # Env-free container build (DOCKER_BUILDKIT=0) +docker compose up -d +``` +When running the Docker track, all runtime commands must be executed **inside the container**: +```bash +docker compose exec django python manage.py migrate +docker compose exec django python manage.py bootstrap_live_sources +``` + +## Celery Beat Schedule +The Celery beat schedule file (`celerybeat-schedule`) is written to `.cache/` to prevent dirtying the project root or colliding between host/container environments. + +## Frontend Dev Loop +For iterating purely on the Next.js app while the backend runs in Docker: +```bash +cd frontend && npm run dev +``` + +## When to Use Which Workflow +* **Writing code, running tests, checking types**: Host-side (`just lint`, `just test`). +* **Testing LLMs, seeing the UI, testing ingestion, full pipelines**: Docker Track (`docker compose up`). diff --git a/docs/developer-guide/overview.md b/docs/developer-guide/overview.md new file mode 100644 index 00000000..6f2bd659 --- /dev/null +++ b/docs/developer-guide/overview.md @@ -0,0 +1,35 @@ +# Developer Overview + +Welcome to the Newsletter Maker codebase! This folder documents how developers build, test, and run the backend and frontend. + +## Repo Map + +``` +newsletter-maker/ +├── core/ # Cross-cutting plumbing (LLM wrappers, Qdrant, auth, tasks) +├── content/ # Content models, deduplication, relevance saving +├── entities/ # Entity extraction, authority scoring +├── frontend/ # Next.js App Router application +├── ingestion/ # RSS, Reddit, Bluesky, Mastodon integrations +├── newsletter_maker/ # Primary Django Config / WSGI / ASGI routing +├── newsletters/ # Email intake via Anymail, confirmation loops +├── pipeline/ # LangGraph orchestration, Review queues +├── projects/ # Core isolated tenancy setup, global Config models +├── trends/ # Velocity clustering, Theme and Idea suggestion logic +├── users/ # Custom User model and invitations +├── docs/ # Documentation +└── skills/ # Independent Markdown prompts used by LLM Wrapper +``` + +## Tech Stack +* **Backend**: Python 3.12, Django 5+, Django REST Framework, Celery, Django Channels (WebSockets) +* **Frontend**: React 19, Next.js 15 App Router, TypeScript, TailwindCSS, shadcn/ui, TanStack Query +* **Infrastructure**: Postgres, Redis, Qdrant (Vector DB), Ollama (Local AI) + +## Where Apps Live +Unlike monolithic Django applications where `core` is a kitchen sink, this project is modularized by feature. `newsletters.intake` strictly handles ingest; `trends.clustering` handles algorithms for velocities. Only truly global plumbing (like Drf mixins) lives in `core`. + +## How to Read This Doc Set +If you are standing up the codebase for the first time, head to [Local Development](local-development.md). +If you're investigating a pipeline failure, head to `docs/reference/pipeline.md`. +If you are writing the Next.js UI, head to [Frontend Conventions](frontend-conventions.md). diff --git a/docs/developer-guide/testing.md b/docs/developer-guide/testing.md new file mode 100644 index 00000000..0c2f46e7 --- /dev/null +++ b/docs/developer-guide/testing.md @@ -0,0 +1,21 @@ +# Testing + +## pytest Layout (Backend) +* Application-specific tests live inside their respective apps (e.g., `users/tests/`, `pipeline/tests/`). +* The top-level `tests/` directory is reserved strictly for full-system integration tests. +* Ensure you are running under the host-side track (SQLite `memory:`) for speed. Example: `just backend-test`. + +## Vitest Layout (Frontend) +* Tests live immediately beside the files they test (e.g., `index.test.tsx`). +* We do not use separate `__tests__/` folders. + +## Storybook Usage +* UI components in `elements/` and `layout/` should have `.stories.tsx` files demonstrating their permutations. + +## Coverage Expectations +Use the `.github/skills/coverage-auditor/SKILL.md` to spot gaps. All new logic branches in APIs, serializers, and utility functions should be covered. + +## When to Use just Test Targets +* `just test`: Runs all backend AND frontend tests. +* `just backend-test`: Only runs `pytest`. +* `just frontend-test`: Only runs `vitest`. diff --git a/docs/reference/algorithms.md b/docs/reference/algorithms.md new file mode 100644 index 00000000..ac6b03dd --- /dev/null +++ b/docs/reference/algorithms.md @@ -0,0 +1,52 @@ +# Algorithms + +This document breaks down the major decision-making mathematics behind the AI pipeline. Having these defined explicitly helps debug strange behaviors and informs how to adjust the tunables. + +## Embedding Model & Vector Space +* **What it computes**: Converts extracted content text into high-dimensional vector coordinates. +* **Inputs**: Candidate text. +* **Outputs**: A dense vector stored into the Qdrant database payload. +* **Tunables**: `EMBEDDING_PROVIDER`, `EMBEDDING_MODEL` ([see Tunables](tunables.md)). +* **Location**: `core/embeddings.py` + +## Cosine Relevance Scoring +* **What it computes**: Decides if a candidate article is relevant to the project's specific topic. +* **Inputs**: The candidate Article Vector, and the top-5 $k$-Nearest-Neighbor vectors tagged as `is_reference=True` in Qdrant for that project. +* **Formula / Rules**: + 1. If Similarity $\ge 0.85$: The candidate is considered a **clear match** and is deemed highly relevant (LLM bypass). + 2. If Similarity $< 0.50$: The candidate is a **clear non-match** (LLM bypass). + 3. If $0.50 \le$ Similarity $< 0.85$: The candidate falls into the **ambiguous band**, triggering the Relevance LLM skill which returns a score to break the tie, assuming `OPENROUTER_API_KEY` is present. +* **Tunables**: Static bounds `0.50` and `0.85`. +* **Location**: `core/pipeline.py` and `core/ai.py`. + +## Topic Centroid Feedback Loop +* **What it computes**: Drifts the project baseline "topic vector" based on explicit editorial thumbs-up or thumbs-down feedback. +* **Inputs**: Explicit `UserFeedback` records. +* **Formula**: The sum of positively-ranked content vectors minus negatively-ranked vectors, proportionally shifting the project's reference similarity center point. +* **Tunables**: `ProjectConfig.recompute_topic_centroid_on_feedback_save` ([see Tunables](tunables.md)). +* **Location**: `trends/` and `core/embeddings.py`. + +## Authority Scoring +* **What it computes**: Assigns an influence multiplier (`authority_score`) to an `Entity` based on how frequently and prominently it is mentioned or referenced. +* **Inputs**: Detected mentions, source quality signals. +* **Algorithmic Model**: A multi-signal model including raw mention frequency layered against a time-based decay function ($score_{new} = score_{previous} \times decay\_rate$), bounded engagement corroboration. +* **Tunables**: `ProjectConfig.authority_decay_rate`. +* **Location**: `entities/` models and Celery tasks. + +## Deduplication Thresholding +* **What it computes**: Determines whether an incoming ingestion is identical to a piece of content already in the dataset. +* **Inputs**: Incoming article text embedding. +* **Formula**: $L_2$ Distance nearest-neighbor search. Extremely close items are flagged as duplicate and ignored. +* **Location**: `content/deduplication.py` and `pipeline`. + +## Trend Velocity Calculation +* **What it computes**: Isolates topics that are accelerating in mentions, not just frequently used. +* **Formula**: $\frac{Count(Recent Window)}{Count(Baseline Window)}$. Identifies a delta derivative of topic popularity. + +## Source Diversity Metric +* **What it computes**: Quantifies concentration risk (e.g., pointing out if 90% of content is drawn from the same single Reddit community). +* **Formula**: Herfindahl-Hirschman Index (HHI) style proportionality test on plugin sources over total ingestions. + +## Entity Candidate Confidence Scoring +* **What it computes**: Determines whether an unknown text fragment (e.g., "OpenAI") extracted by the LLM should be auto-promoted into a new Tracked Entity or kept pending human review. +* **Formula**: Evaluates proximity to known aliases, capitalization strictness, and recurrence volume. diff --git a/docs/reference/api.md b/docs/reference/api.md new file mode 100644 index 00000000..be71958c --- /dev/null +++ b/docs/reference/api.md @@ -0,0 +1,42 @@ +# API + +Our backend exposes a REST API powered by Django REST Framework (DRF), and a WebSocket API via Django Channels. + +## Auth +- **Session Auth**: Primarily used by the Next.js frontend during browser sessions. +- **Token Auth**: Supported for programmatic access. + +## Base Path +Because of the project-centric design, almost all resources are nested under: +`/api/v1/projects/{project_id}/...` + +This enforces the strict scoping of data and prevents cross-project spillage. + +## Per-Resource Endpoint Table + +| Resource | Path | Methods | Notes | +| --- | --- | --- | --- | +| **Projects** | `/api/v1/projects/` | `GET`, `POST` | List projects you have membership in. | +| **Content** | `/api/v1/projects/{pid}/content/` | `GET`, `PATCH` | Read ingested items. Patch `is_relevant`. | +| **Entities** | `/api/v1/projects/{pid}/entities/` | `GET`, `PATCH` | Tracked entities over time. | +| **Sources** | `/api/v1/projects/{pid}/sources/` | `GET`, `POST`, `PATCH`, `DELETE` | Configure plugin ingestion settings. | +| **Members** | `/api/v1/projects/{pid}/members/` | `GET`, `POST`, `DELETE` | Invite users / manage roles. | +| **Feedback** | `/api/v1/projects/{pid}/content/{cid}/feedback/` | `POST` | Submit upvote/downvote for topic centroid drift. | + +## Inbound Webhook +`/api/v1/inbound/...` +This surface is explicitly deliberately **unscoped** by project ID in the path because external systems (like Anymail handling Resend callbacks) do not know our internal `project_id`. The application code looks up the correct project using tokens or sender records. + +## Messaging +* **REST Path**: `/api/v1/messages/` +* **WebSocket Path**: `/ws/messages/` (Served via Django Channels / ASGI) + +## Pagination & Filtering +* Cursor-based or PageNumber pagination depending on the endpoint scale. +* Filtering standardizes on `django-filter` params (e.g., `?category=tutorial&min_relevance=0.8`). + +## drf-spectacular Schema Link +A full OpenAPI V3 schema is auto-generated by `drf-spectacular`. +Available in dev at: +* Swagger UI: `/api/schema/swagger-ui/` +* Raw YAML: `/api/schema/` diff --git a/docs/reference/data-model.md b/docs/reference/data-model.md new file mode 100644 index 00000000..a9b4c133 --- /dev/null +++ b/docs/reference/data-model.md @@ -0,0 +1,45 @@ +# Data Model + +This document outlines the core domain entities mapping to the database, ensuring developers understand the relational boundaries and project-scoping invariants. + +## Model Diagram +*(Future: Add Mermaid ER diagram for visual reference)* + +## Per-App Model List + +| App | Model | Description | +| --- | --- | --- | +| **`users`** | `AppUser` | The project's custom user model, with profile fields and avatar metadata layered onto Django's historical auth tables. | +| `users` | `MembershipInvitation` | Invite one email address into a project with a predefined role and one-time redemption token. | +| **`projects`** | `Project` | Top-level workspace for one newsletter topic, scoped through project memberships rather than a legacy Django group. | +| `projects` | `ProjectMembership` | Join table assigning one user a per-project role such as admin, member, or reader. | +| `projects` | `ProjectConfig` | Per-project tuning values for authority weighting, decay, and topic-centroid recomputation. | +| `projects` | `SourceConfig` | Per-project configuration for each ingestion plugin (RSS, Reddit), including activation state and fetch configuration. | +| `projects` | `BlueskyCredentials` | Stored account credentials and verification state for a single project's Bluesky plugin. | +| **`content`** | `Content` | The canonical record for ingested articles or posts, including source metadata, extracted text, relevance scoring, embeddings, and entity association. | +| `content` | `UserFeedback` | Explicit upvote or downvote feedback on a content item, used to capture editorial preference signals. | +| **`ingestion`** | `IngestionRun` | Audit/log record for an ingestion execution, tracking plugin, timing, item counts, status, and failure messages. | +| **`entities`** | `Entity` | A person, vendor, or organization tracked within a project to associate content with a known source or subject. | +| `entities` | `EntityAuthoritySnapshot` | One persisted authority-score recomputation for a tracked entity. | +| `entities` | `EntityMention` | A detected mention of a tracked entity inside one content item, including role and sentiment metadata. | +| `entities` | `EntityCandidate` | An extracted named entity awaiting acceptance, rejection, or merge into an existing tracked entity. | +| **`newsletters`** | `IntakeAllowlist` | Approved sender list for project newsletter intake; confirming who can submit inbound newsletter emails. | +| `newsletters` | `NewsletterIntake` | Raw inbound newsletter email captured before and after extraction, holding subject, body, status, and errors. | +| **`pipeline`** | `PipelineRun` | Audit model for an execution round through LangGraph. | +| `pipeline` | `SkillResult` | Output record for an enrichment skill run, storing status, payload, confidence, latency, and model metadata. | +| `pipeline` | `ReviewQueue` | Human review item created for content needing manual judgment (e.g., borderline relevance). | +| **`trends`** | `ThemeSuggestion` | Clustered topic trends presented to human editors for newsletter inclusion. | +| `trends` | `TopicCentroidSnapshot` | One snapshot of the project's feedback-weighted topic centroid and its drift metrics. | +| `trends` | `OriginalContentIdea` | Auto-generated suggestions for topics the editor should write original content about. | + +## Project-Scoping Invariants + +By architectural rule, almost every model (except `AppUser`) belongs securely to a specific `Project` (e.g., via `project_id`). This scoping is heavily enforced at the API layer (refer to `developer-guide/backend-conventions.md`). + +- Never execute unbounded or unscoped wide queries out of the API. +- All relationships crossing between `content`, `entities`, `newsletters`, and `pipeline` models must enforce that the foreign keys belong to the *same* `Project`. + +## Key Indexes + +- `project_id` scoping indexes exist uniformly. +- **Qdrant Vector Index**: Content embeddings are maintained synchronously alongside Postgres data, identified by string UUIDs linking `Content.id` into the Qdrant document payload. Project ID is routinely attached to vector payloads to allow tenant-safe cosine similarity searches. diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md new file mode 100644 index 00000000..592eb969 --- /dev/null +++ b/docs/reference/glossary.md @@ -0,0 +1,40 @@ +# Glossary + +This glossary clarifies our domain language to ensure we all use the same terminology in code, discussion, and user documentation. + +## Core Terms + +### Project (not Tenant) +The top-level container for a newsletter workspace. A Project has settings, sources, team members, content, and an intake inbox. **Never use the word "Tenant" in this codebase or user documentation.** We use "Project" exclusively. + +### Content +The canonical record for ingested articles, posts, or extracted newsletter links. Everything the system curates or analyzes begins as a Content item. + +### Entity and Entity Candidate +An **Entity** is a person, organization, vendor, or open-source project tracked by the system to monitor authority and mention frequency. +An **Entity Candidate** is a temporary object generated by the Entity Extraction skill, representing a potential new entity that needs human editor review before being merged into a tracked Entity. + +### Theme and Idea +A **Theme Suggestion** is a drafted topic or category generated from clustering accelerating trends (velocity). Editors can promote these into newsletter sections. +An **Original Content Idea** is an angle the system suggests the editor write themselves, triangulated from current trends, accepted themes, and editor feedback. + +### Source / Plugin +A **Source Plugin** is the integration method (e.g., RSS, Reddit, Bluesky, Mastodon, LinkedIn). +A **Source Config** is the specific project-level instance of that plugin (e.g., the specific RSS feed URL or Subreddit name). + +### Intake / Allowlist +**Intake** refers to the process of parsing forwarded email newsletters. +The **Intake Allowlist** protects the project by ensuring only authorized sender email addresses can inject content into the ingestion pipeline. + +### Skill +A **Skill** is a discreet prompt-based operation using an LLM (e.g., Categorization, Relevance Scoring, Summarization, Deduplication). They live in the `skills/` directory. + +### Run / Pipeline Run +A **Pipeline Run** represents the execution state of the AI pipeline for a piece or batch of content, orchestrating the sequence of Skills via LangGraph. +An **Ingestion Run** is the execution of a Source Plugin fetching new data. + +### Internal vs Public URL +Due to our architecture running inside Docker Compose, there is a distinction: +- **Internal API URL**: Like `http://nginx`, used for frontend-container-to-backend-container traffic. +- **Public URL**: The externally resolvable URL used to generate links in confirmation emails or OAuth callbacks. + diff --git a/docs/reference/logging-and-observability.md b/docs/reference/logging-and-observability.md new file mode 100644 index 00000000..9c262305 --- /dev/null +++ b/docs/reference/logging-and-observability.md @@ -0,0 +1,37 @@ +# Logging and Observability + +Because pipelines run silently in the background via Celery and LangGraph, strong observability is required to debug node failures, bad LLM parses, and data pipeline locks. + +## Structured Log Format +We mandate structured logging via library (e.g., `structlog`) to emit `JSON` locally and in production. +Required standard fields: +* `project_id`: Present whenever logging within the context of a project. +* `content_id`: Present when acting on a specific ingested item. + +## Trace/Correlation IDs +Celery tasks and Django requests inject a correlation ID into the log context, providing the ability to stitch together an `IngestionRun` that spawned a task, which invoked a LangGraph `PipelineRun`, which hit an OpenRouter error. + +## Prometheus Metrics Exposed +When Helm/Kubernetes deployments activate the `ServiceMonitor`, we expose: +* `django_http_requests_total` +* `django_http_requests_latency_seconds` +* `celery_task_status_total` +* `ai_skill_invocation_total` (Labeled by skill name and model) + +*Metrics enablement requires `METRICS_TOKEN` authentication on the scraping endpoint.* + +## Dashboards +In Kubernetes setups, Grafana dashboards attach to Prometheus data sources. Look for: +- API Latency & Error Rates +- Celery Queue Depth & Worker Saturation +- LLM Token Costs & Skill Reliability + +## Retention Windows +Since raw telemetry can overwhelm DBs quickly, the application enforces automated pruning of its internal audit models based on defaults set in [Tunables](tunables.md): +- `OBSERVABILITY_SNAPSHOT_RETENTION_DAYS` (90 days) +- `OBSERVABILITY_TREND_TASK_RUN_RETENTION_DAYS` (30 days) +- `OBSERVABILITY_REVIEW_QUEUE_RETENTION_DAYS` (30 days) + +## How to Add a New Log/Metric +Always fetch the bound logger instance initialized at the module level rather than calling `logging.info()` directly. +When logging errors for a pipeline node, pass the `payload` kwargs so the JSON log payload has exactly what the LLM received. diff --git a/docs/reference/pipeline.md b/docs/reference/pipeline.md new file mode 100644 index 00000000..4d8f4ae2 --- /dev/null +++ b/docs/reference/pipeline.md @@ -0,0 +1,32 @@ +# Pipeline + +This document details the LangGraph orchestration of the system's core capabilities. + +## Entry Points +The pipeline is invoked after basic ingestion has successfully persisted a new `Content` item. +* **Per-content invocation**: Typically called by a Celery task right after a plugin (RSS, Reddit) saves an item. +* **Batch invocation**: Run nightly or hourly to generate aggregate assets (Theme suggestions, trend velocity). + +## Node Order +When a `Content` item enters LangGraph, it passes through several state nodes: +1. **Classification**: Categorize the content (News, Tutorial, Opinion, etc.). +2. **Relevance**: Calculate Cosine Similarity. If ambiguous, trigger Relevance Skill. +3. **Deduplication**: Suppress if $L_2$ distance to already-saved items is extremely tight. +4. **Entity Extraction**: Detect Candidate Entities (people, orgs) referenced in the text. +5. **Summarization**: Generate the 2-3 sentence `summary_text` exposed in the UI. + +In a separate batch phase: +6. **Theme Detection & Trend Clustering**: Evaluates recent content to draft newsletter sections. + +## Retries & Partial Failure +Our pipeline is designed for **Resilience**: +* If an LLM skill node fails (e.g., timeout, context limit, 429), the node records a `SkillResult` with a failure status. +* The pipeline gracefully degrades. For example, if Relevance fails, it falls back to the baseline Cosine score. If Categorization fails, it is marked `Undefined`. +* "Stuck" items are enqueued to a `ReviewQueue` or can be re-run via Celery. + +## Where Prompts Live +The raw LLM system prompts and logic for each step do not live in the pipeline file itself. They are organized independently. See [Skills](skills.md) and look at the directories under `skills//SKILL.md`. + +## Scheduled vs On-Demand +- **On-Demand**: Categorization, Relevance, Summarization, Deduplication (runs immediately upon content fetch). +- **Scheduled**: Authority Scoring calculation, Trend Velocity compilation, Diversity analysis. diff --git a/docs/reference/skills.md b/docs/reference/skills.md new file mode 100644 index 00000000..da57ebe2 --- /dev/null +++ b/docs/reference/skills.md @@ -0,0 +1,53 @@ +# Skills + +A "Skill" represents one discrete prompt-and-extract operation run by an LLM in our system. + +## Skills Runtime +Skills are invoked dynamically. `core/llm.py` loads the prompt text directly from `skills//SKILL.md`. This allows prompt-tuning to happen independently of Python application logic, inside markdown files that AI assistants (like Copilot) can natively parse format-wise. + +## Skill Catalog + +### Content Classification +* **Purpose**: Assign a general topic bucket for filtering (e.g., tutorial, opinion, news, release). +* **Inputs**: Candidate text. +* **Outputs**: A mapped category. +* **Prompt Location**: `skills/content-classification/SKILL.md`. +* **Failure Mode**: Item is categorized as `Unknown`. + +### Relevance Scoring +* **Purpose**: Act as tie-breaker for items sitting in the ambiguous cosine-similarity band (`0.50 - 0.85`). +* **Inputs**: The project's topic description, precomputed reference similarity score, candidate title, candidate text (trimmed to 5000 chars). +* **Outputs**: A JSON payload containing a `relevance_score` and `explanation`. +* **Prompt Location**: `skills/relevance-scoring/SKILL.md`. +* **Failure Mode**: Aborts and relies purely on the Cosine Similarity score. + +### Deduplication +* **Purpose**: Determine if a new post is functionally identical (or a direct repost) to one already stored recently. +* **Inputs**: Candidate text and closest-distance match. +* **Outputs**: Boolean flag suppressing the clone. +* **Prompt Location**: `skills/deduplication/SKILL.md`. + +### Summarization +* **Purpose**: Condense an input article into a fast 2-3 sentence overview that editors read on the dashboard. +* **Inputs**: Candidate text. +* **Outputs**: Short paragraph saved to `Content.summary_text`. +* **Prompt Location**: `skills/summarization/SKILL.md`. +* **Failure Mode**: Retries; if fatal, leaves UI text empty requiring manual skim. + +### Newsletter Email Extraction +* **Purpose**: Parse raw forwarded HTML emails to decouple multiple article links from the sender's flavor-text wrapping. +* **Inputs**: HTML email body. +* **Outputs**: A list of extracted URLs and parsed titles. +* **Prompt Location**: `skills/newsletter-extraction/SKILL.md`. + +### Entity Extraction +* **Purpose**: Find proper nouns (people, vendors, technologies) referenced in the item so we can track their authority and mention-velocity. +* **Inputs**: Candidate text. +* **Outputs**: Unresolved `EntityCandidate` names. +* **Prompt Location**: `skills/entity-extraction/SKILL.md`. + +### Theme Detection +* **Purpose**: Turn clustered articles into human-readable newsletter draft sections. +* **Inputs**: Groupings of high-velocity related content. +* **Outputs**: Proposed newsletter headings and a contextual summary for the grouping. +* **Prompt Location**: `skills/theme-detection/SKILL.md`. diff --git a/docs/reference/tunables.md b/docs/reference/tunables.md new file mode 100644 index 00000000..eb041eac --- /dev/null +++ b/docs/reference/tunables.md @@ -0,0 +1,50 @@ +# Tunables + +This document collects all parameters, thresholds, and variables that change how the system behaves. Most global tunables are configured via environment variables and loaded into Django settings, while project-specific algorithms use `ProjectConfig`. + +## How Settings Are Read +1. Environment variables set at the Docker Compose / Kubernetes pod level. +2. Loaded in `newsletter_maker/settings/base.py` and combined with defaults. +3. Consumed via `django.conf.settings` across the project. + +## LLM & Embeddings +These map directly to global inference capability. +* `EMBEDDING_PROVIDER`: Options include `local` (HuggingFace `sentence-transformers`), `ollama`, or `openai`/`openrouter`. +* `EMBEDDING_MODEL`: The identifier for the dense vector model. +* `OLLAMA_URL`: Local instance of Ollama, defaulting to `http://ollama:11434`. +* `OPENROUTER_API_KEY`: Fallback or primary inference provider key for OpenRouter or OpenAI compatible APIs. +* `OPENROUTER_API_BASE`: Endpoint for inference. + +## Relevance & Scoring Thresholds +Relevance rules divide candidate articles into clear-match, ambiguous, and clear-non-match bands. See [Algorithms](algorithms.md) for how the pipeline evaluates these. +* **Similarity Thresholds**: Embedding cosine similarity above `0.85` assumes auto-relevant. Below `0.5` assumes irrelevant. The `0.5 - 0.85` band asks the LLM. + +## Deduplication Thresholds +* Usually implemented via nearest-neighbor distance (e.g., threshold `< 0.05` means near duplication). + +## Authority Weights +Configured per-project in `ProjectConfig`: +* `authority_decay_rate` (default: 0.95): The rate at which an entity's authority metric decays over time without recent mentions. + +## Topic Centroid +Configured per-project in `ProjectConfig`: +* `recompute_topic_centroid_on_feedback_save` (default: True): Determines if a user's thumbs up/down immediately recomputes the vector centroid representing the project's topic. + +## URL Settings +* `NEWSLETTER_API_BASE_URL`: **Internal API base URL** (e.g. `http://nginx` within Compose) and historically used as a **Public API URL** for generated links (requires external DNS resolution). This is pending split into distinct explicit variables. +* `FRONTEND_BASE_URL`: Where the Next.js app sits. + +## Observability Retention +Keeps the database from ballooning over time. +* `OBSERVABILITY_SNAPSHOT_RETENTION_DAYS` (default: 90) +* `OBSERVABILITY_TREND_TASK_RUN_RETENTION_DAYS` (default: 30) +* `OBSERVABILITY_REVIEW_QUEUE_RETENTION_DAYS` (default: 30) + +## OAuth Provider Toggles +Requires specific API keys to be populated to become available: +* **LinkedIn**: `LINKEDIN_CLIENT_ID`, `LINKEDIN_CLIENT_SECRET`, `LINKEDIN_OAUTH_SCOPES` +* **Reddit**: `REDDIT_CLIENT_ID`, `REDDIT_CLIENT_SECRET`, `REDDIT_USER_AGENT` + +## Channels / Messaging +* `CHANNEL_LAYER_URL`: URL to the Redis instance used by Django Channels for ASGI web socket propagation (e.g., `redis://redis:6379/1`). +* `MESSAGING_ENABLED` (frontend/build feature flags). diff --git a/docs/user-guide/entities-and-authority.md b/docs/user-guide/entities-and-authority.md new file mode 100644 index 00000000..c3313b26 --- /dev/null +++ b/docs/user-guide/entities-and-authority.md @@ -0,0 +1,19 @@ +# Entities and Authority + +## What an Entity Is +An **Entity** is a semantic object tracked by the system—typically a Person, Company, Product, or Vendor. + +## Candidate Review Queue +When the AI reads an article, it highlights Proper Nouns it doesn't recognize and marks them as **Entity Candidates**. These are pooled in the Candidate Review Queue awaiting human triage. + +## Approving, Rejecting, and Merging +In the Review Queue, you can: +* **Approve**: Turn it into a tracked Entity. +* **Reject**: Tell the system "don't track this word." +* **Merge**: Combine it with an existing entity (e.g., merging "MSFT" to "Microsoft"). + +## What Authority Means +Tracked entities generate **Authority**. If an entity is repeatedly mentioned in high-value, highly-relevant articles, its Authority Score rises. + +## How Authority Influences Content +Incoming articles that mention High-Authority entities are artificially boosted in your content feed. Tracking authority allows the system to surface industry thought leaders dynamically. diff --git a/docs/user-guide/feedback-and-tuning.md b/docs/user-guide/feedback-and-tuning.md new file mode 100644 index 00000000..fa92f961 --- /dev/null +++ b/docs/user-guide/feedback-and-tuning.md @@ -0,0 +1,13 @@ +# Feedback and Tuning + +## What Gets Recorded +Every time you approve an Entity Candidate, flag an article as Highly Relevant, or Dismiss an Idea, a feedback record is durably stored attached to your Project. + +## The Topic Centroid +Feedback mathematically shifts the underlying definition of your Project. When a project is created, the AI defines its "Topic Centroid." Positive reviews drag this Centroid closer to the approved topics, while Negative reviews push the Centroid away from junk text. + +## Time to See Effects +Topic Centroids adjust dynamically but evaluate newly ingested articles only. Pushing "Thumbs down" will not re-score historical items in your list, but within an hour, incoming fetches will reflect the newly tuned boundaries. + +## Re-training vs General Feedback +If you fundamentally pivot the purpose of a Newsletter (e.g. pivoting from "General AI" strictly to "Medical Robotics"), you may need to purge the project references and start fresh. Normal feedback adjusts nuance—it cannot smoothly pivot a topic 180 degrees. diff --git a/docs/user-guide/getting-started-saas.md b/docs/user-guide/getting-started-saas.md new file mode 100644 index 00000000..bc889aab --- /dev/null +++ b/docs/user-guide/getting-started-saas.md @@ -0,0 +1,25 @@ +# Getting Started (Hosted SaaS) + +Welcome to Newsletter Maker! Depending on whether you are using our cloud version or self-hosting, your first steps differ slightly. + +## Signup & Email Verification +Navigate to the hosted application URL and register for an account using your email. We will send you a secure login link. Once verified, you will be taken to your dashboard. + +## Creating Your First Project +A **Project** is your workspace. It represents ONE specific newsletter topic (e.g., "AI in Healthcare"). Create a project and give it a brief description. This description actually informs the AI about what content it should look for. + +## Inviting Collaborators +Head to the **Members** tab to invite coworkers as Editors or Readers. + +## Adding Your First Source +The system has no content until you tell it where to look. +Navigate to **Sources**. Add an RSS Feed (e.g., `https://news.ycombinator.com/rss`) or a Reddit Subreddit (e.g., `r/MachineLearning`). The system will immediately begin fetching historical items. + +## Viewing the Content Dashboard +Click into the **Content** tab. Within roughly 5 minutes, you will see articles appearing from your sources, automatically assigned Relevance scores and Categorized by AI. + +## Forwarding Your First Newsletter +Find your project's unique Inbox Address in settings. Forward an old newsletter to it. The system will parse out all links within the email and add them to your Content pile. + +## Next Steps +Curate your incoming feed by identifying which content is perfect—check [Projects and Content](projects-and-content.md) to understand Relevance. diff --git a/docs/user-guide/getting-started-selfhost.md b/docs/user-guide/getting-started-selfhost.md new file mode 100644 index 00000000..351e5bbc --- /dev/null +++ b/docs/user-guide/getting-started-selfhost.md @@ -0,0 +1,16 @@ +# Getting Started (Self-Hosted) + +## Logging In +Navigate to the URL provided by your IT Administrator. You can log in using your provided Username/Password, or if configured, via Single-Sign On (like LinkedIn). + +## What Your Admin Configured +Your system administrator has already set up the background AI models and databases. Depending on their configuration, some features (like email intake or specific AI generation speeds) might perform differently than a hosted system. + +## Creating Your First Project +Click "Create Project" and define the topic area of your newsletter. + +## Adding Sources +Go to **Sources**. Click Add Source. Paste an RSS or Social Media configuration. + +## Next Steps +Head to [Projects and Content](projects-and-content.md) to learn how to curate the resulting article feed. diff --git a/docs/user-guide/newsletter-drafts.md b/docs/user-guide/newsletter-drafts.md new file mode 100644 index 00000000..0bbb9cc3 --- /dev/null +++ b/docs/user-guide/newsletter-drafts.md @@ -0,0 +1,18 @@ +# Newsletter Drafts + +This is where curation pays off. + +## Assembling a Draft +Drafts are built by composing: +1. High-Relevance fresh articles. +2. Promoted Theme Suggestions. +3. Editor-generated Original Content Ideas. + +## Reordering +The Draft builder allows you to drag-and-drop these artifacts, rewriting transition languages automatically where necessary. + +## Exporting +Click Export to generate raw Markdown or HTML ready to be pasted into Mailchimp, Substack, Hackernews, or a ghost CMS. + +## Iterating with Feedback +If the synthesized draft output misses your tone, utilize the thumbs up/down mechanisms on individual sections to train the system. diff --git a/docs/user-guide/newsletter-intake.md b/docs/user-guide/newsletter-intake.md new file mode 100644 index 00000000..9ee51fe8 --- /dev/null +++ b/docs/user-guide/newsletter-intake.md @@ -0,0 +1,24 @@ +# Newsletter Intake + +You can populate your project by forwarding competitor or complimentary newsletters directly to the system. + +## Finding Your Project's Intake Address +Go to Project Settings. You will find an email address specifically generated for this project. + +## Forwarding Etiquette +Forward un-mutilated HTML emails. The AI `Newsletter Email Extraction` skill is highly tuned to find hyperlinks embedded in paragraphs while ignoring unsubscribe footers. + +## Sender Confirmation Email +To prevent spam engines from filling your project with junk, the first time you forward an email from a new address, the email is quarantined. +You will receive an automated reply containing a Confirmation Link. You must click it. + +## Pending vs Confirmed vs Expired +* **Pending**: Email caught in quarantine awaitng confirmation. +* **Confirmed**: Sender is trusted. Future forwards immediately process. +* **Expired**: Over 7 days without confirmation. Discarded. + +## Managing Your Allowlist +Under Settings -> Allowlist, you can view your trusted senders, pre-clear new emails, or revoke access if an address is compromised. + +## Troubleshooting Drops +If a newsletter fails to show up in your Content view after 10 minutes, contact your administrator to verify the email successfully reached the Anymail webhook. diff --git a/docs/user-guide/projects-and-content.md b/docs/user-guide/projects-and-content.md new file mode 100644 index 00000000..f2fffc23 --- /dev/null +++ b/docs/user-guide/projects-and-content.md @@ -0,0 +1,26 @@ +# Projects and Content + +## What a Project Is +In this platform, a **Project** equates to a specific newsletter brand or topic. Everything—Settings, Articles, Team Members, and specifically the AI's "understanding" of your taste—lives strictly inside boundaries of a single Project. + +## Content List & Filters +The Content Dashboard displays every article discovered from your plugins. +* Use the **Category** dropdown to filter items down to just *Tutorials* or just *News*. +* Use the **Relevance** slider to hide junk items. + +## Relevance Scores +The system assigns a score to everything it ingests, representing "How likely is this article a good fit for this project?" +* **Low score (0.0 - 0.4)**: Probably off-topic. +* **High score (0.8+)**: Perfect fit. +* **Mid score**: Ambiguous. The AI read it and made a guess. + +Read more about the math in [Algorithms](../reference/algorithms.md). + +## Opening a Content Item +Clicking an item reveals the original abstract, AI-generated summary, source metadata, and extracted entities (people/companies mentioned inside it). + +## Marking Content as Relevant/Not-Relevant +Under each item, you will see a Thumbs Up / Thumbs Down mechanism. This is explicit feedback. Click Thumbs Up *only* on articles that perfectly exemplify your Newsletter. + +## How Feedback Affects Future Ranking +When you thumbs-up an article, its underlying data mathematically shifts the Project's "Topic Centroid." This means future incoming articles matching that vibe will score *higher*. See [Feedback and Tuning](feedback-and-tuning.md). diff --git a/docs/user-guide/themes-and-trends.md b/docs/user-guide/themes-and-trends.md new file mode 100644 index 00000000..92d5604d --- /dev/null +++ b/docs/user-guide/themes-and-trends.md @@ -0,0 +1,16 @@ +# Themes and Trends + +## Trend Velocity vs Raw Mention Count +The system does not just look for "topics that are mentioned a lot." It looks for **Trend Velocity**—topics whose metric rate is *accelerating*. An article mentioned twice yesterday and twelve times today is highlighted over a topic mentioned six times every day for a year. + +## Theme Suggestions +The system clusters these high-velocity trends and automatically drafts "Theme Suggestions." These appear as pre-written context groupings summarizing *why* this trend is exploding. + +## Promoting a Theme +If you like an AI-suggested theme, click **Promote**. It instantly enters your Newsletter Drafts queue as an assembled section. + +## Dismissing Themes +If the AI is hallucinating a fake trend or tracking something you do not care about, hit **Dismiss** so it doesn't clutter your view. + +## Source Diversity Warnings +The dashboard will highlight if your trends are skewed entirely towards a single domain or community. A source diversity warning indicates you might be creating an echo chamber and should ingest broader sources. diff --git a/justfile b/justfile index a2087e95..57685254 100644 --- a/justfile +++ b/justfile @@ -1,4 +1,4 @@ -set dotenv-load := true +set dotenv-load := false compose := "docker compose" backend_env := "if [ ! -f .env ]; then cp .env.example .env; fi"