Skip to content

docs: update site after scraper extraction to standalone repo#31

Merged
nitaibezerra merged 5 commits intomainfrom
docs/update-scraper-extraction
Feb 25, 2026
Merged

docs: update site after scraper extraction to standalone repo#31
nitaibezerra merged 5 commits intomainfrom
docs/update-scraper-extraction

Conversation

@nitaibezerra
Copy link
Contributor

Summary

  • Rewrite modulos/scraper.md: standalone repo with API endpoints, DAGs, deploy info
  • Update modulos/data-platform.md: remove scraping, CLI scrape commands, StorageAdapter, env vars
  • Rewrite workflows/scraper-pipeline.md: two-stage architecture (Airflow scraping + GH Actions enrichment)
  • Update workflows/docker-builds.md: scraper uses Artifact Registry + Cloud Run (not GHCR)
  • Update workflows/airflow-dags.md: add scraper DAGs, bucket subdirectories, remove pandas
  • Update onboarding/setup-backend.md: remove scrape CLI commands, point to scraper repo
  • Update arquitetura/fluxo-de-dados.md: diagrams and steps reflect Airflow-based scraping

Context

The scraper was extracted from data-platform to destaquesgovbr/scraper. All 7 doc pages had stale references to the old architecture (CLI scraping, GHCR images, single-repo pipeline).

Test plan

  • mkdocs build compiles without errors
  • Zero references to data-platform scrape or ghcr.io/...scraper remain
  • Internal links verified

🤖 Generated with Claude Code

nitaibezerra and others added 5 commits January 13, 2026 09:59
- Adiciona docs/arquitetura/postgresql.md com schema detalhado
- Adiciona docs/modulos/data-platform.md documentando novo repo
- Adiciona docs/workflows/airflow-dags.md com DAGs do Composer
- Atualiza diagramas em visao-geral.md e fluxo-de-dados.md
- Atualiza index.md com nova arquitetura e repositórios
- Atualiza componentes-estruturantes.md (HF agora é distribuição)
- Atualiza mkdocs.yml com novos arquivos na navegação

Mudança principal: PostgreSQL (Cloud SQL) é agora a fonte de verdade,
HuggingFace passa a ser camada de distribuição de dados abertos.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Simplifica scraper.md com redirect para data-platform
- Atualiza typesense-local.md para usar PostgreSQL como fonte
- Atualiza cogfy-integracao.md com fluxo PostgreSQL
- Atualiza scraper-pipeline.md com 7 jobs (inclui embeddings)
- Atualiza typesense-data.md para sincronizar do PostgreSQL
- Atualiza arquitetura-gcp.md com Cloud SQL, Composer, Embeddings API

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Atualiza setup-backend.md para usar data-platform e PostgreSQL
- Atualiza roteiro-onboarding.md com novos repositórios
- Atualiza setup-datascience.md com diagrama PostgreSQL
- Remove referências aos repos arquivados (scraper, typesense)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rewrite scraper module page for the new standalone repo with API + DAGs.
Remove scraping from data-platform module (CLI, StorageAdapter, env vars).
Rewrite scraper pipeline as two-stage architecture (Airflow + GH Actions).
Update docker builds, airflow DAGs, onboarding, and data flow diagrams.
…xtraction

# Conflicts:
#	docs/arquitetura/fluxo-de-dados.md
#	docs/modulos/data-platform.md
#	docs/modulos/scraper.md
#	docs/onboarding/setup-backend.md
#	docs/workflows/airflow-dags.md
#	docs/workflows/scraper-pipeline.md
@nitaibezerra nitaibezerra merged commit dd66900 into main Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant