Feat/active field agencies by miguellsfilho · Pull Request #70 · destaquesgovbr/data-platform

miguellsfilho · 2026-02-13T19:37:21Z

No description provided.

Add support for marking agencies as active/inactive in the YAML config, replacing the need to comment out problematic URLs. This preserves the configuration history and allows programmatic filtering. Changes: - Update _load_urls_from_yaml() to filter inactive agencies - Add _extract_url() and _is_agency_inactive() helper methods - Migrate 6 commented agencies to active: false format - Add 15 unit tests for the new functionality - Document new YAML format in CLAUDE.md The implementation maintains full backward compatibility with the legacy string format during gradual migration. Closes #64 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…acy support - Convert all 162 agencies in site_urls.yaml from string to dict format - Remove legacy string format support from _extract_url() and _is_agency_inactive() - Update unit tests to use only the new dict format (12 tests passing) - Update CLAUDE.md documentation to reflect the single supported format Note: mypy errors are pre-existing (run_scraper, _process_and_upload_data, _preprocess_data) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create ebc_urls.yaml config file with active/inactive sources - Mark memoria.ebc.com.br as inactive (502 Bad Gateway - issue #50) - Add agenciabrasil.ebc.com.br as new active source - Adapt EBCWebScraper to receive base_url as required parameter - Update scrape_index_page to support new HTML structure - Add _load_urls_from_yaml methods to EBCScrapeManager - Add 12 unit tests for EBCScrapeManager URL loading - Update existing tests to pass base_url parameter - Document ebc_urls.yaml configuration in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change ebc_urls.yaml root key from 'sources' to 'agencies' - Rename keys from underscores to hyphens (agencia_brasil → agencia-brasil) - Rename EBCScrapeManager methods/params: source → agency - Update unit tests to match new naming convention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add support for scraping tvbrasil.ebc.com.br/noticias index page by implementing Strategy 3 in scrape_index_page() method. Changes: - Add Strategy 3 to recognize TV Brasil HTML structure (view-ultimas class with h3.heading links) - Enable tvbrasil agency in ebc_urls.yaml (active: true) - Add unit tests for Strategy 3 (URL extraction, relative-to-absolute conversion, duplicate filtering) This enables TV Brasil as a second functional EBC news source alongside Agência Brasil. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Align scrape-ebc CLI interface with scrape command by exposing the agencies parameter that EBCScrapeManager already supports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add _extract_url() helper to extract URL from both string and dict formats - Add _is_agency_active() helper to check if agency is active - Skip inactive agencies in duplicate URL validation - Update tests to use _extract_url for URL extraction This fixes compatibility between the URL validator (PR #69) and the new dict format for agencies (Issue #64). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add test_02b_scrape_ebc to validate EBC scraper functionality in the integration pipeline. Uses a recent date (2026-02-19) and only tests agencia-brasil due to TV Brasil date parsing issues. Changes: - Add EBC_DATE and EBC_AGENCIES constants - Add test_02b_scrape_ebc function after gov.br scraping test - Update final validation to include EBC statistics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Modificando o link da agencia tvbrasil. O teste test_full_pipeline.py agora raspa notícias desta agência também. Mudanças: - link da agência tvbrasil no arquivo ebc_urls.yaml - test_02b_scrape_ebc raspa notícias do tvbrasil

Change agency key from `agencia_brasil` to `agencia-brasil` for consistency with the naming convention used in other agency keys. Update URLs in test data to include `/ultimas` path and fix minor typo in docstring. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add deduplication logic before INSERT to handle race conditions where the same article appears on multiple pages during scraping. This matches the existing pattern in the HuggingFace backend (drop_duplicates). The issue occurred when pagination content shifted during scraping, causing the same article to be collected twice with the same unique_id. PostgreSQL's ON CONFLICT cannot handle duplicate unique_ids within the same batch INSERT statement. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Miguel Lopes and others added 7 commits February 13, 2026 16:00

feat(cli): add --agencies filter to scrape-ebc command

01bc923

Align scrape-ebc CLI interface with scrape command by exposing the agencies parameter that EBCScrapeManager already supports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

miguellsfilho requested review from mauriciomendonca and nitaibezerra February 13, 2026 19:37

Miguel Lopes and others added 4 commits February 19, 2026 15:01

fix(integration): fix EBC scraping in full pipeline test

271be82

Modificando o link da agencia tvbrasil. O teste test_full_pipeline.py agora raspa notícias desta agência também. Mudanças: - link da agência tvbrasil no arquivo ebc_urls.yaml - test_02b_scrape_ebc raspa notícias do tvbrasil

mauriciomendonca approved these changes Feb 23, 2026

View reviewed changes

miguellsfilho mentioned this pull request Feb 25, 2026

feat: add active field to agency YAML configuration destaquesgovbr/scraper#4

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/active field agencies#70

Feat/active field agencies#70
miguellsfilho wants to merge 11 commits intomainfrom
feat/active-field-agencies

miguellsfilho commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miguellsfilho commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants