Skip to content

Feat/active field agencies#70

Open
miguellsfilho wants to merge 11 commits intomainfrom
feat/active-field-agencies
Open

Feat/active field agencies#70
miguellsfilho wants to merge 11 commits intomainfrom
feat/active-field-agencies

Conversation

@miguellsfilho
Copy link
Contributor

No description provided.

Miguel Lopes and others added 7 commits February 13, 2026 16:00
Add support for marking agencies as active/inactive in the YAML config,
replacing the need to comment out problematic URLs. This preserves the
configuration history and allows programmatic filtering.

Changes:
- Update _load_urls_from_yaml() to filter inactive agencies
- Add _extract_url() and _is_agency_inactive() helper methods
- Migrate 6 commented agencies to active: false format
- Add 15 unit tests for the new functionality
- Document new YAML format in CLAUDE.md

The implementation maintains full backward compatibility with the
legacy string format during gradual migration.

Closes #64

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…acy support

- Convert all 162 agencies in site_urls.yaml from string to dict format
- Remove legacy string format support from _extract_url() and _is_agency_inactive()
- Update unit tests to use only the new dict format (12 tests passing)
- Update CLAUDE.md documentation to reflect the single supported format

Note: mypy errors are pre-existing (run_scraper, _process_and_upload_data, _preprocess_data)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create ebc_urls.yaml config file with active/inactive sources
- Mark memoria.ebc.com.br as inactive (502 Bad Gateway - issue #50)
- Add agenciabrasil.ebc.com.br as new active source
- Adapt EBCWebScraper to receive base_url as required parameter
- Update scrape_index_page to support new HTML structure
- Add _load_urls_from_yaml methods to EBCScrapeManager
- Add 12 unit tests for EBCScrapeManager URL loading
- Update existing tests to pass base_url parameter
- Document ebc_urls.yaml configuration in CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change ebc_urls.yaml root key from 'sources' to 'agencies'
- Rename keys from underscores to hyphens (agencia_brasil → agencia-brasil)
- Rename EBCScrapeManager methods/params: source → agency
- Update unit tests to match new naming convention

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for scraping tvbrasil.ebc.com.br/noticias index page by
implementing Strategy 3 in scrape_index_page() method.

Changes:
- Add Strategy 3 to recognize TV Brasil HTML structure (view-ultimas
  class with h3.heading links)
- Enable tvbrasil agency in ebc_urls.yaml (active: true)
- Add unit tests for Strategy 3 (URL extraction, relative-to-absolute
  conversion, duplicate filtering)

This enables TV Brasil as a second functional EBC news source alongside
Agência Brasil.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Align scrape-ebc CLI interface with scrape command by exposing
the agencies parameter that EBCScrapeManager already supports.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _extract_url() helper to extract URL from both string and dict formats
- Add _is_agency_active() helper to check if agency is active
- Skip inactive agencies in duplicate URL validation
- Update tests to use _extract_url for URL extraction

This fixes compatibility between the URL validator (PR #69) and the new
dict format for agencies (Issue #64).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Miguel Lopes and others added 4 commits February 19, 2026 15:01
Add test_02b_scrape_ebc to validate EBC scraper functionality in the
integration pipeline. Uses a recent date (2026-02-19) and only tests
agencia-brasil due to TV Brasil date parsing issues.

Changes:
- Add EBC_DATE and EBC_AGENCIES constants
- Add test_02b_scrape_ebc function after gov.br scraping test
- Update final validation to include EBC statistics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Modificando o link da agencia tvbrasil. O teste test_full_pipeline.py
agora raspa notícias desta agência também.

Mudanças:
- link da agência tvbrasil no arquivo ebc_urls.yaml
- test_02b_scrape_ebc raspa notícias do tvbrasil
Change agency key from `agencia_brasil` to `agencia-brasil` for consistency
with the naming convention used in other agency keys. Update URLs in test
data to include `/ultimas` path and fix minor typo in docstring.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add deduplication logic before INSERT to handle race conditions where
the same article appears on multiple pages during scraping. This matches
the existing pattern in the HuggingFace backend (drop_duplicates).

The issue occurred when pagination content shifted during scraping,
causing the same article to be collected twice with the same unique_id.
PostgreSQL's ON CONFLICT cannot handle duplicate unique_ids within the
same batch INSERT statement.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants