Open
Conversation
Add support for marking agencies as active/inactive in the YAML config, replacing the need to comment out problematic URLs. This preserves the configuration history and allows programmatic filtering. Changes: - Update _load_urls_from_yaml() to filter inactive agencies - Add _extract_url() and _is_agency_inactive() helper methods - Migrate 6 commented agencies to active: false format - Add 15 unit tests for the new functionality - Document new YAML format in CLAUDE.md The implementation maintains full backward compatibility with the legacy string format during gradual migration. Closes #64 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…acy support - Convert all 162 agencies in site_urls.yaml from string to dict format - Remove legacy string format support from _extract_url() and _is_agency_inactive() - Update unit tests to use only the new dict format (12 tests passing) - Update CLAUDE.md documentation to reflect the single supported format Note: mypy errors are pre-existing (run_scraper, _process_and_upload_data, _preprocess_data) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create ebc_urls.yaml config file with active/inactive sources - Mark memoria.ebc.com.br as inactive (502 Bad Gateway - issue #50) - Add agenciabrasil.ebc.com.br as new active source - Adapt EBCWebScraper to receive base_url as required parameter - Update scrape_index_page to support new HTML structure - Add _load_urls_from_yaml methods to EBCScrapeManager - Add 12 unit tests for EBCScrapeManager URL loading - Update existing tests to pass base_url parameter - Document ebc_urls.yaml configuration in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change ebc_urls.yaml root key from 'sources' to 'agencies' - Rename keys from underscores to hyphens (agencia_brasil → agencia-brasil) - Rename EBCScrapeManager methods/params: source → agency - Update unit tests to match new naming convention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for scraping tvbrasil.ebc.com.br/noticias index page by implementing Strategy 3 in scrape_index_page() method. Changes: - Add Strategy 3 to recognize TV Brasil HTML structure (view-ultimas class with h3.heading links) - Enable tvbrasil agency in ebc_urls.yaml (active: true) - Add unit tests for Strategy 3 (URL extraction, relative-to-absolute conversion, duplicate filtering) This enables TV Brasil as a second functional EBC news source alongside Agência Brasil. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Align scrape-ebc CLI interface with scrape command by exposing the agencies parameter that EBCScrapeManager already supports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _extract_url() helper to extract URL from both string and dict formats - Add _is_agency_active() helper to check if agency is active - Skip inactive agencies in duplicate URL validation - Update tests to use _extract_url for URL extraction This fixes compatibility between the URL validator (PR #69) and the new dict format for agencies (Issue #64). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test_02b_scrape_ebc to validate EBC scraper functionality in the integration pipeline. Uses a recent date (2026-02-19) and only tests agencia-brasil due to TV Brasil date parsing issues. Changes: - Add EBC_DATE and EBC_AGENCIES constants - Add test_02b_scrape_ebc function after gov.br scraping test - Update final validation to include EBC statistics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Modificando o link da agencia tvbrasil. O teste test_full_pipeline.py agora raspa notícias desta agência também. Mudanças: - link da agência tvbrasil no arquivo ebc_urls.yaml - test_02b_scrape_ebc raspa notícias do tvbrasil
Change agency key from `agencia_brasil` to `agencia-brasil` for consistency with the naming convention used in other agency keys. Update URLs in test data to include `/ultimas` path and fix minor typo in docstring. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add deduplication logic before INSERT to handle race conditions where the same article appears on multiple pages during scraping. This matches the existing pattern in the HuggingFace backend (drop_duplicates). The issue occurred when pagination content shifted during scraping, causing the same article to be collected twice with the same unique_id. PostgreSQL's ON CONFLICT cannot handle duplicate unique_ids within the same batch INSERT statement. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
mauriciomendonca
approved these changes
Feb 23, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.