Skip to content

Support cursor-based pagination for large GitHub repos #238

@neuromechanist

Description

@neuromechanist

Problem

GitHub's REST API limits page-based pagination to page 100 (10,000 items with per_page=100). Repositories with more than 10,000 issues/PRs (like mne-tools/mne-python with 13,000+) trigger an HTTP 422 error when requesting beyond page 100.

Current error:

HTTP 422 on page 101 for mne-tools/mne-python issues

This means we silently miss issues/PRs beyond the 10,000 limit.

Proposed Solution

Switch from page-based to cursor-based pagination using GitHub's GraphQL API or the REST API's Link header with since/after parameters.

Option A: Use since parameter (REST API)

For issues/PRs, use since parameter with ISO 8601 timestamp to paginate by creation/update date instead of page number. This avoids the page limit entirely.

Option B: Switch to GraphQL API

GitHub's GraphQL API uses cursor-based pagination natively and has no page limit.

Implementation

  • Modify src/knowledge/github_sync.py sync functions
  • Replace page=N iteration with cursor-based approach
  • Keep backward compatibility with existing incremental sync logic
  • Add tests for repos with >10,000 items

Context

Discovered during MNE community onboarding. mne-tools/mne-python has 13,000+ issues, exceeding GitHub's page-based pagination limit. The sync currently stops at 10,000 items silently (or errors on page 101).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority 2: Important, fix when possibleenhancementNew feature or requestoperationsOperations, monitoring, and observabilitytestingTesting and quality assurance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions