Skip to content

Detect versioned doc URLs in sitemaps and filter to current version #22

@dacharyc

Description

@dacharyc

Problem

Documentation sites that publish versioned docs (Docusaurus, MkDocs, Sphinx, Read the Docs) often include all versions in a single sitemap. When afdocs samples from these sitemaps, the deterministic stride spreads across versions, and the sample may not include any current-version pages at all.

In the Docusaurus scoring run, the sitemap contained docs across 11 version prefixes (/docs/2.x/, /docs/3.0.1/, /docs/3.1.1/, ... /docs/3.9.2/) plus ~186 unversioned (current) pages. The 69-page sample contained zero current-version pages: 11 from 2.x, 12 from 3.0.1, 12 from 3.1.1, and 7 from 3.2.1.

Proposed improvement

After collecting sitemap URLs (and applying path-prefix filtering per #TBD), detect whether the URL set contains versioned duplicates. Common patterns:

  • /docs/2.x/foo, /docs/3.1.1/foo, /docs/foo (Docusaurus)
  • /en/stable/foo, /en/latest/foo, /en/v2/foo (Sphinx/Read the Docs)
  • /docs/1.0/foo, /docs/2.0/foo (generic versioning)

When detected, filter to the "current" version before sampling. The current version is typically the unversioned/unprefixed path, or the one matching latest/stable.

This is similar to the existing guidance for filtering localized duplicates (/cn/docs/, /ja/docs/), but for version prefixes instead of locale prefixes.

Complexity notes

  • Version detection heuristics will vary by framework. A simple approach: group URLs by path-after-prefix, identify groups where the same suffix appears under multiple version-like prefixes, and keep only the shortest/unversioned variant.
  • Some sites intentionally maintain old versions as separate doc sets (e.g., migration guides reference old versions). Filtering should be best-effort, not perfect.

Context

Discovered during proportional scoring of Docusaurus (https://docusaurus.io/docs). See friction log entry: https://github.com/agent-ecosystem/agent-docs-report/blob/main/sites/friction-log.md#docusaurus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions