Problem
Documentation sites that publish versioned docs (Docusaurus, MkDocs, Sphinx, Read the Docs) often include all versions in a single sitemap. When afdocs samples from these sitemaps, the deterministic stride spreads across versions, and the sample may not include any current-version pages at all.
In the Docusaurus scoring run, the sitemap contained docs across 11 version prefixes (/docs/2.x/, /docs/3.0.1/, /docs/3.1.1/, ... /docs/3.9.2/) plus ~186 unversioned (current) pages. The 69-page sample contained zero current-version pages: 11 from 2.x, 12 from 3.0.1, 12 from 3.1.1, and 7 from 3.2.1.
Proposed improvement
After collecting sitemap URLs (and applying path-prefix filtering per #TBD), detect whether the URL set contains versioned duplicates. Common patterns:
/docs/2.x/foo, /docs/3.1.1/foo, /docs/foo (Docusaurus)
/en/stable/foo, /en/latest/foo, /en/v2/foo (Sphinx/Read the Docs)
/docs/1.0/foo, /docs/2.0/foo (generic versioning)
When detected, filter to the "current" version before sampling. The current version is typically the unversioned/unprefixed path, or the one matching latest/stable.
This is similar to the existing guidance for filtering localized duplicates (/cn/docs/, /ja/docs/), but for version prefixes instead of locale prefixes.
Complexity notes
- Version detection heuristics will vary by framework. A simple approach: group URLs by path-after-prefix, identify groups where the same suffix appears under multiple version-like prefixes, and keep only the shortest/unversioned variant.
- Some sites intentionally maintain old versions as separate doc sets (e.g., migration guides reference old versions). Filtering should be best-effort, not perfect.
Context
Discovered during proportional scoring of Docusaurus (https://docusaurus.io/docs). See friction log entry: https://github.com/agent-ecosystem/agent-docs-report/blob/main/sites/friction-log.md#docusaurus
Problem
Documentation sites that publish versioned docs (Docusaurus, MkDocs, Sphinx, Read the Docs) often include all versions in a single sitemap. When
afdocssamples from these sitemaps, the deterministic stride spreads across versions, and the sample may not include any current-version pages at all.In the Docusaurus scoring run, the sitemap contained docs across 11 version prefixes (
/docs/2.x/,/docs/3.0.1/,/docs/3.1.1/, .../docs/3.9.2/) plus ~186 unversioned (current) pages. The 69-page sample contained zero current-version pages: 11 from 2.x, 12 from 3.0.1, 12 from 3.1.1, and 7 from 3.2.1.Proposed improvement
After collecting sitemap URLs (and applying path-prefix filtering per #TBD), detect whether the URL set contains versioned duplicates. Common patterns:
/docs/2.x/foo,/docs/3.1.1/foo,/docs/foo(Docusaurus)/en/stable/foo,/en/latest/foo,/en/v2/foo(Sphinx/Read the Docs)/docs/1.0/foo,/docs/2.0/foo(generic versioning)When detected, filter to the "current" version before sampling. The current version is typically the unversioned/unprefixed path, or the one matching
latest/stable.This is similar to the existing guidance for filtering localized duplicates (
/cn/docs/,/ja/docs/), but for version prefixes instead of locale prefixes.Complexity notes
Context
Discovered during proportional scoring of Docusaurus (
https://docusaurus.io/docs). See friction log entry: https://github.com/agent-ecosystem/agent-docs-report/blob/main/sites/friction-log.md#docusaurus