You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a sitemap index contains per-locale sub-sitemaps, afdocs fetches and merges all of them. This causes the deterministic sample to draw from translations rather than the default (typically English) documentation.
Django is the clearest example. The sitemap index at docs.djangoproject.com/sitemap.xml contains 12 locale-specific sitemaps:
Because deterministic sampling sorts alphabetically and strides, the Django proportional run sampled 129 pages entirely from the Greek (el) locale — zero English pages. The Greek sitemap appears first alphabetically, so it dominates the sample window.
This is distinct from version filtering (#22), which operates on URL patterns within a single sitemap. Here, the sitemap index already provides clean locale separation; afdocs just needs to select the right sub-sitemap rather than merging all of them.
Proposed improvement
When a sitemap index contains sub-sitemaps whose filenames or URLs indicate locale variants (e.g., sitemap-en.xml, sitemap-fr.xml, or paths containing /en/, /fr/), select only the default locale before merging.
Heuristics for default locale:
Prefer en if present
Otherwise, prefer the unprefixed or shortest variant
If ambiguous, fall back to merging all (current behavior)
Complexity notes
Detection should be straightforward since locale sitemaps typically follow naming conventions (sitemap-{locale}.xml or {locale}/sitemap.xml)
Some sites may use locale sitemaps for genuinely distinct content (not translations). Best-effort filtering is acceptable.
Context
Discovered during proportional scoring of Django (https://docs.djangoproject.com). After locale filtering, the English sitemap still contains 10,457 URLs across 17 versions (1.8 through 6.0 plus dev), which is the version filtering problem tracked in #22.
Problem
When a sitemap index contains per-locale sub-sitemaps,
afdocsfetches and merges all of them. This causes the deterministic sample to draw from translations rather than the default (typically English) documentation.Django is the clearest example. The sitemap index at
docs.djangoproject.com/sitemap.xmlcontains 12 locale-specific sitemaps:Because deterministic sampling sorts alphabetically and strides, the Django proportional run sampled 129 pages entirely from the Greek (
el) locale — zero English pages. The Greek sitemap appears first alphabetically, so it dominates the sample window.This is distinct from version filtering (#22), which operates on URL patterns within a single sitemap. Here, the sitemap index already provides clean locale separation;
afdocsjust needs to select the right sub-sitemap rather than merging all of them.Proposed improvement
When a sitemap index contains sub-sitemaps whose filenames or URLs indicate locale variants (e.g.,
sitemap-en.xml,sitemap-fr.xml, or paths containing/en/,/fr/), select only the default locale before merging.Heuristics for default locale:
enif presentComplexity notes
sitemap-{locale}.xmlor{locale}/sitemap.xml)Context
Discovered during proportional scoring of Django (
https://docs.djangoproject.com). After locale filtering, the English sitemap still contains 10,457 URLs across 17 versions (1.8 through 6.0 plus dev), which is the version filtering problem tracked in #22.