Skip to content

Filter sitemap index to default locale before merging sub-sitemaps #30

@dacharyc

Description

@dacharyc

Problem

When a sitemap index contains per-locale sub-sitemaps, afdocs fetches and merges all of them. This causes the deterministic sample to draw from translations rather than the default (typically English) documentation.

Django is the clearest example. The sitemap index at docs.djangoproject.com/sitemap.xml contains 12 locale-specific sitemaps:

sitemap-el.xml
sitemap-en.xml
sitemap-es.xml
sitemap-fr.xml
...

Because deterministic sampling sorts alphabetically and strides, the Django proportional run sampled 129 pages entirely from the Greek (el) locale — zero English pages. The Greek sitemap appears first alphabetically, so it dominates the sample window.

This is distinct from version filtering (#22), which operates on URL patterns within a single sitemap. Here, the sitemap index already provides clean locale separation; afdocs just needs to select the right sub-sitemap rather than merging all of them.

Proposed improvement

When a sitemap index contains sub-sitemaps whose filenames or URLs indicate locale variants (e.g., sitemap-en.xml, sitemap-fr.xml, or paths containing /en/, /fr/), select only the default locale before merging.

Heuristics for default locale:

  • Prefer en if present
  • Otherwise, prefer the unprefixed or shortest variant
  • If ambiguous, fall back to merging all (current behavior)

Complexity notes

  • Detection should be straightforward since locale sitemaps typically follow naming conventions (sitemap-{locale}.xml or {locale}/sitemap.xml)
  • This is simpler than version filtering (Detect versioned doc URLs in sitemaps and filter to current version #22) because the sitemap structure already provides the separation
  • Some sites may use locale sitemaps for genuinely distinct content (not translations). Best-effort filtering is acceptable.

Context

Discovered during proportional scoring of Django (https://docs.djangoproject.com). After locale filtering, the English sitemap still contains 10,457 URLs across 17 versions (1.8 through 6.0 plus dev), which is the version filtering problem tracked in #22.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions