Skip to content

Discover sitemaps at docs subpath, not just origin #32

@dacharyc

Description

@dacharyc

Problem

When docs live at a subpath (e.g., https://swagger.io/docs/), discoverSitemapUrls() only checks:

  1. {origin}/robots.txt for Sitemap: directives
  2. Falls back to {origin}/sitemap.xml

If neither exists, discovery fails entirely, even when a valid sitemap is available at the docs URL path.

Example: Swagger UI docs at https://swagger.io/docs/ have a Starlight-generated sitemap at /docs/sitemap-index.xml (90 pages). But robots.txt returns 404 and /sitemap.xml returns 404, so afdocs finds 0 pages and falls back to testing only the root URL.

Proposed fix

When the input URL has a path component (i.e., it's not at the origin root), add these as fallback candidates in discoverSitemapUrls() after the existing origin-level checks:

  • {url}/sitemap.xml
  • {url}/sitemap-index.xml

The sitemap-index.xml filename is the Astro/Starlight convention (Starlight v0.21.5+ generates this by default). Other SSGs may use sitemap.xml at the docs subpath.

Note: fetchDocsSitemap() in llms-txt-freshness.ts already checks {baseUrl}/sitemap.xml for the freshness comparison, but that logic isn't used during primary page discovery. This would bring the same awareness to the discovery path.

Affected sites

  • Swagger UI (swagger.io/docs/) — Astro Starlight, sitemap at /docs/sitemap-index.xml
  • Potentially any Starlight-based docs site hosted at a subpath

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions