Skip to content

Consider how multi-product doc sites with subsection URLs should be handled #25

@dacharyc

Description

@dacharyc

Context

Some documentation sites host many products under a single origin. GitHub Docs (docs.github.com) is a clear example: it covers Actions, Pages, Codespaces, Copilot, and dozens more products, all under one domain with shared infrastructure (llms.txt, sitemap, content negotiation).

When scoring a specific product's docs (e.g., docs.github.com/en/pages), the base URL points to a subsection rather than the doc root. This raises a few questions about how afdocs should behave.

Observations from GitHub Pages scoring

  • llms.txt lives at docs.github.com/llms.txt (the root), not under /en/pages/. It uses a Page List API pattern rather than direct page links. afdocs found it in both cases (root and subsection URL), so discovery worked.
  • Page count required using the Page List API and filtering to /en/pages paths (28 pages out of thousands). The llms.txt itself doesn't segment by product.
  • Deterministic sampling from docs.github.com/en/pages and docs.github.com produced identical scores (69 D) with the same 21 tested pages, suggesting the crawler found the same pages regardless of starting point.

Questions to consider

  1. Should afdocs be aware of subsection scoping? If a user passes example.com/product-a, should checks like llms.txt and sitemap discovery automatically look at the origin root, or should they also check for product-scoped variants?

  2. Should page counts be scoped? For a site like GitHub Docs, the total page count across all products is thousands, but the relevant count for a specific product subsection might be 28. The --max-links sample size depends on this number, so scoping matters for proportional accuracy.

  3. Should the output note when the base URL is a subsection? It might be useful for consumers of the JSON output to know that the tested URL is a subsection of a larger doc site, so they can interpret discoverability scores in context (e.g., llms.txt exists but is site-wide, not product-scoped).

Not a bug

This isn't blocking anything today. In the GitHub Pages case, scores were identical from both URLs. But as afdocs is used across more multi-product sites (AWS, Microsoft Learn, Google Cloud), this could become more relevant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions