Context
Some documentation sites host many products under a single origin. GitHub Docs (docs.github.com) is a clear example: it covers Actions, Pages, Codespaces, Copilot, and dozens more products, all under one domain with shared infrastructure (llms.txt, sitemap, content negotiation).
When scoring a specific product's docs (e.g., docs.github.com/en/pages), the base URL points to a subsection rather than the doc root. This raises a few questions about how afdocs should behave.
Observations from GitHub Pages scoring
- llms.txt lives at
docs.github.com/llms.txt (the root), not under /en/pages/. It uses a Page List API pattern rather than direct page links. afdocs found it in both cases (root and subsection URL), so discovery worked.
- Page count required using the Page List API and filtering to
/en/pages paths (28 pages out of thousands). The llms.txt itself doesn't segment by product.
- Deterministic sampling from
docs.github.com/en/pages and docs.github.com produced identical scores (69 D) with the same 21 tested pages, suggesting the crawler found the same pages regardless of starting point.
Questions to consider
-
Should afdocs be aware of subsection scoping? If a user passes example.com/product-a, should checks like llms.txt and sitemap discovery automatically look at the origin root, or should they also check for product-scoped variants?
-
Should page counts be scoped? For a site like GitHub Docs, the total page count across all products is thousands, but the relevant count for a specific product subsection might be 28. The --max-links sample size depends on this number, so scoping matters for proportional accuracy.
-
Should the output note when the base URL is a subsection? It might be useful for consumers of the JSON output to know that the tested URL is a subsection of a larger doc site, so they can interpret discoverability scores in context (e.g., llms.txt exists but is site-wide, not product-scoped).
Not a bug
This isn't blocking anything today. In the GitHub Pages case, scores were identical from both URLs. But as afdocs is used across more multi-product sites (AWS, Microsoft Learn, Google Cloud), this could become more relevant.
Context
Some documentation sites host many products under a single origin. GitHub Docs (
docs.github.com) is a clear example: it covers Actions, Pages, Codespaces, Copilot, and dozens more products, all under one domain with shared infrastructure (llms.txt, sitemap, content negotiation).When scoring a specific product's docs (e.g.,
docs.github.com/en/pages), the base URL points to a subsection rather than the doc root. This raises a few questions about howafdocsshould behave.Observations from GitHub Pages scoring
docs.github.com/llms.txt(the root), not under/en/pages/. It uses a Page List API pattern rather than direct page links.afdocsfound it in both cases (root and subsection URL), so discovery worked./en/pagespaths (28 pages out of thousands). The llms.txt itself doesn't segment by product.docs.github.com/en/pagesanddocs.github.comproduced identical scores (69 D) with the same 21 tested pages, suggesting the crawler found the same pages regardless of starting point.Questions to consider
Should
afdocsbe aware of subsection scoping? If a user passesexample.com/product-a, should checks like llms.txt and sitemap discovery automatically look at the origin root, or should they also check for product-scoped variants?Should page counts be scoped? For a site like GitHub Docs, the total page count across all products is thousands, but the relevant count for a specific product subsection might be 28. The
--max-linkssample size depends on this number, so scoping matters for proportional accuracy.Should the output note when the base URL is a subsection? It might be useful for consumers of the JSON output to know that the tested URL is a subsection of a larger doc site, so they can interpret discoverability scores in context (e.g., llms.txt exists but is site-wide, not product-scoped).
Not a bug
This isn't blocking anything today. In the GitHub Pages case, scores were identical from both URLs. But as
afdocsis used across more multi-product sites (AWS, Microsoft Learn, Google Cloud), this could become more relevant.