Problem
The content-negotiation check in markdown-availability can produce false positives when a site returns an error page formatted as markdown with HTTP 200 and Content-Type: text/markdown.
Found this while scoring Next.js. Requesting a page with Accept: text/markdown returns:
HTTP 200
Content-Type: text/markdown
# Page Not Found
The URL `/docs/llm-digest/app/getting-started/installation` does not exist.
## How to find the correct page
...
The check passes because:
- Status is 200 (never validated)
- Content-Type is
text/markdown
- Body has markdown headings and links, so
looksLikeMarkdown() returns true
- Body has no HTML tags
This gave Next.js a falsely inflated markdown-availability score of 100 (A+), when content negotiation doesn't actually work for that site (only .md URL suffix does).
Affected code
src/checks/markdown-availability/content-negotiation.ts (lines 46-69) — classification logic never checks status code or body semantics
src/helpers/detect-markdown.ts — looksLikeMarkdown() is purely structural, not semantic
src/checks/markdown-availability/markdown-url-support.ts — shares the same vulnerability for .md suffix checks (though less likely to trigger in practice)
Secondary impact
The error page content gets cached in pageCache with source 'content-negotiation' (lines 52-66), which can poison downstream checks like markdown-content-parity.
Existing prior art
The http-status-codes check in url-stability already has a SOFT_404_PATTERNS regex:
const SOFT_404_PATTERNS = /not\s*found|page\s*not\s*found|404|does\s*not\s*exist/i;
This could be reused or adapted.
Suggested fix
Two complementary checks:
- Validate HTTP status code is 2xx before classifying as successful
- Scan body for error-page patterns (reuse
SOFT_404_PATTERNS or similar) and reject matches
Problem
The
content-negotiationcheck inmarkdown-availabilitycan produce false positives when a site returns an error page formatted as markdown with HTTP 200 andContent-Type: text/markdown.Found this while scoring Next.js. Requesting a page with
Accept: text/markdownreturns:The check passes because:
text/markdownlooksLikeMarkdown()returns trueThis gave Next.js a falsely inflated markdown-availability score of 100 (A+), when content negotiation doesn't actually work for that site (only
.mdURL suffix does).Affected code
src/checks/markdown-availability/content-negotiation.ts(lines 46-69) — classification logic never checks status code or body semanticssrc/helpers/detect-markdown.ts—looksLikeMarkdown()is purely structural, not semanticsrc/checks/markdown-availability/markdown-url-support.ts— shares the same vulnerability for.mdsuffix checks (though less likely to trigger in practice)Secondary impact
The error page content gets cached in
pageCachewith source'content-negotiation'(lines 52-66), which can poison downstream checks likemarkdown-content-parity.Existing prior art
The
http-status-codescheck inurl-stabilityalready has aSOFT_404_PATTERNSregex:This could be reused or adapted.
Suggested fix
Two complementary checks:
SOFT_404_PATTERNSor similar) and reject matches