Skip to content

Send SERVFAIL response on DNS query failure to prevent client timeouts#14573

Open
syed-dawood wants to merge 2 commits intomicrosoft:masterfrom
syed-dawood:fix/dns-servfail-on-query-failure
Open

Send SERVFAIL response on DNS query failure to prevent client timeouts#14573
syed-dawood wants to merge 2 commits intomicrosoft:masterfrom
syed-dawood:fix/dns-servfail-on-query-failure

Conversation

@syed-dawood
Copy link
Copy Markdown

@syed-dawood syed-dawood commented Mar 28, 2026

Summary of the Pull Request

When DnsQueryRaw fails on the Windows side (synchronously or via async callback), no response is sent back to the Linux DNS client. This forces the client to wait for its full retransmit timeout (typically 5-10 seconds per query) before retrying or failing. This change sends a minimal DNS SERVFAIL response (RFC 1035, RCODE=2) back to the Linux client on failure, giving it an immediate error signal.

This also fixes a memory leak: when no response was sent for a UDP query, the tracking entry in DnsServer::m_udpRequests on the Linux side was never cleaned up. The SERVFAIL response now triggers the normal response handling path which erases the tracking entry.

PR Checklist

Detailed Description of the Pull Request / Additional comments

Problem

There are three failure paths in DnsResolver where no DNS response is sent back to the Linux client:

  1. Synchronous DnsQueryRaw failureProcessDnsRequest returns error code ≠ DNS_REQUEST_PENDING
  2. Async callback with queryResults == nullptr — null result from completion callback
  3. Async callback with queryRawResponse == nullptr — API returned failure status code

In all cases, the Linux DNS client (systemd-resolved or glibc stub resolver) never receives a response and must wait for its retransmit timeout. This produces the 5-10 second DNS hangs many users have reported, especially after network transitions, VPN toggles, or sleep/resume.

Additionally, on the Linux side, DnsServer::m_udpRequests entries for failed queries are never cleaned up because the HandleUdpDnsResponse cleanup path is never invoked. Over time this constitutes a slow memory leak.

Fix

Added DnsResolver::SendServfailResponse() which constructs a minimal 12-byte DNS SERVFAIL response per RFC 1035 section 4.1.1:

  • Copies the original 2-byte transaction ID from the request (stored in DnsQueryContext::m_dnsTransactionId)
  • Sets QR=1 (response) and RCODE=2 (Server Failure)
  • Sends via the existing m_dnsResponseQueueDnsTunnelingChannel path

This is called from all three failure paths (gated on !m_stopped to avoid sending during shutdown).

On the Linux side, the SERVFAIL flows through HandleUdpDnsResponse/HandleTcpDnsResponse normally, which cleans up the m_udpRequests tracking entry — fixing the leak with no Linux-side code changes.

Files Changed

  • src/windows/common/DnsResolver.h — Added m_dnsTransactionId to DnsQueryContext, added SendServfailResponse declaration
  • src/windows/common/DnsResolver.cpp — Store transaction ID, call SendServfailResponse on all three failure paths, added implementation

Validation Steps Performed

  1. Code audit: Traced all three failure paths end-to-end through ProcessDnsRequestHandleDnsQueryCompletionSendServfailResponseDnsTunnelingChannel::SendDnsMessageDnsServer::HandleUdpDnsResponse/HandleTcpDnsResponse
  2. Thread safety: All writes happen under m_dnsLock (recursive_mutex). The m_dnsResponseQueue.submit() lambda captures by value. SendServfailResponse is called before m_dnsRequests.erase() so the context is still valid.
  3. Shutdown safety: SERVFAIL sending is gated on !m_stopped. DnsTunnelingChannel::SendDnsMessage also checks m_stopEvent.is_signaled().
  4. DNS wire format: The 12-byte SERVFAIL response follows RFC 1035 section 4.1.1 with QDCOUNT=0, which is accepted by all major resolvers (glibc, systemd-resolved, musl).
  5. clang-format: Verified clean with clang-format-18 --dry-run --style=file

@syed-dawood
Copy link
Copy Markdown
Author

cc @benhillis @OneBlue @craigloewen-msft @damanm24 @Brian-Perkins for review — this fixes the silent DNS black hole that causes 5-10s query timeouts when DnsQueryRaw fails (related to #4285, #5256, #4737).

@microsoft-github-policy-service agree

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves WSL DNS tunneling failure handling by ensuring the Linux-side DNS client receives an immediate DNS error (SERVFAIL) when Windows DnsQueryRaw fails, avoiding multi-second client retransmit timeouts and triggering normal Linux-side request cleanup.

Changes:

  • Track the original DNS transaction ID per request so a failure response can be constructed.
  • Send a minimal DNS SERVFAIL response on both synchronous DnsQueryRaw failure and async completion failure paths.
  • Add a SendServfailResponse helper for constructing and enqueueing the failure response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/windows/common/DnsResolver.h Adds per-request transaction ID storage and declares SendServfailResponse.
src/windows/common/DnsResolver.cpp Captures transaction ID, sends SERVFAIL on failure paths, and implements SendServfailResponse.

@benhillis
Copy link
Copy Markdown
Member

I think these Copilot comments are valid.

@syed-dawood syed-dawood force-pushed the fix/dns-servfail-on-query-failure branch from e077644 to 20234fc Compare March 29, 2026 20:06
When DnsQueryRaw fails (synchronously or asynchronously), the Windows DNS
resolver currently sends no response back to the Linux DNS client. This
forces the client to wait for its full retransmit timeout (typically 5-10
seconds per query) before moving on, causing significant DNS delays.

This change sends a minimal 12-byte DNS SERVFAIL response (RFC 1035,
RCODE=2) back to the Linux DNS client whenever:
- DnsQueryRaw returns a synchronous failure (not DNS_REQUEST_PENDING)
- The async completion callback receives null results
- The async completion returns a non-null result with no raw response

The SERVFAIL response uses the original request's transaction ID so the
client can match it to the pending query and immediately retry or fail.

This also fixes a memory leak on the Linux side: when no response was
sent, UDP request tracking entries in DnsServer::m_udpRequests were never
cleaned up. The SERVFAIL response now triggers the normal response
handling path which erases the tracking entry.

Refs: microsoft#4285 microsoft#5256 microsoft#4737
@syed-dawood syed-dawood force-pushed the fix/dns-servfail-on-query-failure branch from 20234fc to 40f6fd5 Compare March 29, 2026 20:16
Copilot AI review requested due to automatic review settings March 29, 2026 20:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +412 to +413
dnsHeader[2] = gsl::byte{0x80}; // QR=1 (response), OPCODE=0 (standard query)
dnsHeader[3] = gsl::byte{0x02}; // RA=0, Z=0, RCODE=2 (Server Failure)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SERVFAIL header flags are hard-coded to clear RD (and RA is forced to 0). Per RFC 1035, RD is copied from the query into the response, and some stub resolvers validate this. Consider capturing the RD bit from the original request (alongside the transaction ID) and reflecting it in the SERVFAIL response; also consider setting RA appropriately to match normal resolver behavior.

Suggested change
dnsHeader[2] = gsl::byte{0x80}; // QR=1 (response), OPCODE=0 (standard query)
dnsHeader[3] = gsl::byte{0x02}; // RA=0, Z=0, RCODE=2 (Server Failure)
// Flags: QR=1 (response), OPCODE=0 (standard query), AA=0, TC=0, RD=1
dnsHeader[2] = gsl::byte{0x81};
// Flags: RA=1 (recursion available), Z=0, RCODE=2 (Server Failure)
dnsHeader[3] = gsl::byte{0x82};

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended?

void HandleDnsQueryCompletion(_Inout_ DnsQueryContext* dnsQueryContext, _Inout_opt_ DNS_QUERY_RAW_RESULT* queryResults) noexcept;

// Build and send a minimal DNS SERVFAIL response (RFC 1035, RCODE=2) back to the Linux DNS client.
// This is used when the Windows DNS API fails, to prevent the Linux client from waiting until timeout.
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SendServfailResponse takes a raw uint16_t transactionId but the implementation assumes it is already in network byte order (it memcpy's the value directly). Please document this explicitly on the SendServfailResponse declaration (or normalize to host order in the context and htons() when writing) to prevent future callers from accidentally passing host-order IDs.

Suggested change
// This is used when the Windows DNS API fails, to prevent the Linux client from waiting until timeout.
// This is used when the Windows DNS API fails, to prevent the Linux client from waiting until timeout.
//
// Arguments:
// transactionId - DNS transaction ID to place in the response header, in network byte order.
// Callers that have a host-order transaction ID must convert it (e.g. via htons)
// before passing it to this method.
// dnsClientIdentifier - struct containing protocol (TCP/UDP) and unique id of the Linux DNS client
// to which the SERVFAIL response will be sent.

Copilot uses AI. Check for mistakes.
Comment on lines +370 to +375
else if (!m_stopped)
{
// The Windows DNS API failed to resolve the request. Send a SERVFAIL response to the Linux DNS client
// so it gets an immediate error instead of waiting for a timeout (which can take 5-10 seconds).
SendServfailResponse(queryContext->m_dnsTransactionId, queryContext->m_dnsClientIdentifier);
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior adds a SERVFAIL fallback path when DnsQueryRaw fails/nulls. There are existing DNS tunneling integration tests (test/windows/NetworkTests.cpp) but none appear to validate the failure path; adding a test that forces a DnsQueryRaw failure and asserts that the Linux client receives an immediate SERVFAIL (for both UDP and TCP framing) would help prevent regressions and ensure the timeout/leak fix stays covered.

Copilot uses AI. Check for mistakes.
@syed-dawood syed-dawood requested a review from Copilot March 31, 2026 22:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +238 to +244
if (dnsBuffer.size() >= transactionIdOffset + sizeof(localContext->m_dnsTransactionId))
{
memcpy(&localContext->m_dnsTransactionId, dnsBuffer.data() + transactionIdOffset, sizeof(localContext->m_dnsTransactionId));
}
else
{
WSL_LOG(
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the DNS buffer is too small to contain the transaction ID (e.g., <2 bytes for UDP or <4 bytes for TCP), m_dnsTransactionId remains at its default value and later failure paths can send a SERVFAIL with an incorrect transaction ID. That response won’t correlate to the outstanding query and can reintroduce client timeouts (or, worse, collide with a legitimate query that happens to use ID 0). Consider treating this as invalid input: validate the minimum header size up-front and return early (and/or track whether the transaction ID was successfully extracted and only send SERVFAIL when it was).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems legitimate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants