Improve retry strategy for artifact download#3561
Merged
maddieford merged 4 commits intoAzure:developfrom Feb 27, 2026
Merged
Conversation
Remove local testing change Fix e2e test Improve comments and tests
maddieford
commented
Feb 24, 2026
| def hgap_download(uri): | ||
| request_uri, request_headers = host_ga_plugin.get_artifact_request(uri, use_verify_header=use_verify_header, artifact_manifest_url=host_ga_plugin.manifest_uri) | ||
| return self.stream(request_uri, target_file, headers=request_headers, use_proxy=False) | ||
| return self.stream(request_uri, target_file, headers=request_headers, use_proxy=False, retry_codes=restutil.HGAP_GET_EXTENSION_ARTIFACT_RETRY_CODES) |
Contributor
Author
There was a problem hiding this comment.
We weren't treating BAD_REQUEST response as retryable on zip downloads via HGAP. Rohit confirmed it should be retryable for zip downloads (same as manifest and VMAP downloads)
| redact_data=False, | ||
| return_raw_response=False): | ||
| return_raw_response=False, | ||
| fail_fast_on_timeout=False): |
Contributor
Author
There was a problem hiding this comment.
I opted to add an additional option to this method since all of the retry logic is here and I only wanted to change the retry logic for IOErrors on artifact download requests via direct channel
| # If the error was an IOError and fail_fast_on_timeout is True, check if the error message contains | ||
| # "timed out" or "timeout" and break the loop to fail fast in the case of no outbound connection on the VM. | ||
| # Otherwise, continue with the retries. | ||
| if fail_fast_on_timeout and "timed out" in ustr(e) or "timeout" in ustr(e): |
Contributor
Author
There was a problem hiding this comment.
errno is None for the time out exception so I am using the string here
narrieta
reviewed
Feb 26, 2026
| # 2026-02-20T22:39:30.733714Z INFO ExtHandler ExtHandler ProcessExtensionsGoalState started [incarnation_1 channel: WireServer source: Fabric activity: 1082ed79-bbc7-4925-a932-aefe84bca6e9 correlation 2ad608f9-327b-4b2a-8673-6b40ca7ee6e7 created: 2026-02-20T22:33:22.800011Z] | ||
| # Patterns to match | ||
| downloading_vmap = re.compile(r"Downloading artifacts profile blob") # This is the artifact download we expect to see the channel change happen on | ||
| fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.*1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin |
Member
There was a problem hiding this comment.
Suggested change
| fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.*1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin | |
| fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.* 1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin |
nagworld9
reviewed
Feb 26, 2026
| try: | ||
| # If fail_fast_on_timeout is True, use a shorter timeout for the first attempt to fail fast in the case of | ||
| # no outbound connection on the VM. | ||
| req_timeout = timeout if not fail_fast_on_timeout or attempt > 1 else min(timeout, 5) |
Contributor
There was a problem hiding this comment.
can we add constant for 5 secs default?
narrieta
approved these changes
Feb 26, 2026
nagworld9
approved these changes
Feb 26, 2026
maddieford
added a commit
to maddieford/WALinuxAgent
that referenced
this pull request
Mar 2, 2026
* Improve retry strategy for artifact download Remove local testing change Fix e2e test Improve comments and tests * Add comment for direct download logic' * Address comments (cherry picked from commit 4d1c365)
maddieford
added a commit
that referenced
this pull request
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Current artifact download retries are expensive for VMs where outbound connections are disabled. Since direct downloads are the primary download channel, each request on these VMs takes 10 seconds to timeout. The agent currently retries up to 6 times, with a varying delay between each retry (this takes up to 2.5 minutes), before falling back to the host channel for downloads. This is impacting TDPR for those VMs.
We did a release to Canary where HGAP was the primary download channel in an attempt to address the above issue. We saw an increase in 400 responses on HGAP requests due to a cache issue in HGAP that may produce a BAD_REQUEST failure for valid URIs when using the extensionArtifact API. Retries should help address that cache issue, but the HGAP dev also cited performance concerns w.r.t making HGAP the primary download.
As a result of the above concerns making HGAP the primary download channel, this PR keeps the direct download channel as the primary, but makes some changes to account for those VMs where outbound connections are disabled. For direct artifact downloads, there will be no retries on timed out requests (and the timeout will be decreases to 5 seconds for the first request). As a result, the agent will fallback to the host channel much faster, reducing the delay on those VMs from 2.5 mins to 5 seconds.
In addition, the PR makes a change to make the BAD_REQUEST response retryable on zip downloads via the host channel to address the cache issue described above. BAD_REQUEST is already treated as retryable for manifest and VMAP downloads on Host channel, and HGAP devs confirmed it should be treated as retryable for zip downloads too.
Issue #
PR information
developbranch.Quality of Code and Contribution Guidelines
Distro maintenance information, if applicable