Skip to content

Improve retry strategy for artifact download#3561

Merged
maddieford merged 4 commits intoAzure:developfrom
maddieford:improve_download_retry_logic
Feb 27, 2026
Merged

Improve retry strategy for artifact download#3561
maddieford merged 4 commits intoAzure:developfrom
maddieford:improve_download_retry_logic

Conversation

@maddieford
Copy link
Copy Markdown
Contributor

Description

Current artifact download retries are expensive for VMs where outbound connections are disabled. Since direct downloads are the primary download channel, each request on these VMs takes 10 seconds to timeout. The agent currently retries up to 6 times, with a varying delay between each retry (this takes up to 2.5 minutes), before falling back to the host channel for downloads. This is impacting TDPR for those VMs.

We did a release to Canary where HGAP was the primary download channel in an attempt to address the above issue. We saw an increase in 400 responses on HGAP requests due to a cache issue in HGAP that may produce a BAD_REQUEST failure for valid URIs when using the extensionArtifact API. Retries should help address that cache issue, but the HGAP dev also cited performance concerns w.r.t making HGAP the primary download.

As a result of the above concerns making HGAP the primary download channel, this PR keeps the direct download channel as the primary, but makes some changes to account for those VMs where outbound connections are disabled. For direct artifact downloads, there will be no retries on timed out requests (and the timeout will be decreases to 5 seconds for the first request). As a result, the agent will fallback to the host channel much faster, reducing the delay on those VMs from 2.5 mins to 5 seconds.

In addition, the PR makes a change to make the BAD_REQUEST response retryable on zip downloads via the host channel to address the cache issue described above. BAD_REQUEST is already treated as retryable for manifest and VMAP downloads on Host channel, and HGAP devs confirmed it should be treated as retryable for zip downloads too.

Issue #

PR information

  • Ensure development PR is based on the develop branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines


Distro maintenance information, if applicable

  • This is a contribution from a distro maintainer
  • The changes in this PR have been taken as a downstream patch (Note: it is not recommended to patch the agent without upstream review and approval)

Remove local testing change

Fix e2e test

Improve comments and tests
def hgap_download(uri):
request_uri, request_headers = host_ga_plugin.get_artifact_request(uri, use_verify_header=use_verify_header, artifact_manifest_url=host_ga_plugin.manifest_uri)
return self.stream(request_uri, target_file, headers=request_headers, use_proxy=False)
return self.stream(request_uri, target_file, headers=request_headers, use_proxy=False, retry_codes=restutil.HGAP_GET_EXTENSION_ARTIFACT_RETRY_CODES)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't treating BAD_REQUEST response as retryable on zip downloads via HGAP. Rohit confirmed it should be retryable for zip downloads (same as manifest and VMAP downloads)

redact_data=False,
return_raw_response=False):
return_raw_response=False,
fail_fast_on_timeout=False):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to add an additional option to this method since all of the retry logic is here and I only wanted to change the retry logic for IOErrors on artifact download requests via direct channel

# If the error was an IOError and fail_fast_on_timeout is True, check if the error message contains
# "timed out" or "timeout" and break the loop to fail fast in the case of no outbound connection on the VM.
# Otherwise, continue with the retries.
if fail_fast_on_timeout and "timed out" in ustr(e) or "timeout" in ustr(e):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errno is None for the time out exception so I am using the string here

# 2026-02-20T22:39:30.733714Z INFO ExtHandler ExtHandler ProcessExtensionsGoalState started [incarnation_1 channel: WireServer source: Fabric activity: 1082ed79-bbc7-4925-a932-aefe84bca6e9 correlation 2ad608f9-327b-4b2a-8673-6b40ca7ee6e7 created: 2026-02-20T22:33:22.800011Z]
# Patterns to match
downloading_vmap = re.compile(r"Downloading artifacts profile blob") # This is the artifact download we expect to see the channel change happen on
fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.*1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.*1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin
fetch_failed_on_direct_pattern = re.compile(r"Fetch failed:.* 1 attempts made") # There should only be one attempt on the Direct channel before falling back to HostGAPlugin

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(.*1 would match 01, 11, etc)

try:
# If fail_fast_on_timeout is True, use a shorter timeout for the first attempt to fail fast in the case of
# no outbound connection on the VM.
req_timeout = timeout if not fail_fast_on_timeout or attempt > 1 else min(timeout, 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add constant for 5 secs default?

@maddieford maddieford merged commit 4d1c365 into Azure:develop Feb 27, 2026
14 checks passed
@maddieford maddieford deleted the improve_download_retry_logic branch February 27, 2026 15:38
maddieford added a commit to maddieford/WALinuxAgent that referenced this pull request Mar 2, 2026
* Improve retry strategy for artifact download

Remove local testing change

Fix e2e test

Improve comments and tests

* Add comment for direct download logic'

* Address comments

(cherry picked from commit 4d1c365)
maddieford added a commit that referenced this pull request Mar 2, 2026
* Improve retry strategy for artifact download

Remove local testing change

Fix e2e test

Improve comments and tests

* Add comment for direct download logic'

* Address comments

(cherry picked from commit 4d1c365)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants