Skip to content

feat(toolkit-lib): Implement Hotswap with CCAPIs for broader resource coverage and improve intrinsic function resolution for Get::Attr#1310

Open
ShadowCat567 wants to merge 5 commits intomainfrom
ccapi-hotswap
Open

feat(toolkit-lib): Implement Hotswap with CCAPIs for broader resource coverage and improve intrinsic function resolution for Get::Attr#1310
ShadowCat567 wants to merge 5 commits intomainfrom
ccapi-hotswap

Conversation

@ShadowCat567
Copy link
Copy Markdown

@ShadowCat567 ShadowCat567 commented Apr 3, 2026

The primary change in this PR is the introduction of hotswap with CCAPIs. This allows us to support resource hotswapping for new resource types without having to write custom implementations for each of them.

Other changes in this PR:

  • Improvements to resolution of Fn::GetAttr: previously we could only resolve the CloudFormation intrinsic function, Fn::GetAttr for a small number of resources (11). This stems from us having hardcoded implementations for each of those resources. Now if a resource is not one we have a hardcoded attribute resolution for, we try to get it from the deployed Cloudformation template. If we cannot get it from the deployed Cloudformation template, we attempt to resolve it using its live state through CCAPIs. Only if we cannot resolve it with CCAPIs, do we throw an error that we cannot resolve that attribute.
  • Lambda Function hotswapping takes advantage of exponential backoff: waiters in AWS SDK V3 support exponential back off and we were not taking advantage of that while hotswapping Lambda functions, hotswapping Lambda functions that involve Docker or VPCs should be a bit faster.
  • Parallelize hotswapping with nested stacks: Previously when hotswapping nested stacks we would do it sequentially, now nested stack hotswapping is done in parallel so it should be a bit faster

Resource types that are now supported by hotswap through CCAPIs:

  • AWS::ApiGateway::RestApi
  • AWS::ApiGateway::Deployment
  • AWS::ApiGateway::Method
  • AWS::ApiGatewayV2::Api
  • AWS::ApiGatewayV2::Integration
  • AWS::Bedrock::Agent
  • AWS::Events::Rule
  • AWS::DynamoDB::Table
  • AWS::DynamoDB::GlobalTable
  • AWS::SQS::Queue
  • AWS::CloudWatch::Alarm
  • AWS::CloudWatch::CompositeAlarm
  • AWS::CloudWatch::Dashboard
  • AWS::StepFunctions::StateMachine - previously supported by custom SDK implementation
  • AWS::BedrockAgentCore::Runtime - previously supported by custom SDK implementation

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added the p2 label Apr 3, 2026
@aws-cdk-automation aws-cdk-automation requested a review from a team April 3, 2026 16:33
@ShadowCat567 ShadowCat567 marked this pull request as ready for review April 7, 2026 18:46
Comment on lines +104 to +105
* The processed template from CloudFormation with all intrinsic functions
* resolved to their concrete values from the last successful deployment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not what "processed" is, unfortunately, so I hope you're not basing too much functionality off of this assumption.

For troubleshooting, use the processed template. If a template doesn't reference macros, the original and processed templates are identical.


// The processed template has all intrinsics resolved by CloudFormation from the last deployment.
// Used as a fallback when local evaluation of Fn::GetAtt fails for unsupported resource types.
this.processedTemplate = props.processedTemplate ?? {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order should be

  • { Ref } -- get from CFN's DescribeStackResources
  • { GetAtt } -- either get from a custom handler, otherwise from CCAPI, otherwise fail

We could potentially get a { GetAtt } from the template, if it is an input property that is set. But if we can parallellize all the necessary { GetAtt }s I don't see why this should add a lot of time, and the extremely fast path is very speculative and rare so probably not worth it.

// This will fail for resource types with compound primaryIdentifiers because
// CloudFormation's PhysicalResourceId only contains a single value.
try {
const cloudControl = this.sdk.cloudControl();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want a new SDK client for every single invocation? I imagine there will be a bunch.

Comment on lines +484 to +485
// This will fail for resource types with compound primaryIdentifiers because
// CloudFormation's PhysicalResourceId only contains a single value.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... introduce a mapper!

type CfnToCcapiIdentifierMapper = (cfnPhysicalId: string) => string;

Have a table of these, with (x) => x as a default for resources that don't occur in the table.

Aaannndd... it's probably fine for 90% of resources?

For compound primary identifiers (that is, one that consists of multiple resource properties strung together), to specify the primary identifier as a string, list the property values in the order they are specified in the primary identifier definition, separated by |.

return props[attribute];
}
} catch {
// Cloud Control lookup failed — fall through to the error below
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give a more descriptive error here, include the CCAPI one perhaps. How otherwise will we know what to fix?

Comment on lines +121 to +122
* built by joining each component with "|". CloudFormation's PhysicalResourceId
* only returns a single value, which doesn't work for compound keys.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 times out of 10 the PhysicalResourceId will already be that string with components separated by |.

It should be only for certain legacy resources that they aren't (singular in CFN, compound in CCAPI).

In fact, that list is here. I would say we add a dependency on the service spec, and at build time extract that information and dump it into a JSON file that we use to decide what to do here (or that's the list we must have custom identifier mappers for, as mentioned before).

const delaySeconds = functionIsInVpcOrUsesDockerForCode ? 5 : 1;
// otherwise, the update will be quick. This is reflected in the delay times we
// set for exponential backoff while waiting
const minDelaySeconds = functionIsInVpcOrUsesDockerForCode ? 3 : 1, maxDelaySeconds = functionIsInVpcOrUsesDockerForCode ? 5 : 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a readable way of writing this (at least imo).

More code but more readable:

let minDelaySeconds;
let maxDelaySeconds;
if (functionIsInVpcOrUsesDockerForCode) {
  minDelaySeconds = 3;
  maxDelaySeconds = 5;
} else {
  minDelaySeconds = 1;
  maxDelaySeconds = 3;
}

Btw the difference between these number is so small that we might as well not bother.

The nice thing about exponential backoff is that it will automatically find its own stable point. Do not try to outsmart it, do not go "oh but what if the update finished between poll 13 and 14, then the user is waiting 30 seconds for nothing, better make the poll shorter so they don't have to wait..."

If you've already been waiting 30 seconds, chances are high you'll be waiting 30 seconds longer. Either things are fast and your polling intervals don't matter, or the service is busted and quicker polling won't make (much of) a difference and just degrades the service for everyone.


Relatedly, I was a little confused while reading this:

Lambda Function hotswapping takes advantage of exponential backoff: waiters in AWS SDK V3 support exponential back off and we were not taking advantage of that while hotswapping Lambda functions, hotswapping Lambda functions that involve Docker or VPCs should be a bit faster.

Exponential backoff usually makes things slower (for good reason!). So why would the benefit of using it be that we are faster for Docker or VPCs?

'AWS::AppSync::GraphQLSchema': isHotswappableAppSyncChange,
'AWS::AppSync::ApiKey': isHotswappableAppSyncChange,

'AWS::BedrockAgentCore::Runtime': isHotswappableBedrockAgentCoreRuntimeChange,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not be deleting these functions now then?

service: 'cloudcontrol',
apply: async () => {
const cloudControl = sdk.cloudControl();
const currentResource = await cloudControl.getResource({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in doubt about this.

Upstairs, we already decided we had a "change" that's hotswappable. That means we already look at a "current state" and a "new state", and made a calculation based on the diff between them.

(It might be that that state is inaccurate, but that's a different problem!)

If we already have a current state that we calculated a diff on, it seems dangerous to me to fetch another current state, and do another diff calculation based on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants