diff --git a/README.md b/README.md index 93b9dbb7..9368d443 100644 --- a/README.md +++ b/README.md @@ -41,10 +41,11 @@ To maximize the benefits of plugin-assisted development while maintaining securi | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- | | **amazon-location-service** | Add maps, geocoding, routing, places search, and geospatial features to applications with Amazon Location Service | Available | | **aws-amplify** | Build full-stack apps with AWS Amplify Gen 2 using guided workflows for auth, data, storage, and functions | Available | -| **aws-serverless** | Build serverless applications with Lambda, API Gateway, EventBridge, Step Functions, and durable functions | Available | +| **aws-serverless** | Build serverless applications with Lambda, API Gateway, EventBridge, Step Functions, durable functions, and Lambda Managed Instances | Available | | **codebase-documentor-for-aws** | Analyze AWS-deployed services and codebases to generate structured technical documentation with source-of-truth citations | Available | | **databases-on-aws** | Database guidance for the AWS database portfolio — schema design, queries, migrations, and multi-tenant patterns | Some Services Available (Aurora DSQL) | | **deploy-on-aws** | Deploy applications to AWS with architecture recommendations, cost estimates, and IaC deployment | Available | +| **migration-to-aws** | Migrate GCP infrastructure to AWS with resource discovery, architecture mapping, cost analysis, and execution planning | Available | | **sagemaker-ai** | Build, train, and deploy AI models with deep AWS AI/ML expertise brought directly into your coding assistants, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/) | Available | ## Installation @@ -214,15 +215,16 @@ Build full-stack apps with AWS Amplify Gen 2 using TypeScript code-first develop ## aws-serverless -Design, build, deploy, test, and debug serverless applications with AWS Lambda, API Gateway, EventBridge, Step Functions, and durable functions. Includes SAM and CDK deployment workflows, a SAM template validation hook, and the AWS Lambda durable functions skill for building resilient, long-running, multi-step applications. +Design, build, deploy, test, and debug serverless applications with AWS Lambda, API Gateway, EventBridge, Step Functions, and durable functions. Includes SAM and CDK deployment workflows, a SAM template validation hook, the AWS Lambda durable functions skill for building resilient, long-running, multi-step applications, and the Lambda Managed Instances skill for evaluating, configuring, and migrating workloads to EC2-backed Lambda. ### Agent Skill Triggers -| Agent Skill | Triggers | -| -------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **aws-lambda** | "Lambda function", "event source", "serverless application", "API Gateway", "EventBridge", "Step Functions", "serverless API", "event-driven architecture", "Lambda trigger" | -| **aws-serverless-deployment** | "use SAM", "SAM template", "SAM init", "SAM deploy", "CDK serverless", "CDK Lambda construct", "NodejsFunction", "PythonFunction", "serverless CI/CD pipeline" | -| **aws-lambda-durable-functions** | "lambda durable functions", "workflow orchestration", "state machines", "retry/checkpoint patterns", "long-running stateful Lambda", "saga pattern", "human-in-the-loop" | +| Agent Skill | Triggers | +| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **aws-lambda** | "Lambda function", "event source", "serverless application", "API Gateway", "EventBridge", "Step Functions", "serverless API", "event-driven architecture", "Lambda trigger" | +| **aws-serverless-deployment** | "use SAM", "SAM template", "SAM init", "SAM deploy", "CDK serverless", "CDK Lambda construct", "NodejsFunction", "PythonFunction", "serverless CI/CD pipeline" | +| **aws-lambda-durable-functions** | "lambda durable functions", "workflow orchestration", "state machines", "retry/checkpoint patterns", "long-running stateful Lambda", "saga pattern", "human-in-the-loop" | +| **aws-lambda-managed-instances** | "Lambda Managed Instances", "LMI", "capacity provider", "multi-concurrency Lambda", "EC2-backed Lambda", "cold start elimination", "Graviton Lambda", "Lambda cost optimization with Reserved Instances" | ### MCP Servers diff --git a/plugins/aws-serverless/.claude-plugin/plugin.json b/plugins/aws-serverless/.claude-plugin/plugin.json index 2b0d7ddf..46190f3b 100644 --- a/plugins/aws-serverless/.claude-plugin/plugin.json +++ b/plugins/aws-serverless/.claude-plugin/plugin.json @@ -8,6 +8,8 @@ "aws", "lambda", "durable functions", + "managed-instances", + "lmi", "serverless", "development", "sam", diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/SKILL.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/SKILL.md new file mode 100644 index 00000000..ef15303c --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/SKILL.md @@ -0,0 +1,216 @@ +--- +name: aws-lambda-managed-instances +description: > + Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). + Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, + dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, + instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. + Also trigger when users describe high-volume predictable workloads seeking cost savings, + or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, + use the aws-lambda skill instead. +argument-hint: "[describe your workload or what you need help with]" +metadata: + tags: lambda, lmi, managed-instances, ec2, capacity-provider, multi-concurrency, cost-optimization +--- + +# AWS Lambda Managed Instances (LMI) + +Run Lambda functions on current-generation EC2 instances in your account while AWS manages provisioning, patching, scaling, routing, and load balancing. Combines Lambda's developer experience with EC2's pricing and hardware options. + +For standard Lambda development, see [aws-lambda skill](../aws-lambda/). For SAM/CDK deployment, see [aws-serverless-deployment skill](../aws-serverless-deployment/). + +## When to Load Reference Files + +- **Cost comparison**, **pricing analysis**, **Lambda vs LMI cost**, **Savings Plans**, or **Reserved Instances** -> see [references/cost-comparison.md](references/cost-comparison.md) +- **Instance types**, **memory sizing**, **vCPU ratios**, **scaling tuning**, or **capacity provider config** -> see [references/configuration-guide.md](references/configuration-guide.md) +- **Thread safety**, **concurrency model**, **code review checklist**, **Powertools compatibility**, or **multi-concurrency readiness** -> see [references/thread-safety.md](references/thread-safety.md) +- **Before/after code examples**, **runtime-specific migration** (Node.js, Python, Java, .NET), or **connection pooling** -> see [references/migration-patterns.md](references/migration-patterns.md) +- **IAM roles**, **VPC setup**, **CLI commands**, **SAM template**, or **CDK example** -> see [references/infrastructure-setup.md](references/infrastructure-setup.md) and [scripts/setup-lmi.sh](scripts/setup-lmi.sh) +- **Errors**, **throttling**, **debugging**, or **stuck deployments** -> see [references/troubleshooting.md](references/troubleshooting.md) + +## Quick Decision: Is LMI Right for This Workload? + +| Signal | LMI is a strong fit | Standard Lambda is better | +| -------------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------ | +| Traffic | Steady, predictable, 50M+ req/mo | Bursty, unpredictable, long idle | +| Cost | Duration-heavy spend at scale | Low or sporadic invocations | +| Cold starts | Unacceptable (LMI eliminates for provisioned capacity; scale-out may have brief delays) | Tolerable or mitigated by SnapStart | +| Compute | Latest CPUs, specific families, high network bandwidth | Standard Lambda memory/CPU sufficient | +| Isolation | Dedicated EC2 instances in your account, full VPC control | Shared Firecracker micro-VMs acceptable | +| Scale-to-zero | Not needed (min 3 instances always run) | Required (pay nothing when idle) | +| Code readiness | Thread-safe (Node.js/Java/.NET) or any Python code | Non-thread-safe Node.js/Java/.NET, expensive to change | + +## Instructions + +### Step 1: Assess the Workload + +Gather these signals before recommending: + +1. **Traffic pattern**: Steady vs bursty? Requests per second? +2. **Current costs**: Monthly Lambda spend? Existing Savings Plans? +3. **Runtime**: Node.js, Java, .NET, or Python? +4. **Memory/CPU**: How much memory? CPU-bound or I/O-bound? +5. **Execution duration**: Average and P99? +6. **Concurrency readiness**: Thread safety (Node.js/Java/.NET)? Shared `/tmp` paths? Per-invocation DB connections? +7. **VPC**: Already in a VPC? Private resource access needed? + +### Step 2: Build the Cost Comparison + +REQUIRED: Present a cost comparison before recommending LMI. Compare at minimum: + +| Scenario | When it wins | +| ---------------- | --------------------------- | +| Lambda on-demand | Low volume, bursty traffic | +| LMI on-demand | High volume, steady traffic | + +Rule of thumb: LMI becomes cost-competitive at 50-100M+ req/month with steady traffic. + +For discount analysis (Savings Plans, Reserved Instances), refer users to the [AWS Pricing Calculator](https://calculator.aws/) and [references/cost-comparison.md](references/cost-comparison.md) for formulas and worked examples. Discount recommendations require workload-specific forecasting beyond this skill's scope. + +### Step 3: Configure the Deployment + +**Instance families** (400+ types, .large and up): C-series (compute), M-series (general), R-series (memory). ARM (Graviton) for best price-performance. + +**Memory-to-vCPU ratios**: 2:1 (compute), 4:1 (general, default), 8:1 (memory). Min 2 GB, max 32 GB. + +**Multi-concurrency defaults/vCPU**: Node.js 64, Java 32, .NET 32, Python 16. + +**Scaling**: MinExecutionEnvironments (default 3), MaxVCpuCount (required), TargetResourceUtilization. + +See [references/configuration-guide.md](references/configuration-guide.md) for decision trees and detailed tuning. + +### Step 4: Migrate the Code + +Review code for concurrency safety. LMI runs multiple invocations concurrently per execution environment, but the model differs by runtime: + +- **Python**: Process-based isolation — globals are NOT shared. No thread-safety changes needed. Focus on `/tmp` conflicts and memory sizing (per-process × concurrency). +- **Node.js**: Worker threads — globals shared within a worker. Requires async safety. Callback handlers not supported on Node.js 22. +- **Java/.NET**: OS threads/Tasks — handler shared across threads. Requires full thread safety. + +**Common issues (all runtimes)**: shared `/tmp` paths, per-invocation DB connections. +**Thread-safety issues (Node.js/Java/.NET only)**: mutable globals, non-thread-safe libs. + +See [references/thread-safety.md](references/thread-safety.md) for the review checklist and [references/migration-patterns.md](references/migration-patterns.md) for runtime-specific before/after code. + +### Step 5: Set Up Infrastructure + +1. Create two IAM roles: execution role (for the function) and operator role (for capacity provider EC2 management) +2. Configure VPC with subnets across 3+ AZs +3. Create capacity provider with VPC config and scaling limits +4. Create or update function with capacity provider attachment +5. Publish a version (triggers instance provisioning) + +See [references/infrastructure-setup.md](references/infrastructure-setup.md) for CLI commands and SAM templates. + +### Step 6: Validate and Cut Over + +1. Deploy to a non-production environment first +2. Monitor CloudWatch: CPU utilization, memory, concurrency, throttle rate +3. Gradual traffic shift with weighted aliases (10% → 50% → 100%) +4. Compare costs after 1-2 weeks of production data +5. Decommission standard Lambda once stable + +## Best Practices + +### Configuration + +- Do: Start with 4:1 ratio and runtime default concurrency +- Do: Use ARM (Graviton) unless x86 dependencies exist +- Do: Let Lambda choose instance types unless specific hardware needed +- Do: Set MaxVCpuCount to control cost ceiling +- Don't: Set MinExecutionEnvironments below 3 (breaks AZ resiliency) +- Don't: Over-restrict instance types (lowers availability) + +### Migration + +- Do: Start with I/O-heavy functions (benefit most from multi-concurrency; CPU-bound functions compete for same CPU) +- Do: Review code for concurrency safety before attaching to capacity provider (thread safety for Node.js/Java/.NET; `/tmp` and memory for Python) +- Do: Use weighted aliases for gradual traffic shift +- Do: Include request IDs in all log statements +- Do: Initialize DB pools and SDK clients outside the handler +- Don't: Write to hardcoded `/tmp` paths without request-unique naming +- Don't: Skip cost comparison — LMI is not always cheaper + +### Operations + +- Do: Set CloudWatch alarms on throttle rate > 1% and CPU > 80% +- Do: Plan for 14-day instance rotation (automatic) +- Don't: Manually terminate LMI EC2 instances (delete the capacity provider instead) +- Don't: Forget to publish a version — unpublished functions cannot run on LMI + +## Limits Quick Reference + +| Resource | Limit | +| ----------------- | ----------------------------------------- | +| Memory | 2 GB min, 32 GB max | +| Instances | 3 minimum (AZ resiliency) | +| Instance lifespan | 14 days (auto-replaced) | +| Concurrency/vCPU | 64 (Node.js), 32 (Java/.NET), 16 (Python) | +| Runtimes | Node.js, Java, .NET, Python | +| Instance families | C, M, R (.large and up) | +| Scaling | Absorbs 50% spike; doubles within 5 min | + +## Troubleshooting Quick Reference + +| Issue | Cause | Fix | +| -------------------------- | --------------------------------- | -------------------------------------------------------------------- | +| 429 throttles | Traffic exceeds scaling speed | Increase MinExecutionEnvironments or lower TargetResourceUtilization | +| Function stuck PENDING | Provisioning instances | Wait; check VPC/IAM config | +| Architecture mismatch | Function ≠ capacity provider arch | Align both to same architecture | +| Cannot terminate instances | Managed by capacity provider | Delete capacity provider instead | +| Race conditions | Code not thread-safe | See [references/thread-safety.md](references/thread-safety.md) | + +See [references/troubleshooting.md](references/troubleshooting.md) for detailed resolution steps. + +## Configuration + +### AWS CLI Setup + +REQUIRED: AWS credentials configured on the host machine. + +**Verify access**: Run `aws sts get-caller-identity` + +### Regional Availability + +Check the [Lambda Managed Instances documentation](https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html) for current regional availability. + +## Language Selection + +Default: TypeScript + +Override: "use Python" → Python, "use JavaScript" → JavaScript. When not specified, ALWAYS use TypeScript. + +## IaC Framework Selection + +Default: CDK + +Override: "use SAM" → SAM YAML, "use CloudFormation" → CloudFormation YAML. When not specified, ALWAYS use CDK. + +## Error Scenarios + +### Serverless MCP Server Unavailable + +- Inform user: "AWS Serverless MCP not responding" +- Ask: "Proceed without MCP support?" +- DO NOT continue without user confirmation + +### Unsupported Runtime + +- State: "Lambda Managed Instances does not yet support [runtime]" +- List supported runtimes +- Suggest standard Lambda as alternative + +### Unsupported Region + +- State: "Lambda Managed Instances is not yet available in [region]" +- List available regions + +## Resources + +- [Lambda Managed Instances Docs](https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html) +- [Introducing LMI (AWS Blog)](https://aws.amazon.com/blogs/aws/introducing-aws-lambda-managed-instances-serverless-simplicity-with-ec2-flexibility/) +- [Build High-Performance Apps with LMI](https://aws.amazon.com/blogs/compute/build-high-performance-apps-with-aws-lambda-managed-instances/) +- [Migrating Functions to LMI (AWS Blog)](https://aws.amazon.com/blogs/compute/migrating-your-functions-to-aws-lambda-managed-instances/) +- [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) +- [LMI Samples Repository](https://github.com/aws-samples/sample-aws-lambda-managed-instances) +- [AWS Lambda Pricing](https://aws.amazon.com/lambda/pricing/) diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/configuration-guide.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/configuration-guide.md new file mode 100644 index 00000000..9b2bc458 --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/configuration-guide.md @@ -0,0 +1,69 @@ +# LMI Configuration Guide + +## Instance Type Decision Tree + +- **CPU-intensive** (encoding, ML, compression) → C-series, 2:1 ratio, concurrency=1/vCPU +- **Memory-intensive** (caching, large datasets) → R-series, 8:1 ratio +- **Network-intensive** (streaming, data transfer) → Use AllowedInstanceTypes for n-suffix types, 4:1 ratio +- **General/balanced** (web APIs, microservices) → M-series, 4:1 ratio, default concurrency + +Architecture: ARM (Graviton, g-suffix) for price-performance. x86 (i=Intel, a=AMD) when dependencies require it. + +## Memory-to-vCPU Ratios + +| Ratio | Profile | When to use | Memory examples | +| ----- | ------- | -------------------------- | --------------------- | +| 2:1 | Compute | CPU-bound work | 2GB/1vCPU, 4GB/2vCPU | +| 4:1 | General | Most workloads (default) | 4GB/1vCPU, 8GB/2vCPU | +| 8:1 | Memory | Caching, data, Python apps | 8GB/1vCPU, 16GB/2vCPU | + +Min: 2 GB / 1 vCPU. Max: 32 GB. Memory must align with ratio multiples. + +## Memory Sizing from Existing Lambda + +| Current Lambda | LMI memory | Ratio | Rationale | +| -------------- | ------------- | ---------- | -------------------------------------------- | +| 128-512 MB | 2048 MB | 4:1 | LMI minimum; multi-concurrency shares memory | +| 512 MB-1 GB | 2048 MB | 4:1 | Room for concurrent requests | +| 1-2 GB | 4096 MB | 4:1 | Standard upgrade path | +| 2-4 GB | 4096-8192 MB | 4:1 or 8:1 | Depends on memory vs CPU bottleneck | +| 4-10 GB | 8192-16384 MB | 8:1 | Likely memory-heavy workload | + +## Concurrency Tuning + +| Runtime | Default/vCPU | I/O-bound | CPU-bound | +| ------- | ------------ | ---------------- | ---------- | +| Node.js | 64 | Keep or increase | 1 per vCPU | +| Java | 32 | Keep | 1 per vCPU | +| .NET | 32 | Keep | 1 per vCPU | +| Python | 16 | Keep | 1 per vCPU | + +Total capacity = MinExecutionEnvironments × PerExecutionEnvironmentMaxConcurrency + +## Capacity Provider Scaling Controls + +| Control | Default | Guidance | +| ------------------------- | ------------- | --------------------------------------------- | +| MinExecutionEnvironments | 3 | Increase for baseline capacity; never below 3 | +| MaxExecutionEnvironments | — | Set based on cost budget | +| MaxVCpuCount | Required | Start at 30, adjust by load | +| TargetResourceUtilization | ~50% headroom | Raise for cost savings (less burst tolerance) | +| AllowedInstanceTypes | All | Restrict only for specific hardware needs | +| ExcludedInstanceTypes | None | Exclude expensive types in dev/test | + +## Monitoring Thresholds + +- **CPU > 80%**: reduce concurrency or add vCPUs +- **CPU < 20%**: increase concurrency for better utilization +- **Throttle rate (429s) > 1%**: increase MinExecutionEnvironments or reduce utilization target +- **Memory > 90%**: increase memory or reduce concurrency +- **ExecutionEnvironmentConcurrency near ExecutionEnvironmentConcurrencyLimit**: saturation — reduce concurrency or scale out + +## CloudWatch Metrics Dimensions + +LMI metrics are split across two CloudWatch dimensions: + +- **Alias (live)**: Invocations, Errors, Throttles, Duration +- **Version ($LATEST or numbered)**: CPUUtilization, MemoryUtilization, ExecutionEnvironmentConcurrency, ExecutionEnvironmentCount + +Create a unified dashboard combining both views to monitor LMI performance effectively. diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/cost-comparison.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/cost-comparison.md new file mode 100644 index 00000000..d57c9031 --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/cost-comparison.md @@ -0,0 +1,17 @@ +# Lambda vs LMI Cost Comparison + +Use the [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) for accurate, up-to-date cost comparisons based on your specific workload parameters (region, instance type, request volume, duration). + +When building a cost comparison for a user, gather: region, runtime, requests/month, average duration, memory, and architecture (x86 vs ARM). Plug these into the calculator rather than relying on hardcoded estimates. + +## When LMI is NOT Cheaper + +- < 50M req/month (fixed 3-instance cost exceeds Lambda) +- Very short functions (< 100ms duration) +- Highly bursty, unpredictable traffic +- Workloads needing scale-to-zero + +## Tools + +- [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) — interactive comparison tool +- [AWS Pricing Calculator](https://calculator.aws/) — general AWS cost estimation diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/infrastructure-setup.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/infrastructure-setup.md new file mode 100644 index 00000000..81c234dd --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/infrastructure-setup.md @@ -0,0 +1,234 @@ +# LMI Infrastructure Setup + +## IAM Roles (Two Required) + +### 1. Execution Role (for the function) + +Trust policy: + +```json +{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": { "Service": "lambda.amazonaws.com" }, + "Action": "sts:AssumeRole" + }] +} +``` + +Minimum permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents" + ], + "Resource": "arn:aws:logs:*:*:log-group:/aws/lambda/*" + } + ] +} +``` + +Add VPC permissions only if the function accesses VPC resources: + +```json +{ + "Effect": "Allow", + "Action": [ + "ec2:CreateNetworkInterface", + "ec2:DescribeNetworkInterfaces", + "ec2:DeleteNetworkInterface" + ], + "Resource": "*" +} +``` + +### 2. Operator Role (for capacity provider EC2 management) + +Trust policy: + +```json +{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": { "Service": "lambda.amazonaws.com" }, + "Action": "sts:AssumeRole" + }] +} +``` + +Minimum permissions (scoped with conditions): + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["ec2:RunInstances", "ec2:CreateTags", "ec2:AttachNetworkInterface"], + "Resource": [ + "arn:aws:ec2:*:*:instance/*", + "arn:aws:ec2:*:*:network-interface/*", + "arn:aws:ec2:*:*:volume/*" + ], + "Condition": { + "StringEquals": { + "ec2:ManagedResourceOperator": "scaler.lambda.amazonaws.com" + } + } + }, + { + "Effect": "Allow", + "Action": [ + "ec2:DescribeAvailabilityZones", + "ec2:DescribeCapacityReservations", + "ec2:DescribeInstances", + "ec2:DescribeInstanceStatus", + "ec2:DescribeInstanceTypeOfferings", + "ec2:DescribeInstanceTypes", + "ec2:DescribeSecurityGroups", + "ec2:DescribeSubnets" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": ["ec2:RunInstances", "ec2:CreateNetworkInterface"], + "Resource": [ + "arn:aws:ec2:*:*:subnet/*", + "arn:aws:ec2:*:*:security-group/*" + ] + }, + { + "Effect": "Allow", + "Action": "ec2:RunInstances", + "Resource": "arn:aws:ec2:*:*:image/*", + "Condition": { + "StringEquals": { "ec2:Owner": "amazon" } + } + }, + { + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "" + } + ] +} +``` + +The `ec2:ManagedResourceOperator` condition ensures RunInstances/CreateTags only apply to Lambda-managed instances. First-time capacity provider creation also requires `iam:CreateServiceLinkedRole`. + +## VPC Requirements + +LMI runs functions on EC2 instances inside the VPC. These instances need VPC endpoints or NAT to reach AWS services. + +- 3+ subnets across different AZs (for default 3-instance fleet) +- Security groups: HTTPS egress (port 443) for AWS API calls; no ingress needed +- Required VPC endpoints: + +| Endpoint | Type | Purpose | +| --------------------- | --------- | --------------------- | +| S3 | Gateway | Object storage access | +| DynamoDB | Gateway | Table access | +| SQS | Interface | Queue operations | +| CloudWatch Logs | Interface | Log delivery | +| CloudWatch Monitoring | Interface | Metrics/EMF | +| X-Ray | Interface | Distributed tracing | + +## CLI Workflow + +### Required Parameters + +| Parameter | Description | +| -------------------- | ------------------------------------------- | +| `SUBNET_IDS` | Comma-separated subnet IDs across 3+ AZs | +| `SECURITY_GROUP_ID` | Security group ID for the capacity provider | +| `ACCOUNT_ID` | AWS account ID | +| `OPERATOR_ROLE_ARN` | ARN of the operator role (see above) | +| `EXECUTION_ROLE_ARN` | ARN of the execution role (see above) | +| `FUNCTION_NAME` | Name for the Lambda function | +| `CP_NAME` | Name for the capacity provider | +| `ARCHITECTURE` | `arm64` (Graviton) or `x86_64` | + +### Automated Setup + +See [`scripts/setup-lmi.sh`](../scripts/setup-lmi.sh) — set the environment variables above and run: + +```bash +./scripts/setup-lmi.sh +``` + +### Manual Steps + +```bash +# 1. Create capacity provider +aws lambda create-capacity-provider \ + --capacity-provider-name $CP_NAME \ + --vpc-config "SubnetIds=[$SUBNET_IDS],SecurityGroupIds=[$SECURITY_GROUP_ID]" \ + --permissions-config "CapacityProviderOperatorRoleArn=$OPERATOR_ROLE_ARN" \ + --instance-requirements "Architectures=[$ARCHITECTURE]" \ + --capacity-provider-scaling-config "MaxVCpuCount=30" + +# 2. Create function +aws lambda create-function --function-name $FUNCTION_NAME --runtime python3.13 \ + --handler app.handler --zip-file fileb://function.zip \ + --role $EXECUTION_ROLE_ARN --architectures $ARCHITECTURE \ + --memory-size 4096 \ + --capacity-provider-config \ + "LambdaManagedInstancesCapacityProviderConfig={CapacityProviderArn=arn:aws:lambda:$AWS_REGION:$ACCOUNT_ID:capacity-provider:$CP_NAME}" + +# 3. Publish version (triggers provisioning — takes several minutes) +aws lambda publish-version --function-name $FUNCTION_NAME + +# 4. Invoke (must use versioned ARN) +aws lambda invoke --function-name $FUNCTION_NAME:1 --payload '{}' response.json +``` + +Architecture must match between function and capacity provider. + +## SAM Template + +```yaml +Resources: + MyCP: + Type: AWS::Lambda::CapacityProvider + Properties: + CapacityProviderName: my-cp + VpcConfig: + SubnetIds: [!Ref Sub1, !Ref Sub2, !Ref Sub3] + SecurityGroupIds: [!Ref SG] + PermissionsConfig: + CapacityProviderOperatorRoleArn: !GetAtt OpRole.Arn + InstanceRequirements: + Architectures: [arm64] + CapacityProviderScalingConfig: + MaxVCpuCount: 30 + + MyFn: + Type: AWS::Serverless::Function + Properties: + Runtime: python3.13 + Handler: app.handler + MemorySize: 4096 + Architectures: [arm64] + CapacityProviderConfig: + LambdaManagedInstancesCapacityProviderConfig: + CapacityProviderArn: !GetAtt MyCP.Arn +``` + +## Cleanup + +```bash +aws lambda delete-function --function-name my-fn +aws lambda delete-capacity-provider --capacity-provider-name my-cp +``` + +Deleting the capacity provider destroys all associated EC2 instances. diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/migration-patterns.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/migration-patterns.md new file mode 100644 index 00000000..7898f03f --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/migration-patterns.md @@ -0,0 +1,143 @@ +# LMI Migration Patterns + +Before/after code examples for migrating to multi-concurrency. + +## Node.js + +### Global State + +```javascript +// BEFORE (race condition) +let requestCount = 0; +exports.handler = async (event) => { + requestCount++; + return { count: requestCount }; +}; + +// AFTER (request-isolated) +const { AsyncLocalStorage } = require('node:async_hooks'); +const als = new AsyncLocalStorage(); +exports.handler = async (event) => { + return als.run({ id: event.requestContext?.requestId }, async () => { + return await processEvent(event); + }); +}; +``` + +### File I/O + +```javascript +// BEFORE (shared path) +fs.writeFileSync('/tmp/output.json', JSON.stringify(data)); + +// AFTER (request-unique path) +const path = `/tmp/output-${event.requestContext?.requestId}.json`; +try { fs.writeFileSync(path, JSON.stringify(data)); } +finally { fs.unlinkSync(path); } +``` + +### Database + +```javascript +// BEFORE (per-invocation connection) +exports.handler = async (event) => { + const conn = await mysql.createConnection({/*...*/}); + const [rows] = await conn.execute('SELECT ...'); + await conn.end(); +}; + +// AFTER (shared pool) +const pool = mysql.createPool({ connectionLimit: 10, /*...*/ }); +exports.handler = async (event) => { + const [rows] = await pool.execute('SELECT ...'); + return rows; +}; +``` + +## Python + +Python on LMI uses **process-based isolation**. Each concurrent invocation runs in its own process with independent memory. Global state is NOT shared, so no locking is needed. The main migration concerns are `/tmp` conflicts, memory sizing, and connection pooling. + +### Global State (No Changes Needed) + +```python +# This is SAFE on LMI — each process has its own copy of cache +cache = {} +def handler(event, context): + cache[event['key']] = compute(event) + return cache[event['key']] + +# Module-level clients are also safe (isolated per process) +s3_client = boto3.client('s3') +dynamodb = boto3.resource('dynamodb') +``` + +### File I/O (Change Required — `/tmp` is shared across processes) + +```python +# BEFORE (conflict — all processes share /tmp) +with open('/tmp/data.json', 'w') as f: json.dump(event, f) + +# AFTER (request-unique path) +path = f'/tmp/data-{context.aws_request_id}.json' +try: + with open(path, 'w') as f: json.dump(event, f) +finally: + os.unlink(path) +``` + +### Database (Change Required — each process needs pooled connections) + +```python +# BEFORE (per-invocation connection — exhausts limits at concurrency) +def handler(event, context): + conn = psycopg2.connect(host='...') + +# AFTER (pool per process — initialized at module level) +from psycopg2 import pool +db_pool = pool.SimpleConnectionPool(1, 3, host=os.environ['DB_HOST']) +def handler(event, context): + conn = db_pool.getconn() + try: return query(conn, event) + finally: db_pool.putconn(conn) +# Note: total connections = pool_size × concurrency (e.g., 3 × 16 = 48) +``` + +### Memory Sizing + +```python +# A function using 200 MB per process with default concurrency of 16: +# Total memory ≈ 200 MB × 16 = 3.2 GB +# Use 4:1 or 8:1 memory-to-vCPU ratio to accommodate +# Monitor MemoryUtilization metric and adjust as needed +``` + +## Java + +### Global State + +```java +// BEFORE (race condition) +private static Map cache = new HashMap<>(); + +// AFTER (thread-safe) +private static final ConcurrentHashMap cache = new ConcurrentHashMap<>(); +// Use cache.computeIfAbsent(key, k -> compute(k)); +``` + +### Database + +```java +// BEFORE (per-invocation) +Connection conn = DriverManager.getConnection("jdbc:..."); + +// AFTER (HikariCP pool, static init) +private static final HikariDataSource ds; +static { + HikariConfig c = new HikariConfig(); + c.setJdbcUrl(System.getenv("DB_URL")); + c.setMaximumPoolSize(10); + ds = new HikariDataSource(c); +} +// Use: try (Connection conn = ds.getConnection()) { ... } +``` diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/thread-safety.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/thread-safety.md new file mode 100644 index 00000000..d6d677f2 --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/thread-safety.md @@ -0,0 +1,106 @@ +# Concurrency Safety for LMI + +LMI runs multiple invocations concurrently in the same execution environment. The concurrency model differs by runtime — some require thread safety, others provide process isolation. + +## Code Review Checklist + +When reviewing a function for LMI readiness, check each item: + +- [ ] No shared `/tmp` paths (use request ID in filenames, clean up after — shared across ALL runtimes) +- [ ] Database connections use pools (initialized outside handler, not per-invocation) +- [ ] SDK clients outside handler (module-level singletons are fine — they are thread-safe) +- [ ] Logging includes request ID (for tracing concurrent requests) +- [ ] **Node.js/Java/.NET only:** No global/static mutable variables (use immutable or request-local state) +- [ ] **Node.js/Java/.NET only:** Thread-safe libraries only (check DB drivers, HTTP clients, caching libs) +- [ ] **Node.js/Java/.NET only:** No request state in global scope (use AsyncLocalStorage, contextvars, ThreadLocal) +- [ ] **Node.js/Java/.NET only:** No environment variable mutation during requests +- [ ] **Python only:** Memory budget accounts for per-process multiplication (memory × concurrency) + +## Runtime-Specific Guidance + +### Python (Process-Based Isolation) + +Python uses **multiple independent processes**, each with its own interpreter and memory space. Global variables, module-level caches, and singleton objects are duplicated per process, not shared. If a function works on standard Lambda today, it works on LMI without code changes related to shared state. + +**Key concerns:** + +- Memory consumption: total footprint ≈ per-process memory × concurrency. A 200 MB function with 16 concurrent processes can consume 3+ GB. +- `/tmp` filesystem is shared across all processes — use `context.aws_request_id` in filenames +- Each process needs its own connection pool — size pools per-process, not globally +- Prefer 4:1 or 8:1 memory-to-vCPU ratio to accommodate memory multiplication +- Monitor `MemoryUtilization` metric and adjust ratio if needed + +**Safe patterns (no locking needed):** + +- Module-level mutable globals (isolated per process) +- Module-level SDK clients and caches +- `os.environ` reads + +### Node.js (Worker Threads + Async/Await) + +Uses worker threads (configurable via `AWS_LAMBDA_NODEJS_WORKER_COUNT`) combined with async/await event loops. The handler and global state are **shared across concurrent invocations within a worker thread**. + +The `await` keyword yields control to the event loop, which may execute another invocation that overwrites shared state before the first resumes. + +**Key concerns:** + +- Use `AsyncLocalStorage` from `node:async_hooks` for request context +- Keep mutable state within handler local scope +- Initialize SDK clients and DB pools at module level (they are thread-safe) +- Avoid module-level mutable state (`let count = 0` is a race condition) +- Callback-based handlers are NOT supported on Node.js 22 — use async handlers + +### Java (OS Threads) + +Uses OS-level threads. Lambda loads the handler class once and invokes `handleRequest` from multiple threads simultaneously (identical to a Java app server). + +**Key concerns:** + +- Use immutable objects and thread-safe collections (`ConcurrentHashMap`, `Collections.synchronizedList`) +- Initialize SDK clients and connection pools in constructor or static block +- Avoid mutable `static` fields +- Use `ThreadLocal` for request-specific state +- Use HikariCP or similar for connection pooling (AWS SDK for Java 2.x clients are thread-safe) + +### .NET (Task-Based Concurrency) + +Uses a single process with .NET Tasks (same model as ASP.NET Core). The handler object is shared across all Tasks. + +**Key concerns:** + +- Use `AsyncLocal` for request-scoped data +- Inject scoped services via DI container +- Initialize `HttpClient` and SDK clients as singletons +- Use `ConcurrentDictionary` and `SemaphoreSlim` for thread-safe access +- Invocation timeouts are NOT enforced by the runtime — use `ILambdaContext.RemainingTime` to detect approaching timeouts + +## Common Anti-Patterns + +| Anti-pattern | Affected Runtimes | Risk | Fix | +| -------------------------------- | ------------------- | --------------------------------- | --------------------------------------------- | +| New DB connection per invocation | All | Exhausts connection limits | Module-level connection pool | +| Hardcoded `/tmp` paths | All | File conflicts across processes | Use `aws_request_id` in path | +| Logging without request ID | All | Unreadable interleaved logs | Include `aws_request_id` | +| Mutable module-level state | Node.js, Java, .NET | Race condition / state corruption | Request-local scope or concurrent collections | +| Setting env vars during request | Node.js, Java, .NET | Race condition | Pass state via parameters | +| Assuming sequential execution | Node.js, Java, .NET | State corruption | Each invocation must be self-contained | +| Ignoring memory multiplication | Python | OOM at high concurrency | Account for per-process × concurrency | + +## Powertools for AWS Lambda Compatibility + +Powertools handles multi-concurrency transparently (structured logging, tracing, metrics). No code changes needed. + +| Runtime | Package | Minimum Version | +| ---------- | -------------------------------------- | --------------- | +| Python | Powertools for AWS Lambda (Python) | 3.23.0 | +| TypeScript | Powertools for AWS Lambda (TypeScript) | 2.29.0 | +| Java | Powertools for AWS Lambda (Java) | 2.8.0 | +| .NET | Powertools for AWS Lambda (.NET) | 3.1.0 | + +AWS SDK and X-Ray minimum versions: + +| Runtime | AWS SDK minimum | X-Ray SDK minimum | +| ------- | ----------------------------------- | ----------------------------- | +| Node.js | AWS SDK for JavaScript v3 (3.933.0) | 3.12.0 | +| Java | AWS SDK for Java 2.0 (2.34.0) | 2.20.0 | +| .NET | AWSSDK.Core (4.0.0.32) | AWSXRayRecorder.Core (2.16.0) | diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/troubleshooting.md b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/troubleshooting.md new file mode 100644 index 00000000..4b17579e --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/references/troubleshooting.md @@ -0,0 +1,42 @@ +# LMI Troubleshooting + +## Common Issues + +| Issue | Cause | Resolution | +| ------------------------------ | ------------------------------------------------ | ----------------------------------------------------------------------------------- | +| 429 throttles during scale-up | Traffic doubled faster than 5-min scaling window | Increase MinExecutionEnvironments or lower TargetResourceUtilization | +| Function stuck in PENDING | Capacity provider provisioning instances | Wait several minutes; verify VPC subnets have IP capacity and IAM roles are correct | +| Architecture mismatch error | Function architecture ≠ capacity provider | Align both to arm64 or x86_64 | +| Cannot terminate EC2 instances | LMI instances managed by capacity provider | Delete capacity provider to destroy instances; cannot use EC2 console | +| High CPU, low throughput | Concurrency too high for CPU-bound work | Reduce PerExecutionEnvironmentMaxConcurrency to 1/vCPU | +| Race conditions in production | Code not thread-safe for multi-concurrency | Review with checklist in thread-safety.md | +| Function version not ACTIVE | Fewer than 3 execution environments ready | Wait for provisioning; check capacity provider status | +| Unexpected 500 errors | Unhandled concurrent access to shared state | Add thread-safe patterns from migration-patterns.md | +| CloudWatch logs missing | VPC egress not configured | Add NAT Gateway or CloudWatch Logs VPC endpoint | +| High costs despite low traffic | Minimum 3 instances always running | Evaluate if standard Lambda is more cost-effective | + +## Debugging Steps + +### Function Not Starting + +1. Check capacity provider status: `aws lambda get-capacity-provider --capacity-provider-name ` +2. Verify subnets span 3+ AZs with available IPs +3. Confirm security group allows necessary egress +4. Check operator role has `AWSLambdaManagedEC2ResourceOperator` policy +5. Look for `Operator` field in EC2 DescribeInstances or `aws:lambda:capacity-provider` tag + +### Performance Issues + +1. Check CloudWatch metrics (5-min intervals): CPU utilization, memory, concurrency/env +2. If CPU > 80%: reduce concurrency or add vCPUs (increase memory with appropriate ratio) +3. If throttles > 1%: increase MinExecutionEnvironments +4. If CPU < 20%: increase concurrency — resources are underutilized +5. For Python: verify 4:1 or 8:1 ratio (GIL limits CPU parallelism) + +### Cost Issues + +1. Verify instance count matches actual need (not over-provisioned) +2. Check if Savings Plans or RIs are applied to these instances +3. Compare actual costs against the 4-column estimate from cost-comparison.md +4. If traffic is lower than expected, consider reducing MaxVCpuCount +5. For dev/test: use ExcludedInstanceTypes to avoid expensive instance families diff --git a/plugins/aws-serverless/skills/aws-lambda-managed-instances/scripts/setup-lmi.sh b/plugins/aws-serverless/skills/aws-lambda-managed-instances/scripts/setup-lmi.sh new file mode 100755 index 00000000..8485d6f9 --- /dev/null +++ b/plugins/aws-serverless/skills/aws-lambda-managed-instances/scripts/setup-lmi.sh @@ -0,0 +1,69 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Setup script for AWS Lambda Managed Instances (LMI) +# Usage: ./setup-lmi.sh +# +# Prerequisites: +# - AWS CLI configured with appropriate credentials +# - VPC subnets and security group created +# - IAM roles created (see references/infrastructure-setup.md) +# +# Environment variables (required): +# SUBNET_IDS - Comma-separated subnet IDs (3+ AZs) +# SECURITY_GROUP_ID - Security group ID +# ACCOUNT_ID - AWS account ID +# OPERATOR_ROLE_ARN - ARN of the LMI operator role +# EXECUTION_ROLE_ARN - ARN of the Lambda execution role +# +# Environment variables (optional): +# AWS_REGION - AWS region (default: from AWS CLI config) +# MAX_VCPU_COUNT - Max vCPU limit (default: 30) +# MEMORY_SIZE - Function memory in MB (default: 4096) +# RUNTIME - Lambda runtime (default: python3.13) +# HANDLER - Function handler (default: app.handler) + +FUNCTION_NAME="${1:?Usage: $0 }" +CP_NAME="${2:?Usage: $0 }" +ARCHITECTURE="${3:-arm64}" + +: "${SUBNET_IDS:?Set SUBNET_IDS (comma-separated, 3+ AZs)}" +: "${SECURITY_GROUP_ID:?Set SECURITY_GROUP_ID}" +: "${ACCOUNT_ID:?Set ACCOUNT_ID}" +: "${OPERATOR_ROLE_ARN:?Set OPERATOR_ROLE_ARN}" +: "${EXECUTION_ROLE_ARN:?Set EXECUTION_ROLE_ARN}" + +MAX_VCPU_COUNT="${MAX_VCPU_COUNT:-30}" +MEMORY_SIZE="${MEMORY_SIZE:-4096}" +RUNTIME="${RUNTIME:-python3.13}" +HANDLER="${HANDLER:-app.handler}" +REGION="${AWS_REGION:-$(aws configure get region)}" + +echo "==> Creating capacity provider: ${CP_NAME}" +aws lambda create-capacity-provider \ + --capacity-provider-name "${CP_NAME}" \ + --vpc-config "SubnetIds=[${SUBNET_IDS}],SecurityGroupIds=[${SECURITY_GROUP_ID}]" \ + --permissions-config "CapacityProviderOperatorRoleArn=${OPERATOR_ROLE_ARN}" \ + --instance-requirements "Architectures=[${ARCHITECTURE}]" \ + --capacity-provider-scaling-config "MaxVCpuCount=${MAX_VCPU_COUNT}" + +CP_ARN="arn:aws:lambda:${REGION}:${ACCOUNT_ID}:capacity-provider:${CP_NAME}" + +echo "==> Creating function: ${FUNCTION_NAME}" +aws lambda create-function \ + --function-name "${FUNCTION_NAME}" \ + --runtime "${RUNTIME}" \ + --handler "${HANDLER}" \ + --zip-file fileb://function.zip \ + --role "${EXECUTION_ROLE_ARN}" \ + --architectures "${ARCHITECTURE}" \ + --memory-size "${MEMORY_SIZE}" \ + --capacity-provider-config \ + "LambdaManagedInstancesCapacityProviderConfig={CapacityProviderArn=${CP_ARN}}" + +echo "==> Publishing version (triggers instance provisioning — may take several minutes)" +VERSION=$(aws lambda publish-version --function-name "${FUNCTION_NAME}" --query 'Version' --output text) + +echo "==> Done. Function version: ${VERSION}" +echo " Invoke with: aws lambda invoke --function-name ${FUNCTION_NAME}:${VERSION} --payload '{}' response.json" +echo " Monitor provisioning: aws lambda get-capacity-provider --capacity-provider-name ${CP_NAME}" diff --git a/plugins/aws-serverless/skills/aws-lambda/SKILL.md b/plugins/aws-serverless/skills/aws-lambda/SKILL.md index 9e074af2..4dd14e2f 100644 --- a/plugins/aws-serverless/skills/aws-lambda/SKILL.md +++ b/plugins/aws-serverless/skills/aws-lambda/SKILL.md @@ -16,6 +16,7 @@ Use SAM CLI for project initialization and deployment, Lambda Web Adapter for we - **Web Application Deployment**: Deploy full-stack applications with Lambda Web Adapter - **Event Source Mappings**: Configure Lambda triggers for DynamoDB, Kinesis, SQS, Kafka - **Lambda durable functions**: Resilient multi-step applications with checkpointing — see the [durable-functions skill](../aws-lambda-durable-functions/) for guidance +- **Lambda Managed Instances**: Run Lambda on dedicated EC2 instances with managed lifecycle — see the [managed-instances skill](../aws-lambda-managed-instances/) for evaluation, configuration, and migration guidance - **Schema Management**: Type-safe EventBridge integration with schema registry - **Observability**: CloudWatch logs, metrics, and X-Ray tracing - **Performance Optimization**: Right-sizing, cost optimization, and troubleshooting @@ -30,6 +31,7 @@ Load the appropriate reference file based on what the user is working on: - **Event sources**, **DynamoDB Streams**, **Kinesis**, **SQS**, **Kafka**, **S3 notifications**, or **SNS** -> see [references/event-sources.md](references/event-sources.md) - **EventBridge**, **event bus**, **event patterns**, **event design**, **Pipes**, or **schema registry** -> see [references/event-driven-architecture.md](references/event-driven-architecture.md) - **Durable functions**, **checkpointing**, **replay model**, **saga pattern**, or **long-running Lambda workflows** -> see the [durable-functions skill](../aws-lambda-durable-functions/) (separate skill in this plugin with full SDK reference, testing, and deployment guides) +- **Lambda Managed Instances**, **LMI**, **capacity providers**, **multi-concurrency**, **EC2-backed Lambda**, **cold start elimination**, or **Lambda cost optimization with Reserved Instances** -> see the [managed-instances skill](../aws-lambda-managed-instances/) (separate skill in this plugin for evaluation, configuration, and migration) - **Orchestration**, **workflows**, or **Durable Functions vs Step Functions** -> see [references/orchestration-and-workflows.md](references/orchestration-and-workflows.md) - **Step Functions**, **ASL**, **state machines**, **JSONata**, **Distributed Map**, or **SDK integrations** -> see [references/step-functions.md](references/step-functions.md) - **Step Functions testing**, **TestState API**, **mocking service integrations**, or **state machine unit tests** -> see [references/step-functions-testing.md](references/step-functions-testing.md)