diff --git a/aws-lambda-managed-instances/POWER.md b/aws-lambda-managed-instances/POWER.md new file mode 100644 index 0000000..293f231 --- /dev/null +++ b/aws-lambda-managed-instances/POWER.md @@ -0,0 +1,161 @@ +--- +name: "aws-lambda-managed-instances" +displayName: "AWS Lambda Managed Instances" +description: "Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Run Lambda functions on EC2 instances in your account while AWS manages provisioning, patching, scaling, routing, and load balancing." +keywords: ["lambda", "lmi", "managed-instances", "ec2", "capacity-provider", "multi-concurrency", "cold-start", "graviton", "cost-optimization", "serverless", "lambda-pricing", "reserved-instances", "savings-plans"] +author: "AWS" +--- + +# AWS Lambda Managed Instances (LMI) + +Run Lambda functions on current-generation EC2 instances in your account while AWS manages provisioning, patching, scaling, routing, and load balancing. Combines Lambda's developer experience with EC2's pricing and hardware options. + +## Onboarding + +### Step 1: Validate AWS CLI access + +Before using this power, ensure AWS credentials are configured: + +```bash +aws sts get-caller-identity +``` + +If this fails, configure credentials via `aws configure` or set `AWS_PROFILE`. + +### Step 2: Check regional availability + +Lambda Managed Instances is available in select regions. Verify availability: + +- [Lambda Managed Instances documentation](https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html) + +## When to Load Steering Files + +- **Cost comparison**, **pricing analysis**, **Lambda vs LMI cost**, **Savings Plans**, or **Reserved Instances** → `cost-comparison.md` +- **Instance types**, **memory sizing**, **vCPU ratios**, **scaling tuning**, or **capacity provider config** → `configuration-guide.md` +- **Thread safety**, **concurrency model**, **code review checklist**, **Powertools compatibility**, or **multi-concurrency readiness** → `thread-safety.md` +- **Before/after code examples**, **runtime-specific migration** (Node.js, Python, Java, .NET), or **connection pooling** → `migration-patterns.md` +- **IAM roles**, **VPC setup**, **CLI commands**, **SAM template**, or **CDK example** → `infrastructure-setup.md` +- **Errors**, **throttling**, **debugging**, or **stuck deployments** → `troubleshooting.md` + +## Quick Decision: Is LMI Right for This Workload? + +| Signal | LMI is a strong fit | Standard Lambda is better | +|--------|---------------------|---------------------------| +| Traffic | Steady, predictable, 50M+ req/mo | Bursty, unpredictable, long idle | +| Cost | Duration-heavy spend at scale | Low or sporadic invocations | +| Cold starts | Unacceptable (LMI eliminates for provisioned capacity) | Tolerable or mitigated by SnapStart | +| Compute | Latest CPUs, specific families, high network bandwidth | Standard Lambda memory/CPU sufficient | +| Isolation | Dedicated EC2 instances in your account, full VPC control | Shared Firecracker micro-VMs acceptable | +| Scale-to-zero | Not needed (min 3 instances always run) | Required (pay nothing when idle) | +| Code readiness | Thread-safe (Node.js/Java/.NET) or any Python code | Non-thread-safe code, expensive to change | + +## Workflow + +### Step 1: Assess the Workload + +Gather these signals before recommending: + +1. **Traffic pattern**: Steady vs bursty? Requests per second? +2. **Current costs**: Monthly Lambda spend? Existing Savings Plans? +3. **Runtime**: Node.js, Java, .NET, or Python? +4. **Memory/CPU**: How much memory? CPU-bound or I/O-bound? +5. **Execution duration**: Average and P99? +6. **Concurrency readiness**: Thread safety (Node.js/Java/.NET)? Shared `/tmp` paths? Per-invocation DB connections? +7. **VPC**: Already in a VPC? Private resource access needed? + +### Step 2: Build the Cost Comparison + +REQUIRED: Present a cost comparison before recommending LMI. Compare at minimum: + +| Scenario | When it wins | +|----------|-------------| +| Lambda on-demand | Low volume, bursty traffic | +| LMI on-demand | High volume, steady traffic | + +Rule of thumb: LMI becomes cost-competitive at 50-100M+ req/month with steady traffic. + +Use the [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) for accurate comparisons. + +### Step 3: Configure the Deployment + +- **Instance families** (400+ types, .large and up): C-series (compute), M-series (general), R-series (memory). ARM (Graviton) for best price-performance. +- **Memory-to-vCPU ratios**: 2:1 (compute), 4:1 (general, default), 8:1 (memory). Min 2 GB, max 32 GB. +- **Multi-concurrency defaults/vCPU**: Node.js 64, Java 32, .NET 32, Python 16. +- **Scaling**: MinExecutionEnvironments (default 3), MaxVCpuCount (required), TargetResourceUtilization. + +See `configuration-guide.md` for decision trees and detailed tuning. + +### Step 4: Migrate the Code + +Review code for concurrency safety. LMI runs multiple invocations concurrently per execution environment: + +- **Python**: Process-based isolation — globals are NOT shared. No thread-safety changes needed. Focus on `/tmp` conflicts and memory sizing. +- **Node.js**: Worker threads — globals shared within a worker. Requires async safety. +- **Java/.NET**: OS threads/Tasks — handler shared across threads. Requires full thread safety. + +See `thread-safety.md` for the review checklist and `migration-patterns.md` for before/after code. + +### Step 5: Set Up Infrastructure + +1. Create two IAM roles: execution role (for the function) and operator role (for capacity provider EC2 management) +2. Configure VPC with subnets across 3+ AZs +3. Create capacity provider with VPC config and scaling limits +4. Create or update function with capacity provider attachment +5. Publish a version (triggers instance provisioning) + +See `infrastructure-setup.md` for CLI commands and SAM templates. + +### Step 6: Validate and Cut Over + +1. Deploy to a non-production environment first +2. Monitor CloudWatch: CPU utilization, memory, concurrency, throttle rate +3. Gradual traffic shift with weighted aliases (10% → 50% → 100%) +4. Compare costs after 1-2 weeks of production data +5. Decommission standard Lambda once stable + +## Best Practices + +### Configuration + +- Start with 4:1 ratio and runtime default concurrency +- Use ARM (Graviton) unless x86 dependencies exist +- Let Lambda choose instance types unless specific hardware needed +- Set MaxVCpuCount to control cost ceiling +- Never set MinExecutionEnvironments below 3 (breaks AZ resiliency) + +### Migration + +- Start with I/O-heavy functions (benefit most from multi-concurrency) +- Review code for concurrency safety before attaching to capacity provider +- Use weighted aliases for gradual traffic shift +- Include request IDs in all log statements +- Initialize DB pools and SDK clients outside the handler + +### Operations + +- Set CloudWatch alarms on throttle rate > 1% and CPU > 80% +- Plan for 14-day instance rotation (automatic) +- Never manually terminate LMI EC2 instances (delete the capacity provider instead) +- Always publish a version — unpublished functions cannot run on LMI + +## Limits Quick Reference + +| Resource | Limit | +|----------|-------| +| Memory | 2 GB min, 32 GB max | +| Instances | 3 minimum (AZ resiliency) | +| Instance lifespan | 14 days (auto-replaced) | +| Concurrency/vCPU | 64 (Node.js), 32 (Java/.NET), 16 (Python) | +| Runtimes | Node.js, Java, .NET, Python | +| Instance families | C, M, R (.large and up) | +| Scaling | Absorbs 50% spike; doubles within 5 min | + +## Resources + +- [Lambda Managed Instances Docs](https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html) +- [Introducing LMI (AWS Blog)](https://aws.amazon.com/blogs/aws/introducing-aws-lambda-managed-instances-serverless-simplicity-with-ec2-flexibility/) +- [Build High-Performance Apps with LMI](https://aws.amazon.com/blogs/compute/build-high-performance-apps-with-aws-lambda-managed-instances/) +- [Migrating Functions to LMI](https://aws.amazon.com/blogs/compute/migrating-your-functions-to-aws-lambda-managed-instances/) +- [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) +- [LMI Samples Repository](https://github.com/aws-samples/sample-aws-lambda-managed-instances) +- [AWS Lambda Pricing](https://aws.amazon.com/lambda/pricing/) diff --git a/aws-lambda-managed-instances/steering/configuration-guide.md b/aws-lambda-managed-instances/steering/configuration-guide.md new file mode 100644 index 0000000..89717db --- /dev/null +++ b/aws-lambda-managed-instances/steering/configuration-guide.md @@ -0,0 +1,69 @@ +# LMI Configuration Guide + +## Instance Type Decision Tree + +- **CPU-intensive** (encoding, ML, compression) → C-series, 2:1 ratio, concurrency=1/vCPU +- **Memory-intensive** (caching, large datasets) → R-series, 8:1 ratio +- **Network-intensive** (streaming, data transfer) → Use AllowedInstanceTypes for n-suffix types, 4:1 ratio +- **General/balanced** (web APIs, microservices) → M-series, 4:1 ratio, default concurrency + +Architecture: ARM (Graviton, g-suffix) for price-performance. x86 (i=Intel, a=AMD) when dependencies require it. + +## Memory-to-vCPU Ratios + +| Ratio | Profile | When to use | Memory examples | +|-------|---------|-------------|-----------------| +| 2:1 | Compute | CPU-bound work | 2GB/1vCPU, 4GB/2vCPU | +| 4:1 | General | Most workloads (default) | 4GB/1vCPU, 8GB/2vCPU | +| 8:1 | Memory | Caching, data, Python apps | 8GB/1vCPU, 16GB/2vCPU | + +Min: 2 GB / 1 vCPU. Max: 32 GB. Memory must align with ratio multiples. + +## Memory Sizing from Existing Lambda + +| Current Lambda | LMI memory | Ratio | Rationale | +|---------------|------------|-------|-----------| +| 128-512 MB | 2048 MB | 4:1 | LMI minimum; multi-concurrency shares memory | +| 512 MB-1 GB | 2048 MB | 4:1 | Room for concurrent requests | +| 1-2 GB | 4096 MB | 4:1 | Standard upgrade path | +| 2-4 GB | 4096-8192 MB | 4:1 or 8:1 | Depends on memory vs CPU bottleneck | +| 4-10 GB | 8192-16384 MB | 8:1 | Likely memory-heavy workload | + +## Concurrency Tuning + +| Runtime | Default/vCPU | I/O-bound | CPU-bound | +|---------|-------------|-----------|-----------| +| Node.js | 64 | Keep or increase | 1 per vCPU | +| Java | 32 | Keep | 1 per vCPU | +| .NET | 32 | Keep | 1 per vCPU | +| Python | 16 | Keep | 1 per vCPU | + +Total capacity = MinExecutionEnvironments × PerExecutionEnvironmentMaxConcurrency + +## Capacity Provider Scaling Controls + +| Control | Default | Guidance | +|---------|---------|----------| +| MinExecutionEnvironments | 3 | Increase for baseline capacity; never below 3 | +| MaxExecutionEnvironments | — | Set based on cost budget | +| MaxVCpuCount | Required | Start at 30, adjust by load | +| TargetResourceUtilization | ~50% headroom | Raise for cost savings (less burst tolerance) | +| AllowedInstanceTypes | All | Restrict only for specific hardware needs | +| ExcludedInstanceTypes | None | Exclude expensive types in dev/test | + +## Monitoring Thresholds + +- **CPU > 80%**: reduce concurrency or add vCPUs +- **CPU < 20%**: increase concurrency for better utilization +- **Throttle rate (429s) > 1%**: increase MinExecutionEnvironments or reduce utilization target +- **Memory > 90%**: increase memory or reduce concurrency +- **ExecutionEnvironmentConcurrency near limit**: saturation — reduce concurrency or scale out + +## CloudWatch Metrics Dimensions + +LMI metrics are split across two CloudWatch dimensions: + +- **Alias (live)**: Invocations, Errors, Throttles, Duration +- **Version ($LATEST or numbered)**: CPUUtilization, MemoryUtilization, ExecutionEnvironmentConcurrency, ExecutionEnvironmentCount + +Create a unified dashboard combining both views to monitor LMI performance effectively. diff --git a/aws-lambda-managed-instances/steering/cost-comparison.md b/aws-lambda-managed-instances/steering/cost-comparison.md new file mode 100644 index 0000000..d57c903 --- /dev/null +++ b/aws-lambda-managed-instances/steering/cost-comparison.md @@ -0,0 +1,17 @@ +# Lambda vs LMI Cost Comparison + +Use the [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) for accurate, up-to-date cost comparisons based on your specific workload parameters (region, instance type, request volume, duration). + +When building a cost comparison for a user, gather: region, runtime, requests/month, average duration, memory, and architecture (x86 vs ARM). Plug these into the calculator rather than relying on hardcoded estimates. + +## When LMI is NOT Cheaper + +- < 50M req/month (fixed 3-instance cost exceeds Lambda) +- Very short functions (< 100ms duration) +- Highly bursty, unpredictable traffic +- Workloads needing scale-to-zero + +## Tools + +- [LMI Pricing Calculator](https://aws-samples.github.io/sample-aws-lambda-managed-instances/) — interactive comparison tool +- [AWS Pricing Calculator](https://calculator.aws/) — general AWS cost estimation diff --git a/aws-lambda-managed-instances/steering/infrastructure-setup.md b/aws-lambda-managed-instances/steering/infrastructure-setup.md new file mode 100644 index 0000000..7ba80f9 --- /dev/null +++ b/aws-lambda-managed-instances/steering/infrastructure-setup.md @@ -0,0 +1,226 @@ +# LMI Infrastructure Setup + +## IAM Roles (Two Required) + +### 1. Execution Role (for the function) + +Trust policy: + +```json +{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": {"Service": "lambda.amazonaws.com"}, + "Action": "sts:AssumeRole" + }] +} +``` + +Minimum permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents" + ], + "Resource": "arn:aws:logs:*:*:log-group:/aws/lambda/*" + } + ] +} +``` + +Add VPC permissions only if the function accesses VPC resources: + +```json +{ + "Effect": "Allow", + "Action": [ + "ec2:CreateNetworkInterface", + "ec2:DescribeNetworkInterfaces", + "ec2:DeleteNetworkInterface" + ], + "Resource": "*" +} +``` + +### 2. Operator Role (for capacity provider EC2 management) + +Trust policy: + +```json +{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": {"Service": "lambda.amazonaws.com"}, + "Action": "sts:AssumeRole" + }] +} +``` + +Minimum permissions (scoped with conditions): + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["ec2:RunInstances", "ec2:CreateTags", "ec2:AttachNetworkInterface"], + "Resource": [ + "arn:aws:ec2:*:*:instance/*", + "arn:aws:ec2:*:*:network-interface/*", + "arn:aws:ec2:*:*:volume/*" + ], + "Condition": { + "StringEquals": { + "ec2:ManagedResourceOperator": "scaler.lambda.amazonaws.com" + } + } + }, + { + "Effect": "Allow", + "Action": [ + "ec2:DescribeAvailabilityZones", + "ec2:DescribeCapacityReservations", + "ec2:DescribeInstances", + "ec2:DescribeInstanceStatus", + "ec2:DescribeInstanceTypeOfferings", + "ec2:DescribeInstanceTypes", + "ec2:DescribeSecurityGroups", + "ec2:DescribeSubnets" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": ["ec2:RunInstances", "ec2:CreateNetworkInterface"], + "Resource": [ + "arn:aws:ec2:*:*:subnet/*", + "arn:aws:ec2:*:*:security-group/*" + ] + }, + { + "Effect": "Allow", + "Action": "ec2:RunInstances", + "Resource": "arn:aws:ec2:*:*:image/*", + "Condition": { + "StringEquals": { "ec2:Owner": "amazon" } + } + }, + { + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "" + } + ] +} +``` + +The `ec2:ManagedResourceOperator` condition ensures RunInstances/CreateTags only apply to Lambda-managed instances. + +## VPC Requirements + +LMI runs functions on EC2 instances inside the VPC. These instances need VPC endpoints or NAT to reach AWS services. + +- 3+ subnets across different AZs (for default 3-instance fleet) +- Security groups: HTTPS egress (port 443) for AWS API calls; no ingress needed +- Required VPC endpoints: + +| Endpoint | Type | Purpose | +|----------|------|---------| +| S3 | Gateway | Object storage access | +| DynamoDB | Gateway | Table access | +| SQS | Interface | Queue operations | +| CloudWatch Logs | Interface | Log delivery | +| CloudWatch Monitoring | Interface | Metrics/EMF | +| X-Ray | Interface | Distributed tracing | + +## CLI Workflow + +### Required Parameters + +| Parameter | Description | +|-----------|-------------| +| `SUBNET_IDS` | Comma-separated subnet IDs across 3+ AZs | +| `SECURITY_GROUP_ID` | Security group ID for the capacity provider | +| `ACCOUNT_ID` | AWS account ID | +| `OPERATOR_ROLE_ARN` | ARN of the operator role | +| `EXECUTION_ROLE_ARN` | ARN of the execution role | +| `FUNCTION_NAME` | Name for the Lambda function | +| `CP_NAME` | Name for the capacity provider | +| `ARCHITECTURE` | `arm64` (Graviton) or `x86_64` | + +### Manual Steps + +```bash +# 1. Create capacity provider +aws lambda create-capacity-provider \ + --capacity-provider-name $CP_NAME \ + --vpc-config "SubnetIds=[$SUBNET_IDS],SecurityGroupIds=[$SECURITY_GROUP_ID]" \ + --permissions-config "CapacityProviderOperatorRoleArn=$OPERATOR_ROLE_ARN" \ + --instance-requirements "Architectures=[$ARCHITECTURE]" \ + --capacity-provider-scaling-config "MaxVCpuCount=30" + +# 2. Create function +aws lambda create-function --function-name $FUNCTION_NAME --runtime python3.13 \ + --handler app.handler --zip-file fileb://function.zip \ + --role $EXECUTION_ROLE_ARN --architectures $ARCHITECTURE \ + --memory-size 4096 \ + --capacity-provider-config \ + "LambdaManagedInstancesCapacityProviderConfig={CapacityProviderArn=arn:aws:lambda:$AWS_REGION:$ACCOUNT_ID:capacity-provider:$CP_NAME}" + +# 3. Publish version (triggers provisioning — takes several minutes) +aws lambda publish-version --function-name $FUNCTION_NAME + +# 4. Invoke (must use versioned ARN) +aws lambda invoke --function-name $FUNCTION_NAME:1 --payload '{}' response.json +``` + +Architecture must match between function and capacity provider. + +## SAM Template + +```yaml +Resources: + MyCP: + Type: AWS::Lambda::CapacityProvider + Properties: + CapacityProviderName: my-cp + VpcConfig: + SubnetIds: [!Ref Sub1, !Ref Sub2, !Ref Sub3] + SecurityGroupIds: [!Ref SG] + PermissionsConfig: + CapacityProviderOperatorRoleArn: !GetAtt OpRole.Arn + InstanceRequirements: + Architectures: [arm64] + CapacityProviderScalingConfig: + MaxVCpuCount: 30 + + MyFn: + Type: AWS::Serverless::Function + Properties: + Runtime: python3.13 + Handler: app.handler + MemorySize: 4096 + Architectures: [arm64] + CapacityProviderConfig: + LambdaManagedInstancesCapacityProviderConfig: + CapacityProviderArn: !GetAtt MyCP.Arn +``` + +## Cleanup + +```bash +aws lambda delete-function --function-name my-fn +aws lambda delete-capacity-provider --capacity-provider-name my-cp +``` + +Deleting the capacity provider destroys all associated EC2 instances. diff --git a/aws-lambda-managed-instances/steering/migration-patterns.md b/aws-lambda-managed-instances/steering/migration-patterns.md new file mode 100644 index 0000000..7898f03 --- /dev/null +++ b/aws-lambda-managed-instances/steering/migration-patterns.md @@ -0,0 +1,143 @@ +# LMI Migration Patterns + +Before/after code examples for migrating to multi-concurrency. + +## Node.js + +### Global State + +```javascript +// BEFORE (race condition) +let requestCount = 0; +exports.handler = async (event) => { + requestCount++; + return { count: requestCount }; +}; + +// AFTER (request-isolated) +const { AsyncLocalStorage } = require('node:async_hooks'); +const als = new AsyncLocalStorage(); +exports.handler = async (event) => { + return als.run({ id: event.requestContext?.requestId }, async () => { + return await processEvent(event); + }); +}; +``` + +### File I/O + +```javascript +// BEFORE (shared path) +fs.writeFileSync('/tmp/output.json', JSON.stringify(data)); + +// AFTER (request-unique path) +const path = `/tmp/output-${event.requestContext?.requestId}.json`; +try { fs.writeFileSync(path, JSON.stringify(data)); } +finally { fs.unlinkSync(path); } +``` + +### Database + +```javascript +// BEFORE (per-invocation connection) +exports.handler = async (event) => { + const conn = await mysql.createConnection({/*...*/}); + const [rows] = await conn.execute('SELECT ...'); + await conn.end(); +}; + +// AFTER (shared pool) +const pool = mysql.createPool({ connectionLimit: 10, /*...*/ }); +exports.handler = async (event) => { + const [rows] = await pool.execute('SELECT ...'); + return rows; +}; +``` + +## Python + +Python on LMI uses **process-based isolation**. Each concurrent invocation runs in its own process with independent memory. Global state is NOT shared, so no locking is needed. The main migration concerns are `/tmp` conflicts, memory sizing, and connection pooling. + +### Global State (No Changes Needed) + +```python +# This is SAFE on LMI — each process has its own copy of cache +cache = {} +def handler(event, context): + cache[event['key']] = compute(event) + return cache[event['key']] + +# Module-level clients are also safe (isolated per process) +s3_client = boto3.client('s3') +dynamodb = boto3.resource('dynamodb') +``` + +### File I/O (Change Required — `/tmp` is shared across processes) + +```python +# BEFORE (conflict — all processes share /tmp) +with open('/tmp/data.json', 'w') as f: json.dump(event, f) + +# AFTER (request-unique path) +path = f'/tmp/data-{context.aws_request_id}.json' +try: + with open(path, 'w') as f: json.dump(event, f) +finally: + os.unlink(path) +``` + +### Database (Change Required — each process needs pooled connections) + +```python +# BEFORE (per-invocation connection — exhausts limits at concurrency) +def handler(event, context): + conn = psycopg2.connect(host='...') + +# AFTER (pool per process — initialized at module level) +from psycopg2 import pool +db_pool = pool.SimpleConnectionPool(1, 3, host=os.environ['DB_HOST']) +def handler(event, context): + conn = db_pool.getconn() + try: return query(conn, event) + finally: db_pool.putconn(conn) +# Note: total connections = pool_size × concurrency (e.g., 3 × 16 = 48) +``` + +### Memory Sizing + +```python +# A function using 200 MB per process with default concurrency of 16: +# Total memory ≈ 200 MB × 16 = 3.2 GB +# Use 4:1 or 8:1 memory-to-vCPU ratio to accommodate +# Monitor MemoryUtilization metric and adjust as needed +``` + +## Java + +### Global State + +```java +// BEFORE (race condition) +private static Map cache = new HashMap<>(); + +// AFTER (thread-safe) +private static final ConcurrentHashMap cache = new ConcurrentHashMap<>(); +// Use cache.computeIfAbsent(key, k -> compute(k)); +``` + +### Database + +```java +// BEFORE (per-invocation) +Connection conn = DriverManager.getConnection("jdbc:..."); + +// AFTER (HikariCP pool, static init) +private static final HikariDataSource ds; +static { + HikariConfig c = new HikariConfig(); + c.setJdbcUrl(System.getenv("DB_URL")); + c.setMaximumPoolSize(10); + ds = new HikariDataSource(c); +} +// Use: try (Connection conn = ds.getConnection()) { ... } +``` diff --git a/aws-lambda-managed-instances/steering/thread-safety.md b/aws-lambda-managed-instances/steering/thread-safety.md new file mode 100644 index 0000000..0eadb02 --- /dev/null +++ b/aws-lambda-managed-instances/steering/thread-safety.md @@ -0,0 +1,106 @@ +# Concurrency Safety for LMI + +LMI runs multiple invocations concurrently in the same execution environment. The concurrency model differs by runtime — some require thread safety, others provide process isolation. + +## Code Review Checklist + +When reviewing a function for LMI readiness, check each item: + +- [ ] No shared `/tmp` paths (use request ID in filenames, clean up after — shared across ALL runtimes) +- [ ] Database connections use pools (initialized outside handler, not per-invocation) +- [ ] SDK clients outside handler (module-level singletons are fine — they are thread-safe) +- [ ] Logging includes request ID (for tracing concurrent requests) +- [ ] **Node.js/Java/.NET only:** No global/static mutable variables (use immutable or request-local state) +- [ ] **Node.js/Java/.NET only:** Thread-safe libraries only (check DB drivers, HTTP clients, caching libs) +- [ ] **Node.js/Java/.NET only:** No request state in global scope (use AsyncLocalStorage, contextvars, ThreadLocal) +- [ ] **Node.js/Java/.NET only:** No environment variable mutation during requests +- [ ] **Python only:** Memory budget accounts for per-process multiplication (memory × concurrency) + +## Runtime-Specific Guidance + +### Python (Process-Based Isolation) + +Python uses **multiple independent processes**, each with its own interpreter and memory space. Global variables, module-level caches, and singleton objects are duplicated per process, not shared. If a function works on standard Lambda today, it works on LMI without code changes related to shared state. + +**Key concerns:** + +- Memory consumption: total footprint ≈ per-process memory × concurrency. A 200 MB function with 16 concurrent processes can consume 3+ GB. +- `/tmp` filesystem is shared across all processes — use `context.aws_request_id` in filenames +- Each process needs its own connection pool — size pools per-process, not globally +- Prefer 4:1 or 8:1 memory-to-vCPU ratio to accommodate memory multiplication +- Monitor `MemoryUtilization` metric and adjust ratio if needed + +**Safe patterns (no locking needed):** + +- Module-level mutable globals (isolated per process) +- Module-level SDK clients and caches +- `os.environ` reads + +### Node.js (Worker Threads + Async/Await) + +Uses worker threads combined with async/await event loops. The handler and global state are **shared across concurrent invocations within a worker thread**. + +The `await` keyword yields control to the event loop, which may execute another invocation that overwrites shared state before the first resumes. + +**Key concerns:** + +- Use `AsyncLocalStorage` from `node:async_hooks` for request context +- Keep mutable state within handler local scope +- Initialize SDK clients and DB pools at module level (they are thread-safe) +- Avoid module-level mutable state (`let count = 0` is a race condition) +- Callback-based handlers are NOT supported on Node.js 22 — use async handlers + +### Java (OS Threads) + +Uses OS-level threads. Lambda loads the handler class once and invokes `handleRequest` from multiple threads simultaneously. + +**Key concerns:** + +- Use immutable objects and thread-safe collections (`ConcurrentHashMap`, `Collections.synchronizedList`) +- Initialize SDK clients and connection pools in constructor or static block +- Avoid mutable `static` fields +- Use `ThreadLocal` for request-specific state +- Use HikariCP or similar for connection pooling (AWS SDK for Java 2.x clients are thread-safe) + +### .NET (Task-Based Concurrency) + +Uses a single process with .NET Tasks (same model as ASP.NET Core). The handler object is shared across all Tasks. + +**Key concerns:** + +- Use `AsyncLocal` for request-scoped data +- Inject scoped services via DI container +- Initialize `HttpClient` and SDK clients as singletons +- Use `ConcurrentDictionary` and `SemaphoreSlim` for thread-safe access +- Invocation timeouts are NOT enforced by the runtime — use `ILambdaContext.RemainingTime` + +## Common Anti-Patterns + +| Anti-pattern | Affected Runtimes | Risk | Fix | +|-------------|-------------------|------|-----| +| New DB connection per invocation | All | Exhausts connection limits | Module-level connection pool | +| Hardcoded `/tmp` paths | All | File conflicts across processes | Use `aws_request_id` in path | +| Logging without request ID | All | Unreadable interleaved logs | Include `aws_request_id` | +| Mutable module-level state | Node.js, Java, .NET | Race condition / state corruption | Request-local scope or concurrent collections | +| Setting env vars during request | Node.js, Java, .NET | Race condition | Pass state via parameters | +| Assuming sequential execution | Node.js, Java, .NET | State corruption | Each invocation must be self-contained | +| Ignoring memory multiplication | Python | OOM at high concurrency | Account for per-process × concurrency | + +## Powertools for AWS Lambda Compatibility + +Powertools handles multi-concurrency transparently. No code changes needed. + +| Runtime | Package | Minimum Version | +|---------|---------|-----------------| +| Python | Powertools for AWS Lambda (Python) | 3.23.0 | +| TypeScript | Powertools for AWS Lambda (TypeScript) | 2.29.0 | +| Java | Powertools for AWS Lambda (Java) | 2.8.0 | +| .NET | Powertools for AWS Lambda (.NET) | 3.1.0 | + +AWS SDK and X-Ray minimum versions: + +| Runtime | AWS SDK minimum | X-Ray SDK minimum | +|---------|----------------|-------------------| +| Node.js | AWS SDK for JavaScript v3 (3.933.0) | 3.12.0 | +| Java | AWS SDK for Java 2.0 (2.34.0) | 2.20.0 | +| .NET | AWSSDK.Core (4.0.0.32) | AWSXRayRecorder.Core (2.16.0) | diff --git a/aws-lambda-managed-instances/steering/troubleshooting.md b/aws-lambda-managed-instances/steering/troubleshooting.md new file mode 100644 index 0000000..45fbd45 --- /dev/null +++ b/aws-lambda-managed-instances/steering/troubleshooting.md @@ -0,0 +1,42 @@ +# LMI Troubleshooting + +## Common Issues + +| Issue | Cause | Resolution | +|-------|-------|------------| +| 429 throttles during scale-up | Traffic doubled faster than 5-min scaling window | Increase MinExecutionEnvironments or lower TargetResourceUtilization | +| Function stuck in PENDING | Capacity provider provisioning instances | Wait several minutes; verify VPC subnets have IP capacity and IAM roles are correct | +| Architecture mismatch error | Function architecture ≠ capacity provider | Align both to arm64 or x86_64 | +| Cannot terminate EC2 instances | LMI instances managed by capacity provider | Delete capacity provider to destroy instances; cannot use EC2 console | +| High CPU, low throughput | Concurrency too high for CPU-bound work | Reduce PerExecutionEnvironmentMaxConcurrency to 1/vCPU | +| Race conditions in production | Code not thread-safe for multi-concurrency | Review with checklist in thread-safety.md | +| Function version not ACTIVE | Fewer than 3 execution environments ready | Wait for provisioning; check capacity provider status | +| Unexpected 500 errors | Unhandled concurrent access to shared state | Add thread-safe patterns from migration-patterns.md | +| CloudWatch logs missing | VPC egress not configured | Add NAT Gateway or CloudWatch Logs VPC endpoint | +| High costs despite low traffic | Minimum 3 instances always running | Evaluate if standard Lambda is more cost-effective | + +## Debugging Steps + +### Function Not Starting + +1. Check capacity provider status: `aws lambda get-capacity-provider --capacity-provider-name ` +2. Verify subnets span 3+ AZs with available IPs +3. Confirm security group allows necessary egress +4. Check operator role has required permissions +5. Look for `Operator` field in EC2 DescribeInstances or `aws:lambda:capacity-provider` tag + +### Performance Issues + +1. Check CloudWatch metrics (5-min intervals): CPU utilization, memory, concurrency/env +2. If CPU > 80%: reduce concurrency or add vCPUs (increase memory with appropriate ratio) +3. If throttles > 1%: increase MinExecutionEnvironments +4. If CPU < 20%: increase concurrency — resources are underutilized +5. For Python: verify 4:1 or 8:1 ratio (GIL limits CPU parallelism) + +### Cost Issues + +1. Verify instance count matches actual need (not over-provisioned) +2. Check if Savings Plans or RIs are applied to these instances +3. Compare actual costs against the LMI Pricing Calculator +4. If traffic is lower than expected, consider reducing MaxVCpuCount +5. For dev/test: use ExcludedInstanceTypes to avoid expensive instance families