Environment
Extension version: latest
Runtime: Node.js
reserved_concurrent_executions: 1
Invocation type: synchronous (AWS Step Functions)
Description
When a Lambda function with reserved_concurrent_executions=1 returns its response, the Datadog extension enters the post-runtime flush phase to send telemetry data to Datadog. During this phase, the execution environment is still considered "busy" by AWS Lambda, consuming the single reserved concurrency slot.
If a second invocation arrives immediately after the function has returned its response (but while the extension is still flushing), AWS throttles it with a Lambda.TooManyRequestsException (HTTP 429).
Steps to reproduce
Deploy a Lambda with reserved_concurrent_executions=1 and the Datadog extension enabled
Invoke it synchronously from a Step Function with sequential invocations
When a second invocation is triggered right after the first one completed, a 429 is returned
Expected behavior
The 429 should not occur between two sequential (non-concurrent) invocations.
Observed behavior
Lambda.TooManyRequestsException is raised on the second call because the extension's post-runtime flush keeps the execution environment busy beyond the function's response time.
Workaround
- Setting reserved_concurrent_executions=2 absorbs the overlap between the post-runtime flush of invocation N and the start of invocation N+1.
- DD_SERVERLESS_FLUSH_STRATEGY=periodically,
Defers the flush to a periodic interval. While this reduces the frequency of the issue, it is not acceptable in our case: we require complete telemetry coverage for every Lambda invocation. With a periodic strategy, invocations that complete between two flush intervals may have their telemetry dropped or delayed, making observability unreliable.
Requested solution
We are looking for a solution that allows the extension to flush telemetry without blocking the concurrency slot, so that reserved_concurrent_executions=1 remains usable for sequential workloads. Ideally, the extension would either:
Release the execution environment to AWS before completing its flush, or
Expose a configuration option to cap the post-runtime flush duration to avoid holding the slot beyond an acceptable threshold, without dropping telemetry.
Environment
Extension version: latest
Runtime: Node.js
reserved_concurrent_executions: 1
Invocation type: synchronous (AWS Step Functions)
Description
When a Lambda function with reserved_concurrent_executions=1 returns its response, the Datadog extension enters the post-runtime flush phase to send telemetry data to Datadog. During this phase, the execution environment is still considered "busy" by AWS Lambda, consuming the single reserved concurrency slot.
If a second invocation arrives immediately after the function has returned its response (but while the extension is still flushing), AWS throttles it with a Lambda.TooManyRequestsException (HTTP 429).
Steps to reproduce
Deploy a Lambda with reserved_concurrent_executions=1 and the Datadog extension enabled
Invoke it synchronously from a Step Function with sequential invocations
When a second invocation is triggered right after the first one completed, a 429 is returned
Expected behavior
The 429 should not occur between two sequential (non-concurrent) invocations.
Observed behavior
Lambda.TooManyRequestsException is raised on the second call because the extension's post-runtime flush keeps the execution environment busy beyond the function's response time.
Workaround
Defers the flush to a periodic interval. While this reduces the frequency of the issue, it is not acceptable in our case: we require complete telemetry coverage for every Lambda invocation. With a periodic strategy, invocations that complete between two flush intervals may have their telemetry dropped or delayed, making observability unreliable.
Requested solution
We are looking for a solution that allows the extension to flush telemetry without blocking the concurrency slot, so that reserved_concurrent_executions=1 remains usable for sequential workloads. Ideally, the extension would either:
Release the execution environment to AWS before completing its flush, or
Expose a configuration option to cap the post-runtime flush duration to avoid holding the slot beyond an acceptable threshold, without dropping telemetry.