Skip to content

Extension post-runtime flush causes 429 TooManyRequestsException with reserved_concurrent_executions=1 on synchronous invocation #1175

@jean-alain-re

Description

@jean-alain-re

Environment
Extension version: latest
Runtime: Node.js
reserved_concurrent_executions: 1
Invocation type: synchronous (AWS Step Functions)

Description
When a Lambda function with reserved_concurrent_executions=1 returns its response, the Datadog extension enters the post-runtime flush phase to send telemetry data to Datadog. During this phase, the execution environment is still considered "busy" by AWS Lambda, consuming the single reserved concurrency slot.

If a second invocation arrives immediately after the function has returned its response (but while the extension is still flushing), AWS throttles it with a Lambda.TooManyRequestsException (HTTP 429).

Steps to reproduce
Deploy a Lambda with reserved_concurrent_executions=1 and the Datadog extension enabled
Invoke it synchronously from a Step Function with sequential invocations
When a second invocation is triggered right after the first one completed, a 429 is returned

Expected behavior
The 429 should not occur between two sequential (non-concurrent) invocations.

Observed behavior
Lambda.TooManyRequestsException is raised on the second call because the extension's post-runtime flush keeps the execution environment busy beyond the function's response time.

Workaround

  1. Setting reserved_concurrent_executions=2 absorbs the overlap between the post-runtime flush of invocation N and the start of invocation N+1.
  2. DD_SERVERLESS_FLUSH_STRATEGY=periodically,
    Defers the flush to a periodic interval. While this reduces the frequency of the issue, it is not acceptable in our case: we require complete telemetry coverage for every Lambda invocation. With a periodic strategy, invocations that complete between two flush intervals may have their telemetry dropped or delayed, making observability unreliable.

Requested solution
We are looking for a solution that allows the extension to flush telemetry without blocking the concurrency slot, so that reserved_concurrent_executions=1 remains usable for sequential workloads. Ideally, the extension would either:

Release the execution environment to AWS before completing its flush, or
Expose a configuration option to cap the post-runtime flush duration to avoid holding the slot beyond an acceptable threshold, without dropping telemetry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions