From a427daf9454621df387ce0c1f1ec8cb91c374110 Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Fri, 27 Mar 2026 12:20:40 +0000 Subject: [PATCH 1/7] docs: add-current-state-and-future-recs --- ...d-error-handling-future-recommendations.md | 115 ++++++++++++++++++ docs/developer-guides/sqs-current-state.md | 83 +++++++++++++ 2 files changed, 198 insertions(+) create mode 100644 docs/developer-guides/sqs-and-error-handling-future-recommendations.md create mode 100644 docs/developer-guides/sqs-current-state.md diff --git a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md new file mode 100644 index 00000000..bfc476bb --- /dev/null +++ b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md @@ -0,0 +1,115 @@ +# SQS & Error Handling — Future Recommendations + +## 1. Clean up unused queues and the duplicate `events` ESM + +`order-results`, `notifications`, and `events` are provisioned but carry no messages. Each still incurs CloudWatch alarms and ongoing KMS/encryption costs. + +- If `order-results` and `notifications` are intended for future use, add a `TODO` comment in `sqs.tf` naming the planned producer/consumer so intent is clear. +- If they have no near-term roadmap, remove them. +- The `events` queue creates a redundant event source mapping on `order-router-lambda` (which already has its own dedicated ESM to `order-placement`). Unless `events` is intended as a separate trigger path, remove the generic `aws_lambda_event_source_mapping.sqs` resource or decouple it from `order-router-lambda`. + +## 2. Fix the visibility timeout and add exponential backoff with jitter + +AWS recommends visibility timeout ≥ 6× Lambda timeout. `order-router-lambda` has a 60 s timeout; the queue has a 300 s visibility timeout (5×). Raise to at least 360 s (6×) so a message isn't re-queued while a slow invocation is still running. + +```hcl +visibility_timeout_seconds = 360 +``` + +Beyond the static timeout, SQS does not natively back off between retries — every re-delivery is immediate once the visibility window expires. For a retryable failure against a recovering supplier API, hammering every 360 seconds is unhelpful. The `order-router-lambda` handler can call `ChangeMessageVisibility` on a failing record before returning it in `batchItemFailures`, extending the timeout exponentially based on `attributes.ApproximateReceiveCount` (available on every `SQSRecord`). Adding jitter prevents a convoy of messages all retrying in sync after an outage clears. + +This is more targeted than simply raising `max_receive_count` (see item 7) — backoff controls _when_ retries happen; receive count controls _how many_. Both should be set together. + +## 3. Classify errors as retryable or non-retryable in `order-router-lambda` + +Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily. + +The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost. + +A `isRetryable(error)` helper can classify based on HTTP status (4xx = non-retryable, 5xx/network = retryable) and known error types (schema/validation failures = non-retryable). + +## 4. Enable `bisect_batch_on_function_error` + +Not currently set on either ESM. When enabled, Lambda automatically halves a failing batch to isolate a poison-pill message more quickly, reducing wasted retry attempts before the bad message reaches the DLQ. Safe to enable alongside `ReportBatchItemFailures`. + +```hcl +bisect_batch_on_function_error = true +``` + +## 5. Address the QUEUED status / enqueue race condition in `order-service-lambda` + +If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely. + +**Option A (preferred):** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely. + +**Option B (simpler):** Reverse the order — write `QUEUED` to DB first, then send to SQS. If SQS fails, the DB record stays as `QUEUED` and a separate reconciliation process (or manual intervention) can re-queue it. Easier to implement; less robust under sustained SQS unavailability. + +## 6. Integrated error handling: DLQ, persistent failure store, and triage + +The DLQ alone is insufficient for production. SQS has a 14-day retention limit, provides no random access, and gives no structured way to track which failed messages have been reviewed or reprocessed. Given these are patient orders, audit trail and targeted reprocessing matter. + +The right mental model is: **queues are excellent for retryable failures** (outage recovery, bad deploy — drain the DLQ and retry everything); **a database is the right home for audit, triage, and non-retryable failures**. They should be used together. + +Below are three progressively more capable approaches: + +### Option A — DLQ + manual triage (current state) + +- Inspect messages in the AWS Console or via CLI; redrive in bulk. +- Pro: no additional infrastructure. +- Con: no random access, 14-day retention, no structured audit trail, poor UX for per-message decisions. Does not handle non-retryable errors (item 3). + +### Option B — DLQ consumer Lambda writes to a `failed_messages` table + +- A Lambda consumes the DLQ and writes failed message payloads (with error context, receive count, timestamp) to a DB table. Reprocessing is a DB update + re-enqueue. +- Pro: indefinite retention, random access, queryable by order UID. Relatively simple to implement. +- Con: the DB table is a separate concern from the main order state; two places to check per failed order. Does not solve the enqueue race condition (item 5). + +### Option C — Outbox pattern as primary store (addresses items 5 and 6 together) + +- Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism. +- Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update. +- Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly. +- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or event bridge pipe reading the outbox table). + +Regardless of which option is chosen, a triage runbook must document: how to inspect failed messages, how to distinguish transient from permanent failures, how to redrive safely (throttled), and who is accountable given the patient data involved. + +For bulk redrives from the DLQ back to the source queue: + +```bash +aws sqs start-message-move-task \ + --source-arn \ + --destination-arn \ + --max-number-of-messages-per-second 1 +``` + +Limit the rate to avoid overwhelming a supplier or DB that has just recovered. + +## 7. Supplier idempotency for DLQ redrives + +If a message lands in the DLQ after a successful supplier submission (e.g. the `SUBMITTED` status write failed after a successful HTTP call), redriving it will re-submit to the supplier. Confirm with each supplier whether their endpoint is idempotent for the same `X-Correlation-ID`. If not, redrives must be handled manually per-message rather than as a bulk operation. + +## 8. Increase `max_receive_count` for sustained outages + +With exponential backoff in place (item 2), `max_receive_count` controls the total number of attempts rather than purely the retry window. The current value of 3 is low for production — a supplier outage lasting longer than a few retries will exhaust the count before the service recovers. Consider raising to 5–10, taking into account the maximum total retry window you want before a message is dead-lettered. + +## 9. Add concurrency limit to `order_router_order_placement` ESM + +The generic `events` ESM has `maximum_concurrency = 10`, but the dedicated `order_router_order_placement` ESM does not. Without a cap, a burst of messages could spin up Lambda concurrency proportional to queue depth, potentially exhausting the shared Aurora Serverless connection pool. Set an explicit limit appropriate for the DB connection limit. + +```hcl +scaling_config { + maximum_concurrency = 10 +} +``` + +## 10. Batch size optimisation + +`order-placement` uses batch size 1, which is safe and simple but limits throughput. The existing handler already processes records concurrently via `Promise.all`, so the code change to support a larger batch is minimal — the tuning is primarily infrastructure config. + +Key constraints to measure before increasing batch size: + +- **Aurora Serverless connection pool**: each concurrently processed record in a batch needs a DB connection; connection exhaustion will cause failures. +- **Supplier rate limits**: if a supplier enforces a request rate, concurrent processing of a larger batch could breach it. +- **Lambda memory**: each in-flight record within a batch consumes memory; profile at current memory_size (512 MB) before increasing. + +The `maximum_batching_window_in_seconds` (already set to 5 s on the `events` ESM) is the complementary lever — letting messages accumulate before Lambda is invoked, improving efficiency at lower throughput. Increasing batch size without load testing risks cascading failures; instrument duration, error rate, and DB connection utilisation first and increase incrementally. diff --git a/docs/developer-guides/sqs-current-state.md b/docs/developer-guides/sqs-current-state.md new file mode 100644 index 00000000..66b269fe --- /dev/null +++ b/docs/developer-guides/sqs-current-state.md @@ -0,0 +1,83 @@ +# SQS Current State + +## Queues + +Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper. + +| Queue | Type | Status | +| ----------------- | -------- | ---------------------------------------------------------------------------------------------------- | +| `order-placement` | Standard | Active — written by `order-service-lambda`, consumed by `order-router-lambda` | +| `events` | Standard | Provisioned — `order-router-lambda` is wired as a consumer (see note below), nothing publishes to it | +| `order-results` | Standard | Provisioned — no producer, no consumer | +| `notifications` | FIFO | Provisioned — no producer, no consumer | + +**`events` queue note:** The `events` event source mapping wires any Lambda flagged `sqs_trigger = true` to this queue (batch size 10, 5 s batching window). Currently `order-router-lambda` is the only such Lambda, meaning it has two event source mappings: one to `order-placement` (its operational queue) and one to the unused `events` queue. + +## Active Message Flow + +```text +Patient → POST /order + → order-service-lambda + → DB: create patient/order/consent (transaction) + → SQS: enqueue to order-placement + → DB: set status = QUEUED + → return 201 + +SQS order-placement → order-router-lambda (batch size 1) + → validate & parse message body + → DB: load supplier config + → Secrets Manager: fetch supplier OAuth token + → HTTP: POST FHIR order to supplier + → DB: set status = SUBMITTED (best-effort, see error handling) +``` + +## Queue Configuration + +| Setting | order-placement | order-results | notifications | events | +| ------------------- | --------------- | ------------- | ------------- | --------- | +| Visibility timeout | 300 s | 300 s | 180 s | 300 s | +| Message retention | 14 days | 14 days | 4 days | 14 days | +| DLQ retention | 14 days | 14 days | 14 days | 14 days | +| max receive count | 3 | 3 | 3 | 3 | +| Long polling | 20 s | 20 s | 20 s | 20 s | +| Encryption | KMS (CMK) | KMS (CMK) | KMS (CMK) | KMS (CMK) | +| DLQ redrive enabled | Yes | Yes | Yes | Yes | + +## Event Source Mapping — order-placement → order-router-lambda + +| Setting | Value | +| ------------------------------------------------- | --------------------------------------------------------- | +| Batch size | 1 | +| Partial batch failure (`ReportBatchItemFailures`) | Enabled | +| `bisect_batch_on_function_error` | Not configured | +| Max concurrency | Not configured (only set on the generic `events` ESM: 10) | + +## Error Handling + +### order-service-lambda (producer) + +- The SDK client is configured with `maxAttempts: 3`; transient send failures are retried at the SDK level before the Lambda returns an error to the API caller. +- If `sendMessage` still fails, a structured `console.error` is logged and a `500` is returned to the caller — the order record exists in the DB but is never queued. +- If the subsequent `QUEUED` status write fails, a `500` is returned but the already-sent SQS message is not rolled back. The order is in the queue but its DB status may not reflect `QUEUED`. + +### order-router-lambda (consumer) + +- Every record is processed individually; failures push `{ itemIdentifier: record.messageId }` into `batchItemFailures`. Only failed records are retried; successful records in the same batch are deleted by Lambda. +- Every failure path throws, with one intentional exception: after a successful supplier HTTP call, a subsequent DB status update failure is caught, logged as `console.error`, and **not rethrown**. This is explicitly to prevent SQS from re-delivering the message and causing a duplicate order submission to the supplier. The order is considered submitted. +- All other failures (invalid message body, missing supplier config, OAuth failure, non-200/201 supplier response) propagate and result in the message being retried via SQS. + +### order-result-lambda + +`order-result-lambda` is **not an SQS consumer**. It is an HTTP handler (API Gateway POST `/result`) that receives FHIR Observation resources directly from suppliers. The `order-results` queue is provisioned but has no connection to this Lambda. Errors are handled by returning structured FHIR error responses (HTTP 4xx/5xx); there is no silent failure path. + +## Monitoring + +Three CloudWatch alarms are created per queue: + +| Alarm | Metric | Threshold | +| ------------------------ | ---------------------------------------- | --------------- | +| `{queue}-age-high` | `ApproximateAgeOfOldestMessage` | 600 s | +| `{queue}-depth-high` | `ApproximateNumberOfMessagesVisible` | 1,000 | +| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) | + +All alarms notify the SNS alerts topic. OK transitions also notify. From 8d9613f057865f60dd95eb9b5bb97069691509ae Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Fri, 27 Mar 2026 12:29:33 +0000 Subject: [PATCH 2/7] docs: clarify-bulk-reprocessing --- .../sqs-and-error-handling-future-recommendations.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md index bfc476bb..2b3e3d90 100644 --- a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md +++ b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md @@ -69,7 +69,9 @@ Below are three progressively more capable approaches: - Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism. - Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update. - Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly. -- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or event bridge pipe reading the outbox table). +- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or EventBridge Pipes reading the outbox table). + +**Bulk reprocessing after an outage:** each outbox row carries a `status` field (e.g. `pending`, `enqueued`, `failed`, `non_retryable`). During an outage, affected rows accumulate in `failed`. To reprocess, run a targeted SQL update — filtered by time window, supplier, error type, or any combination — flipping those rows back to `pending`. The outbox poller then picks them up and re-enqueues automatically; no manual DLQ interaction is needed. This also makes it straightforward to exclude specific messages (poison pills, non-retryable failures) from a bulk retry by leaving their status unchanged. Regardless of which option is chosen, a triage runbook must document: how to inspect failed messages, how to distinguish transient from permanent failures, how to redrive safely (throttled), and who is accountable given the patient data involved. From 00922f9320c670547a799c7e672f23c3c5772932 Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Tue, 31 Mar 2026 14:42:36 +0100 Subject: [PATCH 3/7] chore: clarify-transient-errors --- ...d-error-handling-future-recommendations.md | 24 +++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md index 2b3e3d90..93162029 100644 --- a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md +++ b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md @@ -20,13 +20,29 @@ Beyond the static timeout, SQS does not natively back off between retries — ev This is more targeted than simply raising `max_receive_count` (see item 7) — backoff controls _when_ retries happen; receive count controls _how many_. Both should be set together. -## 3. Classify errors as retryable or non-retryable in `order-router-lambda` +## 3. Classify errors as auto-retryable or not in `order-router-lambda` -Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily. +Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, causing SQS to re-deliver until `max_receive_count` is exhausted. This wastes retry attempts on failures that will never self-heal, and sends legitimate patient orders to the DLQ unnecessarily. -The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost. +**The right question at the SQS loop level is not "is this error permanent?" but "will waiting and retrying automatically fix this?"** That is a narrower, more answerable question. -A `isRetryable(error)` helper can classify based on HTTP status (4xx = non-retryable, 5xx/network = retryable) and known error types (schema/validation failures = non-retryable). +Errors that belong in the SQS retry loop (add to `batchItemFailures`, apply backoff): + +- 5xx from supplier — transient server error +- 429 from supplier — rate limited; back off and retry +- Network/timeout failures — transient infrastructure + +Errors that should leave the SQS loop (do **not** add to `batchItemFailures`): + +- 4xx from supplier (other than 429) — includes 400 (bad payload), 401 (auth config), 404 (wrong endpoint), etc. +- Schema/validation failures on the message body — payload is malformed and won't change on retry +- Supplier not found in DB — config issue + +This second group is not necessarily _permanently_ broken — a 401 might be fixed by rotating credentials, a 404 by correcting a config value, a 400 by identifying a mapping bug. But none of these fix themselves with time; they require a deliberate action before the message can succeed. **Crucially, if a DB persistence mechanism is in place (see item 6), every message in this group remains retryable** — the SQS loop is simply not the right place to hold them while waiting for a fix. + +Messages removed from the SQS loop must not be silently discarded. They must be persisted to a durable store (see item 6) and the order status updated to reflect the failure, so they can be reprocessed once the underlying issue is resolved. + +An `isTransient(error)` helper (note: _transient_, not _retryable_ — the distinction matters) can classify based on HTTP status and known error types. When in doubt, treat as transient and let the DLQ act as the final backstop. ## 4. Enable `bisect_batch_on_function_error` From 15107b4c00d1d5c1db565cae4d6b5645417fd396 Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Tue, 7 Apr 2026 14:12:22 +0100 Subject: [PATCH 4/7] chore: clarify-recommendations-based-on-feedback --- ...d-error-handling-future-recommendations.md | 51 ++++++++++++++----- 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md index 93162029..94d9c969 100644 --- a/docs/developer-guides/sqs-and-error-handling-future-recommendations.md +++ b/docs/developer-guides/sqs-and-error-handling-future-recommendations.md @@ -16,9 +16,15 @@ AWS recommends visibility timeout ≥ 6× Lambda timeout. `order-router-lambda` visibility_timeout_seconds = 360 ``` -Beyond the static timeout, SQS does not natively back off between retries — every re-delivery is immediate once the visibility window expires. For a retryable failure against a recovering supplier API, hammering every 360 seconds is unhelpful. The `order-router-lambda` handler can call `ChangeMessageVisibility` on a failing record before returning it in `batchItemFailures`, extending the timeout exponentially based on `attributes.ApproximateReceiveCount` (available on every `SQSRecord`). Adding jitter prevents a convoy of messages all retrying in sync after an outage clears. +Beyond the static timeout, SQS does not natively back off between retries — every re-delivery is immediate once the visibility window expires. For an auto-retryable failure against a recovering supplier API, hammering every 360 seconds is unhelpful. The `order-router-lambda` handler can call `ChangeMessageVisibility` on a failing record before returning it in `batchItemFailures`, extending the timeout exponentially based on `attributes.ApproximateReceiveCount` (available on every `SQSRecord`). Adding jitter prevents a convoy of messages all retrying in sync after an outage clears. -This is more targeted than simply raising `max_receive_count` (see item 7) — backoff controls _when_ retries happen; receive count controls _how many_. Both should be set together. +One viable operating model is: + +- Main queue retention: 2 days, so auto-retryable failures are either processed or surfaced to the DLQ within roughly 48 hours. +- Auto-retryable failures: exponential backoff with jitter, sized so retries are spread across that 48-hour window rather than clustered. +- DLQ retention: keep 14 days, since once a message is dead-lettered the priority shifts from retry pacing to operator inspection and recovery time. + +This is more targeted than simply raising `max_receive_count` (see item 8) — backoff controls _when_ retries happen; receive count controls _how many_. Both should be set together. ## 3. Classify errors as auto-retryable or not in `order-router-lambda` @@ -26,21 +32,27 @@ Currently all failures are treated as retryable — any exception pushes the mes **The right question at the SQS loop level is not "is this error permanent?" but "will waiting and retrying automatically fix this?"** That is a narrower, more answerable question. -Errors that belong in the SQS retry loop (add to `batchItemFailures`, apply backoff): +Errors that belong in the SQS retry loop (add to `batchItemFailures`, apply exponential backoff with jitter): - 5xx from supplier — transient server error - 429 from supplier — rate limited; back off and retry - Network/timeout failures — transient infrastructure +- Daily test quote for supplier exceeded (must clarify how exactly this is surfaced) -Errors that should leave the SQS loop (do **not** add to `batchItemFailures`): +Errors that are not good candidates for exponential backoff: - 4xx from supplier (other than 429) — includes 400 (bad payload), 401 (auth config), 404 (wrong endpoint), etc. - Schema/validation failures on the message body — payload is malformed and won't change on retry - Supplier not found in DB — config issue -This second group is not necessarily _permanently_ broken — a 401 might be fixed by rotating credentials, a 404 by correcting a config value, a 400 by identifying a mapping bug. But none of these fix themselves with time; they require a deliberate action before the message can succeed. **Crucially, if a DB persistence mechanism is in place (see item 6), every message in this group remains retryable** — the SQS loop is simply not the right place to hold them while waiting for a fix. +This second group is not necessarily _permanently_ broken — a 401 might be fixed by rotating credentials, a 404 by correcting a config value, a 400 by identifying a mapping bug. But none of these fix themselves with time; they require a deliberate action before the message can succeed. + +There are two credible ways to handle this group: -Messages removed from the SQS loop must not be silently discarded. They must be persisted to a durable store (see item 6) and the order status updated to reflect the failure, so they can be reprocessed once the underlying issue is resolved. +- Remove them from the SQS retry loop entirely once durable failure persistence exists (see item 6), because the queue is not the right place to wait for a manual fix. +- Keep them on the queue for a small fixed number of attempts while the system remains DLQ-centric. In that model, a pragmatic policy is to leave them on the default 360-second visibility timeout and cap them at 3 receives, while reserving exponential backoff for the genuinely transient group above. + +In either model, these messages must not be silently discarded. They need an operator-visible failure path, whether that is the DLQ alone or a separate persistence mechanism. An `isTransient(error)` helper (note: _transient_, not _retryable_ — the distinction matters) can classify based on HTTP status and known error types. When in doubt, treat as transient and let the DLQ act as the final backstop. @@ -56,9 +68,11 @@ bisect_batch_on_function_error = true If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely. -**Option A (preferred):** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely. +**Option A:** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely. + +**Option B:** Reverse the order — write `QUEUED` to DB first, then send to SQS. If SQS fails, the DB record stays as `QUEUED` and a separate reconciliation process (or manual intervention) can re-queue it. Easier to implement; less robust under sustained SQS unavailability. -**Option B (simpler):** Reverse the order — write `QUEUED` to DB first, then send to SQS. If SQS fails, the DB record stays as `QUEUED` and a separate reconciliation process (or manual intervention) can re-queue it. Easier to implement; less robust under sustained SQS unavailability. +**Option C:** Treat the queue write as the source-of-truth step for the user-facing flow. If `sendMessage` succeeds but the `QUEUED` status write fails, do not return an error to the user; log the failure for alerting and investigation, and let downstream processing continue from the queued message. This is operationally simple, but it assumes later status handling is tolerant of missing intermediate states. ## 6. Integrated error handling: DLQ, persistent failure store, and triage @@ -68,7 +82,7 @@ The right mental model is: **queues are excellent for retryable failures** (outa Below are three progressively more capable approaches: -### Option A — DLQ + manual triage (current state) +### Option A — DLQ + manual triage - Inspect messages in the AWS Console or via CLI; redrive in bulk. - Pro: no additional infrastructure. @@ -89,7 +103,7 @@ Below are three progressively more capable approaches: **Bulk reprocessing after an outage:** each outbox row carries a `status` field (e.g. `pending`, `enqueued`, `failed`, `non_retryable`). During an outage, affected rows accumulate in `failed`. To reprocess, run a targeted SQL update — filtered by time window, supplier, error type, or any combination — flipping those rows back to `pending`. The outbox poller then picks them up and re-enqueues automatically; no manual DLQ interaction is needed. This also makes it straightforward to exclude specific messages (poison pills, non-retryable failures) from a bulk retry by leaving their status unchanged. -Regardless of which option is chosen, a triage runbook must document: how to inspect failed messages, how to distinguish transient from permanent failures, how to redrive safely (throttled), and who is accountable given the patient data involved. +Regardless of which option is used, a triage runbook must document: how to inspect failed messages, how to distinguish transient from permanent failures, how to redrive safely (throttled), and who is accountable given the patient data involved. For bulk redrives from the DLQ back to the source queue: @@ -104,11 +118,20 @@ Limit the rate to avoid overwhelming a supplier or DB that has just recovered. ## 7. Supplier idempotency for DLQ redrives -If a message lands in the DLQ after a successful supplier submission (e.g. the `SUBMITTED` status write failed after a successful HTTP call), redriving it will re-submit to the supplier. Confirm with each supplier whether their endpoint is idempotent for the same `X-Correlation-ID`. If not, redrives must be handled manually per-message rather than as a bulk operation. +If a message lands in the DLQ after a successful supplier submission (e.g. the `SUBMITTED` status write failed after a successful HTTP call), redriving it will re-submit to the supplier. That makes supplier idempotency a prerequisite for safe bulk redrives. + +At present, this appears to be an open question rather than a confirmed property. The practical next step is to add it to a supplier question list and confirm, for each supplier, whether re-submitting the same order with the same `X-Correlation-ID` is treated as idempotent. If not, redrives must be handled manually per-message rather than as a bulk operation. + +## 8. Set `max_receive_count` to fit the retry window + +With exponential backoff in place (item 2), `max_receive_count` controls the total number of attempts rather than purely the retry window. If the main queue retention is reduced to 2 days, the more useful design target is not an arbitrary receive count but a retry schedule that spans roughly 48 hours before the message reaches the DLQ. + +That suggests two separate policies: -## 8. Increase `max_receive_count` for sustained outages +- Auto-retryable failures: set `max_receive_count` high enough that, together with exponential backoff and jitter, retries are spread across the full 48-hour main-queue window. +- Non-auto-retryable failures that remain on SQS for now: keep the default 3 receives and the baseline 360-second visibility timeout, so they surface quickly rather than consuming the whole retry window. -With exponential backoff in place (item 2), `max_receive_count` controls the total number of attempts rather than purely the retry window. The current value of 3 is low for production — a supplier outage lasting longer than a few retries will exhaust the count before the service recovers. Consider raising to 5–10, taking into account the maximum total retry window you want before a message is dead-lettered. +This keeps the long retry budget for failures likely to self-heal, while ensuring configuration, validation, and other operator-action errors reach the DLQ quickly for inspection. ## 9. Add concurrency limit to `order_router_order_placement` ESM @@ -130,4 +153,4 @@ Key constraints to measure before increasing batch size: - **Supplier rate limits**: if a supplier enforces a request rate, concurrent processing of a larger batch could breach it. - **Lambda memory**: each in-flight record within a batch consumes memory; profile at current memory_size (512 MB) before increasing. -The `maximum_batching_window_in_seconds` (already set to 5 s on the `events` ESM) is the complementary lever — letting messages accumulate before Lambda is invoked, improving efficiency at lower throughput. Increasing batch size without load testing risks cascading failures; instrument duration, error rate, and DB connection utilisation first and increase incrementally. +The `maximum_batching_window_in_seconds` (already set to 5 s on the `events` ESM) is the complementary lever — letting messages accumulate before Lambda is invoked, improving efficiency at lower throughput. The right next step here is measurement: add the metrics needed to understand duration, error rate, supplier behaviour, and DB connection utilisation, and run performance tests before changing batch size. Without that data, batch-size tuning is guesswork and risks cascading failures. From b12868f96ebc16e3e018a9451284860c89c5bbaa Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Tue, 7 Apr 2026 14:26:34 +0100 Subject: [PATCH 5/7] feat: add-decisions-and-combined-doc --- .../sqs-and-error-handling-decisions.md | 63 ++++ docs/developer-guides/sqs-combined.md | 306 ++++++++++++++++++ 2 files changed, 369 insertions(+) create mode 100644 docs/developer-guides/sqs-and-error-handling-decisions.md create mode 100644 docs/developer-guides/sqs-combined.md diff --git a/docs/developer-guides/sqs-and-error-handling-decisions.md b/docs/developer-guides/sqs-and-error-handling-decisions.md new file mode 100644 index 00000000..a9dba926 --- /dev/null +++ b/docs/developer-guides/sqs-and-error-handling-decisions.md @@ -0,0 +1,63 @@ +# SQS & Error Handling — Agreed Decisions + +This document records the agreed decisions for points 1 to 10 from the SQS and error-handling review, separate from the options paper in `sqs-and-error-handling-future-recommendations.md`. + +## 1. Clean up unused queues and the duplicate `events` ESM + +Decision: implement the fix as stated. + +Justification: the unused queues are creating avoidable cost and operational noise, and the duplicate `events` event source mapping is redundant unless it is intentionally serving a separate trigger path. + +## 2. Fix the visibility timeout and add exponential backoff with jitter + +Decision: set the main queue visibility timeout to 360 seconds, reduce main queue retention to 2 days, retain DLQ retention at 14 days, and implement exponential backoff with jitter for auto-retryable failures so retries are spread across a maximum 48-hour window. + +Justification: AWS recommends visibility timeout at least 6 times the Lambda timeout, and the 48-hour retry window aligns with supplier daily processing caps while still surfacing stuck messages promptly to the DLQ. + +## 3. Classify errors as auto-retryable or not in `order-router-lambda` + +Decision: apply exponential backoff with jitter only to auto-retryable failures. For non-auto-retryable failures, keep the simpler policy of 3 attempts using the default 360-second visibility timeout. + +Justification: transient failures benefit from a longer retry window, while validation, configuration, and similar operator-action failures should surface to the DLQ quickly rather than consuming that entire window. + +## 4. Enable `bisect_batch_on_function_error` + +Decision: implement the fix as stated. + +Justification: this isolates poison-pill messages more quickly and reduces wasted retry attempts before a bad message reaches the DLQ. + +## 5. Address the `QUEUED` status / enqueue race condition in `order-service-lambda` + +Decision: use the logging-based approach described as Option C. If `sendMessage` succeeds but the `QUEUED` status write fails, do not return an error to the user; log the failure and allow downstream processing to continue. + +Justification: once the message is on the queue, the order will continue to be processed. Logging the failure is the simplest operational response for now, and the solution architect explicitly preferred this until there is better evidence on frequency and impact. + +## 6. Integrated error handling: DLQ, persistent failure store, and triage + +Decision: use Option A for now: DLQ plus manual triage. + +Justification: this is the agreed immediate operating model. The more capable persistence options remain available for later if the current approach proves insufficient. + +## 7. Supplier idempotency for DLQ redrives + +Decision: do not assume supplier idempotency. Record this as an open supplier question and confirm, for each supplier, whether re-submitting the same order with the same `X-Correlation-ID` is idempotent before relying on bulk redrives. + +Justification: if a supplier submission succeeded before the message reached the DLQ, redriving could create a duplicate order unless the supplier endpoint is idempotent. + +## 8. Set `max_receive_count` to fit the retry window + +Decision: configure `max_receive_count` to support the 48-hour retry window for auto-retryable failures, rather than choosing an arbitrary fixed number. Keep non-auto-retryable failures at 3 receives. + +Justification: the design target is a two-day retry period for transient failures, not a specific receive count. For non-auto-retryable failures, quick DLQ visibility is more useful than a prolonged retry cycle. + +## 9. Add concurrency limit to `order_router_order_placement` ESM + +Decision: implement the fix as stated. + +Justification: an explicit concurrency cap reduces the risk of a queue burst exhausting shared Aurora Serverless database connections. + +## 10. Batch size optimisation + +Decision: do not change batch size yet. First add the required metrics and, if needed, performance testing to understand duration, error rate, supplier behaviour, and database connection usage. + +Justification: batch-size tuning without production-relevant metrics is guesswork and risks introducing cascading failures. diff --git a/docs/developer-guides/sqs-combined.md b/docs/developer-guides/sqs-combined.md new file mode 100644 index 00000000..36574378 --- /dev/null +++ b/docs/developer-guides/sqs-combined.md @@ -0,0 +1,306 @@ +# SQS and Error Handling + +## Current State + +### Queues + +Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper. + +| Queue | Type | Status | +| ----------------- | -------- | ---------------------------------------------------------------------------------------------------- | +| `order-placement` | Standard | Active — written by `order-service-lambda`, consumed by `order-router-lambda` | +| `events` | Standard | Provisioned — `order-router-lambda` is wired as a consumer (see note below), nothing publishes to it | +| `order-results` | Standard | Provisioned — no producer, no consumer | +| `notifications` | FIFO | Provisioned — no producer, no consumer | + +**`events` queue note:** The `events` event source mapping wires any Lambda flagged `sqs_trigger = true` to this queue (batch size 10, 5 s batching window). Currently `order-router-lambda` is the only such Lambda, meaning it has two event source mappings: one to `order-placement` (its operational queue) and one to the unused `events` queue. + +### Active Message Flow + +```text +Patient → POST /order + → order-service-lambda + → DB: create patient/order/consent (transaction) + → SQS: enqueue to order-placement + → DB: set status = QUEUED + → return 201 + +SQS order-placement → order-router-lambda (batch size 1) + → validate & parse message body + → DB: load supplier config + → Secrets Manager: fetch supplier OAuth token + → HTTP: POST FHIR order to supplier + → DB: set status = SUBMITTED (best-effort, see error handling) +``` + +### Queue Configuration + +| Setting | order-placement | order-results | notifications | events | +| ------------------- | --------------- | ------------- | ------------- | --------- | +| Visibility timeout | 300 s | 300 s | 180 s | 300 s | +| Message retention | 14 days | 14 days | 4 days | 14 days | +| DLQ retention | 14 days | 14 days | 14 days | 14 days | +| max receive count | 3 | 3 | 3 | 3 | +| Long polling | 20 s | 20 s | 20 s | 20 s | +| Encryption | KMS (CMK) | KMS (CMK) | KMS (CMK) | KMS (CMK) | +| DLQ redrive enabled | Yes | Yes | Yes | Yes | + +### Event Source Mapping — order-placement → order-router-lambda + +| Setting | Value | +| ------------------------------------------------- | --------------------------------------------------------- | +| Batch size | 1 | +| Partial batch failure (`ReportBatchItemFailures`) | Enabled | +| `bisect_batch_on_function_error` | Not configured | +| Max concurrency | Not configured (only set on the generic `events` ESM: 10) | + +### Error Handling + +#### order-service-lambda (producer) + +- The SDK client is configured with `maxAttempts: 3`; transient send failures are retried at the SDK level before the Lambda returns an error to the API caller. +- If `sendMessage` still fails, a structured `console.error` is logged and a `500` is returned to the caller — the order record exists in the DB but is never queued. +- If the subsequent `QUEUED` status write fails, a `500` is returned but the already-sent SQS message is not rolled back. The order is in the queue but its DB status may not reflect `QUEUED`. + +#### order-router-lambda (consumer) + +- Every record is processed individually; failures push `{ itemIdentifier: record.messageId }` into `batchItemFailures`. Only failed records are retried; successful records in the same batch are deleted by Lambda. +- Every failure path throws, with one intentional exception: after a successful supplier HTTP call, a subsequent DB status update failure is caught, logged as `console.error`, and **not rethrown**. This is explicitly to prevent SQS from re-delivering the message and causing a duplicate order submission to the supplier. The order is considered submitted. +- All other failures (invalid message body, missing supplier config, OAuth failure, non-200/201 supplier response) propagate and result in the message being retried via SQS. + +#### order-result-lambda + +`order-result-lambda` is **not an SQS consumer**. It is an HTTP handler (API Gateway POST `/result`) that receives FHIR Observation resources directly from suppliers. The `order-results` queue is provisioned but has no connection to this Lambda. Errors are handled by returning structured FHIR error responses (HTTP 4xx/5xx); there is no silent failure path. + +### Monitoring + +Three CloudWatch alarms are created per queue: + +| Alarm | Metric | Threshold | +| ------------------------ | ---------------------------------------- | --------------- | +| `{queue}-age-high` | `ApproximateAgeOfOldestMessage` | 600 s | +| `{queue}-depth-high` | `ApproximateNumberOfMessagesVisible` | 1,000 | +| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) | + +All alarms notify the SNS alerts topic. OK transitions also notify. + +## Future Recommendations + +### 1. Clean up unused queues and the duplicate `events` ESM + +`order-results`, `notifications`, and `events` are provisioned but carry no messages. Each still incurs CloudWatch alarms and ongoing KMS/encryption costs. + +- If `order-results` and `notifications` are intended for future use, add a `TODO` comment in `sqs.tf` naming the planned producer/consumer so intent is clear. +- If they have no near-term roadmap, remove them. +- The `events` queue creates a redundant event source mapping on `order-router-lambda` (which already has its own dedicated ESM to `order-placement`). Unless `events` is intended as a separate trigger path, remove the generic `aws_lambda_event_source_mapping.sqs` resource or decouple it from `order-router-lambda`. + +### 2. Fix the visibility timeout and add exponential backoff with jitter + +AWS recommends visibility timeout ≥ 6× Lambda timeout. `order-router-lambda` has a 60 s timeout; the queue has a 300 s visibility timeout (5×). Raise to at least 360 s (6×) so a message isn't re-queued while a slow invocation is still running. + +```hcl +visibility_timeout_seconds = 360 +``` + +Beyond the static timeout, SQS does not natively back off between retries — every re-delivery is immediate once the visibility window expires. For an auto-retryable failure against a recovering supplier API, hammering every 360 seconds is unhelpful. The `order-router-lambda` handler can call `ChangeMessageVisibility` on a failing record before returning it in `batchItemFailures`, extending the timeout exponentially based on `attributes.ApproximateReceiveCount` (available on every `SQSRecord`). Adding jitter prevents a convoy of messages all retrying in sync after an outage clears. + +One viable operating model is: + +- Main queue retention: 2 days, so auto-retryable failures are either processed or surfaced to the DLQ within roughly 48 hours. +- Auto-retryable failures: exponential backoff with jitter, sized so retries are spread across that 48-hour window rather than clustered. +- DLQ retention: keep 14 days, since once a message is dead-lettered the priority shifts from retry pacing to operator inspection and recovery time. + +This is more targeted than simply raising `max_receive_count` (see item 8) — backoff controls _when_ retries happen; receive count controls _how many_. Both should be set together. + +### 3. Classify errors as auto-retryable or not in `order-router-lambda` + +Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, causing SQS to re-deliver until `max_receive_count` is exhausted. This wastes retry attempts on failures that will never self-heal, and sends legitimate patient orders to the DLQ unnecessarily. + +**The right question at the SQS loop level is not "is this error permanent?" but "will waiting and retrying automatically fix this?"** That is a narrower, more answerable question. + +Errors that belong in the SQS retry loop (add to `batchItemFailures`, apply exponential backoff with jitter): + +- 5xx from supplier — transient server error +- 429 from supplier — rate limited; back off and retry +- Network/timeout failures — transient infrastructure +- Daily test quote for supplier exceeded (must clarify how exactly this is surfaced) + +Errors that are not good candidates for exponential backoff: + +- 4xx from supplier (other than 429) — includes 400 (bad payload), 401 (auth config), 404 (wrong endpoint), etc. +- Schema/validation failures on the message body — payload is malformed and won't change on retry +- Supplier not found in DB — config issue + +This second group is not necessarily _permanently_ broken — a 401 might be fixed by rotating credentials, a 404 by correcting a config value, a 400 by identifying a mapping bug. But none of these fix themselves with time; they require a deliberate action before the message can succeed. + +There are two credible ways to handle this group: + +- Remove them from the SQS retry loop entirely once durable failure persistence exists (see item 6), because the queue is not the right place to wait for a manual fix. +- Keep them on the queue for a small fixed number of attempts while the system remains DLQ-centric. In that model, a pragmatic policy is to leave them on the default 360-second visibility timeout and cap them at 3 receives, while reserving exponential backoff for the genuinely transient group above. + +In either model, these messages must not be silently discarded. They need an operator-visible failure path, whether that is the DLQ alone or a separate persistence mechanism. + +An `isTransient(error)` helper (note: _transient_, not _retryable_ — the distinction matters) can classify based on HTTP status and known error types. When in doubt, treat as transient and let the DLQ act as the final backstop. + +### 4. Enable `bisect_batch_on_function_error` + +Not currently set on either ESM. When enabled, Lambda automatically halves a failing batch to isolate a poison-pill message more quickly, reducing wasted retry attempts before the bad message reaches the DLQ. Safe to enable alongside `ReportBatchItemFailures`. + +```hcl +bisect_batch_on_function_error = true +``` + +### 5. Address the QUEUED status / enqueue race condition in `order-service-lambda` + +If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely. + +**Option A:** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely. + +**Option B:** Reverse the order — write `QUEUED` to DB first, then send to SQS. If SQS fails, the DB record stays as `QUEUED` and a separate reconciliation process (or manual intervention) can re-queue it. Easier to implement; less robust under sustained SQS unavailability. + +**Option C:** Treat the queue write as the source-of-truth step for the user-facing flow. If `sendMessage` succeeds but the `QUEUED` status write fails, do not return an error to the user; log the failure for alerting and investigation, and let downstream processing continue from the queued message. This is operationally simple, but it assumes later status handling is tolerant of missing intermediate states. + +### 6. Integrated error handling: DLQ, persistent failure store, and triage + +The DLQ alone is insufficient for production. SQS has a 14-day retention limit, provides no random access, and gives no structured way to track which failed messages have been reviewed or reprocessed. Given these are patient orders, audit trail and targeted reprocessing matter. + +The right mental model is: **queues are excellent for retryable failures** (outage recovery, bad deploy — drain the DLQ and retry everything); **a database is the right home for audit, triage, and non-retryable failures**. They should be used together. + +Below are three progressively more capable approaches: + +#### Option A — DLQ + manual triage + +- Inspect messages in the AWS Console or via CLI; redrive in bulk. +- Pro: no additional infrastructure. +- Con: no random access, 14-day retention, no structured audit trail, poor UX for per-message decisions. Does not handle non-retryable errors (item 3). + +#### Option B — DLQ consumer Lambda writes to a `failed_messages` table + +- A Lambda consumes the DLQ and writes failed message payloads (with error context, receive count, timestamp) to a DB table. Reprocessing is a DB update + re-enqueue. +- Pro: indefinite retention, random access, queryable by order UID. Relatively simple to implement. +- Con: the DB table is a separate concern from the main order state; two places to check per failed order. Does not solve the enqueue race condition (item 5). + +#### Option C — Outbox pattern as primary store (addresses items 5 and 6 together) + +- Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism. +- Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update. +- Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly. +- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or EventBridge Pipes reading the outbox table). + +**Bulk reprocessing after an outage:** each outbox row carries a `status` field (e.g. `pending`, `enqueued`, `failed`, `non_retryable`). During an outage, affected rows accumulate in `failed`. To reprocess, run a targeted SQL update — filtered by time window, supplier, error type, or any combination — flipping those rows back to `pending`. The outbox poller then picks them up and re-enqueues automatically; no manual DLQ interaction is needed. This also makes it straightforward to exclude specific messages (poison pills, non-retryable failures) from a bulk retry by leaving their status unchanged. + +Regardless of which option is used, a triage runbook must document: how to inspect failed messages, how to distinguish transient from permanent failures, how to redrive safely (throttled), and who is accountable given the patient data involved. + +For bulk redrives from the DLQ back to the source queue: + +```bash +aws sqs start-message-move-task \ + --source-arn \ + --destination-arn \ + --max-number-of-messages-per-second 1 +``` + +Limit the rate to avoid overwhelming a supplier or DB that has just recovered. + +### 7. Supplier idempotency for DLQ redrives + +If a message lands in the DLQ after a successful supplier submission (e.g. the `SUBMITTED` status write failed after a successful HTTP call), redriving it will re-submit to the supplier. That makes supplier idempotency a prerequisite for safe bulk redrives. + +At present, this appears to be an open question rather than a confirmed property. The practical next step is to add it to a supplier question list and confirm, for each supplier, whether re-submitting the same order with the same `X-Correlation-ID` is treated as idempotent. If not, redrives must be handled manually per-message rather than as a bulk operation. + +### 8. Set `max_receive_count` to fit the retry window + +With exponential backoff in place (item 2), `max_receive_count` controls the total number of attempts rather than purely the retry window. If the main queue retention is reduced to 2 days, the more useful design target is not an arbitrary receive count but a retry schedule that spans roughly 48 hours before the message reaches the DLQ. + +That suggests two separate policies: + +- Auto-retryable failures: set `max_receive_count` high enough that, together with exponential backoff and jitter, retries are spread across the full 48-hour main-queue window. +- Non-auto-retryable failures that remain on SQS for now: keep the default 3 receives and the baseline 360-second visibility timeout, so they surface quickly rather than consuming the whole retry window. + +This keeps the long retry budget for failures likely to self-heal, while ensuring configuration, validation, and other operator-action errors reach the DLQ quickly for inspection. + +### 9. Add concurrency limit to `order_router_order_placement` ESM + +The generic `events` ESM has `maximum_concurrency = 10`, but the dedicated `order_router_order_placement` ESM does not. Without a cap, a burst of messages could spin up Lambda concurrency proportional to queue depth, potentially exhausting the shared Aurora Serverless connection pool. Set an explicit limit appropriate for the DB connection limit. + +```hcl +scaling_config { + maximum_concurrency = 10 +} +``` + +### 10. Batch size optimisation + +`order-placement` uses batch size 1, which is safe and simple but limits throughput. The existing handler already processes records concurrently via `Promise.all`, so the code change to support a larger batch is minimal — the tuning is primarily infrastructure config. + +Key constraints to measure before increasing batch size: + +- **Aurora Serverless connection pool**: each concurrently processed record in a batch needs a DB connection; connection exhaustion will cause failures. +- **Supplier rate limits**: if a supplier enforces a request rate, concurrent processing of a larger batch could breach it. +- **Lambda memory**: each in-flight record within a batch consumes memory; profile at current memory_size (512 MB) before increasing. + +The `maximum_batching_window_in_seconds` (already set to 5 s on the `events` ESM) is the complementary lever — letting messages accumulate before Lambda is invoked, improving efficiency at lower throughput. The right next step here is measurement: add the metrics needed to understand duration, error rate, supplier behaviour, and DB connection utilisation, and run performance tests before changing batch size. Without that data, batch-size tuning is guesswork and risks cascading failures. + +## Agreed Decisions + +This document records the agreed decisions for points 1 to 10 from the SQS and error-handling review, separate from the options paper in `sqs-and-error-handling-future-recommendations.md`. + +### 1. Clean up unused queues and the duplicate `events` ESM + +Decision: implement the fix as stated. + +Justification: the unused queues are creating avoidable cost and operational noise, and the duplicate `events` event source mapping is redundant unless it is intentionally serving a separate trigger path. + +### 2. Fix the visibility timeout and add exponential backoff with jitter + +Decision: set the main queue visibility timeout to 360 seconds, reduce main queue retention to 2 days, retain DLQ retention at 14 days, and implement exponential backoff with jitter for auto-retryable failures so retries are spread across a maximum 48-hour window. + +Justification: AWS recommends visibility timeout at least 6 times the Lambda timeout, and the 48-hour retry window aligns with supplier daily processing caps while still surfacing stuck messages promptly to the DLQ. + +### 3. Classify errors as auto-retryable or not in `order-router-lambda` + +Decision: apply exponential backoff with jitter only to auto-retryable failures. For non-auto-retryable failures, keep the simpler policy of 3 attempts using the default 360-second visibility timeout. + +Justification: transient failures benefit from a longer retry window, while validation, configuration, and similar operator-action failures should surface to the DLQ quickly rather than consuming that entire window. + +### 4. Enable `bisect_batch_on_function_error` + +Decision: implement the fix as stated. + +Justification: this isolates poison-pill messages more quickly and reduces wasted retry attempts before a bad message reaches the DLQ. + +### 5. Address the `QUEUED` status / enqueue race condition in `order-service-lambda` + +Decision: use the logging-based approach described as Option C. If `sendMessage` succeeds but the `QUEUED` status write fails, do not return an error to the user; log the failure and allow downstream processing to continue. + +Justification: once the message is on the queue, the order will continue to be processed. Logging the failure is the simplest operational response for now, and the solution architect explicitly preferred this until there is better evidence on frequency and impact. + +### 6. Integrated error handling: DLQ, persistent failure store, and triage + +Decision: use Option A for now: DLQ plus manual triage. + +Justification: this is the agreed immediate operating model. The more capable persistence options remain available for later if the current approach proves insufficient. + +### 7. Supplier idempotency for DLQ redrives + +Decision: do not assume supplier idempotency. Record this as an open supplier question and confirm, for each supplier, whether re-submitting the same order with the same `X-Correlation-ID` is idempotent before relying on bulk redrives. + +Justification: if a supplier submission succeeded before the message reached the DLQ, redriving could create a duplicate order unless the supplier endpoint is idempotent. + +### 8. Set `max_receive_count` to fit the retry window + +Decision: configure `max_receive_count` to support the 48-hour retry window for auto-retryable failures, rather than choosing an arbitrary fixed number. Keep non-auto-retryable failures at 3 receives. + +Justification: the design target is a two-day retry period for transient failures, not a specific receive count. For non-auto-retryable failures, quick DLQ visibility is more useful than a prolonged retry cycle. + +### 9. Add concurrency limit to `order_router_order_placement` ESM + +Decision: implement the fix as stated. + +Justification: an explicit concurrency cap reduces the risk of a queue burst exhausting shared Aurora Serverless database connections. + +### 10. Batch size optimisation + +Decision: do not change batch size yet. First add the required metrics and, if needed, performance testing to understand duration, error rate, supplier behaviour, and database connection usage. + +Justification: batch-size tuning without production-relevant metrics is guesswork and risks introducing cascading failures. From 69b3487e6842ab0c5bf0820868cd3b7bed4e9660 Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Tue, 7 Apr 2026 14:37:32 +0100 Subject: [PATCH 6/7] chore: add-html-version-of-combined-doc --- .../sqs-combined-confluence.html | 817 ++++++++++++++++++ docs/developer-guides/sqs-combined.md | 4 + 2 files changed, 821 insertions(+) create mode 100644 docs/developer-guides/sqs-combined-confluence.html diff --git a/docs/developer-guides/sqs-combined-confluence.html b/docs/developer-guides/sqs-combined-confluence.html new file mode 100644 index 00000000..707314fe --- /dev/null +++ b/docs/developer-guides/sqs-combined-confluence.html @@ -0,0 +1,817 @@ + + + + + SQS and Error Handling + + + +

SQS and Error Handling

+ +

Overview

+ +

+ This page consolidates the current SQS and error-handling implementation, the future + recommendations discussed on this branch, and the agreed decisions taken so far. It is + intended to provide a single view of the present design, the available improvement options, + and the decisions that will be implemented or followed up. +

+ +

Current State

+ +

Queues

+ +

+ Four queues are provisioned in + hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf via the shared + infrastructure/modules/sqs wrapper. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
QueueTypeStatus
order-placementStandard + Active — written by order-service-lambda, consumed by + order-router-lambda +
eventsStandard + Provisioned — order-router-lambda is wired as a consumer (see note below), + nothing publishes to it +
order-resultsStandardProvisioned — no producer, no consumer
notificationsFIFOProvisioned — no producer, no consumer
+ +

+ events queue note: The events event source mapping + wires any Lambda flagged sqs_trigger = true to this queue (batch size 10, 5 s + batching window). Currently order-router-lambda is the only such Lambda, meaning + it has two event source mappings: one to order-placement (its operational queue) + and one to the unused events queue. +

+ +

Active Message Flow

+ +
+Patient → POST /order
+  → order-service-lambda
+      → DB: create patient/order/consent (transaction)
+      → SQS: enqueue to order-placement
+      → DB: set status = QUEUED
+      → return 201
+
+SQS order-placement → order-router-lambda (batch size 1)
+    → validate & parse message body
+    → DB: load supplier config
+    → Secrets Manager: fetch supplier OAuth token
+    → HTTP: POST FHIR order to supplier
+    → DB: set status = SUBMITTED (best-effort, see error handling)
+ +

Queue Configuration

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Settingorder-placementorder-resultsnotificationsevents
Visibility timeout300 s300 s180 s300 s
Message retention14 days14 days4 days14 days
DLQ retention14 days14 days14 days14 days
max receive count3333
Long polling20 s20 s20 s20 s
EncryptionKMS (CMK)KMS (CMK)KMS (CMK)KMS (CMK)
DLQ redrive enabledYesYesYesYes
+ +

Event Source Mapping — order-placement → order-router-lambda

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
SettingValue
Batch size1
Partial batch failure (ReportBatchItemFailures)Enabled
bisect_batch_on_function_errorNot configured
Max concurrencyNot configured (only set on the generic events ESM: 10)
+ +

Error Handling

+ +

order-service-lambda (producer)

+ +
    +
  • + The SDK client is configured with maxAttempts: 3; transient send failures are + retried at the SDK level before the Lambda returns an error to the API caller. +
  • +
  • + If sendMessage still fails, a structured console.error is logged + and a 500 is returned to the caller — the order record exists in the DB but is + never queued. +
  • +
  • + If the subsequent QUEUED status write fails, a 500 is returned but + the already-sent SQS message is not rolled back. The order is in the queue but its DB status + may not reflect QUEUED. +
  • +
+ +

order-router-lambda (consumer)

+ +
    +
  • + Every record is processed individually; failures push + { itemIdentifier: record.messageId } into batchItemFailures. Only + failed records are retried; successful records in the same batch are deleted by Lambda. +
  • +
  • + Every failure path throws, with one intentional exception: after a successful supplier HTTP + call, a subsequent DB status update failure is caught, logged as console.error, + and not rethrown. This is explicitly to prevent SQS from re-delivering the + message and causing a duplicate order submission to the supplier. The order is considered + submitted. +
  • +
  • + All other failures (invalid message body, missing supplier config, OAuth failure, + non-200/201 supplier response) propagate and result in the message being retried via SQS. +
  • +
+ +

order-result-lambda

+ +

+ order-result-lambda is not an SQS consumer. It is an HTTP + handler (API Gateway POST /result) that receives FHIR Observation resources + directly from suppliers. The order-results queue is provisioned but has no + connection to this Lambda. Errors are handled by returning structured FHIR error responses + (HTTP 4xx/5xx); there is no silent failure path. +

+ +

Monitoring

+ +

Three CloudWatch alarms are created per queue:

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
AlarmMetricThreshold
{queue}-age-highApproximateAgeOfOldestMessage600 s
{queue}-depth-highApproximateNumberOfMessagesVisible1,000
{queue-dlq}-depth-highDLQ ApproximateNumberOfMessagesVisible0 (any message)
+ +

All alarms notify the SNS alerts topic. OK transitions also notify.

+ +

Future Recommendations

+ +

1. Clean up unused queues and the duplicate events ESM

+ +

+ order-results, notifications, and events are + provisioned but carry no messages. Each still incurs CloudWatch alarms and ongoing + KMS/encryption costs. +

+ +
    +
  • + If order-results and notifications are intended for future use, + add a TODO comment in sqs.tf naming the planned producer/consumer + so intent is clear. +
  • +
  • If they have no near-term roadmap, remove them.
  • +
  • + The events queue creates a redundant event source mapping on + order-router-lambda (which already has its own dedicated ESM to + order-placement). Unless events is intended as a separate trigger + path, remove the generic aws_lambda_event_source_mapping.sqs resource or + decouple it from order-router-lambda. +
  • +
+ +

2. Fix the visibility timeout and add exponential backoff with jitter

+ +

+ AWS recommends visibility timeout ≥ 6× Lambda timeout. order-router-lambda has a + 60 s timeout; the queue has a 300 s visibility timeout (5×). Raise to at least 360 s (6×) so a + message is not re-queued while a slow invocation is still running. +

+ +
visibility_timeout_seconds = 360
+ +

+ Beyond the static timeout, SQS does not natively back off between retries — every re-delivery + is immediate once the visibility window expires. For an auto-retryable failure against a + recovering supplier API, hammering every 360 seconds is unhelpful. The + order-router-lambda handler can call ChangeMessageVisibility on a + failing record before returning it in batchItemFailures, extending the timeout + exponentially based on attributes.ApproximateReceiveCount (available on every + SQSRecord). Adding jitter prevents a convoy of messages all retrying in sync + after an outage clears. +

+ +

One viable operating model is:

+ +
    +
  • + Main queue retention: 2 days, so auto-retryable failures are either processed or surfaced to + the DLQ within roughly 48 hours. +
  • +
  • + Auto-retryable failures: exponential backoff with jitter, sized so retries are spread across + that 48-hour window rather than clustered. +
  • +
  • + DLQ retention: keep 14 days, since once a message is dead-lettered the priority shifts from + retry pacing to operator inspection and recovery time. +
  • +
+ +

+ This is more targeted than simply raising max_receive_count (see item 8) — + backoff controls when retries happen; receive count controls how many. Both + should be set together. +

+ +

3. Classify errors as auto-retryable or not in order-router-lambda

+ +

+ Currently all failures are treated as retryable — any exception pushes the message ID into + batchItemFailures, causing SQS to re-deliver until + max_receive_count is exhausted. This wastes retry attempts on failures that will + never self-heal, and sends legitimate patient orders to the DLQ unnecessarily. +

+ +

+ The right question at the SQS loop level is not "is this error permanent?" but "will + waiting and retrying automatically fix this?" + That is a narrower, more answerable question. +

+ +

+ Errors that belong in the SQS retry loop (add to batchItemFailures, apply + exponential backoff with jitter): +

+ +
    +
  • 5xx from supplier — transient server error
  • +
  • 429 from supplier — rate limited; back off and retry
  • +
  • Network/timeout failures — transient infrastructure
  • +
  • Daily test quote for supplier exceeded (must clarify how exactly this is surfaced)
  • +
+ +

Errors that are not good candidates for exponential backoff:

+ +
    +
  • + 4xx from supplier (other than 429) — includes 400 (bad payload), 401 (auth config), 404 + (wrong endpoint), etc. +
  • +
  • + Schema/validation failures on the message body — payload is malformed and won't change on + retry +
  • +
  • Supplier not found in DB — config issue
  • +
+ +

+ This second group is not necessarily permanently broken — a 401 might be fixed by + rotating credentials, a 404 by correcting a config value, a 400 by identifying a mapping bug. + But none of these fix themselves with time; they require a deliberate action before the + message can succeed. +

+ +

There are two credible ways to handle this group:

+ +
    +
  • + Remove them from the SQS retry loop entirely once durable failure persistence exists (see + item 6), because the queue is not the right place to wait for a manual fix. +
  • +
  • + Keep them on the queue for a small fixed number of attempts while the system remains + DLQ-centric. In that model, a pragmatic policy is to leave them on the default 360-second + visibility timeout and cap them at 3 receives, while reserving exponential backoff for the + genuinely transient group above. +
  • +
+ +

+ In either model, these messages must not be silently discarded. They need an operator-visible + failure path, whether that is the DLQ alone or a separate persistence mechanism. +

+ +

+ An isTransient(error) helper (note: transient, not retryable — + the distinction matters) can classify based on HTTP status and known error types. When in + doubt, treat as transient and let the DLQ act as the final backstop. +

+ +

4. Enable bisect_batch_on_function_error

+ +

+ Not currently set on either ESM. When enabled, Lambda automatically halves a failing batch to + isolate a poison-pill message more quickly, reducing wasted retry attempts before the bad + message reaches the DLQ. Safe to enable alongside ReportBatchItemFailures. +

+ +
bisect_batch_on_function_error = true
+ +

+ 5. Address the QUEUED status / enqueue race condition in order-service-lambda +

+ +

+ If sendMessage succeeds but the subsequent QUEUED status write + fails, the order is in the queue with no QUEUED DB status. + order-router-lambda will process it and attempt to write SUBMITTED, + skipping QUEUED entirely. +

+ +

+ Option A: Wrap the sendMessage and + addOrderStatusUpdate calls in a single DB transaction with an outbox pattern — + persist the message to an outbox table inside the transaction, then publish from + the outbox. This eliminates the race entirely. +

+ +

+ Option B: Reverse the order — write QUEUED to DB first, then + send to SQS. If SQS fails, the DB record stays as QUEUED and a separate + reconciliation process (or manual intervention) can re-queue it. Easier to implement; less + robust under sustained SQS unavailability. +

+ +

+ Option C: Treat the queue write as the source-of-truth step for the + user-facing flow. If sendMessage succeeds but the QUEUED status + write fails, do not return an error to the user; log the failure for alerting and + investigation, and let downstream processing continue from the queued message. This is + operationally simple, but it assumes later status handling is tolerant of missing intermediate + states. +

+ +

6. Integrated error handling: DLQ, persistent failure store, and triage

+ +

+ The DLQ alone is insufficient for production. SQS has a 14-day retention limit, provides no + random access, and gives no structured way to track which failed messages have been reviewed + or reprocessed. Given these are patient orders, audit trail and targeted reprocessing matter. +

+ +

+ The right mental model is: + queues are excellent for retryable failures (outage recovery, bad deploy — + drain the DLQ and retry everything); + a database is the right home for audit, triage, and non-retryable failures. + They should be used together. +

+ +

Below are three progressively more capable approaches:

+ +

Option A — DLQ + manual triage

+ +
    +
  • Inspect messages in the AWS Console or via CLI; redrive in bulk.
  • +
  • Pro: no additional infrastructure.
  • +
  • + Con: no random access, 14-day retention, no structured audit trail, poor UX for per-message + decisions. Does not handle non-retryable errors (item 3). +
  • +
+ +

Option B — DLQ consumer Lambda writes to a failed_messages table

+ +
    +
  • + A Lambda consumes the DLQ and writes failed message payloads (with error context, receive + count, timestamp) to a DB table. Reprocessing is a DB update + re-enqueue. +
  • +
  • + Pro: indefinite retention, random access, queryable by order UID. Relatively simple to + implement. +
  • +
  • + Con: the DB table is a separate concern from the main order state; two places to check per + failed order. Does not solve the enqueue race condition (item 5). +
  • +
+ +

Option C — Outbox pattern as primary store (addresses items 5 and 6 together)

+ +
    +
  • + Messages are written to an outbox DB table + inside the originating transaction (alongside the order record). A separate process + polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the + delivery mechanism. +
  • +
  • + Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the + row status rather than being discarded silently. Reprocessing is a targeted DB update. +
  • +
  • + Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports + both retryable and non-retryable failure paths cleanly. +
  • +
  • + Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled + Lambda or EventBridge Pipes reading the outbox table). +
  • +
+ +

+ Bulk reprocessing after an outage: each outbox row carries a + status field (e.g. pending, enqueued, + failed, non_retryable). During an outage, affected rows accumulate + in failed. To reprocess, run a targeted SQL update — filtered by time window, + supplier, error type, or any combination — flipping those rows back to pending. + The outbox poller then picks them up and re-enqueues automatically; no manual DLQ interaction + is needed. This also makes it straightforward to exclude specific messages (poison pills, + non-retryable failures) from a bulk retry by leaving their status unchanged. +

+ +

+ Regardless of which option is used, a triage runbook must document: how to inspect failed + messages, how to distinguish transient from permanent failures, how to redrive safely + (throttled), and who is accountable given the patient data involved. +

+ +

For bulk redrives from the DLQ back to the source queue:

+ +
aws sqs start-message-move-task \
+  --source-arn <dlq-arn> \
+  --destination-arn <source-queue-arn> \
+  --max-number-of-messages-per-second 1
+ +

Limit the rate to avoid overwhelming a supplier or DB that has just recovered.

+ +

7. Supplier idempotency for DLQ redrives

+ +

+ If a message lands in the DLQ after a successful supplier submission (e.g. the + SUBMITTED status write failed after a successful HTTP call), redriving it will + re-submit to the supplier. That makes supplier idempotency a prerequisite for safe bulk + redrives. +

+ +

+ At present, this appears to be an open question rather than a confirmed property. The + practical next step is to add it to a supplier question list and confirm, for each supplier, + whether re-submitting the same order with the same X-Correlation-ID is treated as + idempotent. If not, redrives must be handled manually per-message rather than as a bulk + operation. +

+ +

8. Set max_receive_count to fit the retry window

+ +

+ With exponential backoff in place (item 2), max_receive_count controls the total + number of attempts rather than purely the retry window. If the main queue retention is reduced + to 2 days, the more useful design target is not an arbitrary receive count but a retry + schedule that spans roughly 48 hours before the message reaches the DLQ. +

+ +

That suggests two separate policies:

+ +
    +
  • + Auto-retryable failures: set max_receive_count high enough that, together with + exponential backoff and jitter, retries are spread across the full 48-hour main-queue + window. +
  • +
  • + Non-auto-retryable failures that remain on SQS for now: keep the default 3 receives and the + baseline 360-second visibility timeout, so they surface quickly rather than consuming the + whole retry window. +
  • +
+ +

+ This keeps the long retry budget for failures likely to self-heal, while ensuring + configuration, validation, and other operator-action errors reach the DLQ quickly for + inspection. +

+ +

9. Add concurrency limit to order_router_order_placement ESM

+ +

+ The generic events ESM has maximum_concurrency = 10, but the + dedicated order_router_order_placement ESM does not. Without a cap, a burst of + messages could spin up Lambda concurrency proportional to queue depth, potentially exhausting + the shared Aurora Serverless connection pool. Set an explicit limit appropriate for the DB + connection limit. +

+ +
scaling_config {
+  maximum_concurrency = 10
+}
+ +

10. Batch size optimisation

+ +

+ order-placement uses batch size 1, which is safe and simple but limits + throughput. The existing handler already processes records concurrently via + Promise.all, so the code change to support a larger batch is minimal — the tuning + is primarily infrastructure config. +

+ +

Key constraints to measure before increasing batch size:

+ +
    +
  • + Aurora Serverless connection pool: each concurrently processed record in a + batch needs a DB connection; connection exhaustion will cause failures. +
  • +
  • + Supplier rate limits: if a supplier enforces a request rate, concurrent + processing of a larger batch could breach it. +
  • +
  • + Lambda memory: each in-flight record within a batch consumes memory; + profile at current memory_size (512 MB) before increasing. +
  • +
+ +

+ The maximum_batching_window_in_seconds (already set to 5 s on the + events + ESM) is the complementary lever — letting messages accumulate before Lambda is invoked, + improving efficiency at lower throughput. The right next step here is measurement: add the + metrics needed to understand duration, error rate, supplier behaviour, and DB connection + utilisation, and run performance tests before changing batch size. Without that data, + batch-size tuning is guesswork and risks cascading failures. +

+ +

Agreed Decisions

+ +

+ This document records the agreed decisions for points 1 to 10 from the SQS and error-handling + review, separate from the options paper in + sqs-and-error-handling-future-recommendations.md. +

+ +

1. Clean up unused queues and the duplicate events ESM

+ +

Decision: implement the fix as stated.

+ +

+ Justification: the unused queues are creating avoidable cost and operational noise, and the + duplicate events event source mapping is redundant unless it is intentionally + serving a separate trigger path. +

+ +

2. Fix the visibility timeout and add exponential backoff with jitter

+ +

+ Decision: set the main queue visibility timeout to 360 seconds, reduce main queue retention to + 2 days, retain DLQ retention at 14 days, and implement exponential backoff with jitter for + auto-retryable failures so retries are spread across a maximum 48-hour window. +

+ +

+ Justification: AWS recommends visibility timeout at least 6 times the Lambda timeout, and the + 48-hour retry window aligns with supplier daily processing caps while still surfacing stuck + messages promptly to the DLQ. +

+ +

3. Classify errors as auto-retryable or not in order-router-lambda

+ +

+ Decision: apply exponential backoff with jitter only to auto-retryable failures. For + non-auto-retryable failures, keep the simpler policy of 3 attempts using the default + 360-second visibility timeout. +

+ +

+ Justification: transient failures benefit from a longer retry window, while validation, + configuration, and similar operator-action failures should surface to the DLQ quickly rather + than consuming that entire window. +

+ +

4. Enable bisect_batch_on_function_error

+ +

Decision: implement the fix as stated.

+ +

+ Justification: this isolates poison-pill messages more quickly and reduces wasted retry + attempts before a bad message reaches the DLQ. +

+ +

+ 5. Address the QUEUED status / enqueue race condition in + order-service-lambda +

+ +

+ Decision: use the logging-based approach described as Option C. If + sendMessage succeeds but the QUEUED status write fails, do not + return an error to the user; log the failure and allow downstream processing to continue. +

+ +

+ Justification: once the message is on the queue, the order will continue to be processed. + Logging the failure is the simplest operational response for now, and the solution architect + explicitly preferred this until there is better evidence on frequency and impact. +

+ +

6. Integrated error handling: DLQ, persistent failure store, and triage

+ +

Decision: use Option A for now: DLQ plus manual triage.

+ +

+ Justification: this is the agreed immediate operating model. The more capable persistence + options remain available for later if the current approach proves insufficient. +

+ +

7. Supplier idempotency for DLQ redrives

+ +

+ Decision: do not assume supplier idempotency. Record this as an open supplier question and + confirm, for each supplier, whether re-submitting the same order with the same + X-Correlation-ID is idempotent before relying on bulk redrives. +

+ +

+ Justification: if a supplier submission succeeded before the message reached the DLQ, + redriving could create a duplicate order unless the supplier endpoint is idempotent. +

+ +

8. Set max_receive_count to fit the retry window

+ +

+ Decision: configure max_receive_count to support the 48-hour retry window for + auto-retryable failures, rather than choosing an arbitrary fixed number. Keep + non-auto-retryable failures at 3 receives. +

+ +

+ Justification: the design target is a two-day retry period for transient failures, not a + specific receive count. For non-auto-retryable failures, quick DLQ visibility is more useful + than a prolonged retry cycle. +

+ +

9. Add concurrency limit to order_router_order_placement ESM

+ +

Decision: implement the fix as stated.

+ +

+ Justification: an explicit concurrency cap reduces the risk of a queue burst exhausting shared + Aurora Serverless database connections. +

+ +

10. Batch size optimisation

+ +

+ Decision: do not change batch size yet. First add the required metrics and, if needed, + performance testing to understand duration, error rate, supplier behaviour, and database + connection usage. +

+ +

+ Justification: batch-size tuning without production-relevant metrics is guesswork and risks + introducing cascading failures. +

+ + diff --git a/docs/developer-guides/sqs-combined.md b/docs/developer-guides/sqs-combined.md index 36574378..1926dddd 100644 --- a/docs/developer-guides/sqs-combined.md +++ b/docs/developer-guides/sqs-combined.md @@ -1,5 +1,9 @@ # SQS and Error Handling +## Overview + +This page consolidates the current SQS and error-handling implementation, the future recommendations discussed on this branch, and the agreed decisions taken so far. It is intended to provide a single view of the present design, the available improvement options, and the decisions that will be implemented or followed up. + ## Current State ### Queues From dca809dd6f8739b1946715dfea44633c7399502a Mon Sep 17 00:00:00 2001 From: billwirz1 Date: Tue, 7 Apr 2026 14:41:46 +0100 Subject: [PATCH 7/7] chore: add-contents-section-with-links --- .../sqs-combined-confluence.html | 35 +++++++++++++++---- docs/developer-guides/sqs-combined.md | 8 ++++- 2 files changed, 36 insertions(+), 7 deletions(-) diff --git a/docs/developer-guides/sqs-combined-confluence.html b/docs/developer-guides/sqs-combined-confluence.html index 707314fe..0ab827e1 100644 --- a/docs/developer-guides/sqs-combined-confluence.html +++ b/docs/developer-guides/sqs-combined-confluence.html @@ -35,21 +35,44 @@ code { font-family: Menlo, Consolas, monospace; } + + .contents { + margin: 16px 0 24px; + padding: 12px; + border: 1px solid #ccc; + } + + .contents ul { + margin: 8px 0 0; + }

SQS and Error Handling

+ +

Overview

This page consolidates the current SQS and error-handling implementation, the future - recommendations discussed on this branch, and the agreed decisions taken so far. It is - intended to provide a single view of the present design, the available improvement options, - and the decisions that will be implemented or followed up. + recommendations discussed on + this branch, and the agreed decisions taken so far. It is intended to provide a single view of the + present design, the available improvement options, and the decisions that will be implemented + or followed up.

-

Current State

+

Current State

Queues

@@ -302,7 +325,7 @@

Monitoring

All alarms notify the SNS alerts topic. OK transitions also notify.

-

Future Recommendations

+

Future Recommendations

1. Clean up unused queues and the duplicate events ESM

@@ -684,7 +707,7 @@

10. Batch size optimisation

batch-size tuning is guesswork and risks cascading failures.

-

Agreed Decisions

+

Agreed Decisions

This document records the agreed decisions for points 1 to 10 from the SQS and error-handling diff --git a/docs/developer-guides/sqs-combined.md b/docs/developer-guides/sqs-combined.md index 1926dddd..0488050d 100644 --- a/docs/developer-guides/sqs-combined.md +++ b/docs/developer-guides/sqs-combined.md @@ -1,8 +1,14 @@ # SQS and Error Handling +## Contents + +- [Current State](#current-state) +- [Future Recommendations](#future-recommendations) +- [Agreed Decisions](#agreed-decisions) + ## Overview -This page consolidates the current SQS and error-handling implementation, the future recommendations discussed on this branch, and the agreed decisions taken so far. It is intended to provide a single view of the present design, the available improvement options, and the decisions that will be implemented or followed up. +This page consolidates the current SQS and error-handling implementation, the future recommendations discussed on [this branch](https://github.com/NHSDigital/hometest-service/tree/feature/hote-813/document-sqs-and-error-handling-approach), and the agreed decisions taken so far. It is intended to provide a single view of the present design, the available improvement options, and the decisions that will be implemented or followed up. ## Current State