Conversation
There was a problem hiding this comment.
Pull request overview
Adds internal developer documentation describing the current SQS setup/message flow and proposing future improvements to queue usage and error-handling strategy.
Changes:
- Documented the current SQS queues, configurations, message flow, and error-handling behaviour.
- Added a forward-looking recommendations guide covering queue cleanup, retry/backoff strategy, and DLQ/outbox patterns.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/developer-guides/sqs-current-state.md | New “current state” write-up for queues, flow, configuration, and monitoring. |
| docs/developer-guides/sqs-and-error-handling-future-recommendations.md | New recommendations guide for future SQS/error-handling improvements and operational runbooks. |
|
|
||
| Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily. | ||
|
|
||
| The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost. |
There was a problem hiding this comment.
“Return a 4xx error silently from batchItemFailures” is misleading for an SQS-triggered Lambda (there is no HTTP response); rephrase to say that non-retryable failures should be treated as success by omitting the message from batchItemFailures, while persisting the failure and updating order state.
| The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost. | |
| For non-retryable failures, the handler should **not** add the message to `batchItemFailures`, which causes Lambda to treat it as successfully processed and delete it from the queue. The failure must not simply be discarded, however — it must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost. |
| - Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism. | ||
| - Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update. | ||
| - Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly. | ||
| - Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or EventBridge Pipes reading the outbox table). |
There was a problem hiding this comment.
“event bridge pipe” should use the AWS service’s proper name/capitalisation (e.g. EventBridge Pipes) to avoid confusion when someone searches for the feature in AWS docs.
|
|
||
| ## Queues | ||
|
|
||
| Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper. |
There was a problem hiding this comment.
hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf appears to be outside this repo (no other references found), so readers may not know where to find it; add a link to the repo/location or clarify that this file lives in a separate Terraform repository.
| Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper. | |
| Four queues are provisioned in the separate `hometest-mgmt-terraform` Terraform repository (file `infrastructure/src/hometest-app/sqs.tf`) via the shared `infrastructure/modules/sqs` wrapper. |
| | ------------------------ | ---------------------------------------- | --------------- | | ||
| | `{queue}-age-high` | `ApproximateAgeOfOldestMessage` | 600 s | | ||
| | `{queue}-depth-high` | `ApproximateNumberOfMessagesVisible` | 1,000 | | ||
| | `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) | |
There was a problem hiding this comment.
The alarm name placeholder {queue-dlq}-depth-high is inconsistent with the other placeholders and ambiguous (is it {queue}-dlq or a separate variable?); consider renaming the placeholder to clearly reflect the DLQ queue name pattern (e.g. {queue}-dlq-depth-high).
| | `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) | | |
| | `{queue}-dlq-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) | |
docs/developer-guides/sqs-and-error-handling-future-recommendations.md
Outdated
Show resolved
Hide resolved
|
|
||
| If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely. | ||
|
|
||
| **Option A (preferred):** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely. |
There was a problem hiding this comment.
this is the more preferred and better solution. It would be more work but I think it is the right approach to future proof ourselves better. The approach could then be documented and implemented in other scenarios as well
Description
Context
Type of changes
Checklist
Sensitive Information Declaration
To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.