docs: add-current-state-and-future-recs by billwirz1 · Pull Request #292 · NHSDigital/hometest-service

billwirz1 · 2026-03-27T12:30:46Z

Description

Context

Type of changes

Refactoring (non-breaking change)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would change existing functionality)
Bug fix (non-breaking change which fixes an issue)

Checklist

I am familiar with the contributing guidelines
I have followed the code style of the project
I have added tests to cover my changes
I have updated the documentation accordingly
This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

Copilot

Pull request overview

Adds internal developer documentation describing the current SQS setup/message flow and proposing future improvements to queue usage and error-handling strategy.

Changes:

Documented the current SQS queues, configurations, message flow, and error-handling behaviour.
Added a forward-looking recommendations guide covering queue cleanup, retry/backoff strategy, and DLQ/outbox patterns.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
docs/developer-guides/sqs-current-state.md	New “current state” write-up for queues, flow, configuration, and monitoring.
docs/developer-guides/sqs-and-error-handling-future-recommendations.md	New recommendations guide for future SQS/error-handling improvements and operational runbooks.

Copilot · 2026-03-27T12:34:05Z

docs/developer-guides/sqs-and-error-handling-future-recommendations.md

+
+Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily.
+
+The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.


“Return a 4xx error silently from batchItemFailures” is misleading for an SQS-triggered Lambda (there is no HTTP response); rephrase to say that non-retryable failures should be treated as success by omitting the message from batchItemFailures, while persisting the failure and updating order state.

Suggested change

The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.

For non-retryable failures, the handler should **not** add the message to `batchItemFailures`, which causes Lambda to treat it as successfully processed and delete it from the queue. The failure must not simply be discarded, however — it must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.

Copilot · 2026-03-27T12:34:06Z

docs/developer-guides/sqs-and-error-handling-future-recommendations.md

+- Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism.
+- Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update.
+- Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly.
+- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or EventBridge Pipes reading the outbox table).


“event bridge pipe” should use the AWS service’s proper name/capitalisation (e.g. EventBridge Pipes) to avoid confusion when someone searches for the feature in AWS docs.

Copilot · 2026-03-27T12:34:06Z

docs/developer-guides/sqs-current-state.md

+
+## Queues
+
+Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.


hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf appears to be outside this repo (no other references found), so readers may not know where to find it; add a link to the repo/location or clarify that this file lives in a separate Terraform repository.

Suggested change

Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.

Four queues are provisioned in the separate `hometest-mgmt-terraform` Terraform repository (file `infrastructure/src/hometest-app/sqs.tf`) via the shared `infrastructure/modules/sqs` wrapper.

Copilot · 2026-03-27T12:34:06Z

docs/developer-guides/sqs-current-state.md

+| ------------------------ | ---------------------------------------- | --------------- |
+| `{queue}-age-high`       | `ApproximateAgeOfOldestMessage`          | 600 s           |
+| `{queue}-depth-high`     | `ApproximateNumberOfMessagesVisible`     | 1,000           |
+| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |


The alarm name placeholder {queue-dlq}-depth-high is inconsistent with the other placeholders and ambiguous (is it {queue}-dlq or a separate variable?); consider renaming the placeholder to clearly reflect the DLQ queue name pattern (e.g. {queue}-dlq-depth-high).

Suggested change

| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |

| `{queue}-dlq-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |

docs/developer-guides/sqs-and-error-handling-future-recommendations.md

lewisbirks · 2026-03-27T14:14:36Z

docs/developer-guides/sqs-and-error-handling-future-recommendations.md

+
+If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely.
+
+**Option A (preferred):** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely.


this is the more preferred and better solution. It would be more work but I think it is the right approach to future proof ourselves better. The approach could then be documented and implemented in other scenarios as well

billwirz1 added 2 commits March 27, 2026 12:20

docs: add-current-state-and-future-recs

a427daf

docs: clarify-bulk-reprocessing

8d9613f

Copilot AI review requested due to automatic review settings March 27, 2026 12:30

Copilot started reviewing on behalf of billwirz1 March 27, 2026 12:31 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

lewisbirks reviewed Mar 27, 2026

View reviewed changes

docs/developer-guides/sqs-and-error-handling-future-recommendations.md Outdated Show resolved Hide resolved

lewisbirks reviewed Mar 27, 2026

View reviewed changes

chore: clarify-transient-errors

00922f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add-current-state-and-future-recs#292

docs: add-current-state-and-future-recs#292
billwirz1 wants to merge 3 commits intomainfrom
feature/hote-813/document-sqs-and-error-handling-approach

billwirz1 commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Uh oh!

lewisbirks Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily.

		The fix is to return a 4xx error silently from `batchItemFailures` — i.e., not add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.

	The fix is to return a 4xx error silently from `batchItemFailures` — i.e., not add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.
	For non-retryable failures, the handler should not add the message to `batchItemFailures`, which causes Lambda to treat it as successfully processed and delete it from the queue. The failure must not simply be discarded, however — it must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.


		## Queues

		Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.

	Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.
	Four queues are provisioned in the separate `hometest-mgmt-terraform` Terraform repository (file `infrastructure/src/hometest-app/sqs.tf`) via the shared `infrastructure/modules/sqs` wrapper.

	\| `{queue-dlq}-depth-high` \| DLQ `ApproximateNumberOfMessagesVisible` \| 0 (any message) \|
	\| `{queue}-dlq-depth-high` \| DLQ `ApproximateNumberOfMessagesVisible` \| 0 (any message) \|


		If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely.

		Option A (preferred): Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely.

Conversation

billwirz1 commented Mar 27, 2026

Description

Context

Type of changes

Checklist

Sensitive Information Declaration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lewisbirks Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants