Skip to content

docs: add-current-state-and-future-recs#292

Draft
billwirz1 wants to merge 3 commits intomainfrom
feature/hote-813/document-sqs-and-error-handling-approach
Draft

docs: add-current-state-and-future-recs#292
billwirz1 wants to merge 3 commits intomainfrom
feature/hote-813/document-sqs-and-error-handling-approach

Conversation

@billwirz1
Copy link
Copy Markdown
Contributor

Description

Context

Type of changes

  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project
  • I have added tests to cover my changes
  • I have updated the documentation accordingly
  • This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

Copilot AI review requested due to automatic review settings March 27, 2026 12:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds internal developer documentation describing the current SQS setup/message flow and proposing future improvements to queue usage and error-handling strategy.

Changes:

  • Documented the current SQS queues, configurations, message flow, and error-handling behaviour.
  • Added a forward-looking recommendations guide covering queue cleanup, retry/backoff strategy, and DLQ/outbox patterns.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/developer-guides/sqs-current-state.md New “current state” write-up for queues, flow, configuration, and monitoring.
docs/developer-guides/sqs-and-error-handling-future-recommendations.md New recommendations guide for future SQS/error-handling improvements and operational runbooks.


Currently all failures are treated as retryable — any exception pushes the message ID into `batchItemFailures`, which causes SQS to re-deliver. A 400 from the supplier API (malformed FHIR payload, invalid correlation ID, etc.) will never succeed on retry; retrying it wastes `max_receive_count` attempts and eventually sends a legitimate patient order to the DLQ unnecessarily.

The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Return a 4xx error silently from batchItemFailures” is misleading for an SQS-triggered Lambda (there is no HTTP response); rephrase to say that non-retryable failures should be treated as success by omitting the message from batchItemFailures, while persisting the failure and updating order state.

Suggested change
The fix is to return a 4xx error silently from `batchItemFailures` — i.e., **not** add the message to `batchItemFailures`, causing Lambda to treat it as success and delete it from the queue. It must not simply be discarded, however — the failure must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.
For non-retryable failures, the handler should **not** add the message to `batchItemFailures`, which causes Lambda to treat it as successfully processed and delete it from the queue. The failure must not simply be discarded, however — it must be persisted to a durable store (see item 5) and the order status updated accordingly so the record is not lost.

Copilot uses AI. Check for mistakes.
- Messages are written to an `outbox` DB table _inside the originating transaction_ (alongside the order record). A separate process polls the outbox and enqueues to SQS. The DB row is the source of truth; the queue is the delivery mechanism.
- Failed/DLQ'd messages are always traceable to a DB row. Non-retryable failures update the row status rather than being discarded silently. Reprocessing is a targeted DB update.
- Pro: eliminates the enqueue/QUEUED race; indefinite retention; unified state model; supports both retryable and non-retryable failure paths cleanly.
- Con: higher implementation complexity; requires a reliable outbox poller (e.g. a scheduled Lambda or EventBridge Pipes reading the outbox table).
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“event bridge pipe” should use the AWS service’s proper name/capitalisation (e.g. EventBridge Pipes) to avoid confusion when someone searches for the feature in AWS docs.

Copilot uses AI. Check for mistakes.

## Queues

Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf appears to be outside this repo (no other references found), so readers may not know where to find it; add a link to the repo/location or clarify that this file lives in a separate Terraform repository.

Suggested change
Four queues are provisioned in `hometest-mgmt-terraform/infrastructure/src/hometest-app/sqs.tf` via the shared `infrastructure/modules/sqs` wrapper.
Four queues are provisioned in the separate `hometest-mgmt-terraform` Terraform repository (file `infrastructure/src/hometest-app/sqs.tf`) via the shared `infrastructure/modules/sqs` wrapper.

Copilot uses AI. Check for mistakes.
| ------------------------ | ---------------------------------------- | --------------- |
| `{queue}-age-high` | `ApproximateAgeOfOldestMessage` | 600 s |
| `{queue}-depth-high` | `ApproximateNumberOfMessagesVisible` | 1,000 |
| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alarm name placeholder {queue-dlq}-depth-high is inconsistent with the other placeholders and ambiguous (is it {queue}-dlq or a separate variable?); consider renaming the placeholder to clearly reflect the DLQ queue name pattern (e.g. {queue}-dlq-depth-high).

Suggested change
| `{queue-dlq}-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |
| `{queue}-dlq-depth-high` | DLQ `ApproximateNumberOfMessagesVisible` | 0 (any message) |

Copilot uses AI. Check for mistakes.

If `sendMessage` succeeds but the subsequent `QUEUED` status write fails, the order is in the queue with no `QUEUED` DB status. `order-router-lambda` will process it and attempt to write `SUBMITTED`, skipping `QUEUED` entirely.

**Option A (preferred):** Wrap the `sendMessage` and `addOrderStatusUpdate` calls in a single DB transaction with an outbox pattern — persist the message to an `outbox` table inside the transaction, then publish from the outbox. This eliminates the race entirely.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the more preferred and better solution. It would be more work but I think it is the right approach to future proof ourselves better. The approach could then be documented and implemented in other scenarios as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants