Skip to content

DTOSS-12504: Investigate slow container start after idle period#1212

Open
josielsouzanordcloud wants to merge 12 commits intomainfrom
DTOSS-12504-investigate-slow-container-start-after-idle-period
Open

DTOSS-12504: Investigate slow container start after idle period#1212
josielsouzanordcloud wants to merge 12 commits intomainfrom
DTOSS-12504-investigate-slow-container-start-after-idle-period

Conversation

@josielsouzanordcloud
Copy link
Copy Markdown
Contributor

Summary

  • Replace DefaultAzureCredential with ManagedIdentityCredential for database authentication to skip unnecessary credential chain probing on cold connections
  • Enable psycopg3 process-level connection pool (min_size=1) to keep one warm authenticated connection alive after idle periods, eliminating the TCP+TLS+Azure AD authentication cost on the first post-idle request
  • Add psycopg[pool] dependency
  • Add debug timing log for Azure AD token acquisition to confirm impact in deployed environments

Background

Investigation of 10–20s delay on first request after an idle period. Container metrics (CPU ~2.5%, memory ~1GB stable throughout) ruled out Azure hibernation and scale-to-zero. The root causes identified were:

  1. DefaultAzureCredential probes a chain of credential providers (EnvironmentCredential, WorkloadIdentityCredential, etc.) before reaching ManagedIdentityCredential; in Azure Container Apps some of those probes may make network calls before failing through
  2. With CONN_MAX_AGE=0 and gevent, each request creates a new greenlet with its own DatabaseWrapper, paying the full TCP+TLS+Azure AD authentication cost on every connection — including the first request after idle

The connection pool (min_size=1) keeps one authenticated connection alive across requests, so the pool only calls get_connection_params() (and therefore IMDS) when creating new physical connections, not on every request checkout.

Test plan

  • Deploy to review app (lowest environment with Azure PostgreSQL Flexible Server + managed identity)
  • Set LOG_LEVEL=DEBUG in the Container App environment variables
  • Confirm Azure DB token acquired in Xs log appears only at pool startup, not per request
  • Manually verify the idle-period delay is reduced by leaving the app idle and timing the first request back

Jira link

DTOSS-12504

Review notes

Review checklist

  • Check database queries are correctly scoped to current_provider

@github-actions
Copy link
Copy Markdown

The review app is available at this URL:
https://pr-1212.manage-breast-screening.non-live.screening.nhs.uk
You must authenticate with HTTP basic authentication. Ask the team for credentials.

@josielsouzanordcloud josielsouzanordcloud force-pushed the DTOSS-12504-investigate-slow-container-start-after-idle-period branch from 39c112e to 0db1749 Compare March 23, 2026 09:53
@josielsouzanordcloud josielsouzanordcloud force-pushed the DTOSS-12504-investigate-slow-container-start-after-idle-period branch from 0db1749 to 17d7303 Compare March 23, 2026 10:32
# connection, not on every request. The timing log below is therefore
# expected to appear only at pool initialisation and when the pool expands.
# https://docs.djangoproject.com/en/6.0/ref/databases/#connection-pool
assert self.azure_credential is not None
Copy link
Copy Markdown
Contributor

@MatMoore MatMoore Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a gotcha in python: assert should not be used for preconditions like this, as it's possible for the python interpreter to be compiled without assertion support.

Suggested change
assert self.azure_credential is not None
assert self.azure_credential is not None
if self.azure_credential is None:
raise Exception('something explaining the problem')

Copy link
Copy Markdown
Contributor

@MatMoore MatMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to ManagedCredential looks fine.

I think there are a couple of problems remaining with the pooling solution though:

  1. the way django passes the connection params to the pool means they will get stale, unless I'm misreading the code
  2. the warmup code in wsgi.py potentially blocks the app from becoming ready to serve requests, including the healthcheck

If we can't get this to work right away, perhaps another way forward would be to revert this part of the PR and choose a DATABASE_CONN_MAX_AGE that is lower than the token expiry but still non-zero. I'd like to get it working with the psycopg pool if we can though.

Comment thread manage_breast_screening/config/wsgi.py Outdated
#
# Without this, the psycopg3 pool is created lazily on the first DB access —
# which means the first real request incurs a TCP+TLS+IMDS+Postgres-auth
# round-trip (typically 5–30 s in Azure Container Apps after an idle period).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 5-30s for this roundtrip still seems really excessive to me. Do we have any way of measuring where this latency is coming from, or analysing it without the django app as a complicating factor?

Also, this is triggered by the entrypoint of the app (entrypoint calls gunicorn, which spins up workers, then the workers import this file and start the wsgi application). I think we need to make sure that this won't interfere with health checks and autoscaling.

If this can take a long time, perhaps it needs to be non-blocking. It's not my area of expertise but coupling the app availability to the database in any way feels potentially problematic.

# This makes use of in-memory token caching
# https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/identity/azure-identity/TOKEN_CACHING.md#in-memory-token-caching
return self.azure_credential.get_token(
# Called by the psycopg3 connection pool when creating a new physical
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking more closely, I'm not sure if this is correct...

Looking at the code in django/db/backends/postgresql/base.py for the superclass, the pool is instantiated like this:

pool = ConnectionPool(
                kwargs=connect_kwargs,
                open=False,  # Do not open the pool during startup.
                configure=self._configure_connection,
                check=ConnectionPool.check_connection if enable_checks else None,
                **pool_options,
)

But this _get_azure_connection_password method is called as part of get_connection_params, which is called as a one off, before the pool is instantiated

connect_kwargs = self.get_connection_params()

I think this means that the pool's connection params will get stale when the token expires, so although it will appear to work at deploy time we'll run into errors later(?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some more context, check out the commit message here 27f70df

The original implementation for passwordless login was based on a microsoft sample app, so we could potentially ask them if they have a working solution for pooling the connections while also ensuring that the token is properly refreshed.

Also @swebberuk identified PGBouncer as another option in this PR #1078.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I had a look at the reference docs for ConnectionPool and it seems like
maybe get_connection_params could return a callable, rather than a fixed dict.

That way ConnectionPool gets instantiated with a callback for its kwargs parameter, this would get called whenever a connection is made, and then the callback can call the _get_azure_connection_password to get the password part.

This might break assumptions in Django though, as the docstring for get_connection_params says it is supposed to return a dict.

The alternative is to override in our class the pool property from base.DatabaseWrapper, but I'd really rather not have this complexity in our codebase, and it will complicate future django updates.

protect_keyvault = false
vnet_address_space = "10.142.0.0/16"
deploy_database_as_container = true
deploy_database_as_container = false #TODO: set to true and remove Azure Database for PostgreSQL resources once containerised DB is in place
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this is just for testing purposes?

from manage_breast_screening.config.settings import boolean_env


class _HealthCheckSampler(ParentBased):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding naming

  • _ signifies private in python but it's not normally used for classes, so I'd remove it
  • both the class and the property have the same name which is slightly confusing. Is there a more descriptive name for what the class is doing?

params["password"] = self._get_azure_connection_password()
return params

@property
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a nice workaround (if it works!)

# expected to appear only at pool initialisation and when the pool expands.
# https://docs.djangoproject.com/en/6.0/ref/databases/#connection-pool
assert self.azure_credential is not None
if self.azure_credential is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing but is there a reason we set azure_credential to None at all, rather than throwing an exception in the init?

in the init, we currently have ManagedIdentityCredential(client_id=client_id) if client_id else None

I'm wondering if we could move the check before that, and then remove the conditional, or perhaps move environ.get("AZURE_DB_CLIENT_ID") to the settings file and validate it as part of starting the application. (Note: @malcolmbaig has just merged a change to how the settings files are structured, so you'd need to rebase)

_logger.warning("Database connection warmup failed", exc_info=True)

# Warm up the Jinja2 template cache so the first user request does not pay
# the PackageLoader traversal cost (~11 s in Azure Container Apps due to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11s is a long time to render a template!

  • If this operation is actually slower in a production-like environment than local dev, shouldn't we provision it better?
  • If this is slow locally as well then I think we should have another dev ticket to look into this specifically, maybe there are ways we can speed it up

We are currently configuring it to look in multiple places, so perhaps that is adding a lot of overhead?

        env.loader = ChoiceLoader(
            [
                env.loader,
                PackageLoader(
                    "nhsuk_frontend_jinja", package_path="templates/nhsuk/components"
                ),
                PackageLoader(
                    "nhsuk_frontend_jinja", package_path="templates/nhsuk/macros"
                ),
                PackageLoader("nhsuk_frontend_jinja"),
            ]
        )

)
t3 = time.perf_counter()

logger.info(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to leave these in? If so should they be DEBUG level if we're going to log them?

…ential for DB auth

  - Replace DefaultAzureCredential with ManagedIdentityCredential to skip unnecessary credential chain
  probing on each cold connection
  - Enable psycopg3 process-level connection pool (min_size=1) to keep one warm authenticated connection
  alive after idle periods, avoiding the TCP+TLS+Azure AD authentication cost on the first post-idle request
  - Add psycopg[pool] dependency
  - Add debug timing log for token acquisition to confirm fix in deployed envs
  Add a post-initialisation hook in wsgi.py that calls
  connection.ensure_connection() in each gunicorn worker before it
  starts accepting requests.

  Without this, the psycopg3 pool is created lazily on the first DB
  access, so the first request after an idle period pays the full
  TCP+TLS+IMDS token+Postgres auth cost (5–30 s in Azure Container
  Apps).
  connection.close() returns the connection to the pool rather than
  terminating it, so the pool stays warm for subsequent requests.
…timeouts

  Azure NAT silently drops idle TCP connections after ~4 minutes. The pool
  held a connection that appeared alive but was dead at the TCP level, so
  the first post-idle request stalled ~10 s while TCP retransmitted against
  it before giving up.
  keepalives_idle=60 sends the first probe at 60 s of inactivity, well
  within the NAT window, keeping the connection alive indefinitely.
…spans

  azure-monitor-opentelemetry only bundles opentelemetry-instrumentation-psycopg2,
  so DB query spans were silently missing from Application Insights. Adding the
  psycopg3-specific package enables automatic instrumentation of database queries.
  - Enable capture_parameters=True to include SQL statements in dependency spans                                                                    - Enable OTEL_PYTHON_PSYCOPG_ENABLE_SQLCOMMENTER to correlate Postgres logs
    with Application Insights traces
  - Add _HealthCheckSampler to reduce /healthcheck trace volume to 1%
  Django passes get_connection_params() as a static dict to ConnectionPool,
  so the Azure AD token is fixed at pool creation and expires after ~1hr.
  Override the pool property to replace pool.kwargs with a callable so a
  fresh token is fetched for each new physical connection.

  Also replace assert with RuntimeError as flagged in PR review.
  Replace the psycopg3 connection pool approach with Django persistent
  connections (CONN_MAX_AGE=180). Connections are recycled every 3 minutes —
  well within Azure's NAT idle timeout (~4 min) and well before the Azure AD
  token TTL (~1 hr) — so tokens are always fresh without requiring a pool or
  callable kwargs.
  - Remove pool OPTIONS block from DATABASES settings
  - Remove pool property override from DatabaseWrapper; get_connection_params()
    is now the sole token-injection point
  - Update wsgi.py DB warmup comment/log to drop pool references
  - Update base.py docstring and _get_azure_connection_password comment to                                                                                     reflect the persistent-connection model
  Add opentelemetry-instrumentation-django so incoming HTTP requests
  are traced and visible in Application Insights alongside the existing
  psycopg DB spans.
@josielsouzanordcloud josielsouzanordcloud force-pushed the DTOSS-12504-investigate-slow-container-start-after-idle-period branch from b4eb1b5 to 801df10 Compare April 8, 2026 08:57
  Add opentelemetry-instrumentation-django so incoming HTTP requests
  are traced and visible in Application Insights alongside the existing
  psycopg DB spans.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants