DTOSS-12504: Investigate slow container start after idle period#1212
DTOSS-12504: Investigate slow container start after idle period#1212josielsouzanordcloud wants to merge 12 commits intomainfrom
Conversation
|
The review app is available at this URL: |
39c112e to
0db1749
Compare
0db1749 to
17d7303
Compare
| # connection, not on every request. The timing log below is therefore | ||
| # expected to appear only at pool initialisation and when the pool expands. | ||
| # https://docs.djangoproject.com/en/6.0/ref/databases/#connection-pool | ||
| assert self.azure_credential is not None |
There was a problem hiding this comment.
This is a gotcha in python: assert should not be used for preconditions like this, as it's possible for the python interpreter to be compiled without assertion support.
| assert self.azure_credential is not None | |
| assert self.azure_credential is not None | |
| if self.azure_credential is None: | |
| raise Exception('something explaining the problem') |
There was a problem hiding this comment.
The change to ManagedCredential looks fine.
I think there are a couple of problems remaining with the pooling solution though:
- the way django passes the connection params to the pool means they will get stale, unless I'm misreading the code
- the warmup code in wsgi.py potentially blocks the app from becoming ready to serve requests, including the healthcheck
If we can't get this to work right away, perhaps another way forward would be to revert this part of the PR and choose a DATABASE_CONN_MAX_AGE that is lower than the token expiry but still non-zero. I'd like to get it working with the psycopg pool if we can though.
| # | ||
| # Without this, the psycopg3 pool is created lazily on the first DB access — | ||
| # which means the first real request incurs a TCP+TLS+IMDS+Postgres-auth | ||
| # round-trip (typically 5–30 s in Azure Container Apps after an idle period). |
There was a problem hiding this comment.
This 5-30s for this roundtrip still seems really excessive to me. Do we have any way of measuring where this latency is coming from, or analysing it without the django app as a complicating factor?
Also, this is triggered by the entrypoint of the app (entrypoint calls gunicorn, which spins up workers, then the workers import this file and start the wsgi application). I think we need to make sure that this won't interfere with health checks and autoscaling.
If this can take a long time, perhaps it needs to be non-blocking. It's not my area of expertise but coupling the app availability to the database in any way feels potentially problematic.
| # This makes use of in-memory token caching | ||
| # https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/identity/azure-identity/TOKEN_CACHING.md#in-memory-token-caching | ||
| return self.azure_credential.get_token( | ||
| # Called by the psycopg3 connection pool when creating a new physical |
There was a problem hiding this comment.
After looking more closely, I'm not sure if this is correct...
Looking at the code in django/db/backends/postgresql/base.py for the superclass, the pool is instantiated like this:
pool = ConnectionPool(
kwargs=connect_kwargs,
open=False, # Do not open the pool during startup.
configure=self._configure_connection,
check=ConnectionPool.check_connection if enable_checks else None,
**pool_options,
)But this _get_azure_connection_password method is called as part of get_connection_params, which is called as a one off, before the pool is instantiated
connect_kwargs = self.get_connection_params()I think this means that the pool's connection params will get stale when the token expires, so although it will appear to work at deploy time we'll run into errors later(?)
There was a problem hiding this comment.
For some more context, check out the commit message here 27f70df
The original implementation for passwordless login was based on a microsoft sample app, so we could potentially ask them if they have a working solution for pooling the connections while also ensuring that the token is properly refreshed.
Also @swebberuk identified PGBouncer as another option in this PR #1078.
There was a problem hiding this comment.
Actually, I had a look at the reference docs for ConnectionPool and it seems like
maybe get_connection_params could return a callable, rather than a fixed dict.
That way ConnectionPool gets instantiated with a callback for its kwargs parameter, this would get called whenever a connection is made, and then the callback can call the _get_azure_connection_password to get the password part.
This might break assumptions in Django though, as the docstring for get_connection_params says it is supposed to return a dict.
The alternative is to override in our class the pool property from base.DatabaseWrapper, but I'd really rather not have this complexity in our codebase, and it will complicate future django updates.
| protect_keyvault = false | ||
| vnet_address_space = "10.142.0.0/16" | ||
| deploy_database_as_container = true | ||
| deploy_database_as_container = false #TODO: set to true and remove Azure Database for PostgreSQL resources once containerised DB is in place |
There was a problem hiding this comment.
assuming this is just for testing purposes?
17d7303 to
5e57e26
Compare
| from manage_breast_screening.config.settings import boolean_env | ||
|
|
||
|
|
||
| class _HealthCheckSampler(ParentBased): |
There was a problem hiding this comment.
Regarding naming
_signifies private in python but it's not normally used for classes, so I'd remove it- both the class and the property have the same name which is slightly confusing. Is there a more descriptive name for what the class is doing?
| params["password"] = self._get_azure_connection_password() | ||
| return params | ||
|
|
||
| @property |
There was a problem hiding this comment.
this looks like a nice workaround (if it works!)
| # expected to appear only at pool initialisation and when the pool expands. | ||
| # https://docs.djangoproject.com/en/6.0/ref/databases/#connection-pool | ||
| assert self.azure_credential is not None | ||
| if self.azure_credential is None: |
There was a problem hiding this comment.
Minor thing but is there a reason we set azure_credential to None at all, rather than throwing an exception in the init?
in the init, we currently have ManagedIdentityCredential(client_id=client_id) if client_id else None
I'm wondering if we could move the check before that, and then remove the conditional, or perhaps move environ.get("AZURE_DB_CLIENT_ID") to the settings file and validate it as part of starting the application. (Note: @malcolmbaig has just merged a change to how the settings files are structured, so you'd need to rebase)
| _logger.warning("Database connection warmup failed", exc_info=True) | ||
|
|
||
| # Warm up the Jinja2 template cache so the first user request does not pay | ||
| # the PackageLoader traversal cost (~11 s in Azure Container Apps due to |
There was a problem hiding this comment.
11s is a long time to render a template!
- If this operation is actually slower in a production-like environment than local dev, shouldn't we provision it better?
- If this is slow locally as well then I think we should have another dev ticket to look into this specifically, maybe there are ways we can speed it up
We are currently configuring it to look in multiple places, so perhaps that is adding a lot of overhead?
env.loader = ChoiceLoader(
[
env.loader,
PackageLoader(
"nhsuk_frontend_jinja", package_path="templates/nhsuk/components"
),
PackageLoader(
"nhsuk_frontend_jinja", package_path="templates/nhsuk/macros"
),
PackageLoader("nhsuk_frontend_jinja"),
]
)| ) | ||
| t3 = time.perf_counter() | ||
|
|
||
| logger.info( |
There was a problem hiding this comment.
Are we going to leave these in? If so should they be DEBUG level if we're going to log them?
…ential for DB auth - Replace DefaultAzureCredential with ManagedIdentityCredential to skip unnecessary credential chain probing on each cold connection - Enable psycopg3 process-level connection pool (min_size=1) to keep one warm authenticated connection alive after idle periods, avoiding the TCP+TLS+Azure AD authentication cost on the first post-idle request - Add psycopg[pool] dependency - Add debug timing log for token acquisition to confirm fix in deployed envs
Add a post-initialisation hook in wsgi.py that calls connection.ensure_connection() in each gunicorn worker before it starts accepting requests. Without this, the psycopg3 pool is created lazily on the first DB access, so the first request after an idle period pays the full TCP+TLS+IMDS token+Postgres auth cost (5–30 s in Azure Container Apps). connection.close() returns the connection to the pool rather than terminating it, so the pool stays warm for subsequent requests.
…timeouts Azure NAT silently drops idle TCP connections after ~4 minutes. The pool held a connection that appeared alive but was dead at the TCP level, so the first post-idle request stalled ~10 s while TCP retransmitted against it before giving up. keepalives_idle=60 sends the first probe at 60 s of inactivity, well within the NAT window, keeping the connection alive indefinitely.
…spans azure-monitor-opentelemetry only bundles opentelemetry-instrumentation-psycopg2, so DB query spans were silently missing from Application Insights. Adding the psycopg3-specific package enables automatic instrumentation of database queries.
- Enable capture_parameters=True to include SQL statements in dependency spans - Enable OTEL_PYTHON_PSYCOPG_ENABLE_SQLCOMMENTER to correlate Postgres logs
with Application Insights traces
- Add _HealthCheckSampler to reduce /healthcheck trace volume to 1%
Django passes get_connection_params() as a static dict to ConnectionPool, so the Azure AD token is fixed at pool creation and expires after ~1hr. Override the pool property to replace pool.kwargs with a callable so a fresh token is fetched for each new physical connection. Also replace assert with RuntimeError as flagged in PR review.
Replace the psycopg3 connection pool approach with Django persistent
connections (CONN_MAX_AGE=180). Connections are recycled every 3 minutes —
well within Azure's NAT idle timeout (~4 min) and well before the Azure AD
token TTL (~1 hr) — so tokens are always fresh without requiring a pool or
callable kwargs.
- Remove pool OPTIONS block from DATABASES settings
- Remove pool property override from DatabaseWrapper; get_connection_params()
is now the sole token-injection point
- Update wsgi.py DB warmup comment/log to drop pool references
- Update base.py docstring and _get_azure_connection_password comment to reflect the persistent-connection model
Add opentelemetry-instrumentation-django so incoming HTTP requests are traced and visible in Application Insights alongside the existing psycopg DB spans.
b4eb1b5 to
801df10
Compare
Add opentelemetry-instrumentation-django so incoming HTTP requests are traced and visible in Application Insights alongside the existing psycopg DB spans.
|



Summary
DefaultAzureCredentialwithManagedIdentityCredentialfor database authentication to skip unnecessary credential chain probing on cold connectionsBackground
Investigation of 10–20s delay on first request after an idle period. Container metrics (CPU ~2.5%, memory ~1GB stable throughout) ruled out Azure hibernation and scale-to-zero. The root causes identified were:
The connection pool (min_size=1) keeps one authenticated connection alive across requests, so the pool only calls get_connection_params() (and therefore IMDS) when creating new physical connections, not on every request checkout.
Test plan
Jira link
DTOSS-12504
Review notes
Review checklist