OCPBUGS-84814: Skip chrony-wait on first node join#5990
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughA new chrony synchronization detection mechanism is introduced: a check script monitors drift file age and creates a marker when recently synced; a systemd service runs the script with proper ordering; a drop-in config skips chrony-wait if that marker exists. ChangesChrony Sync Detection
Sequence DiagramsequenceDiagram
actor System
participant chronyd as chronyd.service
participant check as chrony-drift-check.service
participant script as drift-check.sh
participant chrony_wait as chrony-wait.service
System->>chronyd: Start
activate chronyd
chronyd->>check: (After chronyd starts)
check->>script: Execute /usr/local/bin/chrony-drift-check.sh
activate script
script->>script: Check /var/lib/chrony/drift age
alt Drift file < 3600 seconds old
script->>script: Touch /run/chrony-recently-synced
end
script-->>check: Exit 0
deactivate script
check-->>System: Service complete (RemainAfterExit=yes)
rect rgba(200, 150, 100, 0.5)
Note over chrony_wait: Now ready to start<br/>(Before chrony-drift-check)
end
chrony_wait->>chrony_wait: Check ConditionPathExists=/run/chrony-recently-synced
alt Marker exists
chrony_wait-->>System: Skip execution
else Marker absent
chrony_wait->>chrony_wait: Run normally
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws-ovn |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdodson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sdodson: This pull request references Jira Issue OCPBUGS-84814, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
chrony-wait.service blocks boot 2 for 8-24s waiting for NTP synchronization. On first node join this wait is unnecessary — the node has been running chronyd throughout boot 1 (Ignition + MCD firstboot, ~2 minutes on AWS) and time is already synchronized before boot 2 begins. Add a ConditionPathExists drop-in to chrony-wait.service: ConditionPathExists=/var/lib/kubelet/pki/kubelet-client-current.pem This file exists only on nodes that have previously joined the cluster and received a kubelet client certificate. On first join the file is absent and chrony-wait is skipped entirely. On any subsequent reboot of an already-joined node the cert is present and chrony-wait runs normally, preserving correct behavior for long-lived nodes. This replaces an earlier drift-file-age approach that required a separate chrony-drift-check.service and helper script. The kubelet cert check is simpler (one drop-in, no helper), more semantically correct (answers "has this node ever joined?" rather than "was NTP recently active?"), and avoids the edge case where a node was powered off long enough for the drift file to appear stale. Tested on AWS 4.19 with m6i.xlarge (n=4): chrony-wait absent from systemd-analyze blame on all runs. Boot 2 SA reduced from ~20s to ~12s on fast-OVS boots. Total scale-up time reduced by ~7s on average. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b569fd1 to
864d9bb
Compare
|
@sdodson: This pull request references Jira Issue OCPBUGS-84814, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
chrony-wait.serviceblocks boot 2 for 8–24s waiting for NTP synchronization. On first node join this wait is unnecessary — the node has been running chronyd throughout boot 1 (Ignition + MCD firstboot, ~2 minutes on AWS) and time is already synchronized before boot 2 begins.This PR adds a single drop-in to
chrony-wait.service:/var/lib/kubelet/pki/kubelet-client-current.pemexists only on nodes that have previously joined the cluster and received a kubelet client certificate. On first join the file is absent and chrony-wait is skipped entirely. On any subsequent reboot of an already-joined node the cert is present and chrony-wait runs normally.Why this approach
Why not the drift-file-age check?
An earlier iteration used a
chrony-drift-check.servicethat checked whether/var/lib/chrony/driftwas modified recently. That required a helper script, a separate service, and a tmpfs flag file — three moving parts to express what is ultimately a binary question: has this node ever joined the cluster? The kubelet cert answers that directly with one line.Why is this safe?
Why not
ConditionPathExistson the drift file directly?ConditionPathExistschecks existence only, not file age. A newly installed node (never joined) will have no drift file, so the condition would correctly allow chrony-wait to run. But a node powered off for a day and then re-imaged might retain the drift file on-disk. The kubelet cert is a cleaner signal.Test results
Tested on AWS 4.19.30 with m6i.xlarge (n=4 valid rounds):
chrony-wait.serviceabsent fromsystemd-analyze blameon all 4 runsTest plan
chrony-waitabsent fromsystemd-analyze blameon boot 2chrony-waitruns normallychrony-waitruns on every node's first boot (no cert yet)chronyc trackingshows synchronized clock after boot completes🤖 Generated with Claude Code
Summary by CodeRabbit