Skip to content

Comments

add k8s version drift detection and remediation#89

Open
zqingqing1 wants to merge 5 commits intomainfrom
qizhe/drift-detection
Open

add k8s version drift detection and remediation#89
zqingqing1 wants to merge 5 commits intomainfrom
qizhe/drift-detection

Conversation

@zqingqing1
Copy link
Member

Within this PR, I introduced several changes:

  • refactor the daemon loop to use go routine, instead of for loop with ticker
  • add bool to the config to choose self managed mode or "driftAndRemediate" mode, default is latter one
  • support k8s version drift detection and install new k8s version to match with cluster version
  • added fields to status file to reflect the updatedBy and updatedReason
  • setup the drift detection and remediate framework, and easy to extend

kubectl shows correct version

ubuntu@flex-node:/tmp$kubectl get node -o wide
NAME                                STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-27087470-vmss000000   Ready    <none>   132m   v1.33.2   172.19.0.47    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-1
aks-nodepool1-27087470-vmss000001   Ready    <none>   129m   v1.33.2   172.19.0.18    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-1
aks-nodepool1-27087470-vmss000002   Ready    <none>   125m   v1.33.2   172.19.0.10    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-1
flex-node                           Ready    <none>   27d    v1.33.2   10.5.202.191   <none>        Ubuntu 24.04.3 LTS   6.8.0-94-generic    containerd://1.7.20
ubuntu@flex-node:/tmp$

status file being collected as following:

ubuntu@flex-node:/tmp$ cat /tmp/aks-flex-node/status.json
{
  "kubeletVersion": "1.33.2",
  "runcVersion": "1.1.12",
  "containerdVersion": "1.7.20",
  "kubeletRunning": true,
  "kubeletReady": "Ready",
  "containerdRunning": true,
  "arcStatus": {
    "registered": true,
    "connected": true,
    "machineName": "flex-node",
    "resourceId": "/subscriptions/2e2f5a34-d3e4-4e62-a250-bcc49208879e/resourceGroups/qizhe-eastus2/providers/Microsoft.HybridCompute/machines/flex-node",
    "location": "eastus2",
    "resourceGroup": "qizhe-eastus2",
    "lastHeartbeat": "0001-01-01T00:00:00Z"
  },
  "lastUpdated": "2026-02-19T13:50:35.790167898-08:00",
  "lastUpdatedBy": "StatusCollectionLoop",
  "lastUpdatedReason": "perodicStatusLoop",
  "agentVersion": "v0.0.4-26-g0cfb70c-dirty"
}

log can be seen here:

level=info msg="Starting periodic status collection at 2026-02-19 13:25:35..." func="[commands.go:247]"
level=info msg="Starting periodic managed cluster spec collection at 2026-02-19 13:25:35..." func="[commands.go:318]"
level=info msg="Collecting managed cluster spec for qizhe-eastus2/stretch" func="[collector.go:115]"
level=info msg="Status collection completed successfully at 2026-02-19 13:25:37" func="[commands.go:254]"
level=info msg="Managed cluster spec collection completed at 2026-02-19 13:25:38" func="[commands.go:325]"
level=warning msg="Drift detected: id=kubernetes-version title=Kubernetes version drift details=kubelet=\"1.32.7\" desired=\"1.33.2\"" func="[remediation.go:89]"
level=info msg="Starting AKS node drift-kubernetes-upgrade" func="[executor.go:68]"
level=info msg="Executing drift-kubernetes-upgrade step KubeletOnlyDisabled" func="[executor.go:123]"
level=info msg="Stopping and disabling kubelet service (kubelet-only)" func="[kubelet_only_uninstaller.go:35]"
UNIT FILE       STATE   PRESET
kubelet.service enabled enabled

1 unit files listed.
Removed "/etc/systemd/system/multi-user.target.wants/kubelet.service".
level=info msg="drift-kubernetes-upgrade step: KubeletOnlyDisabled completed successfully with duration 577.089745ms" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step KubeBinariesInstaller" func="[executor.go:123]"
level=info msg="Installing Kube Binaries of version 1.33.2" func="[kube_binaries_installer.go:36]"
level=info msg="Cleaning up corrupted Kubernetes installation files to start fresh" func="[kube_binaries_installer.go:49]"
pkill: killing pid 734 failed: Operation not permitted
level=info msg="Constructed Kubernetes download URL: https://acs-mirror.azureedge.net/kubernetes/v1.33.2/binaries/kubernetes-node-linux-amd64.tar.gz" func="[kube_binaries_installer.go:181]"
level=info msg="Downloading Kube binaries from https://acs-mirror.azureedge.net/kubernetes/v1.33.2/binaries/kubernetes-node-linux-amd64.tar.gz into /tmp/kubernetes-node-linux-amd64.tar.gz" func="[kube_binaries_installer.go:74]"
level=info msg="Extracting Kubernetes binaries to /usr/local/bin" func="[kube_binaries_installer.go:80]"
level=info msg="Setting executable permissions on Kubernetes binaries" func="[kube_binaries_installer.go:86]"
level=info msg="Kubernetes binaries installed successfully" func="[kube_binaries_installer.go:43]"
level=info msg="drift-kubernetes-upgrade step: KubeBinariesInstaller completed successfully with duration 8.352009603s" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step KubeletInstaller" func="[executor.go:123]"
level=info msg="Installing and configuring kubelet" func="[kubelet_installer.go:49]"
level=info msg="Configuring kubelet" func="[kubelet_installer.go:81]"
level=info msg="Checking for required kubelet packages" func="[kubelet_installer.go:194]"
/usr/bin/jq
/usr/sbin/iptables
level=info msg="All required kubelet packages are available" func="[kubelet_installer.go:208]"
level=info msg="Creating required directories for kubelet" func="[kubelet_installer.go:170]"
level=info msg="Required directories created successfully" func="[kubelet_installer.go:186]"
level=info msg="Fetching cluster credentials from Azure" func="[kubelet_installer.go:546]"
level=info msg="Fetching cluster credentials for cluster stretch in resource group qizhe-eastus2 using Azure SDK" func="[kubelet_installer.go:711]"
level=info msg="Writing API server client CA certificate" func="[kubelet_installer.go:526]"
level=info msg="API server client CA certificate written to /etc/kubernetes/pki/apiserver-client-ca.crt" func="[kubelet_installer.go:539]"
level=info msg="Kubelet installed and configured successfully" func="[kubelet_installer.go:61]"
level=info msg="drift-kubernetes-upgrade step: KubeletInstaller completed successfully with duration 2.265825941s" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step KubeletOnlyEnabled" func="[executor.go:123]"
level=info msg="Enabling and starting kubelet service (kubelet-only)" func="[kubelet_only_installer.go:37]"
Created symlink /etc/systemd/system/multi-user.target.wants/kubelet.service → /etc/systemd/system/kubelet.service.
active
level=info msg="drift-kubernetes-upgrade step: KubeletOnlyEnabled completed successfully with duration 3.126507062s" func="[executor.go:147]"
level=info msg="AKS node drift-kubernetes-upgrade completed successfully (duration: 14.331840881s, stepCount: 4)" func="[executor.go:106]"
level=info msg="drift-kubernetes-upgrade completed successfully (duration: 14.331840881s, steps: 4)" func="[remediation.go:230]"
level=info msg="Kubernetes upgrade remediation completed successfully" func="[remediation.go:128]"
level=info msg="Drift detection after spec collection completed at 2026-02-19 13:25:53" func="[commands.go:331]"
level=info msg="Starting periodic status collection at 2026-02-19 13:26:35..." func="[commands.go:247]"
level=info msg="Starting bootstrap health check at 2026-02-19 13:26:35..." func="[commands.go:278]"
level=info msg="Bootstrap health check completed at 2026-02-19 13:26:35" func="[commands.go:292]"

"strings"
)

func majorMinor(version string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leverage semver library for this parsing so we can handle edge cases?

}

tempFile := path + ".tmp"
if err := os.WriteFile(tempFile, statusData, 0o600); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use

func WriteFile(filename string, content []byte, perm os.FileMode) error {
, which handles the atomic write for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants