Skip to content

bug: log pattern conditions are never cleared after node reboot (edge-triggered vs level-triggered) #19

@mattmattox

Description

@mattmattox

Summary

matchPatterns() sets a node condition to ConditionTrue when an error-level log pattern fires, but never sets ConditionFalse when the pattern is no longer present. After a node reboot clears the kernel ring buffer (kmsg), the condition stays True indefinitely.

Observed Behaviour

  • NodeDoctorVmxnet3TxHang: True on nodes a1ubr720p04 and a1ubopsp06 with lastTransitionTime: 2025-11-30 — both nodes have since been rebooted onto a fixed kernel (6.17) but the condition remains stuck True 5+ months later.
  • NodeDoctorConntrackTableFull: True on a1ubr720p03 — conntrack is at 0.7% utilisation currently; condition is stale from a Cilium incident.
$ kubectl get node a1ubr720p04 -o jsonpath='{.status.conditions[?(@.type=="NodeDoctorVmxnet3TxHang")]}'
{"lastTransitionTime":"2025-11-30T05:35:19Z","message":"Pattern 'vmxnet3-tx-hang' matched in kmsg logs","reason":"LogPatternMatched","status":"True"}

Root Cause

In pkg/monitors/custom/logpattern.go, matchPatterns() (line 1076) only ever emits ConditionTrue:

// Add condition for error-level patterns
if severity == types.EventError {
    condition := types.NewCondition(
        pattern.Name,
        types.ConditionTrue,       // ← only True is ever emitted
        "LogPatternMatched",
        fmt.Sprintf("Pattern '%s' matched in %s logs", pattern.Name, source),
    )
    status.AddCondition(condition)
}

checkLogPatterns() scans all sources but emits conditions only on match. When a pattern does not match (e.g. because kmsg was cleared by reboot), nothing is emitted — the Kubernetes condition object persists unchanged.

Kubernetes node conditions are persistent state: absence of an update ≠ setting False.

Proposed Fix

After each full scan cycle, emit ConditionFalse for any error-level pattern that was not matched:

for _, pattern := range m.config.Patterns {
    if m.parseSeverity(pattern.Severity) == types.EventError {
        if _, matched := matchedPatterns[pattern.Name]; !matched {
            status.AddCondition(types.NewCondition(
                pattern.Name,
                types.ConditionFalse,
                "LogPatternNotFound",
                fmt.Sprintf("Pattern '%s' not found in current scan", pattern.Name),
            ))
        }
    }
}

Impact

Every error-level log pattern condition accumulates false positives permanently across node reboots. Operators cannot trust True conditions without cross-checking lastTransitionTime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions