Skip to content

fix: Handle healthcheck log corruption gracefully#28803

Open
simek-m wants to merge 1 commit into
containers:mainfrom
simek-m:fix/healthcheck-log-corruption
Open

fix: Handle healthcheck log corruption gracefully#28803
simek-m wants to merge 1 commit into
containers:mainfrom
simek-m:fix/healthcheck-log-corruption

Conversation

@simek-m
Copy link
Copy Markdown

@simek-m simek-m commented May 27, 2026

The healthcheck log could be corrupted if the
podman heathcheck process was interrupted mid-write leading to Podman crashing when trying to read it.

Steps to reproduce

# Run a container with a health check.
TMPDIR=$(mktemp -d)
podman run -d --name hc-test \
  --health-cmd "true" \
  --health-interval 1s \
  --health-log-destination "$TMPDIR" \
  alpine sleep infinity

# Run a health check.
podman healthcheck run hc-test

# Corrupt the log.
CID=$(podman inspect --format '{{.Id}}' hc-test)
echo "{invalid json{" > $TMPDIR/${CID}-healthcheck.log

# The podman healthcheck command fails.
podman healthcheck run hc-test 
# so does podman ps
podman ps

Output:

➜  podman git:(main) ✗ podman ps
Error: unable to get healthcheck log for 405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197: failed to unmarshal existing healthcheck results in /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log: define.HealthCheckResults.ReadString: expects " or n, but found i, error found in #2 byte of ...|{invalid jso|..., bigger context ...|{invalid json{
|...
➜  podman git:(main) ✗ podman healthcheck run hc-test

Error: unable to update health check log /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log for 405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197: failed to unmarshal existing healthcheck results in /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log: define.HealthCheckResults.ReadString: expects " or n, but found i, error found in #2 byte of ...|{invalid jso|..., bigger context ...|{invalid json{
  • Alternatively, it can be reproduced by killing the health check command when it writes to the log. I couldn't get "lucky" with this way, but it should be achievable with creating many systemd units like this script does:
#!/usr/bin/python3
for x in range(50):
  with open(f'/etc/containers/systemd/marshall-{x}.container', 'w') as f:
    f.write(f'''
[Unit]
Description=The marshall container {x}
After=local-fs.target
[Container]
Image=registry.access.redhat.com/ubi9-minimal:latest
Exec=sleep infinity
HealthCmd=printf '%%01000d' 0
HealthInterval=1s
PodmanArgs=--health-max-log-size 0 --health-max-log-count 100
[Install]
WantedBy=multi-user.target default.target
''')

Running them:

python create-quadlets.py
systemctl daemon-reload
systemctl start marshall-{0..49}

and trying to kill the healtcheck repeatedly until it corrupts the log:

pkill -9 -f "podman.*healthcheck"

Fix

  • Uses ioutils.AtomicWriteFile to write the health check logs to handle atomic writes with tmpfiles/renames and flushing.
  • It also changes the behavior when reading a corrupted log file in readFromFileHealthCheckLog and instead of a hard error, it logs a warning and continues with an empty log. The corrupted one gets overwritten on a next write. The error could be propagated in readFromFileHealthCheckLog and handled by callers, but this change is localized to a single place instead of multiple callers.
  • Testing: it adds a new system test that intentionally corrupts the log file and checks that Podman recovers (fails with the main version ).

Fixes: https://redhat.atlassian.net/browse/RHEL-178222

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

  • Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
    commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
    the sign-off email address. See CONTRIBUTING.md
    for more information.
  • Referenced issues using Fixes: #00000 in commit message (if applicable)
  • Tests have been added/updated (or no tests are needed)
  • Documentation has been updated (or no documentation changes are needed)
  • All commits pass make validatepr (format/lint checks)
  • Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

None. Unless ignoring and overwriting the corrupted log while logging a warning instead of failing is a user-facing change?


Comment thread libpod/healthcheck.go Outdated
// If the file is corrupted, handle it as not existing.
if err := json.Unmarshal(b, &healthCheck); err != nil {
return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err)
logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just remove the file if it's unparsable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.

I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.

It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.

@mheon I was thinking of moving it for potential inspection to be safe. After these changes it gets overwritten.

@Luap99
I expected the warning to be controversial, but my reasoning was that it’s only a transient state until a next poll. Also, it’s a second line of defense and should be less likely now, but corruption can still happen for multiple reasons.

However, I can see now why it’s problematic.

To be on the same page, I tried to map the callers and their behavior on corrupted log (most likely not 100% exhaustive and there might be mistakes):

  • libpod/container_inspect.go -> getContainerInspectData()
    • logs an error but continues with an empty log (no change needed here)
  • libpod/healthcheck.go
    • updateHealthStatus() -> propagates err
      • container_internal.go: start() -> propagates err -> many callers (podman start, podman run, …) and leads to failure
      • container_internal.go: unpause() -> propagates err -> multiple callers and leads to failure (podman unpause or just logging in podman commit)
    • isUnhealthy() -> returns false and propagates err (in this PR false is returned and the error swallowed)
      • shouldRestart() -> logs an error and evaluates additional conditions
    • updateHealthCheckLog() -> propagates err
      • runHealthCheck() -> propagates err and new result
        • Runtime.HealthCheck -> propagates err and new result
          • API 500
          • healthcheck run command fails
    • healthCheckStatus() -> HealthCheckStatus() -> propagates err
      • ListContainerBatch() -> propagates err - used by podman ps -> fails
      • WaitForConditionWithInterval() -> trySend(-1, err) -> podman wait fails (correct)
      • GenerateContainerFilterFuncs() -> returned closure returns false

I'm not completely sure what the desired behavior should be (unless it's retaining the current behavior on corruption). I'll draft a better error handling in the callers instead of this lazy one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 If the alternative is health checks failing completely with no possible remediation without going into Podman's state and manually removing the file - which we don't provide any location for in any of our commands or debug logs, last I checked - I think removing an unreadable file seems reasonable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think the file should stay so all callers print this error. So if you do an inspect I would like to see this error logged (not hard failed as we do already there)

updateHealthCheckLog()/updateHealthStatus() should log a warning like you did and since they write a new file it does solves the corruption naturally

isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart().

ListContainerBatch() -> propagates err - used by podman ps -> fails

That should likely not be a hard error, I would say log warning and claim unhealthy

WaitForConditionWithInterval() -> trySend(-1, err) -> podman wait fails (correct)

yes a failure here makes sense, wait should error out

GenerateContainerFilterFuncs() -> returned closure returns false

Seems reasonable but we should have a warning there when it happens.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 Thanks, that sounds good and overwriting the corrupted log in updateHealthCheckLog() / updateHealthStatus() resolves it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 I've updated the PR based on your feedback.

I added the sentinel error define.ErrHealthCheckLogCorrupted to differentiate between a corrupted log and other possible read errors returned from readFromFileHealthCheckLog - primarily ErrPermission that most likely requires a user action. I couldn’t find custom error types in the codebase. I don’t return it directly but wrapped together with the underlying error to have context. It leads to long log messages, though.
The error is a part of the public API, but I wanted to handle it at the right sites (not documented for the whole call path).


  • updateHealthCheckLog() / updateHealthStatus() logs a warning and continues with an empty struct. The corrupted log gets overwritten.
  • isUnhealthy() now returns true instead of false on any read error and it leads to shouldRestart() returning true.
  • healthCheckStatus() now returns HealthCheckUnhealthy instead of “” (there’s no HealthCheckUnknown) on any read error.
  • GenerateContainerFilterFuncs() behaves the same, but logs a warning that the status is ignored.
  • ListContainerBatch() in ps.go now only logs a warning.

I handled all read errors in healthCheckStatus() and isUnhealthy() / shouldRestart() as unhealthy for consistency with what you said about unknown state:

isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart().


Am I wrong here to generalize?

(I’m going to update tests properly when the behavior is approved.)

Copy link
Copy Markdown
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the atomic write make sense but I don't think we should fully ignore the error there as different callers likely should do different things with this.

Comment thread libpod/healthcheck.go Outdated
// If the file is corrupted, handle it as not existing.
if err := json.Unmarshal(b, &healthCheck); err != nil {
return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err)
logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.

I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.

It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case

The healthcheck log could be corrupted if the
process was interrupted mid-write. It could
lead to Podman crashing.

Write the log files atomically and diferentiate
between corrupted log and different errors in
consumers of readFromFileHealthCheckLog().
Add a system test for a corrupted log file.

Fixes: https://redhat.atlassian.net/browse/RHEL-178222
Signed-off-by: Marek Simek <msimek@redhat.com>
@simek-m simek-m force-pushed the fix/healthcheck-log-corruption branch from 868eb90 to 4b0e9cc Compare May 29, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants