fix: Handle healthcheck log corruption gracefully by simek-m · Pull Request #28803 · containers/podman

simek-m · 2026-05-27T16:34:19Z

The healthcheck log could be corrupted if the
podman heathcheck process was interrupted mid-write leading to Podman crashing when trying to read it.

Steps to reproduce

# Run a container with a health check.
TMPDIR=$(mktemp -d)
podman run -d --name hc-test \
  --health-cmd "true" \
  --health-interval 1s \
  --health-log-destination "$TMPDIR" \
  alpine sleep infinity

# Run a health check.
podman healthcheck run hc-test

# Corrupt the log.
CID=$(podman inspect --format '{{.Id}}' hc-test)
echo "{invalid json{" > $TMPDIR/${CID}-healthcheck.log

# The podman healthcheck command fails.
podman healthcheck run hc-test 
# so does podman ps
podman ps

Output:

➜  podman git:(main) ✗ podman ps
Error: unable to get healthcheck log for 405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197: failed to unmarshal existing healthcheck results in /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log: define.HealthCheckResults.ReadString: expects " or n, but found i, error found in #2 byte of ...|{invalid jso|..., bigger context ...|{invalid json{
|...

➜  podman git:(main) ✗ podman healthcheck run hc-test

Error: unable to update health check log /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log for 405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197: failed to unmarshal existing healthcheck results in /tmp/tmp.OvthY1k4nP/405cdfae215c478036f7f6cc7eed08fb866076654995fc70e14ec59c08164197-healthcheck.log: define.HealthCheckResults.ReadString: expects " or n, but found i, error found in #2 byte of ...|{invalid jso|..., bigger context ...|{invalid json{

Alternatively, it can be reproduced by killing the health check command when it writes to the log. I couldn't get "lucky" with this way, but it should be achievable with creating many systemd units like this script does:

#!/usr/bin/python3
for x in range(50):
  with open(f'/etc/containers/systemd/marshall-{x}.container', 'w') as f:
    f.write(f'''
[Unit]
Description=The marshall container {x}
After=local-fs.target
[Container]
Image=registry.access.redhat.com/ubi9-minimal:latest
Exec=sleep infinity
HealthCmd=printf '%%01000d' 0
HealthInterval=1s
PodmanArgs=--health-max-log-size 0 --health-max-log-count 100
[Install]
WantedBy=multi-user.target default.target
''')

Running them:

python create-quadlets.py
systemctl daemon-reload
systemctl start marshall-{0..49}

and trying to kill the healtcheck repeatedly until it corrupts the log:

pkill -9 -f "podman.*healthcheck"

Fix

Uses ioutils.AtomicWriteFile to write the health check logs to handle atomic writes with tmpfiles/renames and flushing.
It also changes the behavior when reading a corrupted log file in readFromFileHealthCheckLog and instead of a hard error, it logs a warning and continues with an empty log. The corrupted one gets overwritten on a next write. The error could be propagated in readFromFileHealthCheckLog and handled by callers, but this change is localized to a single place instead of multiple callers.
Testing: it adds a new system test that intentionally corrupts the log file and checks that Podman recovers (fails with the main version ).

Fixes: https://redhat.atlassian.net/browse/RHEL-178222

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
the sign-off email address. See CONTRIBUTING.md
for more information.
Referenced issues using Fixes: #00000 in commit message (if applicable)
Tests have been added/updated (or no tests are needed)
Documentation has been updated (or no documentation changes are needed)
All commits pass make validatepr (format/lint checks)
Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

None. Unless ignoring and overwriting the corrupted log while logging a warning instead of failing is a user-facing change?

mheon · 2026-05-27T18:57:30Z

+	// If the file is corrupted, handle it as not existing.
 	if err := json.Unmarshal(b, &healthCheck); err != nil {
-		return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err)
+		logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err)


Should we just remove the file if it's unparsable?

I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.

I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.

It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case

Thank you for the feedback.

@mheon I was thinking of moving it for potential inspection to be safe. After these changes it gets overwritten.

@Luap99
I expected the warning to be controversial, but my reasoning was that it’s only a transient state until a next poll. Also, it’s a second line of defense and should be less likely now, but corruption can still happen for multiple reasons.

However, I can see now why it’s problematic.

To be on the same page, I tried to map the callers and their behavior on corrupted log (most likely not 100% exhaustive and there might be mistakes):

libpod/container_inspect.go -> getContainerInspectData()

logs an error but continues with an empty log (no change needed here)

libpod/healthcheck.go

updateHealthStatus() -> propagates err

container_internal.go: start() -> propagates err -> many callers (podman start, podman run, …) and leads to failure

container_internal.go: unpause() -> propagates err -> multiple callers and leads to failure (podman unpause or just logging in podman commit)

isUnhealthy() -> returns false and propagates err (in this PR false is returned and the error swallowed)

shouldRestart() -> logs an error and evaluates additional conditions

updateHealthCheckLog() -> propagates err

runHealthCheck() -> propagates err and new result

Runtime.HealthCheck -> propagates err and new result

API 500

healthcheck run command fails

healthCheckStatus() -> HealthCheckStatus() -> propagates err

ListContainerBatch() -> propagates err - used by podman ps -> fails

WaitForConditionWithInterval() -> trySend(-1, err) -> podman wait fails (correct)

GenerateContainerFilterFuncs() -> returned closure returns false

I'm not completely sure what the desired behavior should be (unless it's retaining the current behavior on corruption). I'll draft a better error handling in the callers instead of this lazy one.

@Luap99 If the alternative is health checks failing completely with no possible remediation without going into Podman's state and manually removing the file - which we don't provide any location for in any of our commands or debug logs, last I checked - I think removing an unreadable file seems reasonable?

Well I think the file should stay so all callers print this error. So if you do an inspect I would like to see this error logged (not hard failed as we do already there)

updateHealthCheckLog()/updateHealthStatus() should log a warning like you did and since they write a new file it does solves the corruption naturally

isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart().

ListContainerBatch() -> propagates err - used by podman ps -> fails

That should likely not be a hard error, I would say log warning and claim unhealthy

WaitForConditionWithInterval() -> trySend(-1, err) -> podman wait fails (correct)

yes a failure here makes sense, wait should error out

GenerateContainerFilterFuncs() -> returned closure returns false

Seems reasonable but we should have a warning there when it happens.

@Luap99 Thanks, that sounds good and overwriting the corrupted log in updateHealthCheckLog() / updateHealthStatus() resolves it.

@Luap99 I've updated the PR based on your feedback.

I added the sentinel error define.ErrHealthCheckLogCorrupted to differentiate between a corrupted log and other possible read errors returned from readFromFileHealthCheckLog - primarily ErrPermission that most likely requires a user action. I couldn’t find custom error types in the codebase. I don’t return it directly but wrapped together with the underlying error to have context. It leads to long log messages, though. The error is a part of the public API, but I wanted to handle it at the right sites (not documented for the whole call path). 

updateHealthCheckLog() / updateHealthStatus() logs a warning and continues with an empty struct. The corrupted log gets overwritten.

isUnhealthy() now returns true instead of false on any read error and it leads to shouldRestart() returning true.

healthCheckStatus() now returns HealthCheckUnhealthy instead of “” (there’s no HealthCheckUnknown) on any read error.

GenerateContainerFilterFuncs() behaves the same, but logs a warning that the status is ignored.

ListContainerBatch() in ps.go now only logs a warning.

I handled all read errors in healthCheckStatus() and isUnhealthy() / shouldRestart() as unhealthy for consistency with what you said about unknown state:

isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart(). 

Am I wrong here to generalize?

(I’m going to update tests properly when the behavior is approved.)

Luap99

I think the atomic write make sense but I don't think we should fully ignore the error there as different callers likely should do different things with this.

Luap99 · 2026-05-27T19:26:00Z

+	// If the file is corrupted, handle it as not existing.
 	if err := json.Unmarshal(b, &healthCheck); err != nil {
-		return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err)
+		logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err)


I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.

I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.

It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case

The healthcheck log could be corrupted if the process was interrupted mid-write. It could lead to Podman crashing. Write the log files atomically and diferentiate between corrupted log and different errors in consumers of readFromFileHealthCheckLog(). Add a system test for a corrupted log file. Fixes: https://redhat.atlassian.net/browse/RHEL-178222 Signed-off-by: Marek Simek <msimek@redhat.com>

mheon reviewed May 27, 2026

View reviewed changes

Luap99 reviewed May 27, 2026

View reviewed changes

simek-m force-pushed the fix/healthcheck-log-corruption branch from 868eb90 to 4b0e9cc Compare May 29, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle healthcheck log corruption gracefully#28803

fix: Handle healthcheck log corruption gracefully#28803
simek-m wants to merge 1 commit into
containers:mainfrom
simek-m:fix/healthcheck-log-corruption

simek-m commented May 27, 2026

Uh oh!

mheon May 27, 2026

Uh oh!

Luap99 May 27, 2026

Uh oh!

simek-m May 28, 2026

Uh oh!

mheon May 28, 2026

Uh oh!

Luap99 May 28, 2026

Uh oh!

simek-m May 28, 2026

Uh oh!

simek-m May 29, 2026

Uh oh!

Luap99 left a comment

Uh oh!

Luap99 May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simek-m commented May 27, 2026

Steps to reproduce

Fix

Checklist

Does this PR introduce a user-facing change?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants