fix: Handle healthcheck log corruption gracefully#28803
Conversation
| // If the file is corrupted, handle it as not existing. | ||
| if err := json.Unmarshal(b, &healthCheck); err != nil { | ||
| return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err) | ||
| logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err) |
There was a problem hiding this comment.
Should we just remove the file if it's unparsable?
There was a problem hiding this comment.
I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.
I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.
It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case
There was a problem hiding this comment.
Thank you for the feedback.
@mheon I was thinking of moving it for potential inspection to be safe. After these changes it gets overwritten.
@Luap99
I expected the warning to be controversial, but my reasoning was that it’s only a transient state until a next poll. Also, it’s a second line of defense and should be less likely now, but corruption can still happen for multiple reasons.
However, I can see now why it’s problematic.
To be on the same page, I tried to map the callers and their behavior on corrupted log (most likely not 100% exhaustive and there might be mistakes):
libpod/container_inspect.go->getContainerInspectData()- logs an error but continues with an empty log (no change needed here)
libpod/healthcheck.goupdateHealthStatus()-> propagates errcontainer_internal.go:start()-> propagates err -> many callers (podman start,podman run, …) and leads to failurecontainer_internal.go:unpause()-> propagates err -> multiple callers and leads to failure (podman unpause or just logging in podman commit)
isUnhealthy()-> returns false and propagates err (in this PR false is returned and the error swallowed)shouldRestart()-> logs an error and evaluates additional conditions
updateHealthCheckLog()-> propagates errrunHealthCheck()-> propagates err and new result- Runtime.HealthCheck -> propagates err and new result
- API 500
healthcheck runcommand fails
- Runtime.HealthCheck -> propagates err and new result
healthCheckStatus()->HealthCheckStatus()-> propagates errListContainerBatch()-> propagates err - used bypodman ps-> failsWaitForConditionWithInterval()->trySend(-1, err)->podman waitfails (correct)GenerateContainerFilterFuncs()-> returned closure returns false
I'm not completely sure what the desired behavior should be (unless it's retaining the current behavior on corruption). I'll draft a better error handling in the callers instead of this lazy one.
There was a problem hiding this comment.
@Luap99 If the alternative is health checks failing completely with no possible remediation without going into Podman's state and manually removing the file - which we don't provide any location for in any of our commands or debug logs, last I checked - I think removing an unreadable file seems reasonable?
There was a problem hiding this comment.
Well I think the file should stay so all callers print this error. So if you do an inspect I would like to see this error logged (not hard failed as we do already there)
updateHealthCheckLog()/updateHealthStatus() should log a warning like you did and since they write a new file it does solves the corruption naturally
isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart().
ListContainerBatch() -> propagates err - used by podman ps -> fails
That should likely not be a hard error, I would say log warning and claim unhealthy
WaitForConditionWithInterval() -> trySend(-1, err) -> podman wait fails (correct)
yes a failure here makes sense, wait should error out
GenerateContainerFilterFuncs() -> returned closure returns false
Seems reasonable but we should have a warning there when it happens.
There was a problem hiding this comment.
@Luap99 Thanks, that sounds good and overwriting the corrupted log in updateHealthCheckLog() / updateHealthStatus() resolves it.
There was a problem hiding this comment.
@Luap99 I've updated the PR based on your feedback.
I added the sentinel error define.ErrHealthCheckLogCorrupted to differentiate between a corrupted log and other possible read errors returned from readFromFileHealthCheckLog - primarily ErrPermission that most likely requires a user action. I couldn’t find custom error types in the codebase. I don’t return it directly but wrapped together with the underlying error to have context. It leads to long log messages, though.
The error is a part of the public API, but I wanted to handle it at the right sites (not documented for the whole call path).
updateHealthCheckLog()/updateHealthStatus()logs a warning and continues with an empty struct. The corrupted log gets overwritten.isUnhealthy()now returns true instead of false on any read error and it leads toshouldRestart()returning true.healthCheckStatus()now returnsHealthCheckUnhealthyinstead of“”(there’s noHealthCheckUnknown) on any read error.GenerateContainerFilterFuncs()behaves the same, but logs a warning that the status is ignored.ListContainerBatch()inps.gonow only logs a warning.
I handled all read errors in healthCheckStatus() and isUnhealthy() / shouldRestart() as unhealthy for consistency with what you said about unknown state:
isUnhealthy() must return true I think, a corrupted log means we do not know the state so should assume the container is unhealthy in shouldRestart(), we can certainly log the error there but it should not be ignored like it is right now in shouldRestart().
Am I wrong here to generalize?
(I’m going to update tests properly when the behavior is approved.)
Luap99
left a comment
There was a problem hiding this comment.
I think the atomic write make sense but I don't think we should fully ignore the error there as different callers likely should do different things with this.
| // If the file is corrupted, handle it as not existing. | ||
| if err := json.Unmarshal(b, &healthCheck); err != nil { | ||
| return healthCheck, fmt.Errorf("failed to unmarshal existing healthcheck results in %s: %w", path, err) | ||
| logrus.Warnf("Failed to unmarshal healthcheck results in %s, ignoring: %v", path, err) |
There was a problem hiding this comment.
I don't really like this at all tbh, like this just eats the error more or less since the hc command runs in the background user will not notice.
I find returning a hard error logical here and the caller can decide what to do, in the case of hc run we can warn and then write a new log file over it, in case of podman inspect it already gets logged as error.
It may be a bit more complex that way but I rather have this over the silent failure mode for callers. i.e. something like isUnhealthy() likely should report unhealthy in this case
The healthcheck log could be corrupted if the process was interrupted mid-write. It could lead to Podman crashing. Write the log files atomically and diferentiate between corrupted log and different errors in consumers of readFromFileHealthCheckLog(). Add a system test for a corrupted log file. Fixes: https://redhat.atlassian.net/browse/RHEL-178222 Signed-off-by: Marek Simek <msimek@redhat.com>
868eb90 to
4b0e9cc
Compare
The healthcheck log could be corrupted if the
podman heathcheck process was interrupted mid-write leading to Podman crashing when trying to read it.
Steps to reproduce
Output:
Running them:
python create-quadlets.py systemctl daemon-reload systemctl start marshall-{0..49}and trying to kill the healtcheck repeatedly until it corrupts the log:
pkill -9 -f "podman.*healthcheck"Fix
readFromFileHealthCheckLogand instead of a hard error, it logs a warning and continues with an empty log. The corrupted one gets overwritten on a next write. The error could be propagated inreadFromFileHealthCheckLogand handled by callers, but this change is localized to a single place instead of multiple callers.mainversion ).Fixes: https://redhat.atlassian.net/browse/RHEL-178222
Checklist
Ensure you have completed the following checklist for your pull request to be reviewed:
commits. (
git commit -s). (If needed, usegit commit -s --amend). The author email must matchthe sign-off email address. See CONTRIBUTING.md
for more information.
Fixes: #00000in commit message (if applicable)make validatepr(format/lint checks)Noneif no user-facing changes)Does this PR introduce a user-facing change?
None. Unless ignoring and overwriting the corrupted log while logging a warning instead of failing is a user-facing change?